From patchwork Fri Feb 26 04:59:20 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 25993
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
X-Original-To: patchwork@ffaux-bg.ffmpeg.org
Delivered-To: patchwork@ffaux-bg.ffmpeg.org
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by ffaux.localdomain (Postfix) with ESMTP id 7899944BC6A
	for <patchwork@ffaux-bg.ffmpeg.org>; Fri, 26 Feb 2021 06:59:27 +0200 (EET)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 4219A68A148;
	Fri, 26 Feb 2021 06:59:27 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E51D4689E8A
 for <ffmpeg-devel@ffmpeg.org>; Fri, 26 Feb 2021 06:59:20 +0200 (EET)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id 8311F1060254
 for <ffmpeg-devel@ffmpeg.org>; Fri, 26 Feb 2021 04:59:20 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1614315560;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:Sender;
 bh=R4UYxfU0XB9Asm5rsJYVc8yANgXhDxcScXGArlhnsBA=;
 b=aFIkDGKKDwiUe/4sL8Jl4Gcziaw1JTiMEhqI2A9vMwRa8Kdo/ycXMrNh2nGFOS4B
 d6kVN0gM3wMK9WxzOEOkA/Y3/ThxgM+00MTdgPHk+80gprWKMf0XjBjMloEmuwgVvnU
 U3+jrbJKjFI3Nauy3vAGsMSOKdHNcrdNb6+9KZKvVoT2gMnVH1h1pWdpFtviwrYiYZH
 AgbAGqZ6gqCCGyVoUFjeRlXfo/VDM88V84mJG3hcDB5obc6zpII1xCfJ8kHiSPlt19X
 m70gUk5rwTryi7vb3/b9gqSxfSZTjrYtZXp9zetzJHhI7WJWDgkR503RH9PJ5N6/M/t
 zfGnttsHzw==
Date: Fri, 26 Feb 2021 05:59:20 +0100 (CET)
From: Lynne <dev@lynne.ee>
To: Ffmpeg Devel <ffmpeg-devel@ffmpeg.org>
Message-ID: <MURh8bt--3-2@lynne.ee>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH] lavu/tx: WIP add x86 assembly
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

This commit adds sse3 and avx assembly optimizations for 4-point and 
8-point transforms only.
The code to recombine them into higher-level transforms is non-functional
currently, so it's not included here. This is just to get some feedback
on possible optimizations.

The 4-point assembly is based on this structure:
https://gist.github.com/cyanreg/665b9c79cbe51df9296a969257f2a16c

The 8-point assembly is based on this structure:
https://gist.github.com/cyanreg/bbf25c8a8dfb910ed3b9ae7663983ca6

They're implemented as macros as they're pasted a few times in
the recombination code.

All code here is faster than both our own current assembly (by around 40%)
and FFTW3 (by around 10% to 40%).

The 8-point core assembly is barely 20 instructions! That's 1 less
than our current code, and saves on a lot of shuffles!
It's 40% faster than FFTW!

The 4-point core assembly is 10 instructions, which is 1 more than
our current code, however it doesn't require any external memory to
load from (a sign mask), which it trades for a shufps (faster),
and also it requires an additional temporary storage register
to reduce latency.

I'll collect the suggestions and implement them when I'm ready
to post the full power-of-two assembly.
Subject: [PATCH] lavu/tx: WIP add x86 assembly

This commit adds sse3 and avx assembly optimizations for 4-point and
8-point transforms only.
The code to recombine them into higher-level transforms is non-functional
currently, so it's not included here. This is just to get some feedback
on possible optimizations.

The 4-point assembly is based on this structure:
https://gist.github.com/cyanreg/665b9c79cbe51df9296a969257f2a16c

The 8-point assembly is based on this structure:
https://gist.github.com/cyanreg/bbf25c8a8dfb910ed3b9ae7663983ca6

They're implemented as macros as they're pasted a few times in
the recombination code.

All code here is faster than both our own current assembly (by around 40%)
and FFTW3 (by around 10% to 40%).

The 8-point core assembly is barely 20 instructions! That's 1 less
than our current code, and saves on a lot of shuffles!
It's 40% faster than FFTW!

The 4-point core assembly is 10 instructions, which is 1 more than
our current code, however it doesn't require any external memory to
load from (a sign mask), which it trades for a shufps (faster),
and also it requires an additional temporary storage register
to reduce latency.

I'll collect the suggestions and implement them when I'm ready
to post the full power-of-two assembly.
---
 libavutil/tx.c                |   2 +
 libavutil/tx_priv.h           |   2 +
 libavutil/x86/Makefile        |   2 +
 libavutil/x86/tx_float.asm    | 171 ++++++++++++++++++++++++++++++++++
 libavutil/x86/tx_float_init.c |  66 +++++++++++++
 5 files changed, 243 insertions(+)
 create mode 100644 libavutil/x86/tx_float.asm
 create mode 100644 libavutil/x86/tx_float_init.c

diff --git a/libavutil/tx.c b/libavutil/tx.c
index 4a5ec6975f..c0ae4b42ca 100644
--- a/libavutil/tx.c
+++ b/libavutil/tx.c
@@ -169,6 +169,8 @@ av_cold int av_tx_init(AVTXContext **ctx, av_tx_fn *tx, enum AVTXType type,
     case AV_TX_FLOAT_MDCT:
         if ((err = ff_tx_init_mdct_fft_float(s, tx, type, inv, len, scale, flags)))
             goto fail;
+        if (ARCH_X86)
+            ff_tx_init_float_x86(s, tx);
         break;
     case AV_TX_DOUBLE_FFT:
     case AV_TX_DOUBLE_MDCT:
diff --git a/libavutil/tx_priv.h b/libavutil/tx_priv.h
index e9fba02a35..b5c1cdd8c6 100644
--- a/libavutil/tx_priv.h
+++ b/libavutil/tx_priv.h
@@ -158,4 +158,6 @@ typedef struct CosTabsInitOnce {
     AVOnce control;
 } CosTabsInitOnce;
 
+void ff_tx_init_float_x86(AVTXContext *s, av_tx_fn *tx);
+
 #endif /* AVUTIL_TX_PRIV_H */
diff --git a/libavutil/x86/Makefile b/libavutil/x86/Makefile
index 5f5242b5bd..d747c37049 100644
--- a/libavutil/x86/Makefile
+++ b/libavutil/x86/Makefile
@@ -3,6 +3,7 @@ OBJS += x86/cpu.o                                                       \
         x86/float_dsp_init.o                                            \
         x86/imgutils_init.o                                             \
         x86/lls_init.o                                                  \
+        x86/tx_float_init.o                                             \
 
 OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o                      \
 
@@ -14,5 +15,6 @@ X86ASM-OBJS += x86/cpuid.o                                              \
              x86/float_dsp.o                                            \
              x86/imgutils.o                                             \
              x86/lls.o                                                  \
+             x86/tx_float.o                                             \
 
 X86ASM-OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils.o                    \
diff --git a/libavutil/x86/tx_float.asm b/libavutil/x86/tx_float.asm
new file mode 100644
index 0000000000..b7edc50e81
--- /dev/null
+++ b/libavutil/x86/tx_float.asm
@@ -0,0 +1,171 @@
+;******************************************************************************
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "x86util.asm"
+
+%if ARCH_X86_64
+%define pointer resq
+%else
+%define pointer resd
+%endif
+
+struc AVTXContext
+    .n:           resd 1    ; Non-power-of-two part
+    .m:           resd 1    ; Power-of-two part
+    .inv:         resd 1    ; Is inverse
+    .type:        resd 1    ; Type
+    .flags:       resq 1    ; Flags
+    .scale:       resq 1    ; Scale
+
+    .exptab:      pointer 1 ; MDCT exptab
+    .tmp:         pointer 1 ; Temporary buffer needed for all compound transforms
+    .pfatab:      pointer 1 ; Input/Output mapping for compound transforms
+    .revtab:      pointer 1 ; Input mapping for power of two transforms
+    .inplace_idx: pointer 1 ; Required indices to revtab for in-place transforms
+endstruc
+
+SECTION_RODATA 32
+
+%define POS 0x00000000
+%define NEG 0x80000000
+%define M_SQRT1_2 0.707106781186547524401
+s8_mult_odd:   dd 1.0, 1.0, -1.0, 1.0, -M_SQRT1_2, -M_SQRT1_2, M_SQRT1_2, M_SQRT1_2
+
+s8_perm_even:  dd 1, 3, 0, 2, 1, 3, 2, 0
+
+s8_perm_odd1:  dd 3, 3, 1, 1, 1, 1, 3, 3
+s8_perm_odd2:  dd 1, 2, 0, 3, 1, 0, 0, 1
+
+mask_mmmmpppm: dd NEG, NEG, NEG, NEG, POS, POS, POS, NEG
+mask_ppmpmmpm: dd POS, POS, NEG, POS, NEG, NEG, POS, NEG
+
+SECTION .text
+
+; Single 4-point in-place complex FFT (will do 2 transforms at once in AVX mode)
+; %1 - even coefficients (r0.reim, r2.reim, r4.reim, r6.reim)
+; %2 - odd coefficients  (r1.reim, r3.reim, r5.reim, r7.reim)
+; %3 - temporary
+; %4 - temporary
+%macro DUET_FFT4 4
+    subps    %3, %1, %2         ;  r1,  r2,  r3,  r4, (r5,  r6,  r7,  r8)
+    addps    %1, %2             ;  t1,  t2,  t3,  t4, (t5,  t6,  t7,  t8)
+    shufps   %2, %3, %3, q1100  ;  r1,  r1,  r2,  r2, (r5,  r5,  r6,  r6)
+    shufps   %4, %1, %1, q3322  ;  t3,  t3,  t4,  t4, (t7,  t7,  t8,  t8)
+    shufps   %3, %3, q2233      ;  r4,  r4,  r3,  r3, (r8,  r8,  r7,  r7)
+    shufps   %1, %1, q1100      ;  t1,  t1,  t2,  t2, (t5,  t5,  t6,  t6)
+    addsubps %2, %3             ;  b3,  b1,  b4,  b2, (b7,  b5,  b8,  b6)
+    addsubps %1, %4             ;  a3,  a1,  a4,  a2, (a7,  a5,  a8,  a6)
+    shufps   %1, %1, q2031      ;  a1,  a2,  a3,  a4, (a5,  a6,  a7,  a8)
+    shufps   %2, %2, q3021      ;  b1,  b2,  b3,  b4, (b5,  b6,  b7,  b8)
+%endmacro
+
+INIT_XMM sse3
+cglobal fft4, 4, 4, 4, ctx, out, in, stride
+    mova m0, [inq + 0*mmsize]
+    mova m1, [inq + 1*mmsize]
+
+    cmp dword [r0 + AVTXContext.inv], 1
+    jl .s
+
+    shufps m2, m1, m0, q3210
+    shufps m0, m1, q3210
+    mova   m1, m2
+
+.s: DUET_FFT4 m0, m1, m2, m3
+
+    unpcklpd m2, m0, m1
+    unpckhpd m0, m1
+
+    mova [outq + 0*mmsize], m2
+    mova [outq + 1*mmsize], m0
+
+    ret
+
+; Single 8-point in-place complex FFT
+; %1 - even coefficients (r0.reim, r2.reim, r4.reim, r6.reim)
+; %2 - odd coefficients  (r1.reim, r3.reim, r5.reim, r7.reim)
+; %3 - temporary
+; %4 - temporary
+%macro SINGLET_FFT8_AVX 4
+    subps      %3, %1, %2               ;  r1,  r2,  r3,  r4,  r5,  r6,  r7,  r8
+    addps      %1, %2                   ;  q1,  q2,  q3,  q4,  q5,  q6,  q7,  q8
+    vpermilps  %2, %3, [s8_perm_odd1]   ;  r4,  r4,  r2,  r2,  r6,  r6,  r8,  r8
+    shufps     %4, %1, %1, q3322        ;  q1,  q1,  q2,  q2,  q5,  q5,  q6,  q6
+    movsldup   %3, %3                   ;  r1,  r1,  r3,  r3,  r5,  r5,  r7,  r7
+    shufps     %1, %1, q1100            ;  q3,  q3,  q4,  q4,  q7,  q7,  q8,  q8
+    addsubps   %3, %2                   ;  z1,  z2,  z3,  z4,  z5,  z6,  z7,  z8
+    addsubps   %1, %4                   ;  s3,  s1,  s4,  s2,  s7,  s5,  s8,  s6
+    mulps      %3, [s8_mult_odd]        ;  z * s8_mult_odd
+    vpermilps  %1, [s8_perm_even]       ;  s1,  s2,  s3,  s4,  s5,  s6,  s8,  s7
+    shufps     %2, %3, %3, q2332        ;   c,   r,   a,   p,  z7,  z8,  z8,  z7
+    xorps      %4, %1, [mask_mmmmpppm]  ;  e1,  e2,  e3,  e4,  e5,  e6,  e8,  e7
+    vpermilps  %3, %3, [s8_perm_odd2]   ;  z2,  z3,  z1,  z4,  z6,  z5,  z5,  z6
+    vperm2f128 %1, %4, q0003            ;  e5,  e6,  e8,  e7,  s1,  s2,  s3,  s4
+    addsubps   %2, %3                   ;   c,   r,   a,   p,  t5,  t6,  t7,  t8
+    subps      %1, %4                   ;  w1,  w2,  w3,  w4,  w5,  w6,  w7,  w8
+    vperm2f128 %2, %2, q0101            ;  t5,  t6,  t7,  t8,  t5,  t6,  t7,  t8
+    vperm2f128 %3, %3, q0000            ;  z2,  z3,  z1,  z4,  z2,  z3,  z1,  z4
+    xorps      %2, [mask_ppmpmmpm]      ;  t5,  t6, -t7,  t8, -t5, -t6,  t7, -t8
+    addps      %2, %3, %2               ;  u1,  u2,  u3,  u4,  u5,  u6,  u7,  u8
+%endmacro
+
+; Load complex values (64 bits) via a lookup table
+; %1 - output register
+; %2 - GRP of base input memory address
+; %3 - GPR of LUT (int32_t indices) address
+; %4 - temporary GPR or vector register
+; %5 - temporary register (for avx only)
+%macro LOADZ_LUT 4-5
+%if 0
+    pcmpeqb m%4, m%4
+    vgatherdpd m%1, [%2 + xm%5*8], m%4
+    mova xm%5, [%3]
+%else
+    mov     %4d, [%3 +     0]
+    movsd  xm%1, [%2 + %4q*8]
+    mov     %4d, [%3 +     4]
+    movhps xm%1, [%2 + %4q*8]
+%if mmsize == 32
+    mov     %4d, [%3 +     8]
+    movsd  xm%5, [%2 + %4q*8]
+    mov     %4d, [%3 +    12]
+    movhps xm%5, [%2 + %4q*8]
+    vinsertf128 m%1, m%1, xm%5, 1
+%endif ; mmsize
+%endif ; vgather
+%endmacro
+
+INIT_YMM avx
+cglobal fft8, 4, 5, 4, ctx, out, in, stride, tmp
+    mov strideq, [r0 + AVTXContext.revtab]
+
+    LOADZ_LUT 0, inq, strideq +  0, tmp, 3
+    LOADZ_LUT 1, inq, strideq + 16, tmp, 3
+
+    SINGLET_FFT8_AVX m0, m1, m2, m3
+
+    unpcklpd m2, m0, m1
+    unpckhpd m0, m1
+
+    vperm2f128 m1, m2, m0, q0301
+    vperm2f128 m0, m2, q0002
+
+    mova [outq + 0*mmsize], m0
+    mova [outq + 1*mmsize], m1
+
+    ret
diff --git a/libavutil/x86/tx_float_init.c b/libavutil/x86/tx_float_init.c
new file mode 100644
index 0000000000..96114b9080
--- /dev/null
+++ b/libavutil/x86/tx_float_init.c
@@ -0,0 +1,66 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#define TX_FLOAT
+#include "libavutil/tx_priv.h"
+#include "libavutil/x86/cpu.h"
+
+void ff_fft4_sse3(AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_fft8_avx(AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+
+/* Reorders the coefficients of the bottom most transforms such that even
+ * coefficients appear first in the buffer while odd ones appear last.
+ * Saves on a lot of intra-lane shuffles.
+ * If the 16-point transform is rewritten to be monolithic (instead of 8x4x4)
+ * update the len check to cover it. */
+static void revtab_avx(int *revtab, int n, int inv, int offs, int len)
+{
+    len >>= 1;
+    if (len <= (8 >> 1)) {
+        for (int j = 0; j < len; j++) {
+            int k1 = -split_radix_permutation(offs + j*2 + 0, n, inv) & (n - 1);
+            int k2 = -split_radix_permutation(offs + j*2 + 1, n, inv) & (n - 1);
+            revtab[k1] = offs + j;
+            revtab[k2] = offs + j + len;
+        }
+        return;
+    }
+    revtab_avx(revtab, n, inv, offs                          , len >> 0);
+    revtab_avx(revtab, n, inv, offs +              (len >> 0), len >> 1);
+    revtab_avx(revtab, n, inv, offs + (len >> 0) + (len >> 1), len >> 1);
+}
+
+av_cold void ff_tx_init_float_x86(AVTXContext *s, av_tx_fn *tx)
+{
+    int cpu_flags = av_get_cpu_flags();
+
+    if (EXTERNAL_AVX_FAST(cpu_flags)) {
+        if (s->n == 1 && s->m == 8) {
+            revtab_avx(s->revtab, s->m, s->inv, 0, s->m);
+            *tx = ff_fft8_avx;
+            return;
+        }
+    }
+
+    if (EXTERNAL_SSE3(cpu_flags)) {
+        if (s->n == 1 && s->m == 4) {
+            *tx = ff_fft4_sse3;
+            return;
+        }
+    }
+}