From patchwork Mon Apr 19 20:20:34 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27106
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp836152iob;
        Mon, 19 Apr 2021 13:20:44 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJzybzgs/IGUV2a4JK+HSnaNYcJazK1Slzeh9J+3moYAywxFFxShRzNP5cMAC5kL1pYigYMK
X-Received: by 2002:a05:6402:1c84:: with SMTP id
 cy4mr15707542edb.260.1618863643818;
        Mon, 19 Apr 2021 13:20:43 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618863643; cv=none;
        d=google.com; s=arc-20160816;
        b=c4eg9fYd4zZxYqsOOkiggF/FU27+I8PxZ8w0ZiZvBJATD+IQ5A72TpATtB2cnPELy2
         ChurwQoZBPIvpYvdy0FX89RwEjPrgHYjBsQSWPeocMzTdnCBL7xGgyNSp3+kEW6Opati
         JMUZE6f/aOtLL8zmTlS+144sDu638BqDEs0C6Wueq3tGYyWCQmMBga7Qze0CxVscyoA8
         XZAW6hQsecs0kHAIvpZsm6ndCs/dSSWyzV/RSApElecpt9xUzhHp+jpT7OQd5wikIE9u
         3+AFe1k+nVybR+lI1B/P9GvanGKvP2vFM08bZO9FRvMOvwVGm1eH2n9I+YLBfkmMDAng
         VfGg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=I8Dz6bkK7keWj0CpJdqyMU1VKIgBB9yGgmrvYu3zrxQ=;
        b=z6J7iSGGf8WvDbos/YpFeYQpThyxaK1bDG1UggslaxSCtIS5V2uwUdNI0bb8kmWJuy
         8PMWkM6Q5iOzQUoK+Tqy4MDQ4NiX/L9H30n9IQpjE3OE5Obf8HRPjaLs+Nt426SYSdDG
         gb4kRWjxJUYnsscLBBvbYgW1MQASTjhdNT99Ao4xvbw9aw3+Xi13zFXVu/CI9T2Y8dxj
         8EpPGRG4owzaCj/PP7syHqwpgl2Ar/61rzNxE0sZ3tyy+LGvaDRJqwBgLWcQXU2Ka+Rd
         +RdP1Om8iroczq3Q2w4z7DzA+Q2OEx2YtyrN4hlquxAR/f48t3i9zJ2okLEZDAeEtGy1
         iHZw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=g4Kujauc;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 i23si12709154ejy.659.2021.04.19.13.20.43;
        Mon, 19 Apr 2021 13:20:43 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=g4Kujauc;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3C8EA680340;
	Mon, 19 Apr 2021 23:20:41 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E040C680340
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:20:34 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id 8E3C7106014C
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:20:34 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863634;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=wdsolWhwv0jLTOobJTwcbiqow1Q/mqv6iqh2BmLMiT4=;
 b=g4KujaucmmuPGJ/3LzGZ6z5W8HXkdNC+0SLP3LrJB9cycDiHd4Ik7MaHJ1ZNJQMC
 agSbtWFBCBQ9qk1behoccrULygBOBQ49YyjOKt8SWxSKqDsEpezUg1MEZx5HbGHtTCA
 BetaFirIbvGcYFOVlvkE4Zlfqk7p44A+tPedQLE1ragc++97r1AeTtflVWGzeWVSUkn
 b5Eb91UGmurau8n5FrN77EDbuqHniYB6bGznZjxmpxm5yAtyZ0TcxwjMRYRNocSYIgO
 6ZnwRHLEppJpbZm79jWZ1lFDDDkyr4yBsAQsMZ/eZZ8GPaJ/aLrPqP8RpkpWytR1w7F
 h4Hl9QJPGQ==
Date: Mon, 19 Apr 2021 22:20:34 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfmfS_--3-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 01/11] lavu/tx: minor code style improvements
 and additional comments
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: +m7m8Lg55W85

Patch attached.
Subject: [PATCH 01/11] lavu/tx: minor code style improvements and additional
 comments
---
 libavutil/tx.c      | 17 +++++++++++++-
 libavutil/tx.h      |  2 ++
 libavutil/tx_priv.h | 57 ++++++++++++++++++++++++---------------------
 3 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/libavutil/tx.c b/libavutil/tx.c
index 1161df3285..05d4de30cc 100644
--- a/libavutil/tx.c
+++ b/libavutil/tx.c
@@ -30,7 +30,7 @@ int ff_tx_type_is_mdct(enum AVTXType type)
     }
 }
 
-/* Calculates the modular multiplicative inverse, not fast, replace */
+/* Calculates the modular multiplicative inverse */
 static av_always_inline int mulinv(int n, int m)
 {
     n = n % m;
@@ -91,6 +91,17 @@ int ff_tx_gen_compound_mapping(AVTXContext *s)
     return 0;
 }
 
+static inline int split_radix_permutation(int i, int m, int inverse)
+{
+    m >>= 1;
+    if (m <= 1)
+        return i & 1;
+    if (!(i & m))
+        return (split_radix_permutation(i, m, inverse) << 1);
+    m >>= 1;
+    return (split_radix_permutation(i, m, inverse) << 2) + 1 - 2*(!(i & m) ^ inverse);
+}
+
 int ff_tx_gen_ptwo_revtab(AVTXContext *s, int invert_lookup)
 {
     const int m = s->m, inv = s->inv;
@@ -117,6 +128,7 @@ int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s)
     if (!(s->inplace_idx = av_malloc(s->m*sizeof(*s->inplace_idx))))
         return AVERROR(ENOMEM);
 
+    /* The first coefficient is always already in-place */
     for (int src = 1; src < s->m; src++) {
         int dst = s->revtab[src];
         int found = 0;
@@ -124,6 +136,9 @@ int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s)
         if (dst <= src)
             continue;
 
+        /* This just checks if a closed loop has been encountered before,
+         * and if so, skips it, since to fully permute a loop we must only
+         * enter it once. */
         do {
             for (int j = 0; j < nb_inplace_idx; j++) {
                 if (dst == s->inplace_idx[j]) {
diff --git a/libavutil/tx.h b/libavutil/tx.h
index bfc0c7f2a3..fccded8bc3 100644
--- a/libavutil/tx.h
+++ b/libavutil/tx.h
@@ -49,9 +49,11 @@ enum AVTXType {
      * float. Length is the frame size, not the window size (which is 2x frame)
      * For forward transforms, the stride specifies the spacing between each
      * sample in the output array in bytes. The input must be a flat array.
+     *
      * For inverse transforms, the stride specifies the spacing between each
      * sample in the input array in bytes. The output will be a flat array.
      * Stride must be a non-zero multiple of sizeof(float).
+     *
      * NOTE: the inverse transform is half-length, meaning the output will not
      * contain redundant data. This is what most codecs work with.
      */
diff --git a/libavutil/tx_priv.h b/libavutil/tx_priv.h
index e2f4314a4f..10d7ea3ade 100644
--- a/libavutil/tx_priv.h
+++ b/libavutil/tx_priv.h
@@ -20,9 +20,7 @@
 #define AVUTIL_TX_PRIV_H
 
 #include "tx.h"
-#include <stddef.h>
 #include "thread.h"
-#include "mem.h"
 #include "mem_internal.h"
 #include "avassert.h"
 #include "attributes.h"
@@ -48,12 +46,14 @@ typedef void FFTComplex;
 
 #if defined(TX_FLOAT) || defined(TX_DOUBLE)
 
-#define CMUL(dre, dim, are, aim, bre, bim) do {                                \
+#define CMUL(dre, dim, are, aim, bre, bim)                                     \
+    do {                                                                       \
         (dre) = (are) * (bre) - (aim) * (bim);                                 \
         (dim) = (are) * (bim) + (aim) * (bre);                                 \
     } while (0)
 
-#define SMUL(dre, dim, are, aim, bre, bim) do {                                \
+#define SMUL(dre, dim, are, aim, bre, bim)                                     \
+    do {                                                                       \
         (dre) = (are) * (bre) - (aim) * (bim);                                 \
         (dim) = (are) * (bim) - (aim) * (bre);                                 \
     } while (0)
@@ -66,7 +66,8 @@ typedef void FFTComplex;
 #elif defined(TX_INT32)
 
 /* Properly rounds the result */
-#define CMUL(dre, dim, are, aim, bre, bim) do {                                \
+#define CMUL(dre, dim, are, aim, bre, bim)                                     \
+    do {                                                                       \
         int64_t accu;                                                          \
         (accu)  = (int64_t)(bre) * (are);                                      \
         (accu) -= (int64_t)(bim) * (aim);                                      \
@@ -76,7 +77,8 @@ typedef void FFTComplex;
         (dim)   = (int)(((accu) + 0x40000000) >> 31);                          \
     } while (0)
 
-#define SMUL(dre, dim, are, aim, bre, bim) do {                                \
+#define SMUL(dre, dim, are, aim, bre, bim)                                     \
+    do {                                                                       \
         int64_t accu;                                                          \
         (accu)  = (int64_t)(bre) * (are);                                      \
         (accu) -= (int64_t)(bim) * (aim);                                      \
@@ -93,7 +95,8 @@ typedef void FFTComplex;
 
 #endif
 
-#define BF(x, y, a, b) do {                                                    \
+#define BF(x, y, a, b)                                                         \
+    do {                                                                       \
         x = (a) - (b);                                                         \
         y = (a) + (b);                                                         \
     } while (0)
@@ -101,7 +104,7 @@ typedef void FFTComplex;
 #define CMUL3(c, a, b)                                                         \
     CMUL((c).re, (c).im, (a).re, (a).im, (b).re, (b).im)
 
-#define COSTABLE(size) \
+#define COSTABLE(size)                                                         \
     DECLARE_ALIGNED(32, FFTSample, TX_NAME(ff_cos_##size))[size/2]
 
 /* Used by asm, reorder with care */
@@ -114,35 +117,35 @@ struct AVTXContext {
     double scale;       /* Scale */
 
     FFTComplex *exptab; /* MDCT exptab */
-    FFTComplex *tmp;    /* Temporary buffer needed for all compound transforms */
+    FFTComplex    *tmp; /* Temporary buffer needed for all compound transforms */
     int        *pfatab; /* Input/Output mapping for compound transforms */
     int        *revtab; /* Input mapping for power of two transforms */
     int   *inplace_idx; /* Required indices to revtab for in-place transforms */
 };
 
-/* Shared functions */
+/* Checks if type is an MDCT */
 int ff_tx_type_is_mdct(enum AVTXType type);
+
+/*
+ * Generates the PFA permutation table into AVTXContext->pfatab. The end table
+ * is appended to the start table.
+ */
 int ff_tx_gen_compound_mapping(AVTXContext *s);
+
+/*
+ * Generates a standard-ish (slightly modified) Split-Radix revtab into
+ * AVTXContext->revtab
+ */
 int ff_tx_gen_ptwo_revtab(AVTXContext *s, int invert_lookup);
+
+/*
+ * Generates an index into AVTXContext->inplace_idx that if followed in the
+ * specific order,  allows the revtab to be done in-place. AVTXContext->revtab
+ * must already exist.
+ */
 int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s);
 
-/* Also used by SIMD init */
-static inline int split_radix_permutation(int i, int n, int inverse)
-{
-    int m;
-    if (n <= 2)
-        return i & 1;
-    m = n >> 1;
-    if (!(i & m))
-        return split_radix_permutation(i, m, inverse)*2;
-    m >>= 1;
-    if (inverse == !(i & m))
-        return split_radix_permutation(i, m, inverse)*4 + 1;
-    else
-        return split_radix_permutation(i, m, inverse)*4 - 1;
-}
-
-/* Templated functions */
+/* Templated init functions */
 int ff_tx_init_mdct_fft_float(AVTXContext *s, av_tx_fn *tx,
                               enum AVTXType type, int inv, int len,
                               const void *scale, uint64_t flags);

From patchwork Mon Apr 19 20:21:36 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27104
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp836967iob;
        Mon, 19 Apr 2021 13:21:45 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJw4bJo2VEGyi4UctBuWJJqdEUiunyqvoYn4Q/2s4sEMHziiFB8bI1ZxhuMkKn2iLsZR5Dro
X-Received: by 2002:a05:6402:31af:: with SMTP id
 dj15mr15909712edb.231.1618863705746;
        Mon, 19 Apr 2021 13:21:45 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618863705; cv=none;
        d=google.com; s=arc-20160816;
        b=IMJcZCLx8jimdpcCCvHnvpG7bBWKZFfi9kvx9+JD7yYQ1byMjnE9DqoqzfNz1uy4Ko
         dR8TYTKJoaCjh7kKjm5Gn1chxhyJJjy/5TY2uyYEheXE/x5td+zuTpwxF60WofXEs5xK
         /qpjlr/tsuRvv2s7u403Ya/+3ddqjCbPvXSI7wPgx95AtSPpPv+9Z7mbjRRy2lRd4rC+
         SUUFZLVv91Qo3+pkHFrY2F1fNy4LIhw5ZKvZ5DPHFFPq8SpOycg/orAswOmGOjFv5xlQ
         EdlSuaf5JCTdTkXQjG82NqMhFg+4fuHCwtqROuFsPr1lVR2U1R9eSakrhKbMRbl0N4BN
         NJ2A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=2ZNNk1noADAPv3fAoix7SAgcUwE7ZejRCmEnxyJeuXg=;
        b=WAYlrVDdLeBHVL5C7M5lpR2GsS8AgSbDnDIq9OclQkfNAgGpUNuA7Sfk2EKIZUORJE
         /vdyhz4tp8XwHxjmalv1/2W48Do530R6kBOgWxgd2LYV4vYGN/gaebH+Jl4apQ8WCEtu
         2wLJsD+dwr8MSxtzx/PWxDhDLPfqa/YkYMQVkmvgvYxgF+MgMUT0q8CxSwrw+tun2b01
         Crpbq1Wf1+nsc/znlXqqAQD2bQyslRLC6M2yNM4gWegfAJrz4fkzvh9NkOU6ayDZO55o
         HkxZaRVKsT04pNFbfSA1xlqrSi8ZWAfFtdvWGwGfLyEM0usFF94ugeoGcsTdLGDzR8hv
         jmow==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=GXccrPgH;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 r11si12898846ejs.646.2021.04.19.13.21.45;
        Mon, 19 Apr 2021 13:21:45 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=GXccrPgH;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 5AFCD689966;
	Mon, 19 Apr 2021 23:21:43 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8D0D1680CF6
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:21:36 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id 3AF7110602E4
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:21:36 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863696;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=4/3f3ijrw7jdXxrrmfr8Nw9+IFwHeinor2OH/8pWZp4=;
 b=GXccrPgHe7vBbOrYB6IMbLE5E3A9jUqOjotnxpwtcA0VBLrf77FeBYk7KKR6HNuR
 8cieQYeAYMKYnoFn5s42gq7wtWdzLpG6nzbqAG803WB7DaMcU5MQUQPBIcbepq0rVQt
 guxnFSqsC1cqxq51Q7KZWFu3oyW6pvGYdAFWJItd3BUjf1Xv86jxNgCB7LV7dShJuAk
 SV8pHxvKdh6LfPNzKcESyqweQ/wJxOYlDs6JrJ3fpu4IhlWGXwKVaNZmgYNLeozPJ4J
 A4OLeJN/fLRHlzzx4KXd4JeqOZgAWXODTXgsh55Dt34IJ02tkL6+rlpTFAbu6xXt6lq
 2mRHB6fThQ==
Date: Mon, 19 Apr 2021 22:21:36 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfmuUU--3-2@lynne.ee>
In-Reply-To: <MYfmfS_--3-2@lynne.ee-MYfmiW7--3-2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
 <MYfmfS_--3-2@lynne.ee-MYfmiW7--3-2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 02/11] lavu/tx: refactor power-of-two FFT
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: bxRXnUSRYtKS

This commit refactors the power-of-two FFT, making it faster and 
halving the size of all tables, making the code much smaller on
all systems, and making initialization much faster.
This removes the big/small pass split, because on modern systems
the "big" pass is always faster, and even on older machines there
is no measurable speed difference.

Patch attached.
Subject: [PATCH 02/11] lavu/tx: refactor power-of-two FFT

This commit refactors the power-of-two FFT, making it faster and
halving the size of all tables, making the code much smaller on
all systems.
This removes the big/small pass split, because on modern systems
the "big" pass is always faster, and even on older machines there
is no measurable speed difference.
---
 libavutil/tx_priv.h     |   2 +-
 libavutil/tx_template.c | 164 +++++++++++++++++++---------------------
 2 files changed, 79 insertions(+), 87 deletions(-)

diff --git a/libavutil/tx_priv.h b/libavutil/tx_priv.h
index 10d7ea3ade..0b40234355 100644
--- a/libavutil/tx_priv.h
+++ b/libavutil/tx_priv.h
@@ -105,7 +105,7 @@ typedef void FFTComplex;
     CMUL((c).re, (c).im, (a).re, (a).im, (b).re, (b).im)
 
 #define COSTABLE(size)                                                         \
-    DECLARE_ALIGNED(32, FFTSample, TX_NAME(ff_cos_##size))[size/2]
+    DECLARE_ALIGNED(32, FFTSample, TX_NAME(ff_cos_##size))[size/4 + 1]
 
 /* Used by asm, reorder with care */
 struct AVTXContext {
diff --git a/libavutil/tx_template.c b/libavutil/tx_template.c
index a436f426d2..f78e7abfb1 100644
--- a/libavutil/tx_template.c
+++ b/libavutil/tx_template.c
@@ -1,6 +1,8 @@
 /*
- * Copyright (c) 2019 Lynne <dev@lynne.ee>
+ * Copyright (c) Lynne
+ *
  * Power of two FFT:
+ * Copyright (c) Lynne
  * Copyright (c) 2008 Loren Merritt
  * Copyright (c) 2002 Fabrice Bellard
  * Partly based on libdjbfft by D. J. Bernstein
@@ -65,10 +67,11 @@ static av_always_inline void init_cos_tabs_idx(int index)
     int m = 1 << index;
     double freq = 2*M_PI/m;
     FFTSample *tab = cos_tabs[index];
-    for(int i = 0; i <= m/4; i++)
-        tab[i] = RESCALE(cos(i*freq));
-    for(int i = 1; i < m/4; i++)
-        tab[m/2 - i] = tab[i];
+
+    for (int i = 0; i < m/4; i++)
+        *tab++ = RESCALE(cos(i*freq));
+
+    *tab = 0;
 }
 
 #define INIT_FF_COS_TABS_FUNC(index, size)                                     \
@@ -214,76 +217,60 @@ static av_always_inline void fft15(FFTComplex *out, FFTComplex *in,
     fft5_m3(out, tmp + 10, stride);
 }
 
-#define BUTTERFLIES(a0,a1,a2,a3) {\
-    BF(t3, t5, t5, t1);\
-    BF(a2.re, a0.re, a0.re, t5);\
-    BF(a3.im, a1.im, a1.im, t3);\
-    BF(t4, t6, t2, t6);\
-    BF(a3.re, a1.re, a1.re, t4);\
-    BF(a2.im, a0.im, a0.im, t6);\
-}
-
-// force loading all the inputs before storing any.
-// this is slightly slower for small data, but avoids store->load aliasing
-// for addresses separated by large powers of 2.
-#define BUTTERFLIES_BIG(a0,a1,a2,a3) {\
-    FFTSample r0=a0.re, i0=a0.im, r1=a1.re, i1=a1.im;\
-    BF(t3, t5, t5, t1);\
-    BF(a2.re, a0.re, r0, t5);\
-    BF(a3.im, a1.im, i1, t3);\
-    BF(t4, t6, t2, t6);\
-    BF(a3.re, a1.re, r1, t4);\
-    BF(a2.im, a0.im, i0, t6);\
-}
-
-#define TRANSFORM(a0,a1,a2,a3,wre,wim) {\
-    CMUL(t1, t2, a2.re, a2.im, wre, -wim);\
-    CMUL(t5, t6, a3.re, a3.im, wre,  wim);\
-    BUTTERFLIES(a0,a1,a2,a3)\
-}
-
-#define TRANSFORM_ZERO(a0,a1,a2,a3) {\
-    t1 = a2.re;\
-    t2 = a2.im;\
-    t5 = a3.re;\
-    t6 = a3.im;\
-    BUTTERFLIES(a0,a1,a2,a3)\
-}
+#define BUTTERFLIES(a0,a1,a2,a3)               \
+    do {                                       \
+        r0=a0.re;                              \
+        i0=a0.im;                              \
+        r1=a1.re;                              \
+        i1=a1.im;                              \
+        BF(t3, t5, t5, t1);                    \
+        BF(a2.re, a0.re, r0, t5);              \
+        BF(a3.im, a1.im, i1, t3);              \
+        BF(t4, t6, t2, t6);                    \
+        BF(a3.re, a1.re, r1, t4);              \
+        BF(a2.im, a0.im, i0, t6);              \
+    } while (0)
+
+#define TRANSFORM(a0,a1,a2,a3,wre,wim)         \
+    do {                                       \
+        CMUL(t1, t2, a2.re, a2.im, wre, -wim); \
+        CMUL(t5, t6, a3.re, a3.im, wre,  wim); \
+        BUTTERFLIES(a0, a1, a2, a3);           \
+    } while (0)
 
 /* z[0...8n-1], w[1...2n-1] */
-#define PASS(name)\
-static void name(FFTComplex *z, const FFTSample *wre, unsigned int n)\
-{\
-    FFTSample t1, t2, t3, t4, t5, t6;\
-    int o1 = 2*n;\
-    int o2 = 4*n;\
-    int o3 = 6*n;\
-    const FFTSample *wim = wre+o1;\
-    n--;\
-\
-    TRANSFORM_ZERO(z[0],z[o1],z[o2],z[o3]);\
-    TRANSFORM(z[1],z[o1+1],z[o2+1],z[o3+1],wre[1],wim[-1]);\
-    do {\
-        z += 2;\
-        wre += 2;\
-        wim -= 2;\
-        TRANSFORM(z[0],z[o1],z[o2],z[o3],wre[0],wim[0]);\
-        TRANSFORM(z[1],z[o1+1],z[o2+1],z[o3+1],wre[1],wim[-1]);\
-    } while(--n);\
+static void split_radix_combine(FFTComplex *z, const FFTSample *cos, int n)
+{
+    int o1 = 2*n;
+    int o2 = 4*n;
+    int o3 = 6*n;
+    const FFTSample *wim = cos + o1 - 7;
+    FFTSample t1, t2, t3, t4, t5, t6, r0, i0, r1, i1;
+
+    for (int i = 0; i < n; i += 4) {
+        TRANSFORM(z[0], z[o1 + 0], z[o2 + 0], z[o3 + 0], cos[0], wim[7]);
+        TRANSFORM(z[2], z[o1 + 2], z[o2 + 2], z[o3 + 2], cos[2], wim[5]);
+        TRANSFORM(z[4], z[o1 + 4], z[o2 + 4], z[o3 + 4], cos[4], wim[3]);
+        TRANSFORM(z[6], z[o1 + 6], z[o2 + 6], z[o3 + 6], cos[6], wim[1]);
+
+        TRANSFORM(z[1], z[o1 + 1], z[o2 + 1], z[o3 + 1], cos[1], wim[6]);
+        TRANSFORM(z[3], z[o1 + 3], z[o2 + 3], z[o3 + 3], cos[3], wim[4]);
+        TRANSFORM(z[5], z[o1 + 5], z[o2 + 5], z[o3 + 5], cos[5], wim[2]);
+        TRANSFORM(z[7], z[o1 + 7], z[o2 + 7], z[o3 + 7], cos[7], wim[0]);
+
+        z   += 2*4;
+        cos += 2*4;
+        wim -= 2*4;
+    }
 }
 
-PASS(pass)
-#undef BUTTERFLIES
-#define BUTTERFLIES BUTTERFLIES_BIG
-PASS(pass_big)
-
-#define DECL_FFT(n,n2,n4)\
-static void fft##n(FFTComplex *z)\
-{\
-    fft##n2(z);\
-    fft##n4(z+n4*2);\
-    fft##n4(z+n4*3);\
-    pass(z,TX_NAME(ff_cos_##n),n4/2);\
+#define DECL_FFT(n, n2, n4)                            \
+static void fft##n(FFTComplex *z)                      \
+{                                                      \
+    fft##n2(z);                                        \
+    fft##n4(z + n4*2);                                 \
+    fft##n4(z + n4*3);                                 \
+    split_radix_combine(z, TX_NAME(ff_cos_##n), n4/2); \
 }
 
 static void fft2(FFTComplex *z)
@@ -310,7 +297,7 @@ static void fft4(FFTComplex *z)
 
 static void fft8(FFTComplex *z)
 {
-    FFTSample t1, t2, t3, t4, t5, t6;
+    FFTSample t1, t2, t3, t4, t5, t6, r0, i0, r1, i1;
 
     fft4(z);
 
@@ -319,24 +306,30 @@ static void fft8(FFTComplex *z)
     BF(t5, z[7].re, z[6].re, -z[7].re);
     BF(t6, z[7].im, z[6].im, -z[7].im);
 
-    BUTTERFLIES(z[0],z[2],z[4],z[6]);
-    TRANSFORM(z[1],z[3],z[5],z[7],RESCALE(M_SQRT1_2),RESCALE(M_SQRT1_2));
+    BUTTERFLIES(z[0], z[2], z[4], z[6]);
+    TRANSFORM(z[1], z[3], z[5], z[7], RESCALE(M_SQRT1_2), RESCALE(M_SQRT1_2));
 }
 
 static void fft16(FFTComplex *z)
 {
-    FFTSample t1, t2, t3, t4, t5, t6;
+    FFTSample t1, t2, t3, t4, t5, t6, r0, i0, r1, i1;
     FFTSample cos_16_1 = TX_NAME(ff_cos_16)[1];
+    FFTSample cos_16_2 = TX_NAME(ff_cos_16)[2];
     FFTSample cos_16_3 = TX_NAME(ff_cos_16)[3];
 
-    fft8(z);
-    fft4(z+8);
-    fft4(z+12);
+    fft8(z +  0);
+    fft4(z +  8);
+    fft4(z + 12);
+
+    t1 = z[ 8].re;
+    t2 = z[ 8].im;
+    t5 = z[12].re;
+    t6 = z[12].im;
+    BUTTERFLIES(z[0], z[4], z[8], z[12]);
 
-    TRANSFORM_ZERO(z[0],z[4],z[8],z[12]);
-    TRANSFORM(z[2],z[6],z[10],z[14],RESCALE(M_SQRT1_2),RESCALE(M_SQRT1_2));
-    TRANSFORM(z[1],z[5],z[9],z[13],cos_16_1,cos_16_3);
-    TRANSFORM(z[3],z[7],z[11],z[15],cos_16_3,cos_16_1);
+    TRANSFORM(z[ 2], z[ 6], z[10], z[14], cos_16_2, cos_16_2);
+    TRANSFORM(z[ 1], z[ 5], z[ 9], z[13], cos_16_1, cos_16_3);
+    TRANSFORM(z[ 3], z[ 7], z[11], z[15], cos_16_3, cos_16_1);
 }
 
 DECL_FFT(32,16,8)
@@ -344,7 +337,6 @@ DECL_FFT(64,32,16)
 DECL_FFT(128,64,32)
 DECL_FFT(256,128,64)
 DECL_FFT(512,256,128)
-#define pass pass_big
 DECL_FFT(1024,512,256)
 DECL_FFT(2048,1024,512)
 DECL_FFT(4096,2048,1024)
@@ -386,8 +378,8 @@ DECL_COMP_FFT(3)
 DECL_COMP_FFT(5)
 DECL_COMP_FFT(15)
 
-static void monolithic_fft(AVTXContext *s, void *_out, void *_in,
-                           ptrdiff_t stride)
+static void split_radix_fft(AVTXContext *s, void *_out, void *_in,
+                            ptrdiff_t stride)
 {
     FFTComplex *in = _in;
     FFTComplex *out = _out;
@@ -730,7 +722,7 @@ int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
                   n == 5 ? inv ? compound_imdct_5xM  : compound_mdct_5xM :
                            inv ? compound_imdct_15xM : compound_mdct_15xM;
     } else { /* Direct transform case */
-        *tx = monolithic_fft;
+        *tx = split_radix_fft;
         if (is_mdct)
             *tx = inv ? monolithic_imdct : monolithic_mdct;
     }

From patchwork Mon Apr 19 20:22:22 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27103
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp837484iob;
        Mon, 19 Apr 2021 13:22:32 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJwOawbpusG98p6gJUAx0KLa4FkwvRRzquW8r5y2tcNhIlVWo80VXu9obofttD3wK0KHA2hK
X-Received: by 2002:a17:906:40da:: with SMTP id
 a26mr23949831ejk.513.1618863752103;
        Mon, 19 Apr 2021 13:22:32 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618863752; cv=none;
        d=google.com; s=arc-20160816;
        b=ChKVxs03Eejtwh+QxtZryWHaXIJJLxnoHKu0lB2yoliX2wXo5vg8TPSglJFCTkmJW0
         LZY+jfFXzDxHsRAbqK2yy1iM/pP2Aeve52DbstPC91qj4k03XbKVC05vH2FcPo6ZSJbn
         D+ZRPwlQdkCqkN7Gf3v9JyXsXZY9INElA/33pJkrL87/ClxCOTrABcmWB7s890YvGof4
         1kQCJdIFZ86gkuL+rgMduW7JXaw+7ZdkiCFM2gHjVG39kmHXJYYzVLW0o8IQwkacmmha
         NUOLpiNtO1M5V5mgV/NJTydtfb7gP8EO1+0P3g+22EkzbwjkBAwgpMZBlDk0/Yp60Xce
         FNKw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=aAcJ5aIJ6zY4rJypaQLsvr8L6rFSza+FGVo+BQJ3kZA=;
        b=Ho/Y11rQdlqRK+LkpPQ2a/5GAvVBss5+2GNHiZWmpEu42oq6jt71iR54KMYeB+mxFb
         KIzeLNr4KcTpMEBt528FWk7W5Ak5EelPpLht9tf6WauWwhlL5Mi3uTy2RScI7I6BVjy/
         6gzpnRRTV3JcBTzBnbPjjPshQSB46Ig5IJ4oPIU5BnrdGGuEtvTi/tRFx1lgJYas2Rt4
         XBh0RgdVIjO+93f/g2YFoUYB8aSFAlOJQ2gRCRXGJ94IVeQCfktOVTpY5JBDc8mNQc7N
         RnGmeRA/uUhl85hudctYly9s/upWJb1XRpMJvAlKmfwezUZoxNKESSn7jEblUohlR7kY
         6ThA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=s6IM2NgX;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 jj17si12346660ejc.123.2021.04.19.13.22.31;
        Mon, 19 Apr 2021 13:22:32 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=s6IM2NgX;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 957BD6899B3;
	Mon, 19 Apr 2021 23:22:29 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id BE0CE689954
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:22:22 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id 6BCFE10601E8
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:22:22 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863742;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=CFCfs5edXzvf0G7D60DRlnYqG4vfp9qGg7MF6j9CydY=;
 b=s6IM2NgX9NFa1NG2v6WBm12s0735fAu7Hoggm12+GFvr3um7xMzSdPE87BhQyBGM
 63EvD49icgmw8Opv1WF3pF9dqXUh5SKFP2UywndCed6DAqvrGSThzHRjo/zmyNdc/Lu
 UNYExkbSrCfNUqInvh7R9tsawbQK/gKepn/gQf+LqhFFsa1pAQP/s0hN9XpBMxZK/X/
 vlK8XSyj5Jr+ohGsuZTTpX4nO0fYpb3TAZfZeqdABgQDlMwPQIT92l9FWhaTCk4/ypV
 WzjTiJt5zsF+pQeKA2LXb0cKnt/KjhTqImE5u/fmXX2NKusl4lBBUm8OxhiNW3OZD2R
 Bxa50Mq0rA==
Date: Mon, 19 Apr 2021 22:22:22 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfn4l3--3-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 03/11] lavu/tx: add a 7-point FFT and (i)MDCT
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: EWA4kkOZf0Eq

Patch attached.
Subject: [PATCH 03/11] lavu/tx: add a 7-point FFT and (i)MDCT
---
 libavutil/tx_template.c | 126 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 116 insertions(+), 10 deletions(-)

diff --git a/libavutil/tx_template.c b/libavutil/tx_template.c
index f78e7abfb1..2946c039be 100644
--- a/libavutil/tx_template.c
+++ b/libavutil/tx_template.c
@@ -40,6 +40,7 @@ COSTABLE(32768);
 COSTABLE(65536);
 COSTABLE(131072);
 DECLARE_ALIGNED(32, FFTComplex, TX_NAME(ff_cos_53))[4];
+DECLARE_ALIGNED(32, FFTComplex, TX_NAME(ff_cos_7))[3];
 
 static FFTSample * const cos_tabs[18] = {
     NULL,
@@ -103,9 +104,16 @@ static av_cold void ff_init_53_tabs(void)
     TX_NAME(ff_cos_53)[3] = (FFTComplex){ RESCALE(cos(2 * M_PI / 10)), RESCALE(sin(2 * M_PI / 10)) };
 }
 
+static av_cold void ff_init_7_tabs(void)
+{
+    TX_NAME(ff_cos_7)[0] = (FFTComplex){ RESCALE(cos(2 * M_PI /  7)), RESCALE(sin(2 * M_PI /  7)) };
+    TX_NAME(ff_cos_7)[1] = (FFTComplex){ RESCALE(sin(2 * M_PI / 28)), RESCALE(cos(2 * M_PI / 28)) };
+    TX_NAME(ff_cos_7)[2] = (FFTComplex){ RESCALE(cos(2 * M_PI / 14)), RESCALE(sin(2 * M_PI / 14)) };
+}
+
 static CosTabsInitOnce cos_tabs_init_once[] = {
     { ff_init_53_tabs, AV_ONCE_INIT },
-    { NULL },
+    { ff_init_7_tabs, AV_ONCE_INIT },
     { NULL },
     { NULL },
     { init_cos_tabs_16, AV_ONCE_INIT },
@@ -204,6 +212,93 @@ DECL_FFT5(fft5_m1,  0,  6, 12,  3,  9)
 DECL_FFT5(fft5_m2, 10,  1,  7, 13,  4)
 DECL_FFT5(fft5_m3,  5, 11,  2,  8, 14)
 
+static av_always_inline void fft7(FFTComplex *out, FFTComplex *in,
+                                  ptrdiff_t stride)
+{
+    FFTComplex t[6], z[3];
+    const FFTComplex *tab = TX_NAME(ff_cos_7);
+#ifdef TX_INT32
+    int64_t mtmp[12];
+#endif
+
+    BF(t[1].re, t[0].re, in[1].re, in[6].re);
+    BF(t[1].im, t[0].im, in[1].im, in[6].im);
+    BF(t[3].re, t[2].re, in[2].re, in[5].re);
+    BF(t[3].im, t[2].im, in[2].im, in[5].im);
+    BF(t[5].re, t[4].re, in[3].re, in[4].re);
+    BF(t[5].im, t[4].im, in[3].im, in[4].im);
+
+    out[0*stride].re = in[0].re + t[0].re + t[2].re + t[4].re;
+    out[0*stride].im = in[0].im + t[0].im + t[2].im + t[4].im;
+
+#ifdef TX_INT32 /* NOTE: it's possible to do this with 16 mults but 72 adds */
+    mtmp[ 0] = ((int64_t)tab[0].re)*t[0].re - ((int64_t)tab[2].re)*t[4].re;
+    mtmp[ 1] = ((int64_t)tab[0].re)*t[4].re - ((int64_t)tab[1].re)*t[0].re;
+    mtmp[ 2] = ((int64_t)tab[0].re)*t[2].re - ((int64_t)tab[2].re)*t[0].re;
+    mtmp[ 3] = ((int64_t)tab[0].re)*t[0].im - ((int64_t)tab[1].re)*t[2].im;
+    mtmp[ 4] = ((int64_t)tab[0].re)*t[4].im - ((int64_t)tab[1].re)*t[0].im;
+    mtmp[ 5] = ((int64_t)tab[0].re)*t[2].im - ((int64_t)tab[2].re)*t[0].im;
+
+    mtmp[ 6] = ((int64_t)tab[2].im)*t[1].im + ((int64_t)tab[1].im)*t[5].im;
+    mtmp[ 7] = ((int64_t)tab[0].im)*t[5].im + ((int64_t)tab[2].im)*t[3].im;
+    mtmp[ 8] = ((int64_t)tab[2].im)*t[5].im + ((int64_t)tab[1].im)*t[3].im;
+    mtmp[ 9] = ((int64_t)tab[0].im)*t[1].re + ((int64_t)tab[1].im)*t[3].re;
+    mtmp[10] = ((int64_t)tab[2].im)*t[3].re + ((int64_t)tab[0].im)*t[5].re;
+    mtmp[11] = ((int64_t)tab[2].im)*t[1].re + ((int64_t)tab[1].im)*t[5].re;
+
+    z[0].re = (int32_t)(mtmp[ 0] - ((int64_t)tab[1].re)*t[2].re + 0x40000000 >> 31);
+    z[1].re = (int32_t)(mtmp[ 1] - ((int64_t)tab[2].re)*t[2].re + 0x40000000 >> 31);
+    z[2].re = (int32_t)(mtmp[ 2] - ((int64_t)tab[1].re)*t[4].re + 0x40000000 >> 31);
+    z[0].im = (int32_t)(mtmp[ 3] - ((int64_t)tab[2].re)*t[4].im + 0x40000000 >> 31);
+    z[1].im = (int32_t)(mtmp[ 4] - ((int64_t)tab[2].re)*t[2].im + 0x40000000 >> 31);
+    z[2].im = (int32_t)(mtmp[ 5] - ((int64_t)tab[1].re)*t[4].im + 0x40000000 >> 31);
+
+    t[0].re = (int32_t)(mtmp[ 6] - ((int64_t)tab[0].im)*t[3].im + 0x40000000 >> 31);
+    t[2].re = (int32_t)(mtmp[ 7] - ((int64_t)tab[1].im)*t[1].im + 0x40000000 >> 31);
+    t[4].re = (int32_t)(mtmp[ 8] + ((int64_t)tab[0].im)*t[1].im + 0x40000000 >> 31);
+    t[0].im = (int32_t)(mtmp[ 9] + ((int64_t)tab[2].im)*t[5].re + 0x40000000 >> 31);
+    t[2].im = (int32_t)(mtmp[10] - ((int64_t)tab[1].im)*t[1].re + 0x40000000 >> 31);
+    t[4].im = (int32_t)(mtmp[11] - ((int64_t)tab[0].im)*t[3].re + 0x40000000 >> 31);
+#else
+    z[0].re = tab[0].re*t[0].re - tab[2].re*t[4].re - tab[1].re*t[2].re;
+    z[1].re = tab[0].re*t[4].re - tab[1].re*t[0].re - tab[2].re*t[2].re;
+    z[2].re = tab[0].re*t[2].re - tab[2].re*t[0].re - tab[1].re*t[4].re;
+    z[0].im = tab[0].re*t[0].im - tab[1].re*t[2].im - tab[2].re*t[4].im;
+    z[1].im = tab[0].re*t[4].im - tab[1].re*t[0].im - tab[2].re*t[2].im;
+    z[2].im = tab[0].re*t[2].im - tab[2].re*t[0].im - tab[1].re*t[4].im;
+
+    /* It's possible to do t[4].re and t[0].im with 2 multiplies only by
+     * multiplying the sum of all with the average of the twiddles */
+
+    t[0].re = tab[2].im*t[1].im + tab[1].im*t[5].im - tab[0].im*t[3].im;
+    t[2].re = tab[0].im*t[5].im + tab[2].im*t[3].im - tab[1].im*t[1].im;
+    t[4].re = tab[2].im*t[5].im + tab[1].im*t[3].im + tab[0].im*t[1].im;
+    t[0].im = tab[0].im*t[1].re + tab[1].im*t[3].re + tab[2].im*t[5].re;
+    t[2].im = tab[2].im*t[3].re + tab[0].im*t[5].re - tab[1].im*t[1].re;
+    t[4].im = tab[2].im*t[1].re + tab[1].im*t[5].re - tab[0].im*t[3].re;
+#endif
+
+    BF(t[1].re, z[0].re, z[0].re, t[4].re);
+    BF(t[3].re, z[1].re, z[1].re, t[2].re);
+    BF(t[5].re, z[2].re, z[2].re, t[0].re);
+    BF(t[1].im, z[0].im, z[0].im, t[0].im);
+    BF(t[3].im, z[1].im, z[1].im, t[2].im);
+    BF(t[5].im, z[2].im, z[2].im, t[4].im);
+
+    out[1*stride].re = in[0].re + z[0].re;
+    out[1*stride].im = in[0].im + t[1].im;
+    out[2*stride].re = in[0].re + t[3].re;
+    out[2*stride].im = in[0].im + z[1].im;
+    out[3*stride].re = in[0].re + z[2].re;
+    out[3*stride].im = in[0].im + t[5].im;
+    out[4*stride].re = in[0].re + t[5].re;
+    out[4*stride].im = in[0].im + z[2].im;
+    out[5*stride].re = in[0].re + z[1].re;
+    out[5*stride].im = in[0].im + t[3].im;
+    out[6*stride].re = in[0].re + t[1].re;
+    out[6*stride].im = in[0].im + z[0].im;
+}
+
 static av_always_inline void fft15(FFTComplex *out, FFTComplex *in,
                                    ptrdiff_t stride)
 {
@@ -376,6 +471,7 @@ static void compound_fft_##N##xM(AVTXContext *s, void *_out,                   \
 
 DECL_COMP_FFT(3)
 DECL_COMP_FFT(5)
+DECL_COMP_FFT(7)
 DECL_COMP_FFT(15)
 
 static void split_radix_fft(AVTXContext *s, void *_out, void *_in,
@@ -473,6 +569,7 @@ static void compound_imdct_##N##xM(AVTXContext *s, void *_dst, void *_src,     \
 
 DECL_COMP_IMDCT(3)
 DECL_COMP_IMDCT(5)
+DECL_COMP_IMDCT(7)
 DECL_COMP_IMDCT(15)
 
 #define DECL_COMP_MDCT(N)                                                      \
@@ -521,6 +618,7 @@ static void compound_mdct_##N##xM(AVTXContext *s, void *_dst, void *_src,      \
 
 DECL_COMP_MDCT(3)
 DECL_COMP_MDCT(5)
+DECL_COMP_MDCT(7)
 DECL_COMP_MDCT(15)
 
 static void monolithic_imdct(AVTXContext *s, void *_dst, void *_src,
@@ -675,6 +773,7 @@ int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
         SRC /= FACTOR;                                                         \
     }
     CHECK_FACTOR(n, 15, len)
+    CHECK_FACTOR(n,  7, len)
     CHECK_FACTOR(n,  5, len)
     CHECK_FACTOR(n,  3, len)
 #undef CHECK_FACTOR
@@ -714,22 +813,29 @@ int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
             return err;
         if (!(s->tmp = av_malloc(n*m*sizeof(*s->tmp))))
             return AVERROR(ENOMEM);
-        *tx = n == 3 ? compound_fft_3xM :
-              n == 5 ? compound_fft_5xM :
-                       compound_fft_15xM;
-        if (is_mdct)
-            *tx = n == 3 ? inv ? compound_imdct_3xM  : compound_mdct_3xM :
-                  n == 5 ? inv ? compound_imdct_5xM  : compound_mdct_5xM :
-                           inv ? compound_imdct_15xM : compound_mdct_15xM;
+        if (!(m & (m - 1))) {
+            *tx = n == 3 ? compound_fft_3xM :
+                  n == 5 ? compound_fft_5xM :
+                  n == 7 ? compound_fft_7xM :
+                           compound_fft_15xM;
+            if (is_mdct)
+                *tx = n == 3 ? inv ? compound_imdct_3xM  : compound_mdct_3xM :
+                      n == 5 ? inv ? compound_imdct_5xM  : compound_mdct_5xM :
+                      n == 7 ? inv ? compound_imdct_7xM  : compound_mdct_7xM :
+                               inv ? compound_imdct_15xM : compound_mdct_15xM;
+        }
     } else { /* Direct transform case */
         *tx = split_radix_fft;
         if (is_mdct)
             *tx = inv ? monolithic_imdct : monolithic_mdct;
     }
 
-    if (n != 1)
+    if (n == 3 || n == 5 || n == 15)
         init_cos_tabs(0);
-    if (m != 1) {
+    else if (n == 7)
+        init_cos_tabs(1);
+
+    if (m != 1 && !(m & (m - 1))) {
         if ((err = ff_tx_gen_ptwo_revtab(s, n == 1 && !is_mdct && !(flags & AV_TX_INPLACE))))
             return err;
         if (flags & AV_TX_INPLACE) {

From patchwork Mon Apr 19 20:22:59 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27107
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp837868iob;
        Mon, 19 Apr 2021 13:23:09 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJxKChZs3XIr5CEYBU8H3vRVkHLiVtg6u+D6S1f+7pIdjTrOrEpXfkS4+tLDD/u5luANf2H3
X-Received: by 2002:a05:6402:447:: with SMTP id
 p7mr27779370edw.89.1618863789571;
        Mon, 19 Apr 2021 13:23:09 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618863789; cv=none;
        d=google.com; s=arc-20160816;
        b=l4bVEwaAXjC3kQ4in7oY8D9bZCJYs5CCESnTysAI/r/pf3Y7BQVoo9B+IF3TwNC57h
         mOt0FTop4214xR7BFAEVUP/56eCvWLva2NmOR1ZPzIrw8qVjvhZ/Y8kCHQmxdYi/WKR0
         5REUQjXrcfVSI50eIB6Pl2RXBxgwDuJ8lxxkqZH33SaCU1rcbYPGNVClysTKzv6I63dp
         lFQ9K7SCYeABHMmGaDR2xXnghGOYHlttRL/DIkylb+rryHiiBm+eyyvJEY3v1CHWfyn+
         qiFoUURfzdUaa4KRcjCKyXAidA2iiOAsq01AbBT9xLxF0iKdNAT1RvAbZZuuUv3JRgRr
         hwUg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=4khF0qGhgsLjBlKC1qNZ6UdqhAuBKfruKm4EAlYsxks=;
        b=nQy1GcapOom98bEtrTs7sS9hC9fBf4MQI++nuMnIlc3jHfadWaxd23G3uIvT6pWWDA
         H6WcbtUudsySe2d9bTLj6lsCpSFNCB1iiz1BOECHgLAlvwVBg4oky9XzcQMQhBW3pb9Y
         ZXnVnCRpIXlz6zuWUKtchtGys3h2dHiirZFiaPjvd1OuRtrj1nqlcmhiyfuWZDxhYfTn
         N2xzvN1oXsyyz4bBVxMw/EYGYPmEDgHRQYnkg2ZIu8NdQuaJ40Fc3FjuW6AZCHvIgJLh
         WFx3QNpFVcWxEgvC5e1MxrRQmFnqeG57HxkbojIajqQVe09b1SRvfWU+hKwRpXuGfbFM
         YOSg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=MPuDK16Q;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 a15si13630479edr.346.2021.04.19.13.23.09;
        Mon, 19 Apr 2021 13:23:09 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=MPuDK16Q;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id D16FD689A04;
	Mon, 19 Apr 2021 23:23:06 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 1AFD26804B9
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:22:59 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id 6B81010602D8
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:22:59 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863779;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=L9OJzcRgYXbdgsuSYnSVg32CIc6QGCs0m9Zn8Gnre10=;
 b=MPuDK16QShIY9OiC5RfwYFgSSJSXuXciUfMIBbpqVEgKiC+/lHkGWxs7CYy6OOdS
 S3z0NXepLznet7uJQMljZEliu/YsRW7iB/WoAB/mNljL0dT88WtIUoVgwNIpOZw/k7S
 6UCpLMHZytE3BHVQCV88cbl3s/CDr9P/haWif3UCv8jpzrsRYXSm3x4dhCtb2OP58VD
 tdyzBmeztcGXnZulp16xS3yy5L43fLErvet58CvO+HNr0qBS2oQ9MnViH/veXzKX9kf
 X9416YPG/fH7IdHa0f0ur0gNPyP2MGQLRtQNd2chOk86rHmZ4yltDwFzZ2nrBnofWVJ
 oGyys1V22g==
Date: Mon, 19 Apr 2021 22:22:59 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfnDm_--3-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 04/11] lavu/tx: add a 9-point FFT and (i)MDCT
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: R1VLlXqGeDeX

Patch attached.
Subject: [PATCH 04/11] lavu/tx: add a 9-point FFT and (i)MDCT
---
 libavutil/tx_template.c | 144 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 143 insertions(+), 1 deletion(-)

diff --git a/libavutil/tx_template.c b/libavutil/tx_template.c
index 2946c039be..b3532c1c5e 100644
--- a/libavutil/tx_template.c
+++ b/libavutil/tx_template.c
@@ -41,6 +41,7 @@ COSTABLE(65536);
 COSTABLE(131072);
 DECLARE_ALIGNED(32, FFTComplex, TX_NAME(ff_cos_53))[4];
 DECLARE_ALIGNED(32, FFTComplex, TX_NAME(ff_cos_7))[3];
+DECLARE_ALIGNED(32, FFTComplex, TX_NAME(ff_cos_9))[4];
 
 static FFTSample * const cos_tabs[18] = {
     NULL,
@@ -111,10 +112,19 @@ static av_cold void ff_init_7_tabs(void)
     TX_NAME(ff_cos_7)[2] = (FFTComplex){ RESCALE(cos(2 * M_PI / 14)), RESCALE(sin(2 * M_PI / 14)) };
 }
 
+static av_cold void ff_init_9_tabs(void)
+{
+    TX_NAME(ff_cos_9)[0] = (FFTComplex){ RESCALE(cos(2 * M_PI /  3)), RESCALE( sin(2 * M_PI /  3)) };
+    TX_NAME(ff_cos_9)[1] = (FFTComplex){ RESCALE(cos(2 * M_PI /  9)), RESCALE( sin(2 * M_PI /  9)) };
+    TX_NAME(ff_cos_9)[2] = (FFTComplex){ RESCALE(cos(2 * M_PI / 36)), RESCALE( sin(2 * M_PI / 36)) };
+    TX_NAME(ff_cos_9)[3] = (FFTComplex){ TX_NAME(ff_cos_9)[1].re + TX_NAME(ff_cos_9)[2].im,
+                                         TX_NAME(ff_cos_9)[1].im - TX_NAME(ff_cos_9)[2].re };
+}
+
 static CosTabsInitOnce cos_tabs_init_once[] = {
     { ff_init_53_tabs, AV_ONCE_INIT },
     { ff_init_7_tabs, AV_ONCE_INIT },
-    { NULL },
+    { ff_init_9_tabs, AV_ONCE_INIT },
     { NULL },
     { init_cos_tabs_16, AV_ONCE_INIT },
     { init_cos_tabs_32, AV_ONCE_INIT },
@@ -299,6 +309,130 @@ static av_always_inline void fft7(FFTComplex *out, FFTComplex *in,
     out[6*stride].im = in[0].im + z[0].im;
 }
 
+static av_always_inline void fft9(FFTComplex *out, FFTComplex *in,
+                                  ptrdiff_t stride)
+{
+    const FFTComplex *tab = TX_NAME(ff_cos_9);
+    FFTComplex t[16], w[4], x[5], y[5], z[2];
+#ifdef TX_INT32
+    int64_t mtmp[12];
+#endif
+
+    BF(t[1].re, t[0].re, in[1].re, in[8].re);
+    BF(t[1].im, t[0].im, in[1].im, in[8].im);
+    BF(t[3].re, t[2].re, in[2].re, in[7].re);
+    BF(t[3].im, t[2].im, in[2].im, in[7].im);
+    BF(t[5].re, t[4].re, in[3].re, in[6].re);
+    BF(t[5].im, t[4].im, in[3].im, in[6].im);
+    BF(t[7].re, t[6].re, in[4].re, in[5].re);
+    BF(t[7].im, t[6].im, in[4].im, in[5].im);
+
+    w[0].re = t[0].re - t[6].re;
+    w[0].im = t[0].im - t[6].im;
+    w[1].re = t[2].re - t[6].re;
+    w[1].im = t[2].im - t[6].im;
+    w[2].re = t[1].re - t[7].re;
+    w[2].im = t[1].im - t[7].im;
+    w[3].re = t[3].re + t[7].re;
+    w[3].im = t[3].im + t[7].im;
+
+    z[0].re = in[0].re + t[4].re;
+    z[0].im = in[0].im + t[4].im;
+
+    z[1].re = t[0].re + t[2].re + t[6].re;
+    z[1].im = t[0].im + t[2].im + t[6].im;
+
+    out[0*stride].re = z[0].re + z[1].re;
+    out[0*stride].im = z[0].im + z[1].im;
+
+#ifdef TX_INT32
+    mtmp[0] = t[1].re - t[3].re + t[7].re;
+    mtmp[1] = t[1].im - t[3].im + t[7].im;
+
+    y[3].re = (int32_t)(((int64_t)tab[0].im)*mtmp[0] + 0x40000000 >> 31);
+    y[3].im = (int32_t)(((int64_t)tab[0].im)*mtmp[1] + 0x40000000 >> 31);
+
+    mtmp[0] = (int32_t)(((int64_t)tab[0].re)*z[1].re + 0x40000000 >> 31);
+    mtmp[1] = (int32_t)(((int64_t)tab[0].re)*z[1].im + 0x40000000 >> 31);
+    mtmp[2] = (int32_t)(((int64_t)tab[0].re)*t[4].re + 0x40000000 >> 31);
+    mtmp[3] = (int32_t)(((int64_t)tab[0].re)*t[4].im + 0x40000000 >> 31);
+
+    x[3].re = z[0].re  + (int32_t)mtmp[0];
+    x[3].im = z[0].im  + (int32_t)mtmp[1];
+    z[0].re = in[0].re + (int32_t)mtmp[2];
+    z[0].im = in[0].im + (int32_t)mtmp[3];
+
+    mtmp[0] = ((int64_t)tab[1].re)*w[0].re;
+    mtmp[1] = ((int64_t)tab[1].re)*w[0].im;
+    mtmp[2] = ((int64_t)tab[2].im)*w[0].re;
+    mtmp[3] = ((int64_t)tab[2].im)*w[0].im;
+    mtmp[4] = ((int64_t)tab[1].im)*w[2].re;
+    mtmp[5] = ((int64_t)tab[1].im)*w[2].im;
+    mtmp[6] = ((int64_t)tab[2].re)*w[2].re;
+    mtmp[7] = ((int64_t)tab[2].re)*w[2].im;
+
+    x[1].re = (int32_t)(mtmp[0] + ((int64_t)tab[2].im)*w[1].re + 0x40000000 >> 31);
+    x[1].im = (int32_t)(mtmp[1] + ((int64_t)tab[2].im)*w[1].im + 0x40000000 >> 31);
+    x[2].re = (int32_t)(mtmp[2] - ((int64_t)tab[3].re)*w[1].re + 0x40000000 >> 31);
+    x[2].im = (int32_t)(mtmp[3] - ((int64_t)tab[3].re)*w[1].im + 0x40000000 >> 31);
+    y[1].re = (int32_t)(mtmp[4] + ((int64_t)tab[2].re)*w[3].re + 0x40000000 >> 31);
+    y[1].im = (int32_t)(mtmp[5] + ((int64_t)tab[2].re)*w[3].im + 0x40000000 >> 31);
+    y[2].re = (int32_t)(mtmp[6] - ((int64_t)tab[3].im)*w[3].re + 0x40000000 >> 31);
+    y[2].im = (int32_t)(mtmp[7] - ((int64_t)tab[3].im)*w[3].im + 0x40000000 >> 31);
+
+    y[0].re = (int32_t)(((int64_t)tab[0].im)*t[5].re + 0x40000000 >> 31);
+    y[0].im = (int32_t)(((int64_t)tab[0].im)*t[5].im + 0x40000000 >> 31);
+
+#else
+    y[3].re = tab[0].im*(t[1].re - t[3].re + t[7].re);
+    y[3].im = tab[0].im*(t[1].im - t[3].im + t[7].im);
+
+    x[3].re = z[0].re  + tab[0].re*z[1].re;
+    x[3].im = z[0].im  + tab[0].re*z[1].im;
+    z[0].re = in[0].re + tab[0].re*t[4].re;
+    z[0].im = in[0].im + tab[0].re*t[4].im;
+
+    x[1].re = tab[1].re*w[0].re + tab[2].im*w[1].re;
+    x[1].im = tab[1].re*w[0].im + tab[2].im*w[1].im;
+    x[2].re = tab[2].im*w[0].re - tab[3].re*w[1].re;
+    x[2].im = tab[2].im*w[0].im - tab[3].re*w[1].im;
+    y[1].re = tab[1].im*w[2].re + tab[2].re*w[3].re;
+    y[1].im = tab[1].im*w[2].im + tab[2].re*w[3].im;
+    y[2].re = tab[2].re*w[2].re - tab[3].im*w[3].re;
+    y[2].im = tab[2].re*w[2].im - tab[3].im*w[3].im;
+
+    y[0].re = tab[0].im*t[5].re;
+    y[0].im = tab[0].im*t[5].im;
+#endif
+
+    x[4].re = x[1].re + x[2].re;
+    x[4].im = x[1].im + x[2].im;
+
+    y[4].re = y[1].re - y[2].re;
+    y[4].im = y[1].im - y[2].im;
+    x[1].re = z[0].re + x[1].re;
+    x[1].im = z[0].im + x[1].im;
+    y[1].re = y[0].re + y[1].re;
+    y[1].im = y[0].im + y[1].im;
+    x[2].re = z[0].re + x[2].re;
+    x[2].im = z[0].im + x[2].im;
+    y[2].re = y[2].re - y[0].re;
+    y[2].im = y[2].im - y[0].im;
+    x[4].re = z[0].re - x[4].re;
+    x[4].im = z[0].im - x[4].im;
+    y[4].re = y[0].re - y[4].re;
+    y[4].im = y[0].im - y[4].im;
+
+    out[1*stride] = (FFTComplex){ x[1].re + y[1].im, x[1].im - y[1].re };
+    out[2*stride] = (FFTComplex){ x[2].re + y[2].im, x[2].im - y[2].re };
+    out[3*stride] = (FFTComplex){ x[3].re + y[3].im, x[3].im - y[3].re };
+    out[4*stride] = (FFTComplex){ x[4].re + y[4].im, x[4].im - y[4].re };
+    out[5*stride] = (FFTComplex){ x[4].re - y[4].im, x[4].im + y[4].re };
+    out[6*stride] = (FFTComplex){ x[3].re - y[3].im, x[3].im + y[3].re };
+    out[7*stride] = (FFTComplex){ x[2].re - y[2].im, x[2].im + y[2].re };
+    out[8*stride] = (FFTComplex){ x[1].re - y[1].im, x[1].im + y[1].re };
+}
+
 static av_always_inline void fft15(FFTComplex *out, FFTComplex *in,
                                    ptrdiff_t stride)
 {
@@ -472,6 +606,7 @@ static void compound_fft_##N##xM(AVTXContext *s, void *_out,                   \
 DECL_COMP_FFT(3)
 DECL_COMP_FFT(5)
 DECL_COMP_FFT(7)
+DECL_COMP_FFT(9)
 DECL_COMP_FFT(15)
 
 static void split_radix_fft(AVTXContext *s, void *_out, void *_in,
@@ -570,6 +705,7 @@ static void compound_imdct_##N##xM(AVTXContext *s, void *_dst, void *_src,     \
 DECL_COMP_IMDCT(3)
 DECL_COMP_IMDCT(5)
 DECL_COMP_IMDCT(7)
+DECL_COMP_IMDCT(9)
 DECL_COMP_IMDCT(15)
 
 #define DECL_COMP_MDCT(N)                                                      \
@@ -619,6 +755,7 @@ static void compound_mdct_##N##xM(AVTXContext *s, void *_dst, void *_src,      \
 DECL_COMP_MDCT(3)
 DECL_COMP_MDCT(5)
 DECL_COMP_MDCT(7)
+DECL_COMP_MDCT(9)
 DECL_COMP_MDCT(15)
 
 static void monolithic_imdct(AVTXContext *s, void *_dst, void *_src,
@@ -773,6 +910,7 @@ int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
         SRC /= FACTOR;                                                         \
     }
     CHECK_FACTOR(n, 15, len)
+    CHECK_FACTOR(n,  9, len)
     CHECK_FACTOR(n,  7, len)
     CHECK_FACTOR(n,  5, len)
     CHECK_FACTOR(n,  3, len)
@@ -817,11 +955,13 @@ int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
             *tx = n == 3 ? compound_fft_3xM :
                   n == 5 ? compound_fft_5xM :
                   n == 7 ? compound_fft_7xM :
+                  n == 9 ? compound_fft_9xM :
                            compound_fft_15xM;
             if (is_mdct)
                 *tx = n == 3 ? inv ? compound_imdct_3xM  : compound_mdct_3xM :
                       n == 5 ? inv ? compound_imdct_5xM  : compound_mdct_5xM :
                       n == 7 ? inv ? compound_imdct_7xM  : compound_mdct_7xM :
+                      n == 9 ? inv ? compound_imdct_9xM  : compound_mdct_9xM :
                                inv ? compound_imdct_15xM : compound_mdct_15xM;
         }
     } else { /* Direct transform case */
@@ -834,6 +974,8 @@ int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
         init_cos_tabs(0);
     else if (n == 7)
         init_cos_tabs(1);
+    else if (n == 9)
+        init_cos_tabs(2);
 
     if (m != 1 && !(m & (m - 1))) {
         if ((err = ff_tx_gen_ptwo_revtab(s, n == 1 && !is_mdct && !(flags & AV_TX_INPLACE))))

From patchwork Mon Apr 19 20:23:27 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27105
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp838251iob;
        Mon, 19 Apr 2021 13:23:37 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJwTgZ7VMm6bZeo0x7v7O2D57F6zdsml/nqoWfp4uHY06jpRKtrpeh+ufwOnafwGEK9+xIcD
X-Received: by 2002:a05:6402:1a:: with SMTP id
 d26mr27746497edu.99.1618863817687;
        Mon, 19 Apr 2021 13:23:37 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618863817; cv=none;
        d=google.com; s=arc-20160816;
        b=U++d1wAiHRVeLYj6WVl8eF2EufdZ/IrkvuzmAG7gbC/n8LQacGRSNjmzYpiMRm49cY
         AHS+w8AO/b2IHdNp5NWUcYD33ZPXvYcGJ8Qsn7ZAzIb7KVzMGPOzzsK2p+CNvqNKJg7r
         YxeYEhtytupx5B27kK/HoUt7duQw331r+CpVuukxipDy1tqUwjJj+zk3+H7nox0zfrzW
         ed7O0SJOf7bBnE3kd91PuA/D1Nol2UmdmNwtqsvKyxmtQgWoJdPlCoOXX166O9JeABG2
         IDv3reDtMPDNc3n9BT0yHttpHzkU1krsXvAxRTg8irFI54Q9gDU63kSzJDHB4p88/N3t
         VLAg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=A2MxlJxkHymhbYPSINts1brKMbC1j/K8FQn2MvgQO2c=;
        b=xrPgJ0mhUHFRYCggCj+cG9JbKMC6mkGnD3/sjDY3n0V8zvGOI49wLy6ymEKp/J+BSy
         WtO1GSpZRPQ2PGf38UWOQkCOgGpzdSvde1ew87z+w8ptVU9I9VRxJUomn5DrlayAROvq
         aVSism4KhwLB7JE/hoTBLaDF6y5bNxGklDHJaJ762NwLwavWdjUxCqdIQpvPY1CWJMEw
         5qe1r7Szf2lc8EgP3NwWGJr/6t2MJ9dzL24Tzg6AeyY1mDXN/mdxrNpQFQcTav5I6/27
         qomLR8LntE/dQkmk68xvam3lR8Q86GFNS+44z7yg6v1WdNjNcJV/iLMYUShbZJZxsjBw
         ILZw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=Q0rf5bKP;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 g8si13248129ejm.206.2021.04.19.13.23.37;
        Mon, 19 Apr 2021 13:23:37 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=Q0rf5bKP;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E131D689A8D;
	Mon, 19 Apr 2021 23:23:34 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0B6AB6802B3
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:23:28 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id ADE31106015A
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:23:27 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863807;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=trwWZK/SUFmVwn8f6CZCGWTFjh5mJJFcm2p/nWa5xqM=;
 b=Q0rf5bKP0Uibkc+fqSAd/bltvHtyq1zmtLZdve/Kc58On1LcZ0n/QnOEW+8oQVC5
 Jjd083QSPaK4uQsOiFS2OOXH3gKwJB/THH6Q9sQBAOPb9Qh7rAXKZVRDyJxKX6Nolqo
 1TW/Ys/zLY9ekVgnVScs/ZgsGxuhOGMUq2x3wioV3L2I0eQ9mRbwyNoO26CuBfSxH+C
 2Oq2qTf8KSn7PTlFip5aT/GVtCTDj+iEuXsgxrox7wCeQJKnHkUI49UxXaef7dARwqV
 phWCmimOx8rQI38wuAHabG6qrRwfnu1i2VfhXvisSyO+tdJXdBeOn9nyroVVabs/jJM
 RIYulm4pGA==
Date: Mon, 19 Apr 2021 22:23:27 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfnKjG--3-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 05/11] lavu/tx: add unaligned flag to the API
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: ts/xL1pVj7Qa

Patch attached.
Subject: [PATCH 05/11] lavu/tx: add unaligned flag to the API
---
 libavutil/tx.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/libavutil/tx.h b/libavutil/tx.h
index fccded8bc3..a3d70644e4 100644
--- a/libavutil/tx.h
+++ b/libavutil/tx.h
@@ -95,7 +95,7 @@ enum AVTXType {
  * @param stride the input or output stride in bytes
  *
  * The out and in arrays must be aligned to the maximum required by the CPU
- * architecture.
+ * architecture unless the AV_TX_UNALIGNED flag was set in av_tx_init().
  * The stride must follow the constraints the transform type has specified.
  */
 typedef void (*av_tx_fn)(AVTXContext *s, void *out, void *in, ptrdiff_t stride);
@@ -110,6 +110,12 @@ enum AVTXFlags {
      * transform types.
      */
     AV_TX_INPLACE = 1ULL << 0,
+
+    /**
+     * Relaxes alignment requirement for the in and out arrays of av_tx_fn().
+     * May be slower with certain transform types.
+     */
+    AV_TX_UNALIGNED = 1ULL << 1,
 };
 
 /**

From patchwork Mon Apr 19 20:23:56 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27120
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp838637iob;
        Mon, 19 Apr 2021 13:24:07 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJzVyqFVONqnt2nPPpcC8uj4uT77gYFCblnBFLLcjIZbvwIsox+EoGt9mDt+pIf8sOzNA925
X-Received: by 2002:a05:6402:3591:: with SMTP id
 y17mr5730773edc.67.1618863846907;
        Mon, 19 Apr 2021 13:24:06 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618863846; cv=none;
        d=google.com; s=arc-20160816;
        b=lfFTAoHy9XlkStT7c7SHuCesIlBT1I4ZMqwxITUkONFbBrx95mEMNgeoTggk5CGpC3
         Mh+1Ey5uF18Osk62jYsQpgafROYo0t2hiqWY0kVnTlck4gUR7genSebazWW1vDTdSVD3
         DFbAshjexh98ySTwVIN2Vtr+5YZ3HPYvGEtSnlaR1MoKeTVoPej50h4nnUdMa6mWDiyu
         tHRAWYsC72jxxi3hJWeR1BWXpvtuLua6zPhKeIwDPujOZj+ME7EsK05kZ4Oif1OqP+WX
         vrmiQC9ahRgTptCs8dNcs1kSOwSHRe8exyi44WLh5kgMLnoFVhIYbRcqOFIb99zlkrsC
         wP7A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=mwzLmcpQ1i+AEEbOX1R7fRL229JelDKpwLg7Pvz/RbU=;
        b=l+axqV8l9DZHra/z/j38LMgLwT9SwRNXYtum1n3AwBmsklVf2as0QuRQdVekaCKScs
         zYSVFkgOov1HEQVYYq48thahAqYRwZJ3skP3nAAL0pTVCAPvABZ03KM0hDgsEqzlxn/J
         uUhfr3vaHRtAx7nV1/zPVpiUNArzX67F62eVhiLNpImdh4iX3g90EBSMM6GzUe0BnIhV
         ajUcEfdRHSuu7KdBMA6IZZuuo2qumcpdP2BInFRqHGelUEoRK4riHpVBKKrhnC2267cO
         ctDSftTR2U6LJkMGgIV813Psbr/kL4KPSy7aDl2hr621Xo5JyID1MOU+XZ7w5CNiBEDR
         T3eg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=GreaLsDL;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 bs9si12068669edb.595.2021.04.19.13.24.06;
        Mon, 19 Apr 2021 13:24:06 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=GreaLsDL;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 27FD6680903;
	Mon, 19 Apr 2021 23:24:04 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 46717680903
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:23:57 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id E8D0A1060318
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:23:56 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863836;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=M/7196AaoRVPY+IIMj6BhxIZlYe3QHay+lZLIUUFlIE=;
 b=GreaLsDLigs1LEWcd3HHez8xbUBMfZexT4mRQQ0kQThy1wCE6ydyfkwZGNBjWByc
 rJAw6bkKi6ou2p8HP3z+7eI1hvxWohCT7RWiQebJbgA5MWLU6MlVEy6Kso2FPfA+zee
 qP3my0CMjUxrWcR3AnlL7H/QE26qc+f0JNrQ4NuvYg3FhVSZy5LJWHt8f/SdRsN+laW
 I+uXXOB0iFXpXmwTcBXnx2/oJVA7hnNrfn0jaxh+lvFSzGzZ9JkvReHYD70ULgyQfSu
 k+SjjwGQyY3ZsASaAe053JuU9KeOcVrZFJAozzWlcC3cCGnBUnTQ4rgQWpAqvpuC0Hz
 jVSqi2a2bg==
Date: Mon, 19 Apr 2021 22:23:56 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfnRrc--3-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 06/11] lavu/tx: add full-sized iMDCT
 transform flag
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: IWiN30x9Y22J

Patch attached.
Subject: [PATCH 06/11] lavu/tx: add full-sized iMDCT transform flag
---
 libavutil/tx.h          | 11 ++++++++++-
 libavutil/tx_priv.h     |  4 ++++
 libavutil/tx_template.c | 29 ++++++++++++++++++++++++++++-
 3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/libavutil/tx.h b/libavutil/tx.h
index a3d70644e4..55173810ee 100644
--- a/libavutil/tx.h
+++ b/libavutil/tx.h
@@ -55,7 +55,8 @@ enum AVTXType {
      * Stride must be a non-zero multiple of sizeof(float).
      *
      * NOTE: the inverse transform is half-length, meaning the output will not
-     * contain redundant data. This is what most codecs work with.
+     * contain redundant data. This is what most codecs work with. To do a full
+     * inverse transform, set the AV_TX_FULL_IMDCT flag on init.
      */
     AV_TX_FLOAT_MDCT = 1,
 
@@ -116,6 +117,14 @@ enum AVTXFlags {
      * May be slower with certain transform types.
      */
     AV_TX_UNALIGNED = 1ULL << 1,
+
+    /**
+     * Performs a full inverse MDCT rather than leaving out samples that can be
+     * derived through symmetry. Requires an output array of 'len' floats,
+     * rather than the usual 'len/2' floats.
+     * Ignored for all transforms but inverse MDCTs.
+     */
+    AV_TX_FULL_IMDCT = 1ULL << 2,
 };
 
 /**
diff --git a/libavutil/tx_priv.h b/libavutil/tx_priv.h
index 0b40234355..1d4245e71b 100644
--- a/libavutil/tx_priv.h
+++ b/libavutil/tx_priv.h
@@ -121,6 +121,10 @@ struct AVTXContext {
     int        *pfatab; /* Input/Output mapping for compound transforms */
     int        *revtab; /* Input mapping for power of two transforms */
     int   *inplace_idx; /* Required indices to revtab for in-place transforms */
+
+    av_tx_fn    top_tx; /* Used for computing transforms derived from other
+                         * transforms, like full-length iMDCTs and RDFTs.
+                         * NOTE: Do NOT use this to mix assembly with C code. */
 };
 
 /* Checks if type is an MDCT */
diff --git a/libavutil/tx_template.c b/libavutil/tx_template.c
index b3532c1c5e..a68a84dcd5 100644
--- a/libavutil/tx_template.c
+++ b/libavutil/tx_template.c
@@ -875,6 +875,24 @@ static void naive_mdct(AVTXContext *s, void *_dst, void *_src,
     }
 }
 
+static void full_imdct_wrapper_fn(AVTXContext *s, void *_dst, void *_src,
+                                  ptrdiff_t stride)
+{
+    int len = s->m*s->n*4;
+    int len2 = len >> 1;
+    int len4 = len >> 2;
+    FFTSample *dst = _dst;
+
+    s->top_tx(s, dst + len4, _src, stride);
+
+    stride /= sizeof(*dst);
+
+    for (int i = 0; i < len4; i++) {
+        dst[            i*stride] = -dst[(len2 - i - 1)*stride];
+        dst[(len - i - 1)*stride] =  dst[(len2 + i + 0)*stride];
+    }
+}
+
 static int gen_mdct_exptab(AVTXContext *s, int len4, double scale)
 {
     const double theta = (scale < 0 ? len4 : 0) + 1.0/8.0;
@@ -942,6 +960,10 @@ int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
         if (is_mdct) {
             s->scale = *((SCALE_TYPE *)scale);
             *tx = inv ? naive_imdct : naive_mdct;
+            if (inv && (flags & AV_TX_FULL_IMDCT)) {
+                s->top_tx = *tx;
+                *tx = full_imdct_wrapper_fn;
+            }
         }
         return 0;
     }
@@ -990,8 +1012,13 @@ int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
             init_cos_tabs(i);
     }
 
-    if (is_mdct)
+    if (is_mdct) {
+        if (inv && (flags & AV_TX_FULL_IMDCT)) {
+            s->top_tx = *tx;
+            *tx = full_imdct_wrapper_fn;
+        }
         return gen_mdct_exptab(s, n*m, *((SCALE_TYPE *)scale));
+    }
 
     return 0;
 }

From patchwork Mon Apr 19 20:24:29 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27119
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp839084iob;
        Mon, 19 Apr 2021 13:24:40 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJw9riCLKj/04XUD+dBpsf4/mMWyPLpUtGhgBCGdgaBnuufXxgFiA7mnZblbBIu9mh2Dici7
X-Received: by 2002:a17:906:dc92:: with SMTP id
 cs18mr24538416ejc.27.1618863879901;
        Mon, 19 Apr 2021 13:24:39 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618863879; cv=none;
        d=google.com; s=arc-20160816;
        b=ZAhmFHEDz6tmwhKEE8Lmv0i95SJ5BnpqbT/qHcWHLjztnOHjFPVexrIyRMS/r8/ssp
         AKBTKLax0M3U6qbg6Wn+B72gw1NxMK11DAXDxZm2SzKDvfGtCYT42IaOhFynM4+dcf6o
         mzxieChgIknPpsudiyo8pmu8PibpMFcX6TKYFDXY6CdqSDH7zAD4A/wYiGjT1eIex7X4
         TQBZ4CR/XHbPWw09HKyO8VPQ9FJKdgxldQlF7/2GgdMTeZwoM5ktu1MqpzPR9rShiIXP
         lkSifwCG/f3wvfn59+vm3rXM/9h6a82w1GXVOBp19ODgOd7j/CyFJLHXvLhwGi7LD4lr
         oudw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=wBLKIh9KSx+FkjEDSns7XEnT/QFgX40HQbxnYUZF8Ug=;
        b=cLbJp0gWvsyiPo9ejX76IubbY7I2F7ZdUNxU5iimcM5E1ZaQIagZnY/1/jGfFvl6q/
         mv52LjmH1H9ZrRgxQt5aeHlwx2Ru2l02tWjCuuYKdtrrmf5F9jp4e+sp+IsZ7/CeGXpi
         wsUkiasnSRBI4zpigtEOa57s7wTnDnmUI9a9zsDKdVPCNnTh+jJooXYqR2ByZDTAxwV7
         voRQWtcBYdW5mKjcetMCb8oqh9EKCcNonlctLqRakaXX2qMcAn8RWTX2S4R1VP4cruV9
         Wc8YCrEu2igc7frCKF1WDYDN12M1eIOXO0pQ7tlmo5f45Q7RSzsrq0HfudS0gFLZUWTD
         gF/Q==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=1LHq9Esh;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 cq21si12127123ejc.542.2021.04.19.13.24.39;
        Mon, 19 Apr 2021 13:24:39 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=1LHq9Esh;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 1CB4D689B09;
	Mon, 19 Apr 2021 23:24:37 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0DE946802B3
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:24:30 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id B0FD4106015A
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:24:29 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863869;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=+Ij3qMMtMKxxKvhIx3anJVnGvB0JzrE4EB3GPRF1Thg=;
 b=1LHq9EshpD5Dy9EQfqT2XXdgDykK8uCZz+B3U/Un6m/ge/jj0fOUMEPRdJewYyO3
 QSmplM/IQGLxZUbebi41eYWmQi0CRa7B4E9uPB+3RlH79KNadb7veVoXYgXmCDlPe0b
 EFL6gKWH3LY2ZNE3lwwrCYffzzWyNBRCVM+Y+Uvd+Y6yHh3dru9uALz2KjOhzksRTI9
 q2IY/PI/XAT/6QPxDPKSFv3sDd0+9VU5HcDSItxOauOZDvAToHtZFNP6A4MDWT1hvw8
 YAuBI/4Se9G3WmwFMmVnCl/JETX9jHreBYYubXRGTHUuXGF0pnNMoT8oMyWXboU7sib
 Z/u8JtgY0Q==
Date: Mon, 19 Apr 2021 22:24:29 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfnZr_--3-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 07/11] lavu: bump minor and add APIchanges
 entry for the lavu/tx changes
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: Q2Bpj0R1WvDQ

Patch attached.
Subject: [PATCH 07/11] lavu: bump minor and add APIchanges entry for the
 lavu/tx changes
---
 doc/APIchanges      | 3 +++
 libavutil/version.h | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/doc/APIchanges b/doc/APIchanges
index cd3ea3c865..095e9aed99 100644
--- a/doc/APIchanges
+++ b/doc/APIchanges
@@ -14,6 +14,9 @@ libavutil:     2017-10-21
 
 
 API changes, most recent first:
+2021-04-19 - xxxxxxxxxx - lavu 56.74.100 - tx.h
+  Add AV_TX_FULL_IMDCT and AV_TX_UNALIGNED.
+
 2021-04-17 - xxxxxxxxxx - lavu 56.73.100 - frame.h detection_bbox.h
   Add AV_FRAME_DATA_DETECTION_BBOXES
 
diff --git a/libavutil/version.h b/libavutil/version.h
index 658bcd402e..fb6700511b 100644
--- a/libavutil/version.h
+++ b/libavutil/version.h
@@ -79,7 +79,7 @@
  */
 
 #define LIBAVUTIL_VERSION_MAJOR  56
-#define LIBAVUTIL_VERSION_MINOR  73
+#define LIBAVUTIL_VERSION_MINOR  74
 #define LIBAVUTIL_VERSION_MICRO 100
 
 #define LIBAVUTIL_VERSION_INT   AV_VERSION_INT(LIBAVUTIL_VERSION_MAJOR, \

From patchwork Mon Apr 19 20:25:03 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27118
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp839426iob;
        Mon, 19 Apr 2021 13:25:09 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJwRLzNBLVBwZksJPE1V5mww5VT/UzejkYQMlEuKOum7QQahvDBRopGu/Ztp8u4NEa1X5kZX
X-Received: by 2002:a17:907:c10:: with SMTP id
 ga16mr9698178ejc.402.1618863908871;
        Mon, 19 Apr 2021 13:25:08 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618863908; cv=none;
        d=google.com; s=arc-20160816;
        b=be8tkGxe3xDb2kx9T/SddhTT7/GhrliEl3WDxkQjnsMnV6sN4cXZKdIJPiqS0qq66f
         MGHquDTIqHlE8wdQujiGfDD6Mz/ZqkzGiHy0LBaU0ZSBYZlS/QM5/HQet230IBmnUt7W
         tL5FaO2mmcArt9KR9R5lQCUtI6b9F8n+SuoPvWWls2+oAm+TcOhtaky93hTyISS7E8+m
         T/sKUstRBz4Zt8siMq6BOuC9lIIqTFCB6K+CMaABqzyKpN6C2p0NWUB0GGzQ3665EjV1
         f/Me7qDW/lF1N4SA5+mpJEg+WroTBHChqsc/4l8ChajujCnYeZmdAwNKiOALqTpEC6LI
         sbfQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=MKso3r9gaRIOk5S9nQGSx9HOJOrrE0jtNB8Ax0Kzhdo=;
        b=Td1O2MeYPN4R0So68yCPwLoYPgvd+xUtxdUvTDG6WDPgnM2oCCyGDYnxmRNa0QPcCw
         8YRrm9v5ICDi5cYSj1n3tzjXOvYuSjyGq4hlRLopAsj2ykOwcwl6N0G0rEVQ67um2B2x
         uevsgaRwfxYSbRaOWbK8IWay+eFPjUQ+0Z6cHAdKO1Hukz7sbeFHaYtq9sKny+d8oj68
         U3oesx3LhyDu9rXo134jcFD+/W7fABh7ych8bIrJ+qFQQeypHyOkEIsGsNfHE9tcxMvy
         q6F3XC9PyciOk7NRssh3HX8wvFzj5WbAiOxz6E4Ea65w/wt6bLwJbQ/8zxbHcVhlj2e/
         uOmA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=zugMwxwY;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 dd21si5302853edb.45.2021.04.19.13.25.08;
        Mon, 19 Apr 2021 13:25:08 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=zugMwxwY;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3600D689B7A;
	Mon, 19 Apr 2021 23:25:06 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 3BD826806A2
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:25:03 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id A2086106015A
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:25:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863903;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=V0depmjPVa8OwA6u/eLbEoO2uW5IodqVb2QzShLrGfg=;
 b=zugMwxwYaGj0PYkhfeTCrl+O3jJ/d5OrzVnJyVK7po1GKcMzQSoN0e0BJbVMht4V
 buizCDKOSe9mqiz1T1nmGE/aFxgTXtTzxhGEspj7LCWbyVTkGZROp/6w9IMVMj/bnuQ
 Gi0UIeIU3us7fuR4yrjWv0rvnrDMqf11XJbODbYlXo8ZwG6Vz0zShwnr233EchknARb
 gqlnM1Dt0fY8H0tY/VTmJnYGwFX+m5GzyspXccet5BApJ1XlIOpC+Rus6miXLVwz1Ta
 ZIimgdtULvYgoLzYLQW2gOFo/bDQzkqGgq0cdMIooIjqoO+tuddeJ73TjJZEz5GYf4j
 hkcNTYWFWQ==
Date: Mon, 19 Apr 2021 22:25:03 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfnh8S--3-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 08/11] lavu/tx: add parity revtab generator
 version
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: QVh4Xa3sWghj

This will be used for SIMD support.
Patch attached.
Subject: [PATCH 08/11] lavu/tx: add parity revtab generator version

This will be used for SIMD support.
---
 libavutil/tx.c      | 49 +++++++++++++++++++++++++++++++++++++++++++++
 libavutil/tx_priv.h | 31 ++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/libavutil/tx.c b/libavutil/tx.c
index 05d4de30cc..6d0e854084 100644
--- a/libavutil/tx.c
+++ b/libavutil/tx.c
@@ -158,6 +158,55 @@ int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s)
     return 0;
 }
 
+static void parity_revtab_generator(int *revtab, int n, int inv, int offset,
+                                    int is_dual, int dual_high, int len,
+                                    int basis, int dual_stride)
+{
+    len >>= 1;
+
+    if (len <= basis) {
+        int k1, k2, *even, *odd, stride;
+
+        is_dual = is_dual && dual_stride;
+        dual_high = is_dual & dual_high;
+        stride = is_dual ? FFMIN(dual_stride, len) : 0;
+
+        even = &revtab[offset + dual_high*(stride - 2*len)];
+        odd  = &even[len + (is_dual && !dual_high)*len + dual_high*len];
+
+        for (int i = 0; i < len; i++) {
+            k1 = -split_radix_permutation(offset + i*2 + 0, n, inv) & (n - 1);
+            k2 = -split_radix_permutation(offset + i*2 + 1, n, inv) & (n - 1);
+            *even++ = k1;
+            *odd++  = k2;
+            if (stride && !((i + 1) % stride)) {
+                even += stride;
+                odd  += stride;
+            }
+        }
+
+        return;
+    }
+
+    parity_revtab_generator(revtab, n, inv, offset,
+                            0, 0, len >> 0, basis, dual_stride);
+    parity_revtab_generator(revtab, n, inv, offset + (len >> 0),
+                            1, 0, len >> 1, basis, dual_stride);
+    parity_revtab_generator(revtab, n, inv, offset + (len >> 0) + (len >> 1),
+                            1, 1, len >> 1, basis, dual_stride);
+}
+
+void ff_tx_gen_split_radix_parity_revtab(int *revtab, int len, int inv,
+                                         int basis, int dual_stride)
+{
+    basis >>= 1;
+    if (len < basis)
+        return;
+    av_assert0(!dual_stride || !(dual_stride & (dual_stride - 1)));
+    av_assert0(dual_stride <= basis);
+    parity_revtab_generator(revtab, len, inv, 0, 0, 0, len, basis, dual_stride);
+}
+
 av_cold void av_tx_uninit(AVTXContext **ctx)
 {
     if (!(*ctx))
diff --git a/libavutil/tx_priv.h b/libavutil/tx_priv.h
index 1d4245e71b..b889f6d3b4 100644
--- a/libavutil/tx_priv.h
+++ b/libavutil/tx_priv.h
@@ -149,6 +149,37 @@ int ff_tx_gen_ptwo_revtab(AVTXContext *s, int invert_lookup);
  */
 int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s);
 
+/*
+ * This generates a parity-based revtab of length len and direction inv.
+ *
+ * Parity means even and odd complex numbers will be split, e.g. the even
+ * coefficients will come first, after which the odd coefficients will be
+ * placed. For example, a 4-point transform's coefficients after reordering:
+ * z[0].re, z[0].im, z[2].re, z[2].im, z[1].re, z[1].im, z[3].re, z[3].im
+ *
+ * The basis argument is the length of the largest non-composite transform
+ * supported, and also implies that the basis/2 transform is supported as well,
+ * as the split-radix algorithm requires it to be.
+ *
+ * The dual_stride argument indicates that both the basis, as well as the
+ * basis/2 transforms support doing two transforms at once, and the coefficients
+ * will be interleaved between each pair in a split-radix like so (stride == 2):
+ * tx1[0], tx1[2], tx2[0], tx2[2], tx1[1], tx1[3], tx2[1], tx2[3]
+ * A non-zero number switches this on, with the value indicating the stride
+ * (how many values of 1 transform to put first before switching to the other).
+ * Must be a power of two or 0. Must be less than the basis.
+ * Value will be clipped to the transform size, so for a basis of 16 and a
+ * dual_stride of 8, dual 8-point transforms will be laid out as if dual_stride
+ * was set to 4.
+ * Usually you'll set this to half the complex numbers that fit in a single
+ * register or 0. This allows to reuse SSE functions as dual-transform
+ * functions in AVX mode.
+ *
+ * If length is smaller than basis/2 this function will not do anything.
+ */
+void ff_tx_gen_split_radix_parity_revtab(int *revtab, int len, int inv,
+                                         int basis, int dual_stride);
+
 /* Templated init functions */
 int ff_tx_init_mdct_fft_float(AVTXContext *s, av_tx_fn *tx,
                               enum AVTXType type, int inv, int len,

From patchwork Mon Apr 19 20:25:52 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27123
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp839953iob;
        Mon, 19 Apr 2021 13:26:02 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJw8FXDA0S3nbQeMkl6WlCFaEs7531Fi+1p4wFXOh4Pdj4VwyLXi5dKLjm0lXBRFetFxaNWy
X-Received: by 2002:aa7:d617:: with SMTP id c23mr5529663edr.207.1618863962411;
        Mon, 19 Apr 2021 13:26:02 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618863962; cv=none;
        d=google.com; s=arc-20160816;
        b=nOvwAtiE2+r0naBYo6QUoj7Rn4Bi8Gq7WKdDLoykVunPYcBfU3ft+JE8bvkLZHPunO
         deHiCXtFBhz737nfsmEOlHXT/ixLnMNQP9RDqNDyBPQL/6xTf4xYudXdDYR7jknElQ8R
         6I4iZ89h0IFxyBJza8bYKhDDphSHH7DsmA5wfWm9+t8n+rwhGOuOebO0XPWT8t2fGhLQ
         xPxRfaozJHt8p+w3jAiLD3faRnR2p6soLJhT3Anps5ziCk7LHpZBJarp/2fLq/Fy6AlE
         A7TZpfT9hXXinkSQWEtINtdGDlxu1XFWc6gmFxvAJg/uDADIqi6xyPVVUjw6+x0N+xMa
         XTbA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=44CTg41UUpJQfRDmmyZDpbdmx7egzx2C7Wk/rmOOVgU=;
        b=aHzk2wWovNANy0aSqHlcZJR/pvFpotIiYyFebKS7eI2hXTY0qqvexpJoUotRLa9Iri
         wgh+YWUQClr/3aKa0j5L+vIXqSQk76fju50oF7dKmb3d5PHPrSMy8trqwF9fs5tRIFLo
         Vu6JboAKOYaObn7VJDXW3cqseB6EzfYuIZmNfa8SbVSWNc0hU5RmlH5lgARXEyalF4qs
         imRVXCmvGnAgTpMJiCV2zy8dKFfdUidA0he2B0SfaLCx6eCWiqThMH/eVrKt5jo6jQ2W
         vJzxVstyiogw79bztp+Dzz1RAHZtxDJ0r89uDicOWqIGBy6yNryVTEoGaF3YRioBSFFO
         VpIg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=ZIoh7qLJ;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 v5si11733279ejq.224.2021.04.19.13.26.01;
        Mon, 19 Apr 2021 13:26:02 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=ZIoh7qLJ;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 4D445689C28;
	Mon, 19 Apr 2021 23:25:59 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 854A968004A
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:25:52 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id 3332B10602D8
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:25:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863952;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=k1719Jp47p4xiIALGZNhcyrHpGftye2swRSSysRnkrA=;
 b=ZIoh7qLJXCdI3sP82t8YJImWPRxwWYYBOsm88lhwrG13jB0U9UrflJBVlbusK4C7
 jqCpFceKQLxsxZwwG/32QCDrg/WTgLYT1sVd2Bqaa49FNYbzIcZGS3KU197b/+ULafo
 g1eWybjSuAMRLdw9bc7AQgNydUWq79i45hasRaM+7G3kilMVPETiy/66ENhJD3ft+o6
 sIvBuQ4/LO8HMKtpFGOsgWlz3UWFZ0C2llGWTYxLpSd9DGaw1Ad9t7MH9rCCHvAJYsv
 7wZZH7UtnmWvl+omGbmnN6MrdM5FKwPnxKlQstM5OFe+tifKMvRA1KUeaysFROerDkS
 KPjMriBl2w==
Date: Mon, 19 Apr 2021 22:25:52 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfnt-I--7-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 09/11] checkasm: add av_tx FFT SIMD testing
 code
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: PDte9c2OO4hk

This sadly required making changes to the code itself, 
due to the same context needing to be reused for both versions.
The lookup table had to be duplicated for both versions.

Patch attached.
Subject: [PATCH 09/11] checkasm: add av_tx FFT SIMD testing code

This sadly required making changes to the code itself,
due to the same context needing to be reused for both versions.
The lookup table had to be duplicated for both versions.
---
 libavutil/tx.c            |  15 +++---
 libavutil/tx_priv.h       |   5 +-
 libavutil/tx_template.c   |  18 +++----
 tests/checkasm/Makefile   |   1 +
 tests/checkasm/av_tx.c    | 109 ++++++++++++++++++++++++++++++++++++++
 tests/checkasm/checkasm.c |   1 +
 tests/checkasm/checkasm.h |   1 +
 tests/fate/checkasm.mak   |   1 +
 8 files changed, 135 insertions(+), 16 deletions(-)
 create mode 100644 tests/checkasm/av_tx.c

diff --git a/libavutil/tx.c b/libavutil/tx.c
index 6d0e854084..dcfb257899 100644
--- a/libavutil/tx.c
+++ b/libavutil/tx.c
@@ -106,22 +106,24 @@ int ff_tx_gen_ptwo_revtab(AVTXContext *s, int invert_lookup)
 {
     const int m = s->m, inv = s->inv;
 
-    if (!(s->revtab = av_malloc(m*sizeof(*s->revtab))))
+    if (!(s->revtab = av_malloc(s->m*sizeof(*s->revtab))))
+        return AVERROR(ENOMEM);
+    if (!(s->revtab_c = av_malloc(m*sizeof(*s->revtab_c))))
         return AVERROR(ENOMEM);
 
     /* Default */
     for (int i = 0; i < m; i++) {
         int k = -split_radix_permutation(i, m, inv) & (m - 1);
         if (invert_lookup)
-            s->revtab[i] = k;
+            s->revtab[i] = s->revtab_c[i] = k;
         else
-            s->revtab[k] = i;
+            s->revtab[i] = s->revtab_c[k] = i;
     }
 
     return 0;
 }
 
-int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s)
+int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s, int *revtab)
 {
     int nb_inplace_idx = 0;
 
@@ -130,7 +132,7 @@ int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s)
 
     /* The first coefficient is always already in-place */
     for (int src = 1; src < s->m; src++) {
-        int dst = s->revtab[src];
+        int dst = revtab[src];
         int found = 0;
 
         if (dst <= src)
@@ -146,7 +148,7 @@ int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s)
                     break;
                 }
             }
-            dst = s->revtab[dst];
+            dst = revtab[dst];
         } while (dst != src && !found);
 
         if (!found)
@@ -215,6 +217,7 @@ av_cold void av_tx_uninit(AVTXContext **ctx)
     av_free((*ctx)->pfatab);
     av_free((*ctx)->exptab);
     av_free((*ctx)->revtab);
+    av_free((*ctx)->revtab_c);
     av_free((*ctx)->inplace_idx);
     av_free((*ctx)->tmp);
 
diff --git a/libavutil/tx_priv.h b/libavutil/tx_priv.h
index b889f6d3b4..88589fcbb4 100644
--- a/libavutil/tx_priv.h
+++ b/libavutil/tx_priv.h
@@ -122,6 +122,9 @@ struct AVTXContext {
     int        *revtab; /* Input mapping for power of two transforms */
     int   *inplace_idx; /* Required indices to revtab for in-place transforms */
 
+    int      *revtab_c; /* Revtab for only the C transforms, needed because
+                         * checkasm makes us reuse the same context. */
+
     av_tx_fn    top_tx; /* Used for computing transforms derived from other
                          * transforms, like full-length iMDCTs and RDFTs.
                          * NOTE: Do NOT use this to mix assembly with C code. */
@@ -147,7 +150,7 @@ int ff_tx_gen_ptwo_revtab(AVTXContext *s, int invert_lookup);
  * specific order,  allows the revtab to be done in-place. AVTXContext->revtab
  * must already exist.
  */
-int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s);
+int ff_tx_gen_ptwo_inplace_revtab_idx(AVTXContext *s, int *revtab);
 
 /*
  * This generates a parity-based revtab of length len and direction inv.
diff --git a/libavutil/tx_template.c b/libavutil/tx_template.c
index a68a84dcd5..cad66a8bc0 100644
--- a/libavutil/tx_template.c
+++ b/libavutil/tx_template.c
@@ -593,7 +593,7 @@ static void compound_fft_##N##xM(AVTXContext *s, void *_out,                   \
     for (int i = 0; i < m; i++) {                                              \
         for (int j = 0; j < N; j++)                                            \
             fft##N##in[j] = in[in_map[i*N + j]];                               \
-        fft##N(s->tmp + s->revtab[i], fft##N##in, m);                          \
+        fft##N(s->tmp + s->revtab_c[i], fft##N##in, m);                        \
     }                                                                          \
                                                                                \
     for (int i = 0; i < N; i++)                                                \
@@ -624,16 +624,16 @@ static void split_radix_fft(AVTXContext *s, void *_out, void *_in,
 
         do {
             tmp = out[src];
-            dst = s->revtab[src];
+            dst = s->revtab_c[src];
             do {
                 FFSWAP(FFTComplex, tmp, out[dst]);
-                dst = s->revtab[dst];
+                dst = s->revtab_c[dst];
             } while (dst != src); /* Can be > as well, but is less predictable */
             out[dst] = tmp;
         } while ((src = *inplace_idx++));
     } else {
         for (int i = 0; i < m; i++)
-            out[i] = in[s->revtab[i]];
+            out[i] = in[s->revtab_c[i]];
     }
 
     fft_dispatch[mb](out);
@@ -685,7 +685,7 @@ static void compound_imdct_##N##xM(AVTXContext *s, void *_dst, void *_src,     \
             FFTComplex tmp = { in2[-k*stride], in1[k*stride] };                \
             CMUL3(fft##N##in[j], tmp, exp[k >> 1]);                            \
         }                                                                      \
-        fft##N(s->tmp + s->revtab[i], fft##N##in, m);                          \
+        fft##N(s->tmp + s->revtab_c[i], fft##N##in, m);                        \
     }                                                                          \
                                                                                \
     for (int i = 0; i < N; i++)                                                \
@@ -733,7 +733,7 @@ static void compound_mdct_##N##xM(AVTXContext *s, void *_dst, void *_src,      \
             CMUL(fft##N##in[j].im, fft##N##in[j].re, tmp.re, tmp.im,           \
                  exp[k >> 1].re, exp[k >> 1].im);                              \
         }                                                                      \
-        fft##N(s->tmp + s->revtab[i], fft##N##in, m);                          \
+        fft##N(s->tmp + s->revtab_c[i], fft##N##in, m);                        \
     }                                                                          \
                                                                                \
     for (int i = 0; i < N; i++)                                                \
@@ -772,7 +772,7 @@ static void monolithic_imdct(AVTXContext *s, void *_dst, void *_src,
 
     for (int i = 0; i < m; i++) {
         FFTComplex tmp = { in2[-2*i*stride], in1[2*i*stride] };
-        CMUL3(z[s->revtab[i]], tmp, exp[i]);
+        CMUL3(z[s->revtab_c[i]], tmp, exp[i]);
     }
 
     fftp(z);
@@ -806,7 +806,7 @@ static void monolithic_mdct(AVTXContext *s, void *_dst, void *_src,
             tmp.re = FOLD(-src[ len4 + k], -src[5*len4 - 1 - k]);
             tmp.im = FOLD( src[-len4 + k], -src[1*len3 - 1 - k]);
         }
-        CMUL(z[s->revtab[i]].im, z[s->revtab[i]].re, tmp.re, tmp.im,
+        CMUL(z[s->revtab_c[i]].im, z[s->revtab_c[i]].re, tmp.re, tmp.im,
              exp[i].re, exp[i].im);
     }
 
@@ -1005,7 +1005,7 @@ int TX_NAME(ff_tx_init_mdct_fft)(AVTXContext *s, av_tx_fn *tx,
         if (flags & AV_TX_INPLACE) {
             if (is_mdct) /* In-place MDCTs are not supported yet */
                 return AVERROR(ENOSYS);
-            if ((err = ff_tx_gen_ptwo_inplace_revtab_idx(s)))
+            if ((err = ff_tx_gen_ptwo_inplace_revtab_idx(s, s->revtab_c)))
                 return err;
         }
         for (int i = 4; i <= av_log2(m); i++)
diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index 1827a4e134..4ef5fa87da 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -50,6 +50,7 @@ SWSCALEOBJS                             += sw_rgb.o sw_scale.o
 CHECKASMOBJS-$(CONFIG_SWSCALE)  += $(SWSCALEOBJS)
 
 # libavutil tests
+AVUTILOBJS                              += av_tx.o
 AVUTILOBJS                              += fixed_dsp.o
 AVUTILOBJS                              += float_dsp.o
 
diff --git a/tests/checkasm/av_tx.c b/tests/checkasm/av_tx.c
new file mode 100644
index 0000000000..6ffbce2b4a
--- /dev/null
+++ b/tests/checkasm/av_tx.c
@@ -0,0 +1,109 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include "libavutil/mem_internal.h"
+#include "libavutil/tx.h"
+#include "libavutil/error.h"
+
+#include "checkasm.h"
+
+#define EPS 0.0001
+
+#define SCALE_NOOP(x) (x)
+#define SCALE_INT20(x) (av_clip64(lrintf((x) * 2147483648.0), INT32_MIN, INT32_MAX) >> 12)
+
+#define randomize_complex(BUF, LEN, TYPE, SCALE)                \
+    do {                                                        \
+        TYPE *buf = (TYPE *)BUF;                                \
+        for (int i = 0; i < LEN; i++) {                         \
+            double fre = (double)rnd() / UINT_MAX;              \
+            double fim = (double)rnd() / UINT_MAX;              \
+            buf[i] = (TYPE){ SCALE(fre), SCALE(fim) };          \
+        }                                                       \
+    } while (0)
+
+static const int check_lens[] = {
+    2, 4, 8, 16, 32, 64, 1024, 16384,
+    3*2, 5*2, 7*2, 9*2, 15*2,
+};
+
+#define CHECK_TEMPLATE(PREFIX, TYPE, DATA_TYPE, SCALE, LENGTHS, CHECK_EXPRESSION) \
+    do {                                                                          \
+        int err;                                                                  \
+        AVTXContext *tx;                                                          \
+        av_tx_fn fn;                                                              \
+        int num_checks = 0;                                                       \
+        int last_check = 0;                                                       \
+        const void *scale = &SCALE;                                               \
+                                                                                  \
+        for (int i = 0; i < FF_ARRAY_ELEMS(LENGTHS); i++) {                       \
+            int len = LENGTHS[i];                                                 \
+                                                                                  \
+            if ((err = av_tx_init(&tx, &fn, TYPE, 0, len, &scale, 0x0)) < 0) {    \
+                fprintf(stderr, "av_tx: %s\n", av_err2str(err));                  \
+                return;                                                           \
+            }                                                                     \
+                                                                                  \
+            if (check_func(fn, PREFIX "_%i", len)) {                              \
+                num_checks++;                                                     \
+                last_check = len;                                                 \
+                call_ref(tx, out_ref, in, sizeof(DATA_TYPE));                     \
+                call_new(tx, out_new, in, sizeof(DATA_TYPE));                     \
+                if (CHECK_EXPRESSION) {                                           \
+                    fail();                                                       \
+                    break;                                                        \
+                }                                                                 \
+                bench_new(tx, out_new, in, sizeof(DATA_TYPE));                    \
+            }                                                                     \
+                                                                                  \
+            av_tx_uninit(&tx);                                                    \
+            fn = NULL;                                                            \
+        }                                                                         \
+                                                                                  \
+        av_tx_uninit(&tx);                                                        \
+        fn = NULL;                                                                \
+                                                                                  \
+        if (num_checks == 1)                                                      \
+            report(PREFIX "_%i", last_check);                                     \
+        else if (num_checks)                                                      \
+            report(PREFIX);                                                       \
+    } while (0)
+
+void checkasm_check_av_tx(void)
+{
+    const float scale_float = 1.0f;
+    const double scale_double = 1.0f;
+
+    declare_func(void, AVTXContext *tx, void *out, void *in, ptrdiff_t stride);
+
+    void *in      = av_malloc(16384*2*8);
+    void *out_ref = av_malloc(16384*2*8);
+    void *out_new = av_malloc(16384*2*8);
+
+    randomize_complex(in, 16384, AVComplexFloat, SCALE_NOOP);
+    CHECK_TEMPLATE("float_fft", AV_TX_FLOAT_FFT, AVComplexFloat, scale_float, check_lens,
+                   !float_near_abs_eps_array(out_ref, out_new, EPS, len*2));
+
+    randomize_complex(in, 16384, AVComplexDouble, SCALE_NOOP);
+    CHECK_TEMPLATE("double_fft", AV_TX_DOUBLE_FFT, AVComplexDouble, scale_double, check_lens,
+                   !double_near_abs_eps_array(out_ref, out_new, EPS, len*2));
+
+    av_free(in);
+    av_free(out_ref);
+    av_free(out_new);
+}
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 8338e8ff58..e2e17d2b11 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -198,6 +198,7 @@ static const struct {
 #if CONFIG_AVUTIL
         { "fixed_dsp", checkasm_check_fixed_dsp },
         { "float_dsp", checkasm_check_float_dsp },
+        { "av_tx",     checkasm_check_av_tx },
 #endif
     { NULL }
 };
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index ef6645e3a2..0593d0edac 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -43,6 +43,7 @@ void checkasm_check_aacpsdsp(void);
 void checkasm_check_afir(void);
 void checkasm_check_alacdsp(void);
 void checkasm_check_audiodsp(void);
+void checkasm_check_av_tx(void);
 void checkasm_check_blend(void);
 void checkasm_check_blockdsp(void);
 void checkasm_check_bswapdsp(void);
diff --git a/tests/fate/checkasm.mak b/tests/fate/checkasm.mak
index 07f1d8238e..3108fcd510 100644
--- a/tests/fate/checkasm.mak
+++ b/tests/fate/checkasm.mak
@@ -2,6 +2,7 @@ FATE_CHECKASM = fate-checkasm-aacpsdsp                                  \
                 fate-checkasm-af_afir                                   \
                 fate-checkasm-alacdsp                                   \
                 fate-checkasm-audiodsp                                  \
+                fate-checkasm-av_tx                                     \
                 fate-checkasm-blockdsp                                  \
                 fate-checkasm-bswapdsp                                  \
                 fate-checkasm-exrdsp                                    \

From patchwork Mon Apr 19 20:26:39 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27121
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp840385iob;
        Mon, 19 Apr 2021 13:26:49 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJxFsDVcFwWbfYlYVFCIMQdYdixzwEfOU6H/j4FLfOk4wQOAAuEVJSfwOWg2GwSPHieCHmGY
X-Received: by 2002:a17:906:2746:: with SMTP id
 a6mr9525865ejd.265.1618864009107;
        Mon, 19 Apr 2021 13:26:49 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618864009; cv=none;
        d=google.com; s=arc-20160816;
        b=1BU6KhRNQlA5YZHZ+uLxM+Fm9vVeS+cIThsvQYUux7OWqmyKm9LzFjNerpI0VFhBW1
         kvjEM/dip+FUTpmIvJGFl2VwHxeUN7nafL9o2K734WaLnxtUHf3ci/VY4ZXEd1ShHBKB
         qlpDGYsCy30F/WZO0u3kk7xbXYys2Fo6Fs/I5StBt6RKf58OENRo2eid9LpXlTLKzywU
         g8aaKJbj8q+tFZeYtsGr0DeIwNFQge54KSRNDiLljLejp+2kxRTMm3idsMDUXxSijADB
         oV5xxeugCCjqAAj/nYtpcO5tpc58hvbURrYwpY57sEyJzvoqtrpBHWAAv+I+cJ2zVhhE
         HSHA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=lPzTY051b4zJhqTbRdfoI7OpUzj12M0EeqGCmRKaZK4=;
        b=egPVjPQbdRrT92X++IeLme4M7tbQFnZa6zOVuDXlXs9X69Vf2YY+DEJKl1Jv0Ft/GF
         mcgoiwKzJO7Gu3pexiejH3tMSOr4zRc3dvQX9lYQWy+xzcMt+uBOGBvKKW19skpTD1iB
         rr2SeNRmGeQUWZX+7bCX1UWwConRBBV7i3c8I/+PpNCHiMxz5t7vCyZKrCfKeSD5Yjy4
         G7XzPCMCNZgLt+bZtS9aqGK3AgXkQVWInf1lGwOJ8Ur/hvbxdMB2C6NYlCYhyzTbiWdd
         srKPXwObxtxbc3jmbQ4uPUkaA7n/gPtKotX3xeus2KOqToppaeRHiP8DIPo1PtmxzN/x
         1Icw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=DDiDXmGz;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 f1si12489042ejx.485.2021.04.19.13.26.48;
        Mon, 19 Apr 2021 13:26:49 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=DDiDXmGz;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8355C680BE0;
	Mon, 19 Apr 2021 23:26:46 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 79733680BE0
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:26:39 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id 2865110602E4
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:26:39 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863999;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=UWdlEYBLd2s0SzpAp/jxuSGBRg1Zc2AtsPhegFvJaow=;
 b=DDiDXmGzKU0o6naMPCEvIlu1cFrTDWZgJfNeNUUVgJZMoCU/Zfbpw/IJH620oRkW
 sDEYCrWEDSjRNf2kDZeoCLACn1wDl6er6tMek6JBGOWqhw8Ms4NLBQ6rUsYUqcEq5rN
 Sa3r7UdkruZhC459Yb5XgEsAjLrnCqsmsY73VqPQNFKtHbT2WKvHXwVDlHm0VqF0efV
 g0s2SvsAq2Kqi5wjB5BEKhfBK+++1cklkByb4XDhFDZ5gEMfM09vv43+wYrkxU1H007
 P64wyyH4dZd0qQuErCearCOUBlMHBAZvvmi3jInVgtg5rlTZtDNYE40kBhbeuyuWD9h
 FmjBOtbsvg==
Date: Mon, 19 Apr 2021 22:26:39 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfo3UD--3-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 10/11] doc/transforms: add documentation for
 the FFT transforms
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: B4FQO97vJitb

Makes the code far easier to follow, and makes creating new SIMD 
for the transforms far, far easier.
Patch attached.
Subject: [PATCH 10/11] doc/transforms: add documentation for the FFT
 transforms

Makes the code far easier to follow, and makes creating new SIMD
for the transforms far, far easier.
---
 doc/transforms.md | 706 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 706 insertions(+)
 create mode 100644 doc/transforms.md

diff --git a/doc/transforms.md b/doc/transforms.md
new file mode 100644
index 0000000000..78f3f68d65
--- /dev/null
+++ b/doc/transforms.md
@@ -0,0 +1,706 @@
+The basis transforms used for FFT and various other derived functions are based
+on the following unrollings.
+The functions can be easily adapted to double precision floats as well.
+
+# Parity permutation
+The basis transforms described here all use the following permutation:
+
+``` C
+void ff_tx_gen_split_radix_parity_revtab(int *revtab, int len, int inv,
+                                         int basis, int dual_stride);
+```
+Parity means even and odd complex numbers will be split, e.g. the even
+coefficients will come first, after which the odd coefficients will be
+placed. For example, a 4-point transform's coefficients after reordering:
+`z[0].re, z[0].im, z[2].re, z[2].im, z[1].re, z[1].im, z[3].re, z[3].im`
+
+The basis argument is the length of the largest non-composite transform
+supported, and also implies that the basis/2 transform is supported as well,
+as the split-radix algorithm requires it to be.
+
+The dual_stride argument indicates that both the basis, as well as the
+basis/2 transforms support doing two transforms at once, and the coefficients
+will be interleaved between each pair in a split-radix like so (stride == 2):
+`tx1[0], tx1[2], tx2[0], tx2[2], tx1[1], tx1[3], tx2[1], tx2[3]`
+A non-zero number switches this on, with the value indicating the stride
+(how many values of 1 transform to put first before switching to the other).
+Must be a power of two or 0. Must be less than the basis.
+Value will be clipped to the transform size, so for a basis of 16 and a
+dual_stride of 8, dual 8-point transforms will be laid out as if dual_stride
+was set to 4.
+Usually you'll set this to half the complex numbers that fit in a single
+register or 0. This allows to reuse SSE functions as dual-transform
+functions in AVX mode.
+If length is smaller than basis/2 this function will not do anything.
+
+# 4-point FFT transform
+The only permutation this transform needs is to swap the `z[1]` and `z[2]`
+elements when performing an inverse transform, which in the assembly code is
+hardcoded with the function itself being templated and duplicated for each
+direction.
+
+``` C
+static void fft4(FFTComplex *z)
+{
+    FFTSample r1 = z[0].re - z[2].re;
+    FFTSample r2 = z[0].im - z[2].im;
+    FFTSample r3 = z[1].re - z[3].re;
+    FFTSample r4 = z[1].im - z[3].im;
+    /* r5-r8 second transform */
+
+    FFTSample t1 = z[0].re + z[2].re;
+    FFTSample t2 = z[0].im + z[2].im;
+    FFTSample t3 = z[1].re + z[3].re;
+    FFTSample t4 = z[1].im + z[3].im;
+    /* t5-t8 second transform */
+
+    /* 1sub + 1add = 2 instructions */
+
+    /* 2 shufs */
+    FFTSample a3 = t1 - t3;
+    FFTSample a4 = t2 - t4;
+    FFTSample b3 = r1 - r4;
+    FFTSample b2 = r2 - r3;
+
+    FFTSample a1 = t1 + t3;
+    FFTSample a2 = t2 + t4;
+    FFTSample b1 = r1 + r4;
+    FFTSample b4 = r2 + r3;
+    /* 1 add 1 sub 3 shufs */
+
+    z[0].re = a1;
+    z[0].im = a2;
+    z[2].re = a3;
+    z[2].im = a4;
+
+    z[1].re = b1;
+    z[1].im = b2;
+    z[3].re = b3;
+    z[3].im = b4;
+}
+```
+
+# 8-point AVX FFT transform
+Input must be pre-permuted using the parity lookup table, generated via
+`ff_tx_gen_split_radix_parity_revtab`.
+
+``` C
+static void fft8(FFTComplex *z)
+{
+    FFTSample r1 = z[0].re - z[4].re;
+    FFTSample r2 = z[0].im - z[4].im;
+    FFTSample r3 = z[1].re - z[5].re;
+    FFTSample r4 = z[1].im - z[5].im;
+
+    FFTSample r5 = z[2].re - z[6].re;
+    FFTSample r6 = z[2].im - z[6].im;
+    FFTSample r7 = z[3].re - z[7].re;
+    FFTSample r8 = z[3].im - z[7].im;
+
+    FFTSample q1 = z[0].re + z[4].re;
+    FFTSample q2 = z[0].im + z[4].im;
+    FFTSample q3 = z[1].re + z[5].re;
+    FFTSample q4 = z[1].im + z[5].im;
+
+    FFTSample q5 = z[2].re + z[6].re;
+    FFTSample q6 = z[2].im + z[6].im;
+    FFTSample q7 = z[3].re + z[7].re;
+    FFTSample q8 = z[3].im + z[7].im;
+
+    FFTSample s3 = q1 - q3;
+    FFTSample s1 = q1 + q3;
+    FFTSample s4 = q2 - q4;
+    FFTSample s2 = q2 + q4;
+
+    FFTSample s7 = q5 - q7;
+    FFTSample s5 = q5 + q7;
+    FFTSample s8 = q6 - q8;
+    FFTSample s6 = q6 + q8;
+
+    FFTSample e1 = s1 * -1;
+    FFTSample e2 = s2 * -1;
+    FFTSample e3 = s3 * -1;
+    FFTSample e4 = s4 * -1;
+
+    FFTSample e5 = s5 *  1;
+    FFTSample e6 = s6 *  1;
+    FFTSample e7 = s7 * -1;
+    FFTSample e8 = s8 *  1;
+
+    FFTSample w1 =  e5 - e1;
+    FFTSample w2 =  e6 - e2;
+    FFTSample w3 =  e8 - e3;
+    FFTSample w4 =  e7 - e4;
+
+    FFTSample w5 =  s1 - e5;
+    FFTSample w6 =  s2 - e6;
+    FFTSample w7 =  s3 - e8;
+    FFTSample w8 =  s4 - e7;
+
+    z[0].re = w1;
+    z[0].im = w2;
+    z[2].re = w3;
+    z[2].im = w4;
+    z[4].re = w5;
+    z[4].im = w6;
+    z[6].re = w7;
+    z[6].im = w8;
+
+    FFTSample z1 = r1 - r4;
+    FFTSample z2 = r1 + r4;
+    FFTSample z3 = r3 - r2;
+    FFTSample z4 = r3 + r2;
+
+    FFTSample z5 = r5 - r6;
+    FFTSample z6 = r5 + r6;
+    FFTSample z7 = r7 - r8;
+    FFTSample z8 = r7 + r8;
+
+    z3 *= -1;
+    z5 *= -M_SQRT1_2;
+    z6 *= -M_SQRT1_2;
+    z7 *=  M_SQRT1_2;
+    z8 *=  M_SQRT1_2;
+
+    FFTSample t5 = z7 - z6;
+    FFTSample t6 = z8 + z5;
+    FFTSample t7 = z8 - z5;
+    FFTSample t8 = z7 + z6;
+
+    FFTSample u1 =  z2 + t5;
+    FFTSample u2 =  z3 + t6;
+    FFTSample u3 =  z1 - t7;
+    FFTSample u4 =  z4 + t8;
+
+    FFTSample u5 =  z2 - t5;
+    FFTSample u6 =  z3 - t6;
+    FFTSample u7 =  z1 + t7;
+    FFTSample u8 =  z4 - t8;
+
+    z[1].re = u1;
+    z[1].im = u2;
+    z[3].re = u3;
+    z[3].im = u4;
+    z[5].re = u5;
+    z[5].im = u6;
+    z[7].re = u7;
+    z[7].im = u8;
+}
+```
+
+As you can see, there are 2 independent paths, one for even and one for odd coefficients.
+This theme continues throughout the document. Note that in the actual assembly code,
+the paths are interleaved to improve unit saturation and CPU dependency tracking, so
+to more clearly see them, you'll need to deinterleave the instructions.
+
+# 8-point SSE/ARM64 FFT transform
+Input must be pre-permuted using the parity lookup table, generated via
+`ff_tx_gen_split_radix_parity_revtab`.
+
+``` C
+static void fft8(FFTComplex *z)
+{
+    FFTSample r1 = z[0].re - z[4].re;
+    FFTSample r2 = z[0].im - z[4].im;
+    FFTSample r3 = z[1].re - z[5].re;
+    FFTSample r4 = z[1].im - z[5].im;
+
+    FFTSample j1 = z[2].re - z[6].re;
+    FFTSample j2 = z[2].im - z[6].im;
+    FFTSample j3 = z[3].re - z[7].re;
+    FFTSample j4 = z[3].im - z[7].im;
+
+    FFTSample q1 = z[0].re + z[4].re;
+    FFTSample q2 = z[0].im + z[4].im;
+    FFTSample q3 = z[1].re + z[5].re;
+    FFTSample q4 = z[1].im + z[5].im;
+
+    FFTSample k1 = z[2].re + z[6].re;
+    FFTSample k2 = z[2].im + z[6].im;
+    FFTSample k3 = z[3].re + z[7].re;
+    FFTSample k4 = z[3].im + z[7].im;
+    /* 2 add 2 sub = 4 */
+
+    /* 2 shufs, 1 add 1 sub = 4 */
+    FFTSample s1 = q1 + q3;
+    FFTSample s2 = q2 + q4;
+    FFTSample g1 = k3 + k1;
+    FFTSample g2 = k2 + k4;
+
+    FFTSample s3 = q1 - q3;
+    FFTSample s4 = q2 - q4;
+    FFTSample g4 = k3 - k1;
+    FFTSample g3 = k2 - k4;
+
+    /* 1 unpack + 1 shuffle = 2 */
+
+    /* 1 add */
+    FFTSample w1 =  s1 + g1;
+    FFTSample w2 =  s2 + g2;
+    FFTSample w3 =  s3 + g3;
+    FFTSample w4 =  s4 + g4;
+
+    /* 1 sub */
+    FFTSample h1 =  s1 - g1;
+    FFTSample h2 =  s2 - g2;
+    FFTSample h3 =  s3 - g3;
+    FFTSample h4 =  s4 - g4;
+
+    z[0].re = w1;
+    z[0].im = w2;
+    z[2].re = w3;
+    z[2].im = w4;
+    z[4].re = h1;
+    z[4].im = h2;
+    z[6].re = h3;
+    z[6].im = h4;
+
+    /* 1 shuf + 1 shuf + 1 xor + 1 addsub */
+    FFTSample z1 = r1 + r4;
+    FFTSample z2 = r2 - r3;
+    FFTSample z3 = r1 - r4;
+    FFTSample z4 = r2 + r3;
+
+    /* 1 mult */
+    j1 *=  M_SQRT1_2;
+    j2 *= -M_SQRT1_2;
+    j3 *= -M_SQRT1_2;
+    j4 *=  M_SQRT1_2;
+
+    /* 1 shuf + 1 addsub */
+    FFTSample l2 = j1 - j2;
+    FFTSample l1 = j2 + j1;
+    FFTSample l4 = j3 - j4;
+    FFTSample l3 = j4 + j3;
+
+    /* 1 shuf + 1 addsub */
+    FFTSample t1 = l3 - l2;
+    FFTSample t2 = l4 + l1;
+    FFTSample t3 = l1 - l4;
+    FFTSample t4 = l2 + l3;
+
+    /* 1 add */
+    FFTSample u1 =  z1 - t1;
+    FFTSample u2 =  z2 - t2;
+    FFTSample u3 =  z3 - t3;
+    FFTSample u4 =  z4 - t4;
+
+    /* 1 sub */
+    FFTSample o1 =  z1 + t1;
+    FFTSample o2 =  z2 + t2;
+    FFTSample o3 =  z3 + t3;
+    FFTSample o4 =  z4 + t4;
+
+    z[1].re = u1;
+    z[1].im = u2;
+    z[3].re = u3;
+    z[3].im = u4;
+    z[5].re = o1;
+    z[5].im = o2;
+    z[7].re = o3;
+    z[7].im = o4;
+}
+```
+
+Most functions here are highly tuned to use x86's addsub instruction to save on
+external sign mask loading.
+
+# 16-point AVX FFT transform
+This version expects the output of the 8 and 4-point transforms to follow the
+even/odd convention established above.
+
+``` C
+static void fft16(FFTComplex *z)
+{
+    FFTSample cos_16_1 = 0.92387950420379638671875f;
+    FFTSample cos_16_3 = 0.3826834261417388916015625f;
+
+    fft8(z);
+    fft4(z+8);
+    fft4(z+10);
+
+    FFTSample s[32];
+
+    /*
+        xorps m1, m1 - free
+        mulps m0
+        shufps m1, m1, m0
+        xorps
+        addsub
+        shufps
+        mulps
+        mulps
+        addps
+        or (fma3)
+        shufps
+        shufps
+        mulps
+        mulps
+        fma
+        fma
+     */
+
+    s[0]  =  z[8].re*( 1) - z[8].im*( 0);
+    s[1]  =  z[8].im*( 1) + z[8].re*( 0);
+    s[2]  =  z[9].re*( 1) - z[9].im*(-1);
+    s[3]  =  z[9].im*( 1) + z[9].re*(-1);
+
+    s[4]  = z[10].re*( 1) - z[10].im*( 0);
+    s[5]  = z[10].im*( 1) + z[10].re*( 0);
+    s[6]  = z[11].re*( 1) - z[11].im*( 1);
+    s[7]  = z[11].im*( 1) + z[11].re*( 1);
+
+    s[8]  = z[12].re*(  cos_16_1) - z[12].im*( -cos_16_3);
+    s[9]  = z[12].im*(  cos_16_1) + z[12].re*( -cos_16_3);
+    s[10] = z[13].re*(  cos_16_3) - z[13].im*( -cos_16_1);
+    s[11] = z[13].im*(  cos_16_3) + z[13].re*( -cos_16_1);
+
+    s[12] = z[14].re*(  cos_16_1) - z[14].im*(  cos_16_3);
+    s[13] = z[14].im*( -cos_16_1) + z[14].re*( -cos_16_3);
+    s[14] = z[15].re*(  cos_16_3) - z[15].im*(  cos_16_1);
+    s[15] = z[15].im*( -cos_16_3) + z[15].re*( -cos_16_1);
+
+    s[2] *=  M_SQRT1_2;
+    s[3] *=  M_SQRT1_2;
+    s[5] *= -1;
+    s[6] *=  M_SQRT1_2;
+    s[7] *= -M_SQRT1_2;
+
+    FFTSample w5 =  s[0] + s[4];
+    FFTSample w6 =  s[1] - s[5];
+    FFTSample x5 =  s[2] + s[6];
+    FFTSample x6 =  s[3] - s[7];
+
+    FFTSample w3 =  s[4] - s[0];
+    FFTSample w4 =  s[5] + s[1];
+    FFTSample x3 =  s[6] - s[2];
+    FFTSample x4 =  s[7] + s[3];
+
+    FFTSample y5 =  s[8] + s[12];
+    FFTSample y6 =  s[9] - s[13];
+    FFTSample u5 = s[10] + s[14];
+    FFTSample u6 = s[11] - s[15];
+
+    FFTSample y3 = s[12] - s[8];
+    FFTSample y4 = s[13] + s[9];
+    FFTSample u3 = s[14] - s[10];
+    FFTSample u4 = s[15] + s[11];
+
+    /* 2xorps, 2vperm2fs, 2 adds, 2 vpermilps = 8 */
+
+    FFTSample o1  = z[0].re + w5;
+    FFTSample o2  = z[0].im + w6;
+    FFTSample o5  = z[1].re + x5;
+    FFTSample o6  = z[1].im + x6;
+    FFTSample o9  = z[2].re + w4; //h
+    FFTSample o10 = z[2].im + w3;
+    FFTSample o13 = z[3].re + x4;
+    FFTSample o14 = z[3].im + x3;
+
+    FFTSample o17 = z[0].re - w5;
+    FFTSample o18 = z[0].im - w6;
+    FFTSample o21 = z[1].re - x5;
+    FFTSample o22 = z[1].im - x6;
+    FFTSample o25 = z[2].re - w4; //h
+    FFTSample o26 = z[2].im - w3;
+    FFTSample o29 = z[3].re - x4;
+    FFTSample o30 = z[3].im - x3;
+
+    FFTSample o3  = z[4].re + y5;
+    FFTSample o4  = z[4].im + y6;
+    FFTSample o7  = z[5].re + u5;
+    FFTSample o8  = z[5].im + u6;
+    FFTSample o11 = z[6].re + y4; //h
+    FFTSample o12 = z[6].im + y3;
+    FFTSample o15 = z[7].re + u4;
+    FFTSample o16 = z[7].im + u3;
+
+    FFTSample o19 = z[4].re - y5;
+    FFTSample o20 = z[4].im - y6;
+    FFTSample o23 = z[5].re - u5;
+    FFTSample o24 = z[5].im - u6;
+    FFTSample o27 = z[6].re - y4; //h
+    FFTSample o28 = z[6].im - y3;
+    FFTSample o31 = z[7].re - u4;
+    FFTSample o32 = z[7].im - u3;
+
+    /* This is just deinterleaving, happens separately */
+    z[0]  = (FFTComplex){  o1,  o2 };
+    z[1]  = (FFTComplex){  o3,  o4 };
+    z[2]  = (FFTComplex){  o5,  o6 };
+    z[3]  = (FFTComplex){  o7,  o8 };
+    z[4]  = (FFTComplex){  o9, o10 };
+    z[5]  = (FFTComplex){ o11, o12 };
+    z[6]  = (FFTComplex){ o13, o14 };
+    z[7]  = (FFTComplex){ o15, o16 };
+
+    z[8]  = (FFTComplex){ o17, o18 };
+    z[9]  = (FFTComplex){ o19, o20 };
+    z[10] = (FFTComplex){ o21, o22 };
+    z[11] = (FFTComplex){ o23, o24 };
+    z[12] = (FFTComplex){ o25, o26 };
+    z[13] = (FFTComplex){ o27, o28 };
+    z[14] = (FFTComplex){ o29, o30 };
+    z[15] = (FFTComplex){ o31, o32 };
+}
+```
+
+# AVX split-radix synthesis
+To create larger transforms, the following unrolling of the C split-radix
+function is used.
+
+``` C
+#define BF(x, y, a, b)                           \
+    do {                                         \
+        x = (a) - (b);                           \
+        y = (a) + (b);                           \
+    } while (0)
+
+#define BUTTERFLIES(a0,a1,a2,a3)               \
+    do {                                       \
+        r0=a0.re;                              \
+        i0=a0.im;                              \
+        r1=a1.re;                              \
+        i1=a1.im;                              \
+        BF(q3, q5, q5, q1);                    \
+        BF(a2.re, a0.re, r0, q5);              \
+        BF(a3.im, a1.im, i1, q3);              \
+        BF(q4, q6, q2, q6);                    \
+        BF(a3.re, a1.re, r1, q4);              \
+        BF(a2.im, a0.im, i0, q6);              \
+    } while (0)
+
+#undef TRANSFORM
+#define TRANSFORM(a0,a1,a2,a3,wre,wim)         \
+    do {                                       \
+        CMUL(q1, q2, a2.re, a2.im, wre, -wim); \
+        CMUL(q5, q6, a3.re, a3.im, wre,  wim); \
+        BUTTERFLIES(a0, a1, a2, a3);           \
+    } while (0)
+
+#define CMUL(dre, dim, are, aim, bre, bim)       \
+    do {                                         \
+        (dre) = (are) * (bre) - (aim) * (bim);   \
+        (dim) = (are) * (bim) + (aim) * (bre);   \
+    } while (0)
+
+static void recombine(FFTComplex *z, const FFTSample *cos,
+                      unsigned int n)
+{
+    const int o1 = 2*n;
+    const int o2 = 4*n;
+    const int o3 = 6*n;
+    const FFTSample *wim = cos + o1 - 7;
+    FFTSample q1, q2, q3, q4, q5, q6, r0, i0, r1, i1;
+
+#if 0
+    for (int i = 0; i < n; i += 4) {
+#endif
+
+#if 0
+        TRANSFORM(z[ 0 + 0], z[ 0 + 4], z[o2 + 0], z[o2 + 2], cos[0], wim[7]);
+        TRANSFORM(z[ 0 + 1], z[ 0 + 5], z[o2 + 1], z[o2 + 3], cos[2], wim[5]);
+        TRANSFORM(z[ 0 + 2], z[ 0 + 6], z[o2 + 4], z[o2 + 6], cos[4], wim[3]);
+        TRANSFORM(z[ 0 + 3], z[ 0 + 7], z[o2 + 5], z[o2 + 7], cos[6], wim[1]);
+
+        TRANSFORM(z[o1 + 0], z[o1 + 4], z[o3 + 0], z[o3 + 2], cos[1], wim[6]);
+        TRANSFORM(z[o1 + 1], z[o1 + 5], z[o3 + 1], z[o3 + 3], cos[3], wim[4]);
+        TRANSFORM(z[o1 + 2], z[o1 + 6], z[o3 + 4], z[o3 + 6], cos[5], wim[2]);
+        TRANSFORM(z[o1 + 3], z[o1 + 7], z[o3 + 5], z[o3 + 7], cos[7], wim[0]);
+#else
+        FFTSample h[8], j[8], r[8], w[8];
+        FFTSample t[8];
+        FFTComplex *m0 = &z[0];
+        FFTComplex *m1 = &z[4];
+        FFTComplex *m2 = &z[o2 + 0];
+        FFTComplex *m3 = &z[o2 + 4];
+
+        const FFTSample *t1  = &cos[0];
+        const FFTSample *t2  = &wim[0];
+
+        /* 2 loads (tabs) */
+
+        /* 2 vperm2fs, 2 shufs (im), 2 shufs (tabs) */
+        /* 1 xor, 1 add, 1 sub, 4 mults OR 2 mults, 2 fmas */
+        /* 13 OR 10ish (-2 each for second passovers!) */
+
+        w[0] = m2[0].im*t1[0] - m2[0].re*t2[7];
+        w[1] = m2[0].re*t1[0] + m2[0].im*t2[7];
+        w[2] = m2[1].im*t1[2] - m2[1].re*t2[5];
+        w[3] = m2[1].re*t1[2] + m2[1].im*t2[5];
+        w[4] = m3[0].im*t1[4] - m3[0].re*t2[3];
+        w[5] = m3[0].re*t1[4] + m3[0].im*t2[3];
+        w[6] = m3[1].im*t1[6] - m3[1].re*t2[1];
+        w[7] = m3[1].re*t1[6] + m3[1].im*t2[1];
+
+        j[0] = m2[2].im*t1[0] + m2[2].re*t2[7];
+        j[1] = m2[2].re*t1[0] - m2[2].im*t2[7];
+        j[2] = m2[3].im*t1[2] + m2[3].re*t2[5];
+        j[3] = m2[3].re*t1[2] - m2[3].im*t2[5];
+        j[4] = m3[2].im*t1[4] + m3[2].re*t2[3];
+        j[5] = m3[2].re*t1[4] - m3[2].im*t2[3];
+        j[6] = m3[3].im*t1[6] + m3[3].re*t2[1];
+        j[7] = m3[3].re*t1[6] - m3[3].im*t2[1];
+
+        /* 1 add + 1 shuf */
+        t[1] = j[0] + w[0];
+        t[0] = j[1] + w[1];
+        t[3] = j[2] + w[2];
+        t[2] = j[3] + w[3];
+        t[5] = j[4] + w[4];
+        t[4] = j[5] + w[5];
+        t[7] = j[6] + w[6];
+        t[6] = j[7] + w[7];
+
+        /* 1 sub + 1 xor */
+        r[0] =  (w[0] - j[0]);
+        r[1] = -(w[1] - j[1]);
+        r[2] =  (w[2] - j[2]);
+        r[3] = -(w[3] - j[3]);
+        r[4] =  (w[4] - j[4]);
+        r[5] = -(w[5] - j[5]);
+        r[6] =  (w[6] - j[6]);
+        r[7] = -(w[7] - j[7]);
+
+        /* Min: 2 subs, 2 adds, 2 vperm2fs (OPTIONAL) */
+        m2[0].re = m0[0].re - t[0];
+        m2[0].im = m0[0].im - t[1];
+        m2[1].re = m0[1].re - t[2];
+        m2[1].im = m0[1].im - t[3];
+        m3[0].re = m0[2].re - t[4];
+        m3[0].im = m0[2].im - t[5];
+        m3[1].re = m0[3].re - t[6];
+        m3[1].im = m0[3].im - t[7];
+
+        m2[2].re = m1[0].re - r[0];
+        m2[2].im = m1[0].im - r[1];
+        m2[3].re = m1[1].re - r[2];
+        m2[3].im = m1[1].im - r[3];
+        m3[2].re = m1[2].re - r[4];
+        m3[2].im = m1[2].im - r[5];
+        m3[3].re = m1[3].re - r[6];
+        m3[3].im = m1[3].im - r[7];
+
+        m0[0].re = m0[0].re + t[0];
+        m0[0].im = m0[0].im + t[1];
+        m0[1].re = m0[1].re + t[2];
+        m0[1].im = m0[1].im + t[3];
+        m0[2].re = m0[2].re + t[4];
+        m0[2].im = m0[2].im + t[5];
+        m0[3].re = m0[3].re + t[6];
+        m0[3].im = m0[3].im + t[7];
+
+        m1[0].re = m1[0].re + r[0];
+        m1[0].im = m1[0].im + r[1];
+        m1[1].re = m1[1].re + r[2];
+        m1[1].im = m1[1].im + r[3];
+        m1[2].re = m1[2].re + r[4];
+        m1[2].im = m1[2].im + r[5];
+        m1[3].re = m1[3].re + r[6];
+        m1[3].im = m1[3].im + r[7];
+
+        /* Identical for below, but with the following parameters */
+        m0 = &z[o1];
+        m1 = &z[o1 + 4];
+        m2 = &z[o3 + 0];
+        m3 = &z[o3 + 4];
+        t1  = &cos[1];
+        t2  = &wim[-1];
+
+        w[0] = m2[0].im*t1[0] - m2[0].re*t2[7];
+        w[1] = m2[0].re*t1[0] + m2[0].im*t2[7];
+        w[2] = m2[1].im*t1[2] - m2[1].re*t2[5];
+        w[3] = m2[1].re*t1[2] + m2[1].im*t2[5];
+        w[4] = m3[0].im*t1[4] - m3[0].re*t2[3];
+        w[5] = m3[0].re*t1[4] + m3[0].im*t2[3];
+        w[6] = m3[1].im*t1[6] - m3[1].re*t2[1];
+        w[7] = m3[1].re*t1[6] + m3[1].im*t2[1];
+
+        j[0] = m2[2].im*t1[0] + m2[2].re*t2[7];
+        j[1] = m2[2].re*t1[0] - m2[2].im*t2[7];
+        j[2] = m2[3].im*t1[2] + m2[3].re*t2[5];
+        j[3] = m2[3].re*t1[2] - m2[3].im*t2[5];
+        j[4] = m3[2].im*t1[4] + m3[2].re*t2[3];
+        j[5] = m3[2].re*t1[4] - m3[2].im*t2[3];
+        j[6] = m3[3].im*t1[6] + m3[3].re*t2[1];
+        j[7] = m3[3].re*t1[6] - m3[3].im*t2[1];
+
+        /* 1 add + 1 shuf */
+        t[1] = j[0] + w[0];
+        t[0] = j[1] + w[1];
+        t[3] = j[2] + w[2];
+        t[2] = j[3] + w[3];
+        t[5] = j[4] + w[4];
+        t[4] = j[5] + w[5];
+        t[7] = j[6] + w[6];
+        t[6] = j[7] + w[7];
+
+        /* 1 sub + 1 xor */
+        r[0] =  (w[0] - j[0]);
+        r[1] = -(w[1] - j[1]);
+        r[2] =  (w[2] - j[2]);
+        r[3] = -(w[3] - j[3]);
+        r[4] =  (w[4] - j[4]);
+        r[5] = -(w[5] - j[5]);
+        r[6] =  (w[6] - j[6]);
+        r[7] = -(w[7] - j[7]);
+
+        /* Min: 2 subs, 2 adds, 2 vperm2fs (OPTIONAL) */
+        m2[0].re = m0[0].re - t[0];
+        m2[0].im = m0[0].im - t[1];
+        m2[1].re = m0[1].re - t[2];
+        m2[1].im = m0[1].im - t[3];
+        m3[0].re = m0[2].re - t[4];
+        m3[0].im = m0[2].im - t[5];
+        m3[1].re = m0[3].re - t[6];
+        m3[1].im = m0[3].im - t[7];
+
+        m2[2].re = m1[0].re - r[0];
+        m2[2].im = m1[0].im - r[1];
+        m2[3].re = m1[1].re - r[2];
+        m2[3].im = m1[1].im - r[3];
+        m3[2].re = m1[2].re - r[4];
+        m3[2].im = m1[2].im - r[5];
+        m3[3].re = m1[3].re - r[6];
+        m3[3].im = m1[3].im - r[7];
+
+        m0[0].re = m0[0].re + t[0];
+        m0[0].im = m0[0].im + t[1];
+        m0[1].re = m0[1].re + t[2];
+        m0[1].im = m0[1].im + t[3];
+        m0[2].re = m0[2].re + t[4];
+        m0[2].im = m0[2].im + t[5];
+        m0[3].re = m0[3].re + t[6];
+        m0[3].im = m0[3].im + t[7];
+
+        m1[0].re = m1[0].re + r[0];
+        m1[0].im = m1[0].im + r[1];
+        m1[1].re = m1[1].re + r[2];
+        m1[1].im = m1[1].im + r[3];
+        m1[2].re = m1[2].re + r[4];
+        m1[2].im = m1[2].im + r[5];
+        m1[3].re = m1[3].re + r[6];
+        m1[3].im = m1[3].im + r[7];
+#endif
+
+#if 0
+        z   +=   4; // !!!
+        cos += 2*4;
+        wim -= 2*4;
+    }
+#endif
+}
+```
+
+The macros used are identical to those in the generic C version, only with all
+variable declarations exported to the function body.
+An important point here is that the high frequency registers (m2 and m3) have
+their high and low halves swapped in the output. This is intentional, as the
+inputs must also have the same layout, and therefore, the input swapping is only
+performed once for the bottom-most basis transform, with all subsequent combinations
+using the already swapped halves.
+
+Also note that this function requires a special iteration way, due to coefficients
+beginning to overlap, particularly `[o1]` with `[0]` after the second iteration.
+To iterate further, set `z = &z[16]` via `z += 8` for the second iteration. After
+the 4th iteration, the layout resets, so repeat the same.

From patchwork Mon Apr 19 20:27:49 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 27122
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp841037iob;
        Mon, 19 Apr 2021 13:28:00 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJw3FCRiJoe844/7DuuH1NQAxynEj5X6+L6+9aK7Ld7bKc1iBvE9ahfNwDNmSguQ7KVZymZr
X-Received: by 2002:a05:6402:1907:: with SMTP id
 e7mr10392229edz.313.1618864079839;
        Mon, 19 Apr 2021 13:27:59 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1618864079; cv=none;
        d=google.com; s=arc-20160816;
        b=m7740n40R/U2oiM2qUNQeq54FK78/2LTEbOF8i4/2AbbEpMGnW3V1iZVrloJRVqpIq
         lkUPbFpplm3FoKqKkVN74KFMSiuEFB1GrZHsHwmGwxZzOg35KgaKzCMnaUyMqjGGnD9v
         +sK9W4WxXm69yO7u+Hz74N6XXz6qrCignz6SGxM50kWPIH/N72EcLffJdKlIgUBnNVfR
         C/LXvAvKuWuan593jhGzSIjTx1dDPoOLhAlLkF7NxCcuapowYbPTAQI84dCnhjhZX/h/
         sc66ekc+BBDmM9xzUrMsBhdbFFzwGFNVCA8Vz4DOKvPzx2aCZIAndqcy3m/QZaf4rB2v
         3FoA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=DDquG/vqc/BVomFZfmYWUXxgHKc13yYGMlLeaFR8hwk=;
        b=pYbClXb168RXGo+oqj8/EWUQi9OTanjAb4LhpZD9UUP7vlrhLnRLZZq9HsAZ3uXAFw
         dYQWkdUapPfsGoBj/uaRO4N3/dgHdXthla/Eulst/BIwTftXHhLXlB3CuDat6U265Q08
         DoXNlHB1SNCUZSEgbjJFm0nWEgDfsYWyXTj5C+7wqewlinMd/JuaZlRs6HDWVx4RsBPE
         uMjk27ngbz9dorRiDn71XOIDGCuJISXrf7eMuj5tvqthBeM2WPChdgBc3r0jy5U+2zBc
         J0uuQSP1dpfOyeQkwEreTTU6qrbGbGP8d/KBgiACA27lNGqwFZILvK3rqy0H0sz90HjH
         qlZg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b="odjHM/+G";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 j12si4483884eds.348.2021.04.19.13.27.59;
        Mon, 19 Apr 2021 13:27:59 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b="odjHM/+G";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9861C689D59;
	Mon, 19 Apr 2021 23:27:56 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 4FBD96881DC
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 23:27:50 +0300 (EEST)
Received: from w3.tutanota.de (unknown [192.168.1.164])
 by w4.tutanota.de (Postfix) with ESMTP id DE83410602F7
 for <ffmpeg-devel@ffmpeg.org>; Mon, 19 Apr 2021 20:27:49 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618864069;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=F/lWXkUQ5XyB4Fk1Gi98vwlwu+d09ARF/05ewqvf9VQ=;
 b=odjHM/+GvgIWGgnPxalW5J5a55Zj0EYzCAtRPVvFy0uwQ0mdm4qvhDTJWwypWd3N
 mvBu82wyQQCP/bjdjK5TlpqtODBFPQplZkBfRM2w3Z1pL4CU388pzdyfFDjVUsy8ZLN
 vg5IBoQHYBXDs+ccqeNeSDm6iatqMGCmzW7a/KTpXbMX9eyN2AeYHmk95hvLniosBBa
 OOKldjGqlsSzzXyHx2Sc4VkhXxxqv9m+KOL6162XjThMnS5855DVzxpinfsIqOLB72Q
 KqleTmVCMBM96aiywrfGX1Iore4Tn2gDGDiD8B9X1y3pMJVcDxJajEesrpOPQpE+G/U
 NtpB9pI9ig==
Date: Mon, 19 Apr 2021 22:27:49 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <MYfoKfb--3-2@lynne.ee>
In-Reply-To: <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
References: <MYfmSp7--3-2@lynne.ee> <MYfmSp7--3-2@lynne.ee-MYfmXar----2>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 11/11] lavu/x86: add FFT assembly
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: nn4BYN8g+8vF

This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx. 
The design of this pure assembly FFT is pretty unconventional.

On the lowest level, instead of splitting the complex numbers into
real and imaginary parts, we keep complex numbers together but split
them in terms of parity. This saves a number of shuffles in each transform,
but more importantly, it splits each transform into two independent
paths, which we process using separate registers in parallel.
This allows us to keep all units saturated and lets us use all available
registers to avoid dependencies.
Moreover, it allows us to double the granularity of our per-load permutation,
skipping many expensive lookups and allowing us to use just 4 loads per register,
rather than 8, or in case FMA3 (and by extension, AVX2), use the vgatherdpd
instruction, which is at least as fast as 4 separate loads on old hardware,
and quite a bit faster on modern CPUs).

Higher up, we go for a bottom-up construction of large transforms, foregoing
the traditional per-transform call-return recursion chains. Instead, we always
start at the bottom-most basis transform (in this case, a 32-point transform),
and continue constructing larger and larger transforms until we return to the
top-most transform.
This way, we only touch the stack 3 times per a complete target transform:
once for the 1/2 length transform and two times for the 1/4 length transform.

The combination algorithm we use is a standard Split-Radix algorithm,
as used in our C code. Although a version with less operations exists
(Steven G. Johnson and Matteo Frigo's "A modified split-radix FFT with fewer
arithmetic operations", IEEE Trans. Signal Process. 55 (1), 111–119 (2007),
which is the one FFTW uses), it only has 2% less operations and requires at least 4x
the binary code (due to it needing 4 different paths to do a single transform).
That version also has other issues which prevent it from being implemented
with SIMD code as efficiently, which makes it lose the marginal gains it offered,
and cannot be performed bottom-up, requiring many recursive call-return chains,
whose overhead adds up.

We go through a lot of effort to minimize load/stores by keeping as much in
registers in between construcring transforms. This saves us around 32 cycles,
on paper, but in reality a lot more due to load/store aliasing (a load from a
memory location cannot be issued while there's a store pending, and there are
only so many (2 for Zen 3) load/store units in a CPU).
Also, we interleave coefficients during the last stage to save on a store+load
per register.

Each of the smallest, basis transforms (4, 8 and 16-point in our case)
has been extremely optimized. Our 8-point transform is barely 20 instructions
in total, beating our old implementation 8-point transform by 1 instruction.
Our 2x8-point transform is 23 instructions, beating our old implementation by
6 instruction and needing 50% less cycles. Our 16-point transform's combination
code takes slightly more instructions than our old implementation, but makes up
for it by requiring a lot less arithmetic operations.

Overall, the transform was optimized for the timings of Zen 3, which at the
time of writing has the most IPC from all documented CPUs. Shuffles were
preferred over arithmetic operations due to their 1/0.5 latency/throughput.

On average, this code is 30% faster than our old libavcodec implementation.
It's able to trade blows with the previously-untouchable FFTW on small transforms,
and due to its tiny size and better prediction, outdoes FFTW on larger transforms
by 11% on the largest currently supported size.

Full benchmark available here: https://files.lynne.ee/fft_benchmark_i7-6700HQ.txt

Patch attached.
Subject: [PATCH 11/11] lavu/x86: add FFT assembly
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx.
The design of this pure assembly FFT is pretty unconventional.

On the lowest level, instead of splitting the complex numbers into
real and imaginary parts, we keep complex numbers together but split
them in terms of parity. This saves a number of shuffles in each transform,
but more importantly, it splits each transform into two independent
paths, which we process using separate registers in parallel.
This allows us to keep all units saturated and lets us use all available
registers to avoid dependencies.
Moreover, it allows us to double the granularity of our per-load permutation,
skipping many expensive lookups and allowing us to use just 4 loads per register,
rather than 8, or in case FMA3 (and by extension, AVX2), use the vgatherdpd
instruction, which is at least as fast as 4 separate loads on old hardware,
and quite a bit faster on modern CPUs).

Higher up, we go for a bottom-up construction of large transforms, foregoing
the traditional per-transform call-return recursion chains. Instead, we always
start at the bottom-most basis transform (in this case, a 32-point transform),
and continue constructing larger and larger transforms until we return to the
top-most transform.
This way, we only touch the stack 3 times per a complete target transform:
once for the 1/2 length transform and two times for the 1/4 length transform.

The combination algorithm we use is a standard Split-Radix algorithm,
as used in our C code. Although a version with less operations exists
(Steven G. Johnson and Matteo Frigo's "A modified split-radix FFT with fewer
arithmetic operations", IEEE Trans. Signal Process. 55 (1), 111–119 (2007),
which is the one FFTW uses), it only has 2% less operations and requires at least 4x
the binary code (due to it needing 4 different paths to do a single transform).
That version also has other issues which prevent it from being implemented
with SIMD code as efficiently, which makes it lose the marginal gains it offered,
and cannot be performed bottom-up, requiring many recursive call-return chains,
whose overhead adds up.

We go through a lot of effort to minimize load/stores by keeping as much in
registers in between construcring transforms. This saves us around 32 cycles,
on paper, but in reality a lot more due to load/store aliasing (a load from a
memory location cannot be issued while there's a store pending, and there are
only so many (2 for Zen 3) load/store units in a CPU).
Also, we interleave coefficients during the last stage to save on a store+load
per register.

Each of the smallest, basis transforms (4, 8 and 16-point in our case)
has been extremely optimized. Our 8-point transform is barely 20 instructions
in total, beating our old implementation 8-point transform by 1 instruction.
Our 2x8-point transform is 23 instructions, beating our old implementation by
6 instruction and needing 50% less cycles. Our 16-point transform's combination
code takes slightly more instructions than our old implementation, but makes up
for it by requiring a lot less arithmetic operations.

Overall, the transform was optimized for the timings of Zen 3, which at the
time of writing has the most IPC from all documented CPUs. Shuffles were
preferred over arithmetic operations due to their 1/0.5 latency/throughput.

On average, this code is 30% faster than our old libavcodec implementation.
It's able to trade blows with the previously-untouchable FFTW on small transforms,
and due to its tiny size and better prediction, outdoes FFTW on larger transforms
by 11% on the largest currently supported size.
---
 libavutil/tx.c                |    2 +
 libavutil/tx_priv.h           |    2 +
 libavutil/x86/Makefile        |    2 +
 libavutil/x86/tx_float.asm    | 1216 +++++++++++++++++++++++++++++++++
 libavutil/x86/tx_float_init.c |  101 +++
 5 files changed, 1323 insertions(+)
 create mode 100644 libavutil/x86/tx_float.asm
 create mode 100644 libavutil/x86/tx_float_init.c

diff --git a/libavutil/tx.c b/libavutil/tx.c
index dcfb257899..8da04e99ca 100644
--- a/libavutil/tx.c
+++ b/libavutil/tx.c
@@ -237,6 +237,8 @@ av_cold int av_tx_init(AVTXContext **ctx, av_tx_fn *tx, enum AVTXType type,
     case AV_TX_FLOAT_MDCT:
         if ((err = ff_tx_init_mdct_fft_float(s, tx, type, inv, len, scale, flags)))
             goto fail;
+        if (ARCH_X86)
+            ff_tx_init_float_x86(s, tx);
         break;
     case AV_TX_DOUBLE_FFT:
     case AV_TX_DOUBLE_MDCT:
diff --git a/libavutil/tx_priv.h b/libavutil/tx_priv.h
index 88589fcbb4..ab44a1843c 100644
--- a/libavutil/tx_priv.h
+++ b/libavutil/tx_priv.h
@@ -199,4 +199,6 @@ typedef struct CosTabsInitOnce {
     AVOnce control;
 } CosTabsInitOnce;
 
+void ff_tx_init_float_x86(AVTXContext *s, av_tx_fn *tx);
+
 #endif /* AVUTIL_TX_PRIV_H */
diff --git a/libavutil/x86/Makefile b/libavutil/x86/Makefile
index 5f5242b5bd..d747c37049 100644
--- a/libavutil/x86/Makefile
+++ b/libavutil/x86/Makefile
@@ -3,6 +3,7 @@ OBJS += x86/cpu.o                                                       \
         x86/float_dsp_init.o                                            \
         x86/imgutils_init.o                                             \
         x86/lls_init.o                                                  \
+        x86/tx_float_init.o                                             \
 
 OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o                      \
 
@@ -14,5 +15,6 @@ X86ASM-OBJS += x86/cpuid.o                                              \
              x86/float_dsp.o                                            \
              x86/imgutils.o                                             \
              x86/lls.o                                                  \
+             x86/tx_float.o                                             \
 
 X86ASM-OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils.o                    \
diff --git a/libavutil/x86/tx_float.asm b/libavutil/x86/tx_float.asm
new file mode 100644
index 0000000000..bcfa9dc4e6
--- /dev/null
+++ b/libavutil/x86/tx_float.asm
@@ -0,0 +1,1216 @@
+;******************************************************************************
+;* Copyright (c) Lynne
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+; Open `doc/transforms.md` to see the code upon which the transforms here were
+; based upon and compare.
+
+; TODO:
+;       carry over registers from smaller transforms to save on ~8 loads/stores
+;       check if vinsertf could be faster than verpm2f128 for duplication
+;       even faster FFT8 (current one is very #instructions optimized)
+;       replace some xors with blends + addsubs?
+;       replace some shuffles with vblends?
+;       avx512 split-radix
+
+%include "x86util.asm"
+
+%if ARCH_X86_64
+%define ptr resq
+%else
+%define ptr resd
+%endif
+
+%assign i 16
+%rep 14
+cextern cos_ %+ i %+ _float ; ff_cos_i_float...
+%assign i (i << 1)
+%endrep
+
+struc AVTXContext
+    .n:           resd 1 ; Non-power-of-two part
+    .m:           resd 1 ; Power-of-two part
+    .inv:         resd 1 ; Is inverse
+    .type:        resd 1 ; Type
+    .flags:       resq 1 ; Flags
+    .scale:       resq 1 ; Scale
+
+    .exptab:       ptr 1 ; MDCT exptab
+    .tmp:          ptr 1 ; Temporary buffer needed for all compound transforms
+    .pfatab:       ptr 1 ; Input/Output mapping for compound transforms
+    .revtab:       ptr 1 ; Input mapping for power of two transforms
+    .inplace_idx:  ptr 1 ; Required indices to revtab for in-place transforms
+
+    .top_tx        ptr 1 ;  Used for transforms derived from other transforms
+endstruc
+
+SECTION_RODATA 32
+
+%define POS 0x00000000
+%define NEG 0x80000000
+
+%define M_SQRT1_2 0.707106781186547524401
+%define COS16_1   0.92387950420379638671875
+%define COS16_3   0.3826834261417388916015625
+
+d8_mult_odd:   dd M_SQRT1_2, -M_SQRT1_2, -M_SQRT1_2, M_SQRT1_2, \
+                  M_SQRT1_2, -M_SQRT1_2, -M_SQRT1_2, M_SQRT1_2
+
+s8_mult_odd:   dd 1.0, 1.0, -1.0, 1.0, -M_SQRT1_2, -M_SQRT1_2, M_SQRT1_2, M_SQRT1_2
+s8_perm_even:  dd 1, 3, 0, 2, 1, 3, 2, 0
+s8_perm_odd1:  dd 3, 3, 1, 1, 1, 1, 3, 3
+s8_perm_odd2:  dd 1, 2, 0, 3, 1, 0, 0, 1
+
+s16_mult_even: dd 1.0, 1.0, M_SQRT1_2, M_SQRT1_2, 1.0, -1.0, M_SQRT1_2, -M_SQRT1_2
+s16_mult_odd1: dd COS16_1,  COS16_1,  COS16_3,  COS16_3,  COS16_1, -COS16_1,  COS16_3, -COS16_3
+s16_mult_odd2: dd COS16_3, -COS16_3,  COS16_1, -COS16_1, -COS16_3, -COS16_3, -COS16_1, -COS16_1
+s16_perm:      dd 0, 1, 2, 3, 1, 0, 3, 2
+
+mask_mmmmpppm: dd NEG, NEG, NEG, NEG, POS, POS, POS, NEG
+mask_ppmpmmpm: dd POS, POS, NEG, POS, NEG, NEG, POS, NEG
+mask_mppmmpmp: dd NEG, POS, POS, NEG, NEG, POS, NEG, POS
+mask_mpmppmpm: dd NEG, POS, NEG, POS, POS, NEG, POS, NEG
+mask_pmmppmmp: dd POS, NEG, NEG, POS, POS, NEG, NEG, POS
+mask_pmpmpmpm: times 4 dd POS, NEG
+
+SECTION .text
+
+; Load complex values (64 bits) via a lookup table
+; %1 - output register
+; %2 - GRP of base input memory address
+; %3 - GPR of LUT (int32_t indices) address
+; %4 - LUT offset
+; %5 - temporary GPR (only used if vgather is not used)
+; %6 - temporary register (for avx only)
+; %7 - temporary register (for avx only, enables vgatherdpd (AVX2) if FMA3 is set)
+%macro LOAD64_LUT 5-7
+%if %0 > 6 && cpuflag(fma3)
+    pcmpeqd %6, %6 ; pcmpeqq has a 0.5 throughput on Zen 3, this has 0.25
+    movapd xmm%7, [%3 + %4] ; float mov since vgatherdpd is a float instruction
+    vgatherdpd %1, [%2 + xmm%7*8], %6 ; must use separate registers for args
+%else
+    mov      %5d, [%3 + %4 + 0]
+    movsd  xmm%1, [%2 + %5q*8]
+%if mmsize == 32
+    mov      %5d, [%3 + %4 + 8]
+    movsd  xmm%6, [%2 + %5q*8]
+%endif
+    mov      %5d, [%3 + %4 + 4]
+    movhps xmm%1, [%2 + %5q*8]
+%if mmsize == 32
+    mov      %5d, [%3 + %4 + 12]
+    movhps xmm%6, [%2 + %5q*8]
+    vinsertf128 %1, %1, xmm%6, 1
+%endif
+%endif
+%endmacro
+
+; Single 2-point in-place complex FFT (will do 2 transforms at once in AVX mode)
+; %1 - coefficients (r0.reim, r1.reim)
+; %2 - temporary
+%macro FFT2 2
+    shufps   %2, %1, %1, q3322
+    shufps   %1, %1, q1100
+
+    addsubps %1, %2
+
+    shufps   %1, %1, q2031
+%endmacro
+
+; Single 4-point in-place complex FFT (will do 2 transforms at once in [AVX] mode)
+; %1 - even coefficients (r0.reim, r2.reim, r4.reim, r6.reim)
+; %2 - odd coefficients  (r1.reim, r3.reim, r5.reim, r7.reim)
+; %3 - temporary
+%macro FFT4 3
+    subps  %3, %1, %2         ;  r1234, [r5678]
+    addps  %1, %2             ;  t1234, [t5678]
+
+    shufps %2, %1, %3, q1010  ;  t12, r12
+    shufps %1, %3, q2332      ;  t34, r43
+
+    subps  %3, %2, %1         ;  a34, b32
+    addps  %2, %1             ;  a12, b14
+
+    shufps %1, %2, %3, q1010  ;  a1234     even
+
+    shufps %2, %3, q2332      ;  b1423
+    shufps %2, %2, q1320      ;  b1234     odd
+%endmacro
+
+; Single/Dual 8-point in-place complex FFT (will do 2 transforms in [AVX] mode)
+; %1 - even coefficients (a0.reim, a2.reim, [b0.reim, b2.reim])
+; %2 - even coefficients (a4.reim, a6.reim, [b4.reim, b6.reim])
+; %3 - odd coefficients  (a1.reim, a3.reim, [b1.reim, b3.reim])
+; %4 - odd coefficients  (a5.reim, a7.reim, [b5.reim, b7.reim])
+; %5 - temporary
+; %6 - temporary
+%macro FFT8 6
+    addps    %5, %1, %3           ; q1-8
+    addps    %6, %2, %4           ; k1-8
+
+    subps    %1, %3               ; r1-8
+    subps    %2, %4               ; j1-8
+
+    shufps   %4, %1, %1, q2323    ; r4343
+    shufps   %3, %5, %6, q3032    ; q34, k14
+
+    shufps   %1, %1, q1010        ; r1212
+    shufps   %5, %6, q1210        ; q12, k32
+
+    xorps    %4, [mask_pmmppmmp]  ; r4343 * pmmp
+    addps    %6, %5, %3           ; s12, g12
+
+    mulps    %2, [d8_mult_odd]    ; r8 * d8_mult_odd
+    subps    %5, %3               ; s34, g43
+
+    addps    %3, %1, %4           ; z1234
+    unpcklpd %1, %6, %5           ; s1234
+
+    shufps   %4, %2, %2, q2301    ; j2143
+    shufps   %6, %5, q2332        ; g1234
+
+    addsubps %2, %4               ; l2143
+    shufps   %5, %2, %2, q0123    ; l3412
+    addsubps %5, %2               ; t1234
+
+    subps    %2, %1, %6           ; h1234 even
+    subps    %4, %3, %5           ; u1234 odd
+
+    addps    %1, %6               ; w1234 even
+    addps    %3, %5               ; o1234 odd
+%endmacro
+
+; Single 8-point in-place complex FFT in 20 instructions
+; %1 - even coefficients (r0.reim, r2.reim, r4.reim, r6.reim)
+; %2 - odd coefficients  (r1.reim, r3.reim, r5.reim, r7.reim)
+; %3 - temporary
+; %4 - temporary
+%macro FFT8_AVX 4
+    subps      %3, %1, %2               ;  r1234, r5678
+    addps      %1, %2                   ;  q1234, q5678
+
+    vpermilps  %2, %3, [s8_perm_odd1]   ;  r4422, r6688
+    shufps     %4, %1, %1, q3322        ;  q1122, q5566
+
+    movsldup   %3, %3                   ;  r1133, r5577
+    shufps     %1, %1, q1100            ;  q3344, q7788
+
+    addsubps   %3, %2                   ;  z1234, z5678
+    addsubps   %1, %4                   ;  s3142, s7586
+
+    mulps      %3, [s8_mult_odd]        ;  z * s8_mult_odd
+    vpermilps  %1, [s8_perm_even]       ;  s1234, s5687 !
+
+    shufps     %2, %3, %3, q2332        ;   junk, z7887
+    xorps      %4, %1, [mask_mmmmpppm]  ;  e1234, e5687 !
+
+    vpermilps  %3, %3, [s8_perm_odd2]   ;  z2314, z6556
+    vperm2f128 %1, %4, 0x03             ;  e5687, s1234
+
+    addsubps   %2, %3                   ;   junk, t5678
+    subps      %1, %4                   ;  w1234, w5678 even
+
+    vperm2f128 %2, %2, 0x11             ;  t5678, t5678
+    vperm2f128 %3, %3, 0x00             ;  z2314, z2314
+
+    xorps      %2, [mask_ppmpmmpm]      ;  t * ppmpmmpm
+    addps      %2, %3, %2               ;  u1234, u5678 odd
+%endmacro
+
+; Single 16-point in-place complex FFT
+; %1 - even coefficients (r0.reim, r2.reim,  r4.reim,  r6.reim)
+; %2 - even coefficients (r8.reim, r10.reim, r12.reim, r14.reim)
+; %3 - odd coefficients  (r1.reim, r3.reim,  r5.reim,  r7.reim)
+; %4 - odd coefficients  (r9.reim, r11.reim, r13.reim, r15.reim)
+; %5, %6 - temporary
+; %7, %8 - temporary (optional)
+%macro FFT16 6-8
+    FFT4       %3, %4, %5
+%if %0 > 7
+    FFT8_AVX   %1, %2, %6, %7
+    movaps     %8, [mask_mpmppmpm]
+    movaps     %7, [s16_perm]
+%define mask %8
+%define perm %7
+%elif %0 > 6
+    FFT8_AVX   %1, %2, %6, %7
+    movaps     %7, [s16_perm]
+%define mask [mask_mpmppmpm]
+%define perm %7
+%else
+    FFT8_AVX   %1, %2, %6, %5
+%define mask [mask_mpmppmpm]
+%define perm [s16_perm]
+%endif
+    xorps      %5, %5                       ; 0
+
+    shufps     %6, %4, %4, q2301            ; z12.imre, z13.imre...
+    shufps     %5, %5, %3, q2301            ; 0, 0, z8.imre...
+
+    mulps      %4, [s16_mult_odd1]          ; z.reim * costab
+    xorps      %5, [mask_mppmmpmp]
+%if cpuflag(fma3)
+    fmaddps    %6, %6, [s16_mult_odd2], %4  ; s[8..15]
+    addps      %5, %3, %5                   ; s[0...7]
+%else
+    mulps      %6, [s16_mult_odd2]          ; z.imre * costab
+
+    addps      %5, %3, %5                   ; s[0...7]
+    addps      %6, %4, %6                   ; s[8..15]
+%endif
+    mulps      %5, [s16_mult_even]          ; s[0...7]*costab
+
+    xorps      %4, %6, mask                 ; s[8..15]*mpmppmpm
+    xorps      %3, %5, mask                 ; s[0...7]*mpmppmpm
+
+    vperm2f128 %4, %4, 0x01                 ; s[12..15, 8..11]
+    vperm2f128 %3, %3, 0x01                 ; s[4..7, 0..3]
+
+    addps      %6, %4                       ; y56, u56, y34, u34
+    addps      %5, %3                       ; w56, x56, w34, x34
+
+    vpermilps  %6, perm                     ; y56, u56, y43, u43
+    vpermilps  %5, perm                     ; w56, x56, w43, x43
+
+    subps      %4, %2, %6                   ; odd  part 2
+    addps      %3, %2, %6                   ; odd  part 1
+
+    subps      %2, %1, %5                   ; even part 2
+    addps      %1, %1, %5                   ; even part 1
+%undef mask
+%undef perm
+%endmacro
+
+; Cobmines m0...m8 (tx1[even, even, odd, odd], tx2,3[even], tx2,3[odd]) coeffs
+; Uses all 16 of registers.
+; Output is slightly permuted such that tx2,3's coefficients are interleaved
+; on a 2-point basis (look at `doc/transforms.md`)
+%macro SPLIT_RADIX_COMBINE 17
+%if %1 && mmsize == 32
+    vperm2f128 %14, %6, %7, 0x20     ; m2[0], m2[1], m3[0], m3[1] even
+    vperm2f128 %16, %9, %8, 0x20     ; m2[0], m2[1], m3[0], m3[1] odd
+    vperm2f128 %15, %6, %7, 0x31     ; m2[2], m2[3], m3[2], m3[3] even
+    vperm2f128 %17, %9, %8, 0x31     ; m2[2], m2[3], m3[2], m3[3] odd
+%endif
+
+    shufps     %12, %10, %10, q2200  ; cos00224466
+    shufps     %13, %11, %11, q1133  ; wim77553311
+    movshdup   %10, %10              ; cos11335577
+    shufps     %11, %11, q0022       ; wim66442200
+
+%if %1 && mmsize == 32
+    shufps     %6, %14, %14, q2301   ; m2[0].imre, m2[1].imre, m2[2].imre, m2[3].imre even
+    shufps     %8, %16, %16, q2301   ; m2[0].imre, m2[1].imre, m2[2].imre, m2[3].imre odd
+    shufps     %7, %15, %15, q2301   ; m3[0].imre, m3[1].imre, m3[2].imre, m3[3].imre even
+    shufps     %9, %17, %17, q2301   ; m3[0].imre, m3[1].imre, m3[2].imre, m3[3].imre odd
+
+    mulps      %14, %13              ; m2[0123]reim * wim7531 even
+    mulps      %16, %11              ; m2[0123]reim * wim7531 odd
+    mulps      %15, %13              ; m3[0123]reim * wim7531 even
+    mulps      %17, %11              ; m3[0123]reim * wim7531 odd
+%else
+    mulps      %14, %6, %13          ; m2,3[01]reim * wim7531 even
+    mulps      %16, %8, %11          ; m2,3[01]reim * wim7531 odd
+    mulps      %15, %7, %13          ; m2,3[23]reim * wim7531 even
+    mulps      %17, %9, %11          ; m2,3[23]reim * wim7531 odd
+    ; reorder the multiplies to save movs reg, reg in the %if above
+    shufps     %6, %6, q2301         ; m2[0].imre, m2[1].imre, m3[0].imre, m3[1].imre even
+    shufps     %8, %8, q2301         ; m2[0].imre, m2[1].imre, m3[0].imre, m3[1].imre odd
+    shufps     %7, %7, q2301         ; m2[2].imre, m2[3].imre, m3[2].imre, m3[3].imre even
+    shufps     %9, %9, q2301         ; m2[2].imre, m2[3].imre, m3[2].imre, m3[3].imre odd
+%endif
+
+%if cpuflag(fma3) ; 11 - 5 = 6 instructions saved through FMA!
+    fmaddsubps %6, %6, %12, %14      ; w[0..8] even
+    fmaddsubps %8, %8, %10, %16      ; w[0..8] odd
+    fmsubaddps %7, %7, %12, %15      ; j[0..8] even
+    fmsubaddps %9, %9, %10, %17      ; j[0..8] odd
+    movaps     %13, [mask_pmpmpmpm]  ; "subaddps? pfft, who needs that!"
+%else
+    mulps      %6, %12               ; m2,3[01]imre * cos0246
+    mulps      %8, %10               ; m2,3[01]imre * cos0246
+    movaps     %13, [mask_pmpmpmpm]  ; "subaddps? pfft, who needs that!"
+    mulps      %7, %12               ; m2,3[23]reim * cos0246
+    mulps      %9, %10               ; m2,3[23]reim * cos0246
+    addsubps   %6, %14               ; w[0..8]
+    addsubps   %8, %16               ; w[0..8]
+    xorps      %15, %13              ; +-m2,3[23]imre * wim7531
+    xorps      %17, %13              ; +-m2,3[23]imre * wim7531
+    addps      %7, %15               ; j[0..8]
+    addps      %9, %17               ; j[0..8]
+%endif
+
+    addps      %14, %6, %7           ; t10235476 even
+    addps      %16, %8, %9           ; t10235476 odd
+    subps      %15, %6, %7           ; +-r[0..7] even
+    subps      %17, %8, %9           ; +-r[0..7] odd
+
+    shufps     %14, %14, q2301       ; t[0..7] even
+    shufps     %16, %16, q2301       ; t[0..7] odd
+    xorps      %15, %13              ; r[0..7] even
+    xorps      %17, %13              ; r[0..7] odd
+
+    subps      %6, %2, %14           ; m2,3[01] even
+    subps      %8, %4, %16           ; m2,3[01] odd
+    subps      %7, %3, %15           ; m2,3[23] even
+    subps      %9, %5, %17           ; m2,3[23] odd
+
+    addps      %2, %14               ; m0 even
+    addps      %4, %16               ; m0 odd
+    addps      %3, %15               ; m1 even
+    addps      %5, %17               ; m1 odd
+%endmacro
+
+; Same as above, only does one parity at a time, takes 3 temporary registers,
+; however, if the twiddles aren't needed after this, the registers they use
+; can be used as any of the temporary registers.
+%macro SPLIT_RADIX_COMBINE_HALF 10
+%if %1
+    shufps     %8, %6, %6, q2200     ; cos00224466
+    shufps     %9, %7, %7, q1133     ; wim77553311
+%else
+    shufps     %8, %6, %6, q3311     ; cos11335577
+    shufps     %9, %7, %7, q0022     ; wim66442200
+%endif
+
+    mulps      %10, %4, %9           ; m2,3[01]reim * wim7531 even
+    mulps      %9, %5                ; m2,3[23]reim * wim7531 even
+
+    shufps     %4, %4, q2301         ; m2[0].imre, m2[1].imre, m3[0].imre, m3[1].imre even
+    shufps     %5, %5, q2301         ; m2[2].imre, m2[3].imre, m3[2].imre, m3[3].imre even
+
+%if cpuflag(fma3)
+    fmaddsubps %4, %4, %8, %10       ; w[0..8] even
+    fmsubaddps %5, %5, %8, %9        ; j[0..8] even
+    movaps     %10, [mask_pmpmpmpm]
+%else
+    mulps      %4, %8                ; m2,3[01]imre * cos0246
+    mulps      %5, %8                ; m2,3[23]reim * cos0246
+    addsubps   %4, %10               ; w[0..8]
+    movaps     %10, [mask_pmpmpmpm]
+    xorps      %9, %10               ; +-m2,3[23]imre * wim7531
+    addps      %5, %9                ; j[0..8]
+%endif
+
+    addps      %8, %4, %5            ; t10235476
+    subps      %9, %4, %5            ; +-r[0..7]
+
+    shufps     %8, %8, q2301         ; t[0..7]
+    xorps      %9, %10               ; r[0..7]
+
+    subps      %4, %2, %8            ; %3,3[01]
+    subps      %5, %3, %9            ; %3,3[23]
+
+    addps      %2, %8                ; m0
+    addps      %3, %9                ; m1
+%endmacro
+
+; Same as above, tries REALLY hard to use 2 temporary registers.
+%macro SPLIT_RADIX_COMBINE_LITE 9
+%if %1
+    shufps     %8, %6, %6, q2200     ; cos00224466
+    shufps     %9, %7, %7, q1133     ; wim77553311
+%else
+    shufps     %8, %6, %6, q3311     ; cos11335577
+    shufps     %9, %7, %7, q0022     ; wim66442200
+%endif
+
+    mulps      %9, %4                ; m2,3[01]reim * wim7531 even
+    shufps     %4, %4, q2301         ; m2[0].imre, m2[1].imre, m3[0].imre, m3[1].imre even
+
+%if cpuflag(fma3)
+    fmaddsubps %4, %4, %8, %9        ; w[0..8] even
+%else
+    mulps      %4, %8                ; m2,3[01]imre * cos0246
+    addsubps   %4, %9                ; w[0..8]
+%endif
+
+%if %1
+    shufps     %9, %7, %7, q1133     ; wim77553311
+%else
+    shufps     %9, %7, %7, q0022     ; wim66442200
+%endif
+
+    mulps      %9, %5                ; m2,3[23]reim * wim7531 even
+    shufps     %5, %5, q2301         ; m2[2].imre, m2[3].imre, m3[2].imre, m3[3].imre even
+%if cpuflag (fma3)
+    fmsubaddps %5, %5, %8, %9        ; j[0..8] even
+%else
+    mulps      %5, %8                ; m2,3[23]reim * cos0246
+    xorps      %9, [mask_pmpmpmpm]   ; +-m2,3[23]imre * wim7531
+    addps      %5, %9                ; j[0..8]
+%endif
+
+    addps      %8, %4, %5            ; t10235476
+    subps      %9, %4, %5            ; +-r[0..7]
+
+    shufps     %8, %8, q2301         ; t[0..7]
+    xorps      %9, [mask_pmpmpmpm]   ; r[0..7]
+
+    subps      %4, %2, %8            ; %3,3[01]
+    subps      %5, %3, %9            ; %3,3[23]
+
+    addps      %2, %8                ; m0
+    addps      %3, %9                ; m1
+%endmacro
+
+%macro SPLIT_RADIX_COMBINE_64 0
+    SPLIT_RADIX_COMBINE_LITE 1, m0, m1, tx1_e0, tx2_e0, tw_e, tw_o, tmp1, tmp2
+
+    movaps [outq +  0*mmsize], m0
+    movaps [outq +  4*mmsize], m1
+    movaps [outq +  8*mmsize], tx1_e0
+    movaps [outq + 12*mmsize], tx2_e0
+
+    SPLIT_RADIX_COMBINE_HALF 0, m2, m3, tx1_o0, tx2_o0, tw_e, tw_o, tmp1, tmp2, m0
+
+    movaps [outq +  2*mmsize], m2
+    movaps [outq +  6*mmsize], m3
+    movaps [outq + 10*mmsize], tx1_o0
+    movaps [outq + 14*mmsize], tx2_o0
+
+    movaps tw_e, [ff_cos_64_float + mmsize]
+    vperm2f128 tw_o, [ff_cos_64_float + 64 - 4*7 - mmsize], 0x23
+
+    movaps m0, [outq +  1*mmsize]
+    movaps m1, [outq +  3*mmsize]
+    movaps m2, [outq +  5*mmsize]
+    movaps m3, [outq +  7*mmsize]
+
+    SPLIT_RADIX_COMBINE 0, m0, m2, m1, m3, tx1_e1, tx2_e1, tx1_o1, tx2_o1, tw_e, tw_o, \
+                           tmp1, tmp2, tx2_o0, tx1_o0, tx2_e0, tx1_e0 ; temporary registers
+
+    movaps [outq +  1*mmsize], m0
+    movaps [outq +  3*mmsize], m1
+    movaps [outq +  5*mmsize], m2
+    movaps [outq +  7*mmsize], m3
+
+    movaps [outq +  9*mmsize], tx1_e1
+    movaps [outq + 11*mmsize], tx1_o1
+    movaps [outq + 13*mmsize], tx2_e1
+    movaps [outq + 15*mmsize], tx2_o1
+%endmacro
+
+; Perform a single even/odd split radix combination with loads and stores
+; The _4 indicates this is a quarter of the iterations required to complete a full
+; combine loop
+%macro SPLIT_RADIX_LOAD_COMBINE_4 5
+    movaps m8,     [rtabq + (%3)*mmsize + %5]
+    vperm2f128 m9, [itabq - (%3)*mmsize - %5], 0x23
+
+    movaps m0, [outq + (                   0 + %2)*mmsize + %4]
+    movaps m2, [outq + (                   2 + %2)*mmsize + %4]
+    movaps m1, [outq + (         (%1/16) + 0 + %2)*mmsize + %4]
+    movaps m3, [outq + (         (%1/16) + 2 + %2)*mmsize + %4]
+
+    movaps m4, [outq + ((%1/8)           + 0 + %2)*mmsize + %4]
+    movaps m6, [outq + ((%1/8)           + 2 + %2)*mmsize + %4]
+    movaps m5, [outq + ((%1/8) + (%1/16) + 0 + %2)*mmsize + %4]
+    movaps m7, [outq + ((%1/8) + (%1/16) + 2 + %2)*mmsize + %4]
+
+    SPLIT_RADIX_COMBINE 0, m0, m1, m2, m3, \
+                           m4, m5, m6, m7, \
+                           m8, m9, \
+                           m10, m11, m12, m13, m14, m15
+
+    movaps [outq + (                   0 + %2)*mmsize + %4], m0
+    movaps [outq + (                   2 + %2)*mmsize + %4], m2
+    movaps [outq + (         (%1/16) + 0 + %2)*mmsize + %4], m1
+    movaps [outq + (         (%1/16) + 2 + %2)*mmsize + %4], m3
+
+    movaps [outq + ((%1/8)           + 0 + %2)*mmsize + %4], m4
+    movaps [outq + ((%1/8)           + 2 + %2)*mmsize + %4], m6
+    movaps [outq + ((%1/8) + (%1/16) + 0 + %2)*mmsize + %4], m5
+    movaps [outq + ((%1/8) + (%1/16) + 2 + %2)*mmsize + %4], m7
+%endmacro
+
+%macro SPLIT_RADIX_LOAD_COMBINE_FULL 1-3
+%if %0 > 1
+%define offset_c %2
+%define offset_c %2
+%else
+%define offset_c 0
+%define offset_c 0
+%endif
+%if %0 > 2
+%define offset_t %3
+%define offset_t %3
+%else
+%define offset_t 0
+%define offset_t 0
+%endif
+
+    SPLIT_RADIX_LOAD_COMBINE_4 %1, 0, 0, offset_c, offset_t
+    SPLIT_RADIX_LOAD_COMBINE_4 %1, 1, 1, offset_c, offset_t
+    SPLIT_RADIX_LOAD_COMBINE_4 %1, 4, 2, offset_c, offset_t
+    SPLIT_RADIX_LOAD_COMBINE_4 %1, 5, 3, offset_c, offset_t
+%endmacro
+
+; Perform a single even/odd split radix combination with loads, deinterleaves and
+; stores. The _2 indicates this is a half of the iterations required to complete
+; a full combine+deinterleave loop
+; %3 must contain len*2, %4 must contain len*4, %5 must contain len*6
+%macro SPLIT_RADIX_COMBINE_DEINTERLEAVE_2 6
+    movaps m8,     [rtabq + (0 + %2)*mmsize]
+    vperm2f128 m9, [itabq - (0 + %2)*mmsize], 0x23
+
+    movaps m0, [outq +      (0 + 0 + %1)*mmsize + %6]
+    movaps m2, [outq +      (2 + 0 + %1)*mmsize + %6]
+    movaps m1, [outq + %3 + (0 + 0 + %1)*mmsize + %6]
+    movaps m3, [outq + %3 + (2 + 0 + %1)*mmsize + %6]
+
+    movaps m4, [outq + %4 + (0 + 0 + %1)*mmsize + %6]
+    movaps m6, [outq + %4 + (2 + 0 + %1)*mmsize + %6]
+    movaps m5, [outq + %5 + (0 + 0 + %1)*mmsize + %6]
+    movaps m7, [outq + %5 + (2 + 0 + %1)*mmsize + %6]
+
+    SPLIT_RADIX_COMBINE 0, m0, m1, m2, m3, \
+       m4, m5, m6, m7, \
+       m8, m9, \
+       m10, m11, m12, m13, m14, m15
+
+    unpckhpd m10, m0, m2
+    unpckhpd m11, m1, m3
+    unpckhpd m12, m4, m6
+    unpckhpd m13, m5, m7
+    unpcklpd m0, m2
+    unpcklpd m1, m3
+    unpcklpd m4, m6
+    unpcklpd m5, m7
+
+    vextractf128 [outq +      (0 + 0 + %1)*mmsize + %6 +  0], m0,  0
+    vextractf128 [outq +      (0 + 0 + %1)*mmsize + %6 + 16], m10, 0
+    vextractf128 [outq + %3 + (0 + 0 + %1)*mmsize + %6 +  0], m1,  0
+    vextractf128 [outq + %3 + (0 + 0 + %1)*mmsize + %6 + 16], m11, 0
+
+    vextractf128 [outq + %4 + (0 + 0 + %1)*mmsize + %6 +  0], m4,  0
+    vextractf128 [outq + %4 + (0 + 0 + %1)*mmsize + %6 + 16], m12, 0
+    vextractf128 [outq + %5 + (0 + 0 + %1)*mmsize + %6 +  0], m5,  0
+    vextractf128 [outq + %5 + (0 + 0 + %1)*mmsize + %6 + 16], m13, 0
+
+    vperm2f128 m10, m10, m0, 0x13
+    vperm2f128 m11, m11, m1, 0x13
+    vperm2f128 m12, m12, m4, 0x13
+    vperm2f128 m13, m13, m5, 0x13
+
+    movaps m8,     [rtabq + (1 + %2)*mmsize]
+    vperm2f128 m9, [itabq - (1 + %2)*mmsize], 0x23
+
+    movaps m0, [outq +      (0 + 1 + %1)*mmsize + %6]
+    movaps m2, [outq +      (2 + 1 + %1)*mmsize + %6]
+    movaps m1, [outq + %3 + (0 + 1 + %1)*mmsize + %6]
+    movaps m3, [outq + %3 + (2 + 1 + %1)*mmsize + %6]
+
+    movaps [outq +      (0 + 1 + %1)*mmsize + %6], m10 ; m0 conflict
+    movaps [outq + %3 + (0 + 1 + %1)*mmsize + %6], m11 ; m1 conflict
+
+    movaps m4, [outq + %4 + (0 + 1 + %1)*mmsize + %6]
+    movaps m6, [outq + %4 + (2 + 1 + %1)*mmsize + %6]
+    movaps m5, [outq + %5 + (0 + 1 + %1)*mmsize + %6]
+    movaps m7, [outq + %5 + (2 + 1 + %1)*mmsize + %6]
+
+    movaps [outq + %4 + (0 + 1 + %1)*mmsize + %6], m12 ; m4 conflict
+    movaps [outq + %5 + (0 + 1 + %1)*mmsize + %6], m13 ; m5 conflict
+
+    SPLIT_RADIX_COMBINE 0, m0, m1, m2, m3, \
+                           m4, m5, m6, m7, \
+                           m8, m9, \
+                           m10, m11, m12, m13, m14, m15 ; temporary registers
+
+    unpcklpd m8,  m0, m2
+    unpcklpd m9,  m1, m3
+    unpcklpd m10, m4, m6
+    unpcklpd m11, m5, m7
+    unpckhpd m0, m2
+    unpckhpd m1, m3
+    unpckhpd m4, m6
+    unpckhpd m5, m7
+
+    vextractf128 [outq +      (2 + 0 + %1)*mmsize + %6 +  0], m8,  0
+    vextractf128 [outq +      (2 + 0 + %1)*mmsize + %6 + 16], m0,  0
+    vextractf128 [outq +      (2 + 1 + %1)*mmsize + %6 +  0], m8,  1
+    vextractf128 [outq +      (2 + 1 + %1)*mmsize + %6 + 16], m0,  1
+
+    vextractf128 [outq + %3 + (2 + 0 + %1)*mmsize + %6 +  0], m9,  0
+    vextractf128 [outq + %3 + (2 + 0 + %1)*mmsize + %6 + 16], m1,  0
+    vextractf128 [outq + %3 + (2 + 1 + %1)*mmsize + %6 +  0], m9,  1
+    vextractf128 [outq + %3 + (2 + 1 + %1)*mmsize + %6 + 16], m1,  1
+
+    vextractf128 [outq + %4 + (2 + 0 + %1)*mmsize + %6 +  0], m10, 0
+    vextractf128 [outq + %4 + (2 + 0 + %1)*mmsize + %6 + 16], m4,  0
+    vextractf128 [outq + %4 + (2 + 1 + %1)*mmsize + %6 +  0], m10, 1
+    vextractf128 [outq + %4 + (2 + 1 + %1)*mmsize + %6 + 16], m4,  1
+
+    vextractf128 [outq + %5 + (2 + 0 + %1)*mmsize + %6 +  0], m11, 0
+    vextractf128 [outq + %5 + (2 + 0 + %1)*mmsize + %6 + 16], m5,  0
+    vextractf128 [outq + %5 + (2 + 1 + %1)*mmsize + %6 +  0], m11, 1
+    vextractf128 [outq + %5 + (2 + 1 + %1)*mmsize + %6 + 16], m5,  1
+%endmacro
+
+%macro SPLIT_RADIX_COMBINE_DEINTERLEAVE_FULL 3-4
+%if %0 > 3
+%define offset %4
+%define offset %4
+%else
+%define offset 0
+%define offset 0
+%endif
+    SPLIT_RADIX_COMBINE_DEINTERLEAVE_2 0, 0, %1, %2, %3, offset
+    SPLIT_RADIX_COMBINE_DEINTERLEAVE_2 4, 2, %1, %2, %3, offset
+%endmacro
+
+INIT_XMM sse3
+cglobal fft2_float, 4, 4, 2, ctx, out, in, stride
+    movaps m0, [inq]
+    FFT2 m0, m1
+    movaps [outq], m0
+    RET
+
+%macro FFT4 2
+INIT_XMM sse2
+cglobal fft4_ %+ %1 %+ _float, 4, 4, 3, ctx, out, in, stride
+    movaps m0, [inq + 0*mmsize]
+    movaps m1, [inq + 1*mmsize]
+
+%if %2
+    shufps m2, m1, m0, q3210
+    shufps m0, m1, q3210
+    movaps m1, m2
+%endif
+
+    FFT4 m0, m1, m2
+
+    unpcklpd m2, m0, m1
+    unpckhpd m0, m1
+
+    movaps [outq + 0*mmsize], m2
+    movaps [outq + 1*mmsize], m0
+
+    RET
+%endmacro
+
+FFT4 fwd, 0
+FFT4 inv, 1
+
+INIT_XMM sse3
+cglobal fft8_float, 4, 4, 6, ctx, out, in, tmp
+    mov ctxq, [ctxq + AVTXContext.revtab]
+
+    LOAD64_LUT m0, inq, ctxq, (mmsize/2)*0, tmpq
+    LOAD64_LUT m1, inq, ctxq, (mmsize/2)*1, tmpq
+    LOAD64_LUT m2, inq, ctxq, (mmsize/2)*2, tmpq
+    LOAD64_LUT m3, inq, ctxq, (mmsize/2)*3, tmpq
+
+    FFT8 m0, m1, m2, m3, m4, m5
+
+    unpcklpd m4, m0, m3
+    unpcklpd m5, m1, m2
+    unpckhpd m0, m3
+    unpckhpd m1, m2
+
+    movups [outq + 0*mmsize], m4
+    movups [outq + 1*mmsize], m0
+    movups [outq + 2*mmsize], m5
+    movups [outq + 3*mmsize], m1
+
+    RET
+
+INIT_YMM avx
+cglobal fft8_float, 4, 4, 4, ctx, out, in, tmp
+    mov ctxq, [ctxq + AVTXContext.revtab]
+
+    LOAD64_LUT m0, inq, ctxq, (mmsize/2)*0, tmpq, m2
+    LOAD64_LUT m1, inq, ctxq, (mmsize/2)*1, tmpq, m3
+
+    FFT8_AVX m0, m1, m2, m3
+
+    unpcklpd m2, m0, m1
+    unpckhpd m0, m1
+
+    ; Around 2% faster than 2x vperm2f128 + 2x movapd
+    vextractf128 [outq + 16*0], m2, 0
+    vextractf128 [outq + 16*1], m0, 0
+    vextractf128 [outq + 16*2], m2, 1
+    vextractf128 [outq + 16*3], m0, 1
+
+    RET
+
+%macro FFT16_FN 1
+INIT_YMM %1
+cglobal fft16_float, 4, 4, 8, ctx, out, in, tmp
+    mov ctxq, [ctxq + AVTXContext.revtab]
+
+    LOAD64_LUT m0, inq, ctxq, (mmsize/2)*0, tmpq, m4
+    LOAD64_LUT m1, inq, ctxq, (mmsize/2)*1, tmpq, m5
+    LOAD64_LUT m2, inq, ctxq, (mmsize/2)*2, tmpq, m6
+    LOAD64_LUT m3, inq, ctxq, (mmsize/2)*3, tmpq, m7
+
+    FFT16 m0, m1, m2, m3, m4, m5, m6, m7
+
+    unpcklpd m5, m1, m3
+    unpcklpd m4, m0, m2
+    unpckhpd m1, m3
+    unpckhpd m0, m2
+
+    vextractf128 [outq + 16*0], m4, 0
+    vextractf128 [outq + 16*1], m0, 0
+    vextractf128 [outq + 16*2], m4, 1
+    vextractf128 [outq + 16*3], m0, 1
+    vextractf128 [outq + 16*4], m5, 0
+    vextractf128 [outq + 16*5], m1, 0
+    vextractf128 [outq + 16*6], m5, 1
+    vextractf128 [outq + 16*7], m1, 1
+
+    RET
+%endmacro
+
+FFT16_FN avx
+FFT16_FN fma3
+
+%macro FFT32_FN 1
+INIT_YMM %1
+cglobal fft32_float, 4, 4, 16, ctx, out, in, tmp
+    mov ctxq, [ctxq + AVTXContext.revtab]
+
+    LOAD64_LUT m4, inq, ctxq, (mmsize/2)*4, tmpq,  m8,  m9
+    LOAD64_LUT m5, inq, ctxq, (mmsize/2)*5, tmpq, m10, m11
+    LOAD64_LUT m6, inq, ctxq, (mmsize/2)*6, tmpq, m12, m13
+    LOAD64_LUT m7, inq, ctxq, (mmsize/2)*7, tmpq, m14, m15
+
+    FFT8 m4, m5, m6, m7, m8, m9
+
+    LOAD64_LUT m0, inq, ctxq, (mmsize/2)*0, tmpq,  m8,  m9
+    LOAD64_LUT m1, inq, ctxq, (mmsize/2)*1, tmpq, m10, m11
+    LOAD64_LUT m2, inq, ctxq, (mmsize/2)*2, tmpq, m12, m13
+    LOAD64_LUT m3, inq, ctxq, (mmsize/2)*3, tmpq, m14, m15
+
+    movaps m8, [ff_cos_32_float]
+    vperm2f128 m9, [ff_cos_32_float + 4*8 - 4*7], 0x23
+
+    FFT16 m0, m1, m2, m3, m10, m11, m12, m13
+
+    SPLIT_RADIX_COMBINE 1, m0, m1, m2, m3, m4, m5, m6, m7, m8, m9, \
+                           m10, m11, m12, m13, m14, m15 ; temporary registers
+
+    unpcklpd  m9, m1, m3
+    unpcklpd m10, m5, m7
+    unpcklpd  m8, m0, m2
+    unpcklpd m11, m4, m6
+    unpckhpd  m1, m3
+    unpckhpd  m5, m7
+    unpckhpd  m0, m2
+    unpckhpd  m4, m6
+
+    vextractf128 [outq + 16* 0],  m8, 0
+    vextractf128 [outq + 16* 1],  m0, 0
+    vextractf128 [outq + 16* 2],  m8, 1
+    vextractf128 [outq + 16* 3],  m0, 1
+    vextractf128 [outq + 16* 4],  m9, 0
+    vextractf128 [outq + 16* 5],  m1, 0
+    vextractf128 [outq + 16* 6],  m9, 1
+    vextractf128 [outq + 16* 7],  m1, 1
+
+    vextractf128 [outq + 16* 8], m11, 0
+    vextractf128 [outq + 16* 9],  m4, 0
+    vextractf128 [outq + 16*10], m11, 1
+    vextractf128 [outq + 16*11],  m4, 1
+    vextractf128 [outq + 16*12], m10, 0
+    vextractf128 [outq + 16*13],  m5, 0
+    vextractf128 [outq + 16*14], m10, 1
+    vextractf128 [outq + 16*15],  m5, 1
+
+    RET
+%endmacro
+
+%if ARCH_X86_64
+FFT32_FN avx
+FFT32_FN fma3
+%endif
+
+%macro FFT_SPLIT_RADIX_DEF 1-2
+ALIGN 16
+.%1 %+ pt:
+    PUSH lenq
+    mov lenq, (%1/4)
+
+    add outq, (%1*4) - (%1/1)
+    call .32pt
+
+    add outq, (%1*2) - (%1/2) ; the synth loops also increment outq
+    call .32pt
+
+    POP lenq
+    sub outq, (%1*4) + (%1*2) + (%1/2)
+
+    mov rtabq, ff_cos_ %+ %1 %+ _float
+    lea itabq, [rtabq + %1 - 4*7]
+
+%if %0 > 1
+    cmp tgtq, %1
+    je .deinterleave
+
+    mov tmpq, (%1 >> 7)
+
+.synth_ %+ %1:
+    SPLIT_RADIX_LOAD_COMBINE_FULL %1
+    add outq, 8*mmsize
+    add rtabq, 4*mmsize
+    sub itabq, 4*mmsize
+    sub tmpq, 1
+    jg .synth_ %+ %1
+
+    cmp lenq, %1
+    jg %2 ; can't do math here, nasm doesn't get it
+    ret
+%endif
+%endmacro
+
+%macro FFT_SPLIT_RADIX_FN 1
+INIT_YMM %1
+cglobal split_radix_fft_float, 4, 8, 16, 272, lut, out, in, len, tmp, itab, rtab, tgt
+    mov lenq, [lutq + AVTXContext.m]
+    mov lutq, [lutq + AVTXContext.revtab]
+    mov tgtq, lenq
+
+; Bottom-most/32-point transform ===============================================
+ALIGN 16
+.32pt:
+    LOAD64_LUT m4, inq, lutq, (mmsize/2)*4, tmpq,  m8,  m9
+    LOAD64_LUT m5, inq, lutq, (mmsize/2)*5, tmpq, m10, m11
+    LOAD64_LUT m6, inq, lutq, (mmsize/2)*6, tmpq, m12, m13
+    LOAD64_LUT m7, inq, lutq, (mmsize/2)*7, tmpq, m14, m15
+
+    FFT8 m4, m5, m6, m7, m8, m9
+
+    LOAD64_LUT m0, inq, lutq, (mmsize/2)*0, tmpq,  m8,  m9
+    LOAD64_LUT m1, inq, lutq, (mmsize/2)*1, tmpq, m10, m11
+    LOAD64_LUT m2, inq, lutq, (mmsize/2)*2, tmpq, m12, m13
+    LOAD64_LUT m3, inq, lutq, (mmsize/2)*3, tmpq, m14, m15
+
+    movaps m8, [ff_cos_32_float]
+    vperm2f128 m9, [ff_cos_32_float + 32 - 4*7], 0x23
+
+    FFT16 m0, m1, m2, m3, m10, m11, m12, m13
+
+    SPLIT_RADIX_COMBINE 1, m0, m1, m2, m3, m4, m5, m6, m7, m8, m9, \
+                           m10, m11, m12, m13, m14, m15 ; temporary registers
+
+    movaps [outq + 1*mmsize], m1
+    movaps [outq + 3*mmsize], m3
+    movaps [outq + 5*mmsize], m5
+    movaps [outq + 7*mmsize], m7
+
+    add lutq, (mmsize/2)*8
+    cmp lenq, 32
+    jg .64pt
+
+    movaps [outq + 0*mmsize], m0
+    movaps [outq + 2*mmsize], m2
+    movaps [outq + 4*mmsize], m4
+    movaps [outq + 6*mmsize], m6
+
+    ret
+
+; 64-point transform ===========================================================
+ALIGN 16
+.64pt:
+; Helper defines, these make it easier to track what's happening
+%define tx1_e0 m4
+%define tx1_e1 m5
+%define tx1_o0 m6
+%define tx1_o1 m7
+%define tx2_e0 m8
+%define tx2_e1 m9
+%define tx2_o0 m10
+%define tx2_o1 m11
+%define tw_e m12
+%define tw_o m13
+%define tmp1 m14
+%define tmp2 m15
+
+    SWAP m4, m1
+    SWAP m6, m3
+
+    LOAD64_LUT tx1_e0, inq, lutq, (mmsize/2)*0, tmpq, tw_e, tw_o
+    LOAD64_LUT tx1_e1, inq, lutq, (mmsize/2)*1, tmpq, tmp1, tmp2
+    LOAD64_LUT tx1_o0, inq, lutq, (mmsize/2)*2, tmpq, tw_e, tw_o
+    LOAD64_LUT tx1_o1, inq, lutq, (mmsize/2)*3, tmpq, tmp1, tmp2
+
+    FFT16 tx1_e0, tx1_e1, tx1_o0, tx1_o1, tw_e, tw_o, tx2_o0, tx2_o1
+
+    LOAD64_LUT tx2_e0, inq, lutq, (mmsize/2)*4, tmpq, tmp1, tmp2
+    LOAD64_LUT tx2_e1, inq, lutq, (mmsize/2)*5, tmpq, tw_e, tw_o
+    LOAD64_LUT tx2_o0, inq, lutq, (mmsize/2)*6, tmpq, tmp1, tmp2
+    LOAD64_LUT tx2_o1, inq, lutq, (mmsize/2)*7, tmpq, tw_e, tw_o
+
+    movaps tw_e, [ff_cos_64_float]
+    vperm2f128 tw_o, [ff_cos_64_float + 64 - 4*7], 0x23
+
+    FFT16 tx2_e0, tx2_e1, tx2_o0, tx2_o1, tmp1, tmp2
+
+    add lutq, (mmsize/2)*8
+    cmp tgtq, 64
+    je .deinterleave
+
+    SPLIT_RADIX_COMBINE_64
+
+    cmp lenq, 64
+    jg .128pt
+    ret
+
+; 128-point transform ==========================================================
+ALIGN 16
+.128pt:
+    PUSH lenq
+    mov lenq, 32
+
+    add outq, 16*mmsize
+    call .32pt
+
+    add outq, 8*mmsize
+    call .32pt
+
+    POP lenq
+    sub outq, 24*mmsize
+
+    mov rtabq, ff_cos_128_float
+    lea itabq, [rtabq + 128 - 4*7]
+
+    cmp tgtq, 128
+    je .deinterleave
+
+    SPLIT_RADIX_LOAD_COMBINE_FULL 128
+
+    cmp lenq, 128
+    jg .256pt
+    ret
+
+; 256-point transform ==========================================================
+ALIGN 16
+.256pt:
+    PUSH lenq
+    mov lenq, 64
+
+    add outq, 32*mmsize
+    call .32pt
+
+    add outq, 16*mmsize
+    call .32pt
+
+    POP lenq
+    sub outq, 48*mmsize
+
+    mov rtabq, ff_cos_256_float
+    lea itabq, [rtabq + 256 - 4*7]
+
+    cmp tgtq, 256
+    je .deinterleave
+
+    SPLIT_RADIX_LOAD_COMBINE_FULL 256
+    SPLIT_RADIX_LOAD_COMBINE_FULL 256, 8*mmsize, 4*mmsize
+
+    cmp lenq, 256
+    jg .512pt
+    ret
+
+; 512-point transform ==========================================================
+ALIGN 16
+.512pt:
+    PUSH lenq
+    mov lenq, 128
+
+    add outq, 64*mmsize
+    call .32pt
+
+    add outq, 32*mmsize
+    call .32pt
+
+    POP lenq
+    sub outq, 96*mmsize
+
+    mov rtabq, ff_cos_512_float
+    lea itabq, [rtabq + 512 - 4*7]
+
+    cmp tgtq, 512
+    je .deinterleave
+
+    mov tmpq, 4
+
+.synth_512:
+    SPLIT_RADIX_LOAD_COMBINE_FULL 512
+    add outq, 8*mmsize
+    add rtabq, 4*mmsize
+    sub itabq, 4*mmsize
+    sub tmpq, 1
+    jg .synth_512
+
+    cmp lenq, 512
+    jg .1024pt
+    ret
+
+; 1024-point transform ==========================================================
+ALIGN 16
+.1024pt:
+    PUSH lenq
+    mov lenq, 256
+
+    add outq, 96*mmsize
+    call .32pt
+
+    add outq, 64*mmsize
+    call .32pt
+
+    POP lenq
+    sub outq, 192*mmsize
+
+    mov rtabq, ff_cos_1024_float
+    lea itabq, [rtabq + 1024 - 4*7]
+
+    cmp tgtq, 1024
+    je .deinterleave
+
+    mov tmpq, 8
+
+.synth_1024:
+    SPLIT_RADIX_LOAD_COMBINE_FULL 1024
+    add outq, 8*mmsize
+    add rtabq, 4*mmsize
+    sub itabq, 4*mmsize
+    sub tmpq, 1
+    jg .synth_1024
+
+    cmp lenq, 1024
+    jg .2048pt
+    ret
+
+; 2048 to 131072-point transforms ==============================================
+FFT_SPLIT_RADIX_DEF 2048,  .4096pt
+FFT_SPLIT_RADIX_DEF 4096,  .8192pt
+FFT_SPLIT_RADIX_DEF 8192,  .16384pt
+FFT_SPLIT_RADIX_DEF 16384, .32768pt
+FFT_SPLIT_RADIX_DEF 32768, .65536pt
+FFT_SPLIT_RADIX_DEF 65536, .131072pt
+FFT_SPLIT_RADIX_DEF 131072
+
+;===============================================================================
+; Final synthesis + deinterleaving code
+;===============================================================================
+.deinterleave:
+    cmp lenq, 64
+    je .64pt_deint
+
+    mov tmpq, lenq
+    shr tmpq, 7
+
+    imul lenq, 2
+    imul tgtq, lenq, 2
+    lea lutq, [lenq + tgtq]
+
+.synth_deinterleave:
+    SPLIT_RADIX_COMBINE_DEINTERLEAVE_FULL lenq, tgtq, lutq
+    add outq, 8*mmsize
+    add rtabq, 4*mmsize
+    sub itabq, 4*mmsize
+    sub tmpq, 1
+    jg .synth_deinterleave
+
+    RET
+
+; 64-point deinterleave which only has to load 4 registers =====================
+.64pt_deint:
+    SPLIT_RADIX_COMBINE_LITE 1, m0, m1, tx1_e0, tx2_e0, tw_e, tw_o, tmp1, tmp2
+    SPLIT_RADIX_COMBINE_HALF 0, m2, m3, tx1_o0, tx2_o0, tw_e, tw_o, tmp1, tmp2, tw_e
+
+    unpcklpd tmp1, m0, m2
+    unpcklpd tmp2, m1, m3
+    unpcklpd tw_o, tx1_e0, tx1_o0
+    unpcklpd tw_e, tx2_e0, tx2_o0
+    unpckhpd m0, m2
+    unpckhpd m1, m3
+    unpckhpd tx1_e0, tx1_o0
+    unpckhpd tx2_e0, tx2_o0
+
+    vextractf128 [outq +  0*mmsize +  0], tmp1,   0
+    vextractf128 [outq +  0*mmsize + 16], m0,     0
+    vextractf128 [outq +  4*mmsize +  0], tmp2,   0
+    vextractf128 [outq +  4*mmsize + 16], m1,     0
+
+    vextractf128 [outq +  8*mmsize +  0], tw_o,   0
+    vextractf128 [outq +  8*mmsize + 16], tx1_e0, 0
+    vextractf128 [outq +  9*mmsize +  0], tw_o,   1
+    vextractf128 [outq +  9*mmsize + 16], tx1_e0, 1
+
+    vperm2f128 tmp1, tmp1, m0, 0x31
+    vperm2f128 tmp2, tmp2, m1, 0x31
+
+    vextractf128 [outq + 12*mmsize +  0], tw_e,   0
+    vextractf128 [outq + 12*mmsize + 16], tx2_e0, 0
+    vextractf128 [outq + 13*mmsize +  0], tw_e,   1
+    vextractf128 [outq + 13*mmsize + 16], tx2_e0, 1
+
+    movaps tw_e, [ff_cos_64_float + mmsize]
+    vperm2f128 tw_o, [ff_cos_64_float + 64 - 4*7 - mmsize], 0x23
+
+    movaps m0, [outq +  1*mmsize]
+    movaps m1, [outq +  3*mmsize]
+    movaps m2, [outq +  5*mmsize]
+    movaps m3, [outq +  7*mmsize]
+
+    movaps [outq +  1*mmsize], tmp1
+    movaps [outq +  5*mmsize], tmp2
+
+    SPLIT_RADIX_COMBINE 0, m0, m2, m1, m3, tx1_e1, tx2_e1, tx1_o1, tx2_o1, tw_e, tw_o, \
+                           tmp1, tmp2, tx2_o0, tx1_o0, tx2_e0, tx1_e0 ; temporary registers
+
+    unpcklpd tmp1, m0, m1
+    unpcklpd tmp2, m2, m3
+    unpcklpd tw_e, tx1_e1, tx1_o1
+    unpcklpd tw_o, tx2_e1, tx2_o1
+    unpckhpd m0, m1
+    unpckhpd m2, m3
+    unpckhpd tx1_e1, tx1_o1
+    unpckhpd tx2_e1, tx2_o1
+
+    vextractf128 [outq +  2*mmsize +  0], tmp1,   0
+    vextractf128 [outq +  2*mmsize + 16], m0,     0
+    vextractf128 [outq +  3*mmsize +  0], tmp1,   1
+    vextractf128 [outq +  3*mmsize + 16], m0,     1
+
+    vextractf128 [outq +  6*mmsize +  0], tmp2,   0
+    vextractf128 [outq +  6*mmsize + 16], m2,     0
+    vextractf128 [outq +  7*mmsize +  0], tmp2,   1
+    vextractf128 [outq +  7*mmsize + 16], m2,     1
+
+    vextractf128 [outq + 10*mmsize +  0], tw_e,   0
+    vextractf128 [outq + 10*mmsize + 16], tx1_e1, 0
+    vextractf128 [outq + 11*mmsize +  0], tw_e,   1
+    vextractf128 [outq + 11*mmsize + 16], tx1_e1, 1
+
+    vextractf128 [outq + 14*mmsize +  0], tw_o,   0
+    vextractf128 [outq + 14*mmsize + 16], tx2_e1, 0
+    vextractf128 [outq + 15*mmsize +  0], tw_o,   1
+    vextractf128 [outq + 15*mmsize + 16], tx2_e1, 1
+
+    RET
+%endmacro
+
+%if ARCH_X86_64
+FFT_SPLIT_RADIX_FN avx
+FFT_SPLIT_RADIX_FN fma3
+%endif
diff --git a/libavutil/x86/tx_float_init.c b/libavutil/x86/tx_float_init.c
new file mode 100644
index 0000000000..993933317c
--- /dev/null
+++ b/libavutil/x86/tx_float_init.c
@@ -0,0 +1,101 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#define TX_FLOAT
+#include "libavutil/tx_priv.h"
+#include "libavutil/attributes.h"
+#include "libavutil/x86/cpu.h"
+
+void ff_fft2_float_sse3     (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_fft4_inv_float_sse2 (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_fft4_fwd_float_sse2 (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_fft8_float_sse3     (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_fft8_float_avx      (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_fft16_float_avx     (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_fft16_float_fma3    (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_fft32_float_avx     (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_fft32_float_fma3    (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+
+void ff_split_radix_fft_float_avx (AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+void ff_split_radix_fft_float_fma3(AVTXContext *s, void *out, void *in, ptrdiff_t stride);
+
+av_cold void ff_tx_init_float_x86(AVTXContext *s, av_tx_fn *tx)
+{
+    int cpu_flags = av_get_cpu_flags();
+    int gen_revtab = 0, basis, revtab_interleave;
+
+    if (s->flags & AV_TX_UNALIGNED)
+        return;
+
+    if (ff_tx_type_is_mdct(s->type))
+        return;
+
+#define TXFN(fn, gentab, sr_basis, interleave) \
+    do {                                       \
+        *tx = fn;                              \
+        gen_revtab = gentab;                   \
+        basis = sr_basis;                      \
+        revtab_interleave = interleave;        \
+    } while (0)
+
+    if (s->n == 1) {
+        if (EXTERNAL_SSE2(cpu_flags)) {
+            if (s->m == 4 && s->inv)
+                TXFN(ff_fft4_inv_float_sse2, 0, 0, 0);
+            else if (s->m == 4)
+                TXFN(ff_fft4_fwd_float_sse2, 0, 0, 0);
+        }
+
+        if (EXTERNAL_SSE3(cpu_flags)) {
+            if (s->m == 2)
+                TXFN(ff_fft2_float_sse3, 0, 0, 0);
+            else if (s->m == 8)
+                TXFN(ff_fft8_float_sse3, 1, 8, 0);
+        }
+
+        if (EXTERNAL_AVX_FAST(cpu_flags)) {
+            if (s->m == 8)
+                TXFN(ff_fft8_float_avx, 1, 8, 0);
+            else if (s->m == 16)
+                TXFN(ff_fft16_float_avx, 1, 8, 2);
+#if ARCH_X86_64
+            else if (s->m == 32)
+                TXFN(ff_fft32_float_avx, 1, 8, 2);
+            else if (s->m >= 64 && s->m <= 131072 && !(s->flags & AV_TX_INPLACE))
+                TXFN(ff_split_radix_fft_float_avx, 1, 8, 2);
+#endif
+        }
+
+        if (EXTERNAL_FMA3_FAST(cpu_flags)) {
+            if (s->m == 16)
+                TXFN(ff_fft16_float_fma3, 1, 8, 2);
+#if ARCH_X86_64
+            else if (s->m == 32)
+                TXFN(ff_fft32_float_fma3, 1, 8, 2);
+            else if (s->m >= 64 && s->m <= 131072 && !(s->flags & AV_TX_INPLACE))
+                TXFN(ff_split_radix_fft_float_fma3, 1, 8, 2);
+#endif
+        }
+    }
+
+    if (gen_revtab)
+        ff_tx_gen_split_radix_parity_revtab(s->revtab, s->m, s->inv, basis,
+                                            revtab_interleave);
+
+#undef TXFN
+}