From patchwork Thu Jan 18 18:52:56 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 45656
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:968f:b0:199:de12:6fa6 with SMTP id
 hp15csp467172pzc;
        Thu, 18 Jan 2024 10:53:09 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHXoZaHJPWjr3ohqHf8s2RzRBj2fbMhwyAmRuO5biK2yEeRd0BaCOjYnEVTMU205aOMDvBz
X-Received: by 2002:a2e:6e07:0:b0:2cc:d663:2e49 with SMTP id
 j7-20020a2e6e07000000b002ccd6632e49mr19033ljc.4.1705603988833;
        Thu, 18 Jan 2024 10:53:08 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1705603988; cv=none;
        d=google.com; s=arc-20160816;
        b=BShvW+xQKMILeMuU4IdY6KANM8//svOPMwO2jBf6Bd8mTFIApdBf66PiTEssTy9uNg
         fEN/hzOSApaPS05jUgdmptdDNGV/UHXZLrRCUoPS+weAWtxOup9/KYUMDgLeWwSWA2H8
         HYbEf51z9EBj1wCQnhJwaIlApDQ2sIEqtkxE9Xpz1+q5GJNjo7+QUn9jmAH7pZ4V2bJj
         WrUW/St4rO6eq0W0CcYsrXYiYMNoYmqSj+Cr/527Yfu7pfMtU+jzQvnVApLyE6lHN97X
         /rG9nw70Xd91xj+4ROnQFMkLz1k0kTjAHqW/c7aALr6Zss5MbZSlCk3O8X/34AZies6m
         xWew==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:message-id:to:from:date:dkim-signature:delivered-to;
        bh=1axfEliIfur999UviNI9fRS58i1mEKM5IoEEeMWJ1qE=;
        fh=Q46kXK7oI5D1Jhi90JBr53c7NIaTxGaU4KPeRZyM/hI=;
        b=BSzOIqbCTsryuGyNx1RKbiXnuRseQn2t6oxRvkdoix9SrulFme8+OViAIhyXvuKm5l
         IYYCCdrPtvy7WIh5IehR1oyyuc5R3Eq9id2jyqUHiKwdqYYGrOCcGIwV6EJzNfv7yx3m
         q+PQqvOp0gucGtGozZZJomdXF6mnXzKBHM40oPo0o83vgcs0F7MePgLD7AW6eEC1ngwJ
         PUPfixmMz1dcIBf7OkgbfAdGb8Nx8I1DtTj4EE3SXDMPllECnq2yoHU2BuNl4dmjr9bA
         ceS7lt2iQI814noFCPZEsQOyT1+k8WRCjB3US8va6iYBy7Jpkfa2UWP7heB4IH8BE0MN
         lnlw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=Hw3vZela;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 k7-20020a17090627c700b00a28d3917ff5si6755593ejc.650.2024.01.18.10.53.08;
        Thu, 18 Jan 2024 10:53:08 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=Hw3vZela;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3FBC668D032;
	Thu, 18 Jan 2024 20:53:04 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 78C60689245
 for <ffmpeg-devel@ffmpeg.org>; Thu, 18 Jan 2024 20:52:57 +0200 (EET)
Received: from tutadb.w10.tutanota.de (unknown [192.168.1.10])
 by w4.tutanota.de (Postfix) with ESMTP id C058B106029E
 for <ffmpeg-devel@ffmpeg.org>; Thu, 18 Jan 2024 18:52:56 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1705603976;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:Sender;
 bh=TjrA1+CSOEfzJEczTDiPEtHGMdVvgtMSTCYl1f9O154=;
 b=Hw3vZelafWIv/2P+Iuepe1DpJLXiigH4WpTzprk3pU2cOGnWmj4UkWR9SYohR+mW
 Jd9ZQdBEeHfvBLgn3nwVepwm6/APsZ58AC2bPT6UIRtL8FII8pkbPJJ6hH6kWUgqOe/
 yNK40ylD9PHVufJvVMF1UtBe5wTMxSLu39yNs9HRfPml+ltignnYoOf3nFrMz0h5IUP
 WnpwtzQ4vHswKQHwzoaol7goA4n4kXhDo5q9TXsGx3fonoh+2Hg8ARWrAP8ipwao6a5
 oRHadlUt1PZ/iu3vSZF96CtBiiqscotP3ZC+CFiw9uMW4ZrBLW8G85iaTxPQgdR0LFw
 CsHxzMFrWg==
Date: Thu, 18 Jan 2024 19:52:56 +0100 (CET)
From: Lynne <dev@lynne.ee>
To: Ffmpeg Devel <ffmpeg-devel@ffmpeg.org>
Message-ID: <NoSuodr--3-9@lynne.ee>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH] x86/tx_float: AVX2 SIMD for R2C and C2R RDFTs
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: p6VUFEOyt3tt

Adds full assembly for R2C and C2R transforms

R2C Before:
145370 decicycles in           av_tx (r2c),  131072 runs,      0 skips
R2C After:
56897 decicycles in           av_tx (r2c),  131072 runs,      0 skips

C2R Before:
140958 decicycles in           av_tx (c2r),  131071 runs,      1 skips
C2R After:
50427 decicycles in           av_tx (c2r),  131061 runs,     11 skips

C2R does an in-place scatter for the FFT.
R2C could be made a little faster by adding an assembly-only
version of the regular lookup-enabled FFT. In theory, may only
help for really large transforms.

From f5281404f5789498b854d33f808c133820540281 Mon Sep 17 00:00:00 2001
From: Lynne <dev@lynne.ee>
Date: Mon, 11 Dec 2023 13:29:21 +0100
Subject: [PATCH] x86/tx_float: AVX2 SIMD for R2C and C2R RDFTs

R2C Before:
145370 decicycles in           av_tx (r2c),  131072 runs,      0 skips
R2C After:
56897 decicycles in           av_tx (r2c),  131072 runs,      0 skips

C2R Before:
140958 decicycles in           av_tx (c2r),  131071 runs,      1 skips
C2R After:
50427 decicycles in           av_tx (c2r),  131061 runs,     11 skips

C2R does an in-place scatter for the FFT.
R2C could be made a little faster by adding an assembly-only
version of the regular lookup-enabled FFT.

This also adds a small optimization to the C version
and makes it more in-line with what the assembly does.
---
 libavutil/tx_template.c       |  20 ++-
 libavutil/x86/tx_float.asm    | 226 ++++++++++++++++++++++++++++++++++
 libavutil/x86/tx_float_init.c |  65 ++++++++++
 tests/checkasm/av_tx.c        |   7 ++
 4 files changed, 306 insertions(+), 12 deletions(-)

diff --git a/libavutil/tx_template.c b/libavutil/tx_template.c
index a2c27465cb..08a0243a2e 100644
--- a/libavutil/tx_template.c
+++ b/libavutil/tx_template.c
@@ -1647,13 +1647,10 @@ static av_cold int TX_NAME(ff_tx_rdft_init)(AVTXContext *s,
     *tab++ = RESCALE( (0.5 - inv) * m);
     *tab++ = RESCALE(-(0.5 - inv) * m);
 
-    for (int i = 0; i < len4; i++)
+    for (int i = 0; i < len4; i++) {
         *tab++ = RESCALE(cos(i*f));
-
-    tab = ((TXSample *)s->exp) + len4 + 8;
-
-    for (int i = 0; i < len4; i++)
         *tab++ = RESCALE(cos(((len - i*4)/4.0)*f)) * (inv ? 1 : -1);
+    }
 
     return 0;
 }
@@ -1665,8 +1662,7 @@ static void TX_NAME(ff_tx_rdft_ ##n)(AVTXContext *s, void *_dst,               \
     const int len2 = s->len >> 1;                                              \
     const int len4 = s->len >> 2;                                              \
     const TXSample *fact = (void *)s->exp;                                     \
-    const TXSample *tcos = fact + 8;                                           \
-    const TXSample *tsin = tcos + len4;                                        \
+    const TXSample *exp = fact + 8;                                            \
     TXComplex *data = inv ? _src : _dst;                                       \
     TXComplex t[3];                                                            \
                                                                                \
@@ -1688,18 +1684,18 @@ static void TX_NAME(ff_tx_rdft_ ##n)(AVTXContext *s, void *_dst,               \
                                                                                \
     for (int i = 1; i < len4; i++) {                                           \
         /* Separate even and odd FFTs */                                       \
-        t[0].re = MULT(fact[4], (data[i].re + data[len2 - i].re));             \
         t[0].im = MULT(fact[5], (data[i].im - data[len2 - i].im));             \
-        t[1].re = MULT(fact[6], (data[i].im + data[len2 - i].im));             \
+        t[0].re = MULT(fact[4], (data[i].re + data[len2 - i].re));             \
         t[1].im = MULT(fact[7], (data[i].re - data[len2 - i].re));             \
+        t[1].re = MULT(fact[6], (data[i].im + data[len2 - i].im));             \
                                                                                \
         /* Apply twiddle factors to the odd FFT and add to the even FFT */     \
-        CMUL(t[2].re, t[2].im, t[1].re, t[1].im, tcos[i], tsin[i]);            \
+        CMUL(t[2].re, t[2].im, t[1].re, t[1].im, exp[i*2 + 0], exp[i*2 + 1]);  \
                                                                                \
-        data[       i].re = t[0].re + t[2].re;                                 \
+        data[       i].re = t[2].re + t[0].re;                                 \
         data[       i].im = t[2].im - t[0].im;                                 \
         data[len2 - i].re = t[0].re - t[2].re;                                 \
-        data[len2 - i].im = t[2].im + t[0].im;                                 \
+        data[len2 - i].im = t[0].im + t[2].im;                                 \
     }                                                                          \
                                                                                \
     if (inv) {                                                                 \
diff --git a/libavutil/x86/tx_float.asm b/libavutil/x86/tx_float.asm
index e1533a8595..11d5e946db 100644
--- a/libavutil/x86/tx_float.asm
+++ b/libavutil/x86/tx_float.asm
@@ -91,6 +91,11 @@ s16_perm:      dd 0, 1, 2, 3, 1, 0, 3, 2
 
 s15_perm:      dd 0, 6, 5, 3, 2, 4, 7, 1
 
+rdft_perm_pos: dd 3, 2, 2, 3, 1, 0, 0, 1
+rdft_perm_neg: dd 1, 0, 0, 1, 3, 2, 2, 3
+rdft_perm_exp: dd 1, 1, 0, 0, 3, 3, 2, 2
+rdft_m11:      times 2 dd 0x0, 0x0, 0x3f800000, 0x3f800000 ; 0, 0, 1, 1
+
 mask_mmppmmmm: dd NEG, NEG, POS, POS, NEG, NEG, NEG, NEG
 mask_mmmmpppm: dd NEG, NEG, NEG, NEG, POS, POS, POS, NEG
 mask_ppmpmmpm: dd POS, POS, NEG, POS, NEG, NEG, POS, NEG
@@ -1934,3 +1939,224 @@ cglobal fft_pfa_15xM_ns_float, 4, 14, 16, 320, ctx, out, in, stride, len, lut, b
 PFA_15_FN avx2, 0
 PFA_15_FN avx2, 1
 %endif
+
+%macro RDFT_CONV_LOAD 0
+    movaps m10, [rdft_perm_neg]
+    movaps m11, [rdft_perm_pos]
+    vbroadcastf128 m12, [expq + 4*4]                ; fact[5476]
+    movaps m13, [mask_pmmppmmp]                     ; +--+
+
+    movaps m14, [rdft_m11]                          ; 0.0, 0.0, 1.0, 1.0
+    movaps m15, [rdft_perm_exp]
+%endmacro
+
+; %1 - source, front
+; %2 - source, rear
+; Results are left in m0 (front) and m2 (rear)
+%macro RDFT_CONV_ITER 2
+    movups m8, [expq + (8 + 2)*4]
+
+    vperm2f128 m4, m8, m8, 0x00                     ; cos,sin,cos,sin x2
+    vperm2f128 m6, m8, m8, 0x11                     ; cos,sin,cos,sin x2
+
+    vpermilps m5, m4, m15                           ; cos1,cos1,cos1,cos1,sin2,sin2,cos2,cos2
+    vpermilps m7, m6, m15                           ; cos1,cos1,cos1,cos1,sin2,sin2,cos2,cos2
+
+    shufpd m4, m14, m5, 1111b                       ; 1,1,cos1,cos1,1,1,cos2,cos2
+    shufpd m5, m14, m5, 0000b                       ; 0,0,sin1,sin1,0,0,sin2,sin2
+    shufpd m6, m14, m7, 1111b                       ; 1,1,cos1,cos1,1,1,cos2,cos2
+    shufpd m7, m14, m7, 0000b                       ; 0,0,sin1,sin1,0,0,sin2,sin2
+
+    movups m2, [%1q]
+    movups m3, [%2q]
+
+    vperm2f128 m0, m2, m2, 0x00
+    vperm2f128 m1, m3, m3, 0x11
+    vperm2f128 m2, m2, m2, 0x11
+    vperm2f128 m3, m3, m3, 0x00
+
+    vpermilps m0, m0, m10                           ; data[0].imrereim, data[1].imrereim
+    vpermilps m1, m1, m11                           ; data[len - 01].imrereim
+    vpermilps m2, m2, m10                           ; data[0].imrereim, data[1].imrereim
+    vpermilps m3, m3, m11                           ; data[len - 01].imrereim
+
+    addsubps m0, m1                                 ; data[0] - data[len - 0] x2
+    addsubps m2, m3                                 ; data[0] - data[len - 0] x2
+
+    mulps m0, m12                                   ; t[01].imre
+    mulps m2, m12                                   ; t[01].imre
+
+    shufps m1, m0, m0, q2301                        ; t[01].reim
+    shufps m3, m2, m2, q2301                        ; t[01].reim
+
+    mulps m1, m4                                    ; 1, 1, tcos, tcos x2
+    mulps m0, m5                                    ; 0, 0, tsin, tsin x2
+    mulps m3, m6                                    ; 1, 1, tcos, tcos x2
+    mulps m2, m7                                    ; 0, 0, tsin, tsin x2
+
+    addsubps m1, m0                                 ; t[02].reim
+    addsubps m3, m2                                 ; t[02].reim
+
+    shufpd m0, m1, m1, 0101b                        ; t[20].reim
+    shufpd m2, m3, m3, 0101b                        ; t[20].reim
+
+    xorps m1, m13                                   ; +--+t[02].reim
+    xorps m3, m13                                   ; +--+t[02].reim
+
+    addps m0, m1                                    ; data[0].reim, data[len2 - 0].reim x2
+    addps m2, m3                                    ; data[0].reim, data[len2 - 0].reim x2
+
+    shufpd m1, m0, m2, 0000b                        ; high
+    shufpd m3, m0, m2, 1111b                        ; low
+
+    vpermpd m0, m1, q3120
+    vpermpd m2, m3, q0213
+%endmacro
+
+%macro RDFT_R2C 1
+INIT_YMM %1
+cglobal rdft_r2c_float, 4, 14, 16, 320, ctx, out, in, stride, len, lut, exp, t1, t2, t3, \
+                                        t4, t5, btmp
+    ; FFT setup
+    mov btmpq, ctxq                                 ; backup original context
+    mov t3q, [ctxq + AVTXContext.fn]                ; subtransform's jump point
+
+    mov ctxq, [ctxq + AVTXContext.sub]              ; load subtransform's context
+    mov lutq, [ctxq + AVTXContext.map]              ; load subtransform's map
+    movsxd lenq, dword [ctxq + AVTXContext.len]     ; load subtransform's length
+
+    mov expq, outq
+.preshuf:
+    LOAD64_LUT m0, inq, lutq, 0, t4q, m1, m2
+    movaps [outq], m0
+    add outq, mmsize
+    add lutq, (mmsize/2)
+    sub lenq, (mmsize/8)
+    jg .preshuf
+
+    mov outq, expq
+    mov inq, expq
+    movsxd lenq, dword [ctxq + AVTXContext.len]     ; load subtransform's length
+
+    call t3q                                        ; call the FFT
+
+    mov ctxq, btmpq                                 ; restore original context
+
+    movsxd lenq, dword [ctxq + AVTXContext.len]
+    mov expq, [ctxq + AVTXContext.exp]
+
+    movsd  xm0, [outq]                              ; data[0].reim
+    movhps xm0, [outq + lenq*2]                     ; data[len4].reim
+
+    shufps xm1, xm0, xm0, q2301                     ; data[0].imre, junk
+    addsubps xm2, xm1, xm0                          ; t[0].imre, junk
+    shufps xm1, xm2, xm1, q2301                     ; t[0].reim, data[len4].reim
+    mulps xm9, xm1, [expq]                          ; data[0,len4].reim
+
+    mov inq, outq
+    lea t1q, [outq + lenq*4 - mmsize]
+    mov t2q, lenq
+    add outq, 8
+
+    ; Perform in-place RDFT conversion
+    RDFT_CONV_LOAD
+.loop:
+    RDFT_CONV_ITER out, t1
+    movups [outq], m0
+    movups [t1q],  m2
+
+    add expq, mmsize
+    add outq, mmsize
+    sub t1q,  mmsize
+
+    sub t2q, mmsize/2
+    jg .loop
+
+    ; Write DC, middle and tail
+    movhps [inq + lenq*2], xm9
+    xorps xm0, xm0
+    shufps xm9, xm9, xm0, q3210
+    shufps xm8, xm9, xm9, q2120
+    movsd [inq], xm8
+    movhps [inq + lenq*4], xm8
+
+    RET
+%endmacro
+
+%macro RDFT_C2R 1
+INIT_YMM %1
+cglobal rdft_c2r_float, 4, 14, 16, 320, ctx, out, in, stride, len, lut, exp, t1, t2, t3, \
+                                        t4, t5, btmp
+    movsxd lenq, dword [ctxq + AVTXContext.len]
+    mov expq, [ctxq + AVTXContext.exp]
+    mov btmpq, [ctxq + AVTXContext.fn]             ; subtransform's jump point
+
+    mov ctxq, [ctxq + AVTXContext.sub]             ; load subtransform's context
+    mov lutq, [ctxq + AVTXContext.map]             ; load subtransform's map
+
+    movss xm0, [inq]                               ; data[0].re
+    insertps xm0, [inq + lenq*4], 0b00_01_0000     ; src, dst, zero flags
+    movhps xm0, [inq + lenq*2]                     ; data[0,len4]
+
+    shufps xm1, xm0, xm0, q2301                    ; data[0].imre, junk
+    addsubps xm2, xm1, xm0                         ; t[0].imre, junk
+    shufps xm1, xm2, xm1, q2301                    ; t[0].reim, data[len4].reim
+    mulps xm9, xm1, [expq]                         ; data[0,len4].reim
+
+    mov t1q, lenq
+    lea t2q, [inq + lenq*4 - mmsize]
+    lea inq, [inq + 8]
+
+    lea t4q, [lutq + 4]
+    lea t5q, [lutq + lenq*2 - 4*4]
+
+    RDFT_CONV_LOAD                                 ; Perform in-place RDFT conversion
+.loop:
+    RDFT_CONV_ITER in, t2
+
+    vextractf128 xm1, m0, 1
+    vextractf128 xm3, m2, 1
+
+    movsxd t3q, dword [t4q + 4*0]
+    movlpd [outq + 8*t3q], xm0
+    movsxd t3q, dword [t4q + 4*1]
+    movhpd [outq + 8*t3q], xm0
+    movsxd t3q, dword [t4q + 4*2]
+    movlpd [outq + 8*t3q], xm1
+    movsxd t3q, dword [t4q + 4*3]
+    movhpd [outq + 8*t3q], xm1
+
+    movsxd t3q, dword [t5q + 4*0]
+    movlpd [outq + 8*t3q], xm2
+    movsxd t3q, dword [t5q + 4*1]
+    movhpd [outq + 8*t3q], xm2
+    movsxd t3q, dword [t5q + 4*2]
+    movlpd [outq + 8*t3q], xm3
+    movsxd t3q, dword [t5q + 4*3]
+    movhpd [outq + 8*t3q], xm3
+
+    add expq, mmsize
+    add inq,  mmsize
+    sub t2q,  mmsize
+    add t4q, 4*4
+    sub t5q, 4*4
+
+    sub t1q, mmsize/2
+    jg .loop
+
+    movsxd t3q, dword [lutq + 0]
+    movsd [outq + 8*t3q], xm9
+    movsxd t3q, dword [lutq + lenq]
+    movhps [outq + 8*t3q], xm9
+
+    mov inq, outq
+    movsxd lenq, dword [ctxq + AVTXContext.len]     ; load subtransform's length
+    call btmpq                                      ; call the FFT
+
+    RET
+%endmacro
+
+%if ARCH_X86_64 && HAVE_AVX2_EXTERNAL
+RDFT_R2C avx2
+RDFT_C2R avx2
+%endif
diff --git a/libavutil/x86/tx_float_init.c b/libavutil/x86/tx_float_init.c
index d3c0beb50f..2f3d7899a9 100644
--- a/libavutil/x86/tx_float_init.c
+++ b/libavutil/x86/tx_float_init.c
@@ -52,6 +52,9 @@ TX_DECL_FN(fft_pfa_15xM_ns, avx2)
 
 TX_DECL_FN(mdct_inv, avx2)
 
+TX_DECL_FN(rdft_r2c, avx2)
+TX_DECL_FN(rdft_c2r, avx2)
+
 TX_DECL_FN(fft2_asm, sse3)
 TX_DECL_FN(fft4_fwd_asm, sse2)
 TX_DECL_FN(fft4_inv_asm, sse2)
@@ -167,6 +170,63 @@ static av_cold int m_inv_init(AVTXContext *s, const FFTXCodelet *cd,
     return 0;
 }
 
+static av_cold int rdft_init(AVTXContext *s, const FFTXCodelet *cd,
+                             uint64_t flags, FFTXCodeletOptions *opts,
+                             int len, int inv, const void *scale)
+{
+    int ret;
+    double f, m;
+    TXSample *tab;
+    uint64_t r2r = flags & AV_TX_REAL_TO_REAL;
+    int len4 = FFALIGN(len, 4) / 4;
+    FFTXCodeletOptions sub_opts = { .map_dir = inv ? FF_TX_MAP_SCATTER : FF_TX_MAP_GATHER };
+
+    s->scale_d = *((SCALE_TYPE *)scale);
+    s->scale_f = s->scale_d;
+
+    flags &= ~(AV_TX_REAL_TO_REAL | AV_TX_REAL_TO_IMAGINARY);
+    flags |=  FF_TX_PRESHUFFLE;   /* This function handles the permute step */
+    flags |=  AV_TX_INPLACE;      /* in-place */
+    flags |=  FF_TX_ASM_CALL;     /* We want an assembly function, not C */
+
+    if ((ret = ff_tx_init_subtx(s, TX_TYPE(FFT), flags, &sub_opts,
+                                len >> 1, inv, scale)))
+        return ret;
+
+    if (!(s->exp = av_mallocz((8 + 2*len4)*sizeof(*s->exp))))
+        return AVERROR(ENOMEM);
+
+    if (!(s->tmp = av_malloc(len*sizeof(*s->tmp))))
+        return AVERROR(ENOMEM);
+
+    tab = (TXSample *)s->exp;
+
+    f = 2*M_PI/len;
+
+    m = (inv ? 2*s->scale_d : s->scale_d);
+
+    *tab++ =  RESCALE((inv ? 0.5 : 1.0) * m);
+    *tab++ = -RESCALE(inv ? 0.5*m : 1.0*m);
+    *tab++ =  RESCALE( m);
+    *tab++ =  RESCALE(-m);
+
+    if (r2r)
+        *tab++ = 1 / s->scale_f;
+    else
+        *tab++ = RESCALE( (0.0 - 0.5) * m);
+    *tab++ = RESCALE( (0.5 - 0.0) * m);
+
+    *tab++ = RESCALE(-(0.5 - inv) * m);
+    *tab++ = RESCALE( (0.5 - inv) * m);
+
+    for (int i = 0; i < len4; i++) {
+        *tab++ = RESCALE(cos(i*f));
+        *tab++ = RESCALE(cos(((len - i*4)/4.0)*f)) * (inv ? 1 : -1);
+    }
+
+    return 0;
+}
+
 static av_cold int fft_pfa_init(AVTXContext *s,
                                 const FFTXCodelet *cd,
                                 uint64_t flags,
@@ -303,6 +363,11 @@ const FFTXCodelet * const ff_tx_codelet_list_float_x86[] = {
 
     TX_DEF(mdct_inv, MDCT, 16, TX_LEN_UNLIMITED, 2, TX_FACTOR_ANY, 384, m_inv_init, avx2, AVX2,
            FF_TX_INVERSE_ONLY, AV_CPU_FLAG_AVXSLOW | AV_CPU_FLAG_SLOW_GATHER),
+
+    TX_DEF(rdft_r2c, RDFT, 16, TX_LEN_UNLIMITED, 2, TX_FACTOR_ANY, 384, rdft_init, avx2, AVX2,
+           FF_TX_FORWARD_ONLY, AV_CPU_FLAG_AVXSLOW | AV_CPU_FLAG_SLOW_GATHER),
+    TX_DEF(rdft_c2r, RDFT, 16, TX_LEN_UNLIMITED, 2, TX_FACTOR_ANY, 384, rdft_init, avx2, AVX2,
+           FF_TX_INVERSE_ONLY, AV_CPU_FLAG_AVXSLOW | AV_CPU_FLAG_SLOW_GATHER),
 #endif
 #endif
 
diff --git a/tests/checkasm/av_tx.c b/tests/checkasm/av_tx.c
index aa8fc6b4e9..676c39ed86 100644
--- a/tests/checkasm/av_tx.c
+++ b/tests/checkasm/av_tx.c
@@ -43,6 +43,10 @@ static const int check_lens[] = {
     2, 4, 8, 16, 32, 64, 120, 960, 1024, 1920, 16384,
 };
 
+static const int rdft_check_lens[] = {
+    32, 1024,
+};
+
 static AVTXContext *tx_refs[AV_TX_NB][2 /* Direction */][FF_ARRAY_ELEMS(check_lens)] = { 0 };
 static int init = 0;
 
@@ -113,6 +117,9 @@ void checkasm_check_av_tx(void)
     CHECK_TEMPLATE("float_imdct", AV_TX_FLOAT_MDCT, 1, float, float, check_lens,
                    !float_near_abs_eps_array(out_ref, out_new, EPS, len));
 
+    CHECK_TEMPLATE("float_r2c", AV_TX_FLOAT_RDFT, 0, float, float, rdft_check_lens,
+                   !float_near_abs_eps_array(out_ref, out_new, EPS, len));
+
     randomize_complex(in, 16384, AVComplexDouble, SCALE_NOOP);
     CHECK_TEMPLATE("double_fft", AV_TX_DOUBLE_FFT, 0, AVComplexDouble, double, check_lens,
                    !double_near_abs_eps_array(out_ref, out_new, EPS, len*2));
-- 
2.43.0