From patchwork Mon Apr 19 20:26:39 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lynne X-Patchwork-Id: 27121 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:5014:0:0:0:0:0 with SMTP id e20csp840385iob; Mon, 19 Apr 2021 13:26:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxFsDVcFwWbfYlYVFCIMQdYdixzwEfOU6H/j4FLfOk4wQOAAuEVJSfwOWg2GwSPHieCHmGY X-Received: by 2002:a17:906:2746:: with SMTP id a6mr9525865ejd.265.1618864009107; Mon, 19 Apr 2021 13:26:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618864009; cv=none; d=google.com; s=arc-20160816; b=1BU6KhRNQlA5YZHZ+uLxM+Fm9vVeS+cIThsvQYUux7OWqmyKm9LzFjNerpI0VFhBW1 kvjEM/dip+FUTpmIvJGFl2VwHxeUN7nafL9o2K734WaLnxtUHf3ci/VY4ZXEd1ShHBKB qlpDGYsCy30F/WZO0u3kk7xbXYys2Fo6Fs/I5StBt6RKf58OENRo2eid9LpXlTLKzywU g8aaKJbj8q+tFZeYtsGr0DeIwNFQge54KSRNDiLljLejp+2kxRTMm3idsMDUXxSijADB oV5xxeugCCjqAAj/nYtpcO5tpc58hvbURrYwpY57sEyJzvoqtrpBHWAAv+I+cJ2zVhhE HSHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject :mime-version:references:in-reply-to:message-id:to:from:date :dkim-signature:delivered-to; bh=lPzTY051b4zJhqTbRdfoI7OpUzj12M0EeqGCmRKaZK4=; b=egPVjPQbdRrT92X++IeLme4M7tbQFnZa6zOVuDXlXs9X69Vf2YY+DEJKl1Jv0Ft/GF mcgoiwKzJO7Gu3pexiejH3tMSOr4zRc3dvQX9lYQWy+xzcMt+uBOGBvKKW19skpTD1iB rr2SeNRmGeQUWZX+7bCX1UWwConRBBV7i3c8I/+PpNCHiMxz5t7vCyZKrCfKeSD5Yjy4 G7XzPCMCNZgLt+bZtS9aqGK3AgXkQVWInf1lGwOJ8Ur/hvbxdMB2C6NYlCYhyzTbiWdd srKPXwObxtxbc3jmbQ4uPUkaA7n/gPtKotX3xeus2KOqToppaeRHiP8DIPo1PtmxzN/x 1Icw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1 header.b=DDiDXmGz; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id f1si12489042ejx.485.2021.04.19.13.26.48; Mon, 19 Apr 2021 13:26:49 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1 header.b=DDiDXmGz; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8355C680BE0; Mon, 19 Apr 2021 23:26:46 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 79733680BE0 for ; Mon, 19 Apr 2021 23:26:39 +0300 (EEST) Received: from w3.tutanota.de (unknown [192.168.1.164]) by w4.tutanota.de (Postfix) with ESMTP id 2865110602E4 for ; Mon, 19 Apr 2021 20:26:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1618863999; s=s1; d=lynne.ee; h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender; bh=UWdlEYBLd2s0SzpAp/jxuSGBRg1Zc2AtsPhegFvJaow=; b=DDiDXmGzKU0o6naMPCEvIlu1cFrTDWZgJfNeNUUVgJZMoCU/Zfbpw/IJH620oRkW sDEYCrWEDSjRNf2kDZeoCLACn1wDl6er6tMek6JBGOWqhw8Ms4NLBQ6rUsYUqcEq5rN Sa3r7UdkruZhC459Yb5XgEsAjLrnCqsmsY73VqPQNFKtHbT2WKvHXwVDlHm0VqF0efV g0s2SvsAq2Kqi5wjB5BEKhfBK+++1cklkByb4XDhFDZ5gEMfM09vv43+wYrkxU1H007 P64wyyH4dZd0qQuErCearCOUBlMHBAZvvmi3jInVgtg5rlTZtDNYE40kBhbeuyuWD9h FmjBOtbsvg== Date: Mon, 19 Apr 2021 22:26:39 +0200 (CEST) From: Lynne To: FFmpeg development discussions and patches Message-ID: In-Reply-To: References: MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 10/11] doc/transforms: add documentation for the FFT transforms X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: B4FQO97vJitb Makes the code far easier to follow, and makes creating new SIMD for the transforms far, far easier. Patch attached. Subject: [PATCH 10/11] doc/transforms: add documentation for the FFT transforms Makes the code far easier to follow, and makes creating new SIMD for the transforms far, far easier. --- doc/transforms.md | 706 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 706 insertions(+) create mode 100644 doc/transforms.md diff --git a/doc/transforms.md b/doc/transforms.md new file mode 100644 index 0000000000..78f3f68d65 --- /dev/null +++ b/doc/transforms.md @@ -0,0 +1,706 @@ +The basis transforms used for FFT and various other derived functions are based +on the following unrollings. +The functions can be easily adapted to double precision floats as well. + +# Parity permutation +The basis transforms described here all use the following permutation: + +``` C +void ff_tx_gen_split_radix_parity_revtab(int *revtab, int len, int inv, + int basis, int dual_stride); +``` +Parity means even and odd complex numbers will be split, e.g. the even +coefficients will come first, after which the odd coefficients will be +placed. For example, a 4-point transform's coefficients after reordering: +`z[0].re, z[0].im, z[2].re, z[2].im, z[1].re, z[1].im, z[3].re, z[3].im` + +The basis argument is the length of the largest non-composite transform +supported, and also implies that the basis/2 transform is supported as well, +as the split-radix algorithm requires it to be. + +The dual_stride argument indicates that both the basis, as well as the +basis/2 transforms support doing two transforms at once, and the coefficients +will be interleaved between each pair in a split-radix like so (stride == 2): +`tx1[0], tx1[2], tx2[0], tx2[2], tx1[1], tx1[3], tx2[1], tx2[3]` +A non-zero number switches this on, with the value indicating the stride +(how many values of 1 transform to put first before switching to the other). +Must be a power of two or 0. Must be less than the basis. +Value will be clipped to the transform size, so for a basis of 16 and a +dual_stride of 8, dual 8-point transforms will be laid out as if dual_stride +was set to 4. +Usually you'll set this to half the complex numbers that fit in a single +register or 0. This allows to reuse SSE functions as dual-transform +functions in AVX mode. +If length is smaller than basis/2 this function will not do anything. + +# 4-point FFT transform +The only permutation this transform needs is to swap the `z[1]` and `z[2]` +elements when performing an inverse transform, which in the assembly code is +hardcoded with the function itself being templated and duplicated for each +direction. + +``` C +static void fft4(FFTComplex *z) +{ + FFTSample r1 = z[0].re - z[2].re; + FFTSample r2 = z[0].im - z[2].im; + FFTSample r3 = z[1].re - z[3].re; + FFTSample r4 = z[1].im - z[3].im; + /* r5-r8 second transform */ + + FFTSample t1 = z[0].re + z[2].re; + FFTSample t2 = z[0].im + z[2].im; + FFTSample t3 = z[1].re + z[3].re; + FFTSample t4 = z[1].im + z[3].im; + /* t5-t8 second transform */ + + /* 1sub + 1add = 2 instructions */ + + /* 2 shufs */ + FFTSample a3 = t1 - t3; + FFTSample a4 = t2 - t4; + FFTSample b3 = r1 - r4; + FFTSample b2 = r2 - r3; + + FFTSample a1 = t1 + t3; + FFTSample a2 = t2 + t4; + FFTSample b1 = r1 + r4; + FFTSample b4 = r2 + r3; + /* 1 add 1 sub 3 shufs */ + + z[0].re = a1; + z[0].im = a2; + z[2].re = a3; + z[2].im = a4; + + z[1].re = b1; + z[1].im = b2; + z[3].re = b3; + z[3].im = b4; +} +``` + +# 8-point AVX FFT transform +Input must be pre-permuted using the parity lookup table, generated via +`ff_tx_gen_split_radix_parity_revtab`. + +``` C +static void fft8(FFTComplex *z) +{ + FFTSample r1 = z[0].re - z[4].re; + FFTSample r2 = z[0].im - z[4].im; + FFTSample r3 = z[1].re - z[5].re; + FFTSample r4 = z[1].im - z[5].im; + + FFTSample r5 = z[2].re - z[6].re; + FFTSample r6 = z[2].im - z[6].im; + FFTSample r7 = z[3].re - z[7].re; + FFTSample r8 = z[3].im - z[7].im; + + FFTSample q1 = z[0].re + z[4].re; + FFTSample q2 = z[0].im + z[4].im; + FFTSample q3 = z[1].re + z[5].re; + FFTSample q4 = z[1].im + z[5].im; + + FFTSample q5 = z[2].re + z[6].re; + FFTSample q6 = z[2].im + z[6].im; + FFTSample q7 = z[3].re + z[7].re; + FFTSample q8 = z[3].im + z[7].im; + + FFTSample s3 = q1 - q3; + FFTSample s1 = q1 + q3; + FFTSample s4 = q2 - q4; + FFTSample s2 = q2 + q4; + + FFTSample s7 = q5 - q7; + FFTSample s5 = q5 + q7; + FFTSample s8 = q6 - q8; + FFTSample s6 = q6 + q8; + + FFTSample e1 = s1 * -1; + FFTSample e2 = s2 * -1; + FFTSample e3 = s3 * -1; + FFTSample e4 = s4 * -1; + + FFTSample e5 = s5 * 1; + FFTSample e6 = s6 * 1; + FFTSample e7 = s7 * -1; + FFTSample e8 = s8 * 1; + + FFTSample w1 = e5 - e1; + FFTSample w2 = e6 - e2; + FFTSample w3 = e8 - e3; + FFTSample w4 = e7 - e4; + + FFTSample w5 = s1 - e5; + FFTSample w6 = s2 - e6; + FFTSample w7 = s3 - e8; + FFTSample w8 = s4 - e7; + + z[0].re = w1; + z[0].im = w2; + z[2].re = w3; + z[2].im = w4; + z[4].re = w5; + z[4].im = w6; + z[6].re = w7; + z[6].im = w8; + + FFTSample z1 = r1 - r4; + FFTSample z2 = r1 + r4; + FFTSample z3 = r3 - r2; + FFTSample z4 = r3 + r2; + + FFTSample z5 = r5 - r6; + FFTSample z6 = r5 + r6; + FFTSample z7 = r7 - r8; + FFTSample z8 = r7 + r8; + + z3 *= -1; + z5 *= -M_SQRT1_2; + z6 *= -M_SQRT1_2; + z7 *= M_SQRT1_2; + z8 *= M_SQRT1_2; + + FFTSample t5 = z7 - z6; + FFTSample t6 = z8 + z5; + FFTSample t7 = z8 - z5; + FFTSample t8 = z7 + z6; + + FFTSample u1 = z2 + t5; + FFTSample u2 = z3 + t6; + FFTSample u3 = z1 - t7; + FFTSample u4 = z4 + t8; + + FFTSample u5 = z2 - t5; + FFTSample u6 = z3 - t6; + FFTSample u7 = z1 + t7; + FFTSample u8 = z4 - t8; + + z[1].re = u1; + z[1].im = u2; + z[3].re = u3; + z[3].im = u4; + z[5].re = u5; + z[5].im = u6; + z[7].re = u7; + z[7].im = u8; +} +``` + +As you can see, there are 2 independent paths, one for even and one for odd coefficients. +This theme continues throughout the document. Note that in the actual assembly code, +the paths are interleaved to improve unit saturation and CPU dependency tracking, so +to more clearly see them, you'll need to deinterleave the instructions. + +# 8-point SSE/ARM64 FFT transform +Input must be pre-permuted using the parity lookup table, generated via +`ff_tx_gen_split_radix_parity_revtab`. + +``` C +static void fft8(FFTComplex *z) +{ + FFTSample r1 = z[0].re - z[4].re; + FFTSample r2 = z[0].im - z[4].im; + FFTSample r3 = z[1].re - z[5].re; + FFTSample r4 = z[1].im - z[5].im; + + FFTSample j1 = z[2].re - z[6].re; + FFTSample j2 = z[2].im - z[6].im; + FFTSample j3 = z[3].re - z[7].re; + FFTSample j4 = z[3].im - z[7].im; + + FFTSample q1 = z[0].re + z[4].re; + FFTSample q2 = z[0].im + z[4].im; + FFTSample q3 = z[1].re + z[5].re; + FFTSample q4 = z[1].im + z[5].im; + + FFTSample k1 = z[2].re + z[6].re; + FFTSample k2 = z[2].im + z[6].im; + FFTSample k3 = z[3].re + z[7].re; + FFTSample k4 = z[3].im + z[7].im; + /* 2 add 2 sub = 4 */ + + /* 2 shufs, 1 add 1 sub = 4 */ + FFTSample s1 = q1 + q3; + FFTSample s2 = q2 + q4; + FFTSample g1 = k3 + k1; + FFTSample g2 = k2 + k4; + + FFTSample s3 = q1 - q3; + FFTSample s4 = q2 - q4; + FFTSample g4 = k3 - k1; + FFTSample g3 = k2 - k4; + + /* 1 unpack + 1 shuffle = 2 */ + + /* 1 add */ + FFTSample w1 = s1 + g1; + FFTSample w2 = s2 + g2; + FFTSample w3 = s3 + g3; + FFTSample w4 = s4 + g4; + + /* 1 sub */ + FFTSample h1 = s1 - g1; + FFTSample h2 = s2 - g2; + FFTSample h3 = s3 - g3; + FFTSample h4 = s4 - g4; + + z[0].re = w1; + z[0].im = w2; + z[2].re = w3; + z[2].im = w4; + z[4].re = h1; + z[4].im = h2; + z[6].re = h3; + z[6].im = h4; + + /* 1 shuf + 1 shuf + 1 xor + 1 addsub */ + FFTSample z1 = r1 + r4; + FFTSample z2 = r2 - r3; + FFTSample z3 = r1 - r4; + FFTSample z4 = r2 + r3; + + /* 1 mult */ + j1 *= M_SQRT1_2; + j2 *= -M_SQRT1_2; + j3 *= -M_SQRT1_2; + j4 *= M_SQRT1_2; + + /* 1 shuf + 1 addsub */ + FFTSample l2 = j1 - j2; + FFTSample l1 = j2 + j1; + FFTSample l4 = j3 - j4; + FFTSample l3 = j4 + j3; + + /* 1 shuf + 1 addsub */ + FFTSample t1 = l3 - l2; + FFTSample t2 = l4 + l1; + FFTSample t3 = l1 - l4; + FFTSample t4 = l2 + l3; + + /* 1 add */ + FFTSample u1 = z1 - t1; + FFTSample u2 = z2 - t2; + FFTSample u3 = z3 - t3; + FFTSample u4 = z4 - t4; + + /* 1 sub */ + FFTSample o1 = z1 + t1; + FFTSample o2 = z2 + t2; + FFTSample o3 = z3 + t3; + FFTSample o4 = z4 + t4; + + z[1].re = u1; + z[1].im = u2; + z[3].re = u3; + z[3].im = u4; + z[5].re = o1; + z[5].im = o2; + z[7].re = o3; + z[7].im = o4; +} +``` + +Most functions here are highly tuned to use x86's addsub instruction to save on +external sign mask loading. + +# 16-point AVX FFT transform +This version expects the output of the 8 and 4-point transforms to follow the +even/odd convention established above. + +``` C +static void fft16(FFTComplex *z) +{ + FFTSample cos_16_1 = 0.92387950420379638671875f; + FFTSample cos_16_3 = 0.3826834261417388916015625f; + + fft8(z); + fft4(z+8); + fft4(z+10); + + FFTSample s[32]; + + /* + xorps m1, m1 - free + mulps m0 + shufps m1, m1, m0 + xorps + addsub + shufps + mulps + mulps + addps + or (fma3) + shufps + shufps + mulps + mulps + fma + fma + */ + + s[0] = z[8].re*( 1) - z[8].im*( 0); + s[1] = z[8].im*( 1) + z[8].re*( 0); + s[2] = z[9].re*( 1) - z[9].im*(-1); + s[3] = z[9].im*( 1) + z[9].re*(-1); + + s[4] = z[10].re*( 1) - z[10].im*( 0); + s[5] = z[10].im*( 1) + z[10].re*( 0); + s[6] = z[11].re*( 1) - z[11].im*( 1); + s[7] = z[11].im*( 1) + z[11].re*( 1); + + s[8] = z[12].re*( cos_16_1) - z[12].im*( -cos_16_3); + s[9] = z[12].im*( cos_16_1) + z[12].re*( -cos_16_3); + s[10] = z[13].re*( cos_16_3) - z[13].im*( -cos_16_1); + s[11] = z[13].im*( cos_16_3) + z[13].re*( -cos_16_1); + + s[12] = z[14].re*( cos_16_1) - z[14].im*( cos_16_3); + s[13] = z[14].im*( -cos_16_1) + z[14].re*( -cos_16_3); + s[14] = z[15].re*( cos_16_3) - z[15].im*( cos_16_1); + s[15] = z[15].im*( -cos_16_3) + z[15].re*( -cos_16_1); + + s[2] *= M_SQRT1_2; + s[3] *= M_SQRT1_2; + s[5] *= -1; + s[6] *= M_SQRT1_2; + s[7] *= -M_SQRT1_2; + + FFTSample w5 = s[0] + s[4]; + FFTSample w6 = s[1] - s[5]; + FFTSample x5 = s[2] + s[6]; + FFTSample x6 = s[3] - s[7]; + + FFTSample w3 = s[4] - s[0]; + FFTSample w4 = s[5] + s[1]; + FFTSample x3 = s[6] - s[2]; + FFTSample x4 = s[7] + s[3]; + + FFTSample y5 = s[8] + s[12]; + FFTSample y6 = s[9] - s[13]; + FFTSample u5 = s[10] + s[14]; + FFTSample u6 = s[11] - s[15]; + + FFTSample y3 = s[12] - s[8]; + FFTSample y4 = s[13] + s[9]; + FFTSample u3 = s[14] - s[10]; + FFTSample u4 = s[15] + s[11]; + + /* 2xorps, 2vperm2fs, 2 adds, 2 vpermilps = 8 */ + + FFTSample o1 = z[0].re + w5; + FFTSample o2 = z[0].im + w6; + FFTSample o5 = z[1].re + x5; + FFTSample o6 = z[1].im + x6; + FFTSample o9 = z[2].re + w4; //h + FFTSample o10 = z[2].im + w3; + FFTSample o13 = z[3].re + x4; + FFTSample o14 = z[3].im + x3; + + FFTSample o17 = z[0].re - w5; + FFTSample o18 = z[0].im - w6; + FFTSample o21 = z[1].re - x5; + FFTSample o22 = z[1].im - x6; + FFTSample o25 = z[2].re - w4; //h + FFTSample o26 = z[2].im - w3; + FFTSample o29 = z[3].re - x4; + FFTSample o30 = z[3].im - x3; + + FFTSample o3 = z[4].re + y5; + FFTSample o4 = z[4].im + y6; + FFTSample o7 = z[5].re + u5; + FFTSample o8 = z[5].im + u6; + FFTSample o11 = z[6].re + y4; //h + FFTSample o12 = z[6].im + y3; + FFTSample o15 = z[7].re + u4; + FFTSample o16 = z[7].im + u3; + + FFTSample o19 = z[4].re - y5; + FFTSample o20 = z[4].im - y6; + FFTSample o23 = z[5].re - u5; + FFTSample o24 = z[5].im - u6; + FFTSample o27 = z[6].re - y4; //h + FFTSample o28 = z[6].im - y3; + FFTSample o31 = z[7].re - u4; + FFTSample o32 = z[7].im - u3; + + /* This is just deinterleaving, happens separately */ + z[0] = (FFTComplex){ o1, o2 }; + z[1] = (FFTComplex){ o3, o4 }; + z[2] = (FFTComplex){ o5, o6 }; + z[3] = (FFTComplex){ o7, o8 }; + z[4] = (FFTComplex){ o9, o10 }; + z[5] = (FFTComplex){ o11, o12 }; + z[6] = (FFTComplex){ o13, o14 }; + z[7] = (FFTComplex){ o15, o16 }; + + z[8] = (FFTComplex){ o17, o18 }; + z[9] = (FFTComplex){ o19, o20 }; + z[10] = (FFTComplex){ o21, o22 }; + z[11] = (FFTComplex){ o23, o24 }; + z[12] = (FFTComplex){ o25, o26 }; + z[13] = (FFTComplex){ o27, o28 }; + z[14] = (FFTComplex){ o29, o30 }; + z[15] = (FFTComplex){ o31, o32 }; +} +``` + +# AVX split-radix synthesis +To create larger transforms, the following unrolling of the C split-radix +function is used. + +``` C +#define BF(x, y, a, b) \ + do { \ + x = (a) - (b); \ + y = (a) + (b); \ + } while (0) + +#define BUTTERFLIES(a0,a1,a2,a3) \ + do { \ + r0=a0.re; \ + i0=a0.im; \ + r1=a1.re; \ + i1=a1.im; \ + BF(q3, q5, q5, q1); \ + BF(a2.re, a0.re, r0, q5); \ + BF(a3.im, a1.im, i1, q3); \ + BF(q4, q6, q2, q6); \ + BF(a3.re, a1.re, r1, q4); \ + BF(a2.im, a0.im, i0, q6); \ + } while (0) + +#undef TRANSFORM +#define TRANSFORM(a0,a1,a2,a3,wre,wim) \ + do { \ + CMUL(q1, q2, a2.re, a2.im, wre, -wim); \ + CMUL(q5, q6, a3.re, a3.im, wre, wim); \ + BUTTERFLIES(a0, a1, a2, a3); \ + } while (0) + +#define CMUL(dre, dim, are, aim, bre, bim) \ + do { \ + (dre) = (are) * (bre) - (aim) * (bim); \ + (dim) = (are) * (bim) + (aim) * (bre); \ + } while (0) + +static void recombine(FFTComplex *z, const FFTSample *cos, + unsigned int n) +{ + const int o1 = 2*n; + const int o2 = 4*n; + const int o3 = 6*n; + const FFTSample *wim = cos + o1 - 7; + FFTSample q1, q2, q3, q4, q5, q6, r0, i0, r1, i1; + +#if 0 + for (int i = 0; i < n; i += 4) { +#endif + +#if 0 + TRANSFORM(z[ 0 + 0], z[ 0 + 4], z[o2 + 0], z[o2 + 2], cos[0], wim[7]); + TRANSFORM(z[ 0 + 1], z[ 0 + 5], z[o2 + 1], z[o2 + 3], cos[2], wim[5]); + TRANSFORM(z[ 0 + 2], z[ 0 + 6], z[o2 + 4], z[o2 + 6], cos[4], wim[3]); + TRANSFORM(z[ 0 + 3], z[ 0 + 7], z[o2 + 5], z[o2 + 7], cos[6], wim[1]); + + TRANSFORM(z[o1 + 0], z[o1 + 4], z[o3 + 0], z[o3 + 2], cos[1], wim[6]); + TRANSFORM(z[o1 + 1], z[o1 + 5], z[o3 + 1], z[o3 + 3], cos[3], wim[4]); + TRANSFORM(z[o1 + 2], z[o1 + 6], z[o3 + 4], z[o3 + 6], cos[5], wim[2]); + TRANSFORM(z[o1 + 3], z[o1 + 7], z[o3 + 5], z[o3 + 7], cos[7], wim[0]); +#else + FFTSample h[8], j[8], r[8], w[8]; + FFTSample t[8]; + FFTComplex *m0 = &z[0]; + FFTComplex *m1 = &z[4]; + FFTComplex *m2 = &z[o2 + 0]; + FFTComplex *m3 = &z[o2 + 4]; + + const FFTSample *t1 = &cos[0]; + const FFTSample *t2 = &wim[0]; + + /* 2 loads (tabs) */ + + /* 2 vperm2fs, 2 shufs (im), 2 shufs (tabs) */ + /* 1 xor, 1 add, 1 sub, 4 mults OR 2 mults, 2 fmas */ + /* 13 OR 10ish (-2 each for second passovers!) */ + + w[0] = m2[0].im*t1[0] - m2[0].re*t2[7]; + w[1] = m2[0].re*t1[0] + m2[0].im*t2[7]; + w[2] = m2[1].im*t1[2] - m2[1].re*t2[5]; + w[3] = m2[1].re*t1[2] + m2[1].im*t2[5]; + w[4] = m3[0].im*t1[4] - m3[0].re*t2[3]; + w[5] = m3[0].re*t1[4] + m3[0].im*t2[3]; + w[6] = m3[1].im*t1[6] - m3[1].re*t2[1]; + w[7] = m3[1].re*t1[6] + m3[1].im*t2[1]; + + j[0] = m2[2].im*t1[0] + m2[2].re*t2[7]; + j[1] = m2[2].re*t1[0] - m2[2].im*t2[7]; + j[2] = m2[3].im*t1[2] + m2[3].re*t2[5]; + j[3] = m2[3].re*t1[2] - m2[3].im*t2[5]; + j[4] = m3[2].im*t1[4] + m3[2].re*t2[3]; + j[5] = m3[2].re*t1[4] - m3[2].im*t2[3]; + j[6] = m3[3].im*t1[6] + m3[3].re*t2[1]; + j[7] = m3[3].re*t1[6] - m3[3].im*t2[1]; + + /* 1 add + 1 shuf */ + t[1] = j[0] + w[0]; + t[0] = j[1] + w[1]; + t[3] = j[2] + w[2]; + t[2] = j[3] + w[3]; + t[5] = j[4] + w[4]; + t[4] = j[5] + w[5]; + t[7] = j[6] + w[6]; + t[6] = j[7] + w[7]; + + /* 1 sub + 1 xor */ + r[0] = (w[0] - j[0]); + r[1] = -(w[1] - j[1]); + r[2] = (w[2] - j[2]); + r[3] = -(w[3] - j[3]); + r[4] = (w[4] - j[4]); + r[5] = -(w[5] - j[5]); + r[6] = (w[6] - j[6]); + r[7] = -(w[7] - j[7]); + + /* Min: 2 subs, 2 adds, 2 vperm2fs (OPTIONAL) */ + m2[0].re = m0[0].re - t[0]; + m2[0].im = m0[0].im - t[1]; + m2[1].re = m0[1].re - t[2]; + m2[1].im = m0[1].im - t[3]; + m3[0].re = m0[2].re - t[4]; + m3[0].im = m0[2].im - t[5]; + m3[1].re = m0[3].re - t[6]; + m3[1].im = m0[3].im - t[7]; + + m2[2].re = m1[0].re - r[0]; + m2[2].im = m1[0].im - r[1]; + m2[3].re = m1[1].re - r[2]; + m2[3].im = m1[1].im - r[3]; + m3[2].re = m1[2].re - r[4]; + m3[2].im = m1[2].im - r[5]; + m3[3].re = m1[3].re - r[6]; + m3[3].im = m1[3].im - r[7]; + + m0[0].re = m0[0].re + t[0]; + m0[0].im = m0[0].im + t[1]; + m0[1].re = m0[1].re + t[2]; + m0[1].im = m0[1].im + t[3]; + m0[2].re = m0[2].re + t[4]; + m0[2].im = m0[2].im + t[5]; + m0[3].re = m0[3].re + t[6]; + m0[3].im = m0[3].im + t[7]; + + m1[0].re = m1[0].re + r[0]; + m1[0].im = m1[0].im + r[1]; + m1[1].re = m1[1].re + r[2]; + m1[1].im = m1[1].im + r[3]; + m1[2].re = m1[2].re + r[4]; + m1[2].im = m1[2].im + r[5]; + m1[3].re = m1[3].re + r[6]; + m1[3].im = m1[3].im + r[7]; + + /* Identical for below, but with the following parameters */ + m0 = &z[o1]; + m1 = &z[o1 + 4]; + m2 = &z[o3 + 0]; + m3 = &z[o3 + 4]; + t1 = &cos[1]; + t2 = &wim[-1]; + + w[0] = m2[0].im*t1[0] - m2[0].re*t2[7]; + w[1] = m2[0].re*t1[0] + m2[0].im*t2[7]; + w[2] = m2[1].im*t1[2] - m2[1].re*t2[5]; + w[3] = m2[1].re*t1[2] + m2[1].im*t2[5]; + w[4] = m3[0].im*t1[4] - m3[0].re*t2[3]; + w[5] = m3[0].re*t1[4] + m3[0].im*t2[3]; + w[6] = m3[1].im*t1[6] - m3[1].re*t2[1]; + w[7] = m3[1].re*t1[6] + m3[1].im*t2[1]; + + j[0] = m2[2].im*t1[0] + m2[2].re*t2[7]; + j[1] = m2[2].re*t1[0] - m2[2].im*t2[7]; + j[2] = m2[3].im*t1[2] + m2[3].re*t2[5]; + j[3] = m2[3].re*t1[2] - m2[3].im*t2[5]; + j[4] = m3[2].im*t1[4] + m3[2].re*t2[3]; + j[5] = m3[2].re*t1[4] - m3[2].im*t2[3]; + j[6] = m3[3].im*t1[6] + m3[3].re*t2[1]; + j[7] = m3[3].re*t1[6] - m3[3].im*t2[1]; + + /* 1 add + 1 shuf */ + t[1] = j[0] + w[0]; + t[0] = j[1] + w[1]; + t[3] = j[2] + w[2]; + t[2] = j[3] + w[3]; + t[5] = j[4] + w[4]; + t[4] = j[5] + w[5]; + t[7] = j[6] + w[6]; + t[6] = j[7] + w[7]; + + /* 1 sub + 1 xor */ + r[0] = (w[0] - j[0]); + r[1] = -(w[1] - j[1]); + r[2] = (w[2] - j[2]); + r[3] = -(w[3] - j[3]); + r[4] = (w[4] - j[4]); + r[5] = -(w[5] - j[5]); + r[6] = (w[6] - j[6]); + r[7] = -(w[7] - j[7]); + + /* Min: 2 subs, 2 adds, 2 vperm2fs (OPTIONAL) */ + m2[0].re = m0[0].re - t[0]; + m2[0].im = m0[0].im - t[1]; + m2[1].re = m0[1].re - t[2]; + m2[1].im = m0[1].im - t[3]; + m3[0].re = m0[2].re - t[4]; + m3[0].im = m0[2].im - t[5]; + m3[1].re = m0[3].re - t[6]; + m3[1].im = m0[3].im - t[7]; + + m2[2].re = m1[0].re - r[0]; + m2[2].im = m1[0].im - r[1]; + m2[3].re = m1[1].re - r[2]; + m2[3].im = m1[1].im - r[3]; + m3[2].re = m1[2].re - r[4]; + m3[2].im = m1[2].im - r[5]; + m3[3].re = m1[3].re - r[6]; + m3[3].im = m1[3].im - r[7]; + + m0[0].re = m0[0].re + t[0]; + m0[0].im = m0[0].im + t[1]; + m0[1].re = m0[1].re + t[2]; + m0[1].im = m0[1].im + t[3]; + m0[2].re = m0[2].re + t[4]; + m0[2].im = m0[2].im + t[5]; + m0[3].re = m0[3].re + t[6]; + m0[3].im = m0[3].im + t[7]; + + m1[0].re = m1[0].re + r[0]; + m1[0].im = m1[0].im + r[1]; + m1[1].re = m1[1].re + r[2]; + m1[1].im = m1[1].im + r[3]; + m1[2].re = m1[2].re + r[4]; + m1[2].im = m1[2].im + r[5]; + m1[3].re = m1[3].re + r[6]; + m1[3].im = m1[3].im + r[7]; +#endif + +#if 0 + z += 4; // !!! + cos += 2*4; + wim -= 2*4; + } +#endif +} +``` + +The macros used are identical to those in the generic C version, only with all +variable declarations exported to the function body. +An important point here is that the high frequency registers (m2 and m3) have +their high and low halves swapped in the output. This is intentional, as the +inputs must also have the same layout, and therefore, the input swapping is only +performed once for the bottom-most basis transform, with all subsequent combinations +using the already swapped halves. + +Also note that this function requires a special iteration way, due to coefficients +beginning to overlap, particularly `[o1]` with `[0]` after the second iteration. +To iterate further, set `z = &z[16]` via `z += 8` for the second iteration. After +the 4th iteration, the layout resets, so repeat the same.