[FFmpeg-devel] lavu/tx: WIP add x86 assembly

This commit adds sse3 and avx assembly optimizations for 4-point and 
8-point transforms only.
The code to recombine them into higher-level transforms is non-functional
currently, so it's not included here. This is just to get some feedback
on possible optimizations.

The 4-point assembly is based on this structure:
https://gist.github.com/cyanreg/665b9c79cbe51df9296a969257f2a16c

The 8-point assembly is based on this structure:
https://gist.github.com/cyanreg/bbf25c8a8dfb910ed3b9ae7663983ca6

They're implemented as macros as they're pasted a few times in
the recombination code.

All code here is faster than both our own current assembly (by around 40%)
and FFTW3 (by around 10% to 40%).

The 8-point core assembly is barely 20 instructions! That's 1 less
than our current code, and saves on a lot of shuffles!
It's 40% faster than FFTW!

The 4-point core assembly is 10 instructions, which is 1 more than
our current code, however it doesn't require any external memory to
load from (a sign mask), which it trades for a shufps (faster),
and also it requires an additional temporary storage register
to reduce latency.

I'll collect the suggestions and implement them when I'm ready
to post the full power-of-two assembly.
Subject: [PATCH] lavu/tx: WIP add x86 assembly

This commit adds sse3 and avx assembly optimizations for 4-point and
8-point transforms only.
The code to recombine them into higher-level transforms is non-functional
currently, so it's not included here. This is just to get some feedback
on possible optimizations.

The 4-point assembly is based on this structure:
https://gist.github.com/cyanreg/665b9c79cbe51df9296a969257f2a16c

The 8-point assembly is based on this structure:
https://gist.github.com/cyanreg/bbf25c8a8dfb910ed3b9ae7663983ca6

They're implemented as macros as they're pasted a few times in
the recombination code.

All code here is faster than both our own current assembly (by around 40%)
and FFTW3 (by around 10% to 40%).

The 8-point core assembly is barely 20 instructions! That's 1 less
than our current code, and saves on a lot of shuffles!
It's 40% faster than FFTW!

The 4-point core assembly is 10 instructions, which is 1 more than
our current code, however it doesn't require any external memory to
load from (a sign mask), which it trades for a shufps (faster),
and also it requires an additional temporary storage register
to reduce latency.

I'll collect the suggestions and implement them when I'm ready
to post the full power-of-two assembly.
---
 libavutil/tx.c                |   2 +
 libavutil/tx_priv.h           |   2 +
 libavutil/x86/Makefile        |   2 +
 libavutil/x86/tx_float.asm    | 171 ++++++++++++++++++++++++++++++++++
 libavutil/x86/tx_float_init.c |  66 +++++++++++++
 5 files changed, 243 insertions(+)
 create mode 100644 libavutil/x86/tx_float.asm
 create mode 100644 libavutil/x86/tx_float_init.c

Message ID	MURh8bt--3-2@lynne.ee
State	New
Headers	show Return-Path: <ffmpeg-devel-bounces@ffmpeg.org> X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 7899944BC6A for <patchwork@ffaux-bg.ffmpeg.org>; Fri, 26 Feb 2021 06:59:27 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 4219A68A148; Fri, 26 Feb 2021 06:59:27 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E51D4689E8A for <ffmpeg-devel@ffmpeg.org>; Fri, 26 Feb 2021 06:59:20 +0200 (EET) Received: from w3.tutanota.de (unknown [192.168.1.164]) by w4.tutanota.de (Postfix) with ESMTP id 8311F1060254 for <ffmpeg-devel@ffmpeg.org>; Fri, 26 Feb 2021 04:59:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1614315560; s=s1; d=lynne.ee; h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:Sender; bh=R4UYxfU0XB9Asm5rsJYVc8yANgXhDxcScXGArlhnsBA=; b=aFIkDGKKDwiUe/4sL8Jl4Gcziaw1JTiMEhqI2A9vMwRa8Kdo/ycXMrNh2nGFOS4B d6kVN0gM3wMK9WxzOEOkA/Y3/ThxgM+00MTdgPHk+80gprWKMf0XjBjMloEmuwgVvnU U3+jrbJKjFI3Nauy3vAGsMSOKdHNcrdNb6+9KZKvVoT2gMnVH1h1pWdpFtviwrYiYZH AgbAGqZ6gqCCGyVoUFjeRlXfo/VDM88V84mJG3hcDB5obc6zpII1xCfJ8kHiSPlt19X m70gUk5rwTryi7vb3/b9gqSxfSZTjrYtZXp9zetzJHhI7WJWDgkR503RH9PJ5N6/M/t zfGnttsHzw== Date: Fri, 26 Feb 2021 05:59:20 +0100 (CET) From: Lynne <dev@lynne.ee> To: Ffmpeg Devel <ffmpeg-devel@ffmpeg.org> Message-ID: <MURh8bt--3-2@lynne.ee> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_36771_477336148.1614315560159" Subject: [FFmpeg-devel] [PATCH] lavu/tx: WIP add x86 assembly X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org> List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>, <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe> List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel> List-Post: <mailto:ffmpeg-devel@ffmpeg.org> List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help> List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>, <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe> Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Series	[FFmpeg-devel] lavu/tx: WIP add x86 assembly \| expand [FFmpeg-devel] lavu/tx: WIP add x86 assembly

Context	Check	Description
andriy/x86_make	fail	Make failed
andriy/PPC64_make	success	Make finished
andriy/PPC64_make_fate	success	Make fate finished

[FFmpeg-devel] lavu/tx: WIP add x86 assembly

Checks

Commit Message

Patch