[FFmpeg-devel,0/5] Provide optimized neon implementation

Message ID	20220908092507.63319-1-hum@semihalf.com
Headers	show Delivered-To: ffmpegpatchwork2@gmail.com Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; From: Hubert Mazur <hum@semihalf.com> To: ffmpeg-devel@ffmpeg.org Date: Thu, 8 Sep 2022 11:25:02 +0200 Message-Id: <20220908092507.63319-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Precedence: list Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur <hum@semihalf.com>, martin@martin.st, mw@semihalf.com, spop@amazon.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Series	Provide optimized neon implementation \| expand [FFmpeg-devel,0/5] Provide optimized neon implementation [FFmpeg-devel,1/5] lavc/aarch64: Add neon implementation for vsad16 [FFmpeg-devel,2/5] lavc/aarch64: Add neon implementation of vsse16 [FFmpeg-devel,3/5] lavc/aarch64: Add neon implementation for vsad_intra16 [FFmpeg-devel,4/5] lavc/aarch64: Add neon implementation for vsse_intra16 [FFmpeg-devel,5/5] lavc/aarch64: Provide neon implementation of nsse16

Message ID

20220908092507.63319-1-hum@semihalf.com

Headers

Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
From: Hubert Mazur <hum@semihalf.com>
To: ffmpeg-devel@ffmpeg.org
Date: Thu,  8 Sep 2022 11:25:02 +0200
Message-Id: <20220908092507.63319-1-hum@semihalf.com>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation
Precedence: list
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com,
 Hubert Mazur <hum@semihalf.com>, martin@martin.st, mw@semihalf.com,
 spop@amazon.com
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

Series

Provide optimized neon implementation | expand

Message

Hubert Mazur Sept. 8, 2022, 9:25 a.m. UTC

Fix minor issues in the patches.
Regarding vsse16 I didn't change saba & umlal to sub & smlal.
It doesn't affect the performance, so left it as it was.
The majority of changes refer to nsse16:
 - fixed indentation (thanks for pointing out),
 - applied the patch from Martin which fixes the balance
   within instructions,
 - interleaved instructions - apparently this helped a little
   to achieve better benchmarks.

I have also updated the benchmark results for each function -
not a huge performance improvement, but worth the effort.
For nsse and vsse are shown below (these are the biggest changes).
 - vsse16 asm from 64.7 to 59.2,
 - nsse16 asm from 120.0 to 116.5.

Hubert Mazur (5):
  lavc/aarch64: Add neon implementation for vsad16
  lavc/aarch64: Add neon implementation of vsse16
  lavc/aarch64: Add neon implementation for vsad_intra16
  lavc/aarch64: Add neon implementation for vsse_intra16
  lavc/aarch64: Provide neon implementation of nsse16

 libavcodec/aarch64/me_cmp_init_aarch64.c |  30 ++
 libavcodec/aarch64/me_cmp_neon.S         | 385 +++++++++++++++++++++++
 2 files changed, 415 insertions(+)

Comments

Martin Storsjö Sept. 9, 2022, 7:32 a.m. UTC | #1

On Thu, 8 Sep 2022, Hubert Mazur wrote:

> Fix minor issues in the patches.
> Regarding vsse16 I didn't change saba & umlal to sub & smlal.
> It doesn't affect the performance, so left it as it was.
> The majority of changes refer to nsse16:
> - fixed indentation (thanks for pointing out),
> - applied the patch from Martin which fixes the balance
>   within instructions,
> - interleaved instructions - apparently this helped a little
>   to achieve better benchmarks.

Thanks! I measured a small further improvement on A53 with this change; 
from 377 to 370 cycles.

> I have also updated the benchmark results for each function -
> not a huge performance improvement, but worth the effort.
> For nsse and vsse are shown below (these are the biggest changes).
> - vsse16 asm from 64.7 to 59.2,
> - nsse16 asm from 120.0 to 116.5.

It's kinda surprising that the difference is so small, since we reduced 
the amount of work done in the functions quite significantly (IIRC on A53, 
the speedup was something like 1.5x compared with the original), but I 
guess it's understandable if the Graviton 3 is so powerful, that there's 
enough spare execution units so that a bunch of redundant instructions 
doesn't really matter.

Anyway, this revision of the patchset looked good to me, so I pushed it 
now. Thanks!

// Martin