diff mbox series

[FFmpeg-devel] Add support for "omp simd" pragma.

Message ID 20210110164351.86350-1-Reimar.Doeffinger@gmx.de
State New
Headers show
Series [FFmpeg-devel] Add support for "omp simd" pragma.
Related show

Checks

Context Check Description
andriy/x86_make success Make finished
andriy/x86_make_fate success Make fate finished
andriy/PPC64_make success Make finished
andriy/PPC64_make_fate success Make fate finished

Commit Message

Reimar Döffinger Jan. 10, 2021, 4:43 p.m. UTC
From: Reimar Döffinger <Reimar.Doeffinger@gmx.de>

This requests loops to be vectorized using SIMD
instructions.
The performance increase is far from hand-optimized
assembly but still significant over the plain C version.
Typical values are a 2-4x speedup where a hand-written
version would achieve 4x-10x.
So it is far from a replacement, however some architures
will get hand-written assembler quite late or not at all,
and this is a good improvement for a trivial amount of work.
The cause, besides the compiler being a compiler, is
usually that it does not manage to use saturating instructions
and thus has to use 32-bit operations where actually
saturating 16-bit operations would be sufficient.
Other causes are for example the av_clip functions that
are not ideal for vectorization (and even as scalar code
not optimal for any modern CPU that has either CSEL or
MAX/MIN instructions).
And of course this only works for relatively simple
loops, the IDCT functions for example seemed not possible
to optimize that way.
Also note that while clang may accept the code and sometimes
produces warnings, it does not seem to do anything actually
useful at all.
Here are example measurements using gcc 10 under Linux (in a VM unfortunately)
on AArch64 on Apple M1:
Commad:
time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit -threads 1 -noframedrop

Original code:
real    0m19.572s
user    0m23.386s
sys     0m0.213s

Changing all put_hevc:
real    0m15.648s
user    0m19.503s (83.4% of original)
sys     0m0.186s

In addition changing add_residual:
real    0m15.424s
user    0m19.278s (82.4% of original)
sys     0m0.133s

In addition changing planar copy dither:
real    0m15.040s
user    0m18.874s (80.7% of original)
sys     0m0.168s

Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
---
 configure                     | 23 +++++++++++++++++
 libavcodec/hevcdsp_template.c | 47 +++++++++++++++++++++++++++++++++++
 libavutil/internal.h          |  6 +++++
 libswscale/swscale_unscaled.c |  3 +++
 4 files changed, 79 insertions(+)

Comments

Lynne Jan. 10, 2021, 6:55 p.m. UTC | #1
Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de:

> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
>
> This requests loops to be vectorized using SIMD
> instructions.
> The performance increase is far from hand-optimized
> assembly but still significant over the plain C version.
> Typical values are a 2-4x speedup where a hand-written
> version would achieve 4x-10x.
> So it is far from a replacement, however some architures
> will get hand-written assembler quite late or not at all,
> and this is a good improvement for a trivial amount of work.
> The cause, besides the compiler being a compiler, is
> usually that it does not manage to use saturating instructions
> and thus has to use 32-bit operations where actually
> saturating 16-bit operations would be sufficient.
> Other causes are for example the av_clip functions that
> are not ideal for vectorization (and even as scalar code
> not optimal for any modern CPU that has either CSEL or
> MAX/MIN instructions).
> And of course this only works for relatively simple
> loops, the IDCT functions for example seemed not possible
> to optimize that way.
> Also note that while clang may accept the code and sometimes
> produces warnings, it does not seem to do anything actually
> useful at all.
> Here are example measurements using gcc 10 under Linux (in a VM unfortunately)
> on AArch64 on Apple M1:
> Commad:
> time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit -threads 1 -noframedrop
>
> Original code:
> real    0m19.572s
> user    0m23.386s
> sys     0m0.213s
>
> Changing all put_hevc:
> real    0m15.648s
> user    0m19.503s (83.4% of original)
> sys     0m0.186s
>
> In addition changing add_residual:
> real    0m15.424s
> user    0m19.278s (82.4% of original)
> sys     0m0.133s
>
> In addition changing planar copy dither:
> real    0m15.040s
> user    0m18.874s (80.7% of original)
> sys     0m0.168s
>

I think I have to disagree.
The performance gains are marginal, its definitely something the compiler should
be able to decide on its own, and it makes performance highly compiler dependent.
And I'm not even resorting to the painfully obvious FUD arguments that could be made.

Most of the loops this is added to are trivially SIMDable. Just because no one has
had the motivation to do SIMD for a pretty unpopular codec doesn't mean we should
compromise.
Carl Eugen Hoyos Jan. 11, 2021, 12:26 a.m. UTC | #2
Am So., 10. Jan. 2021 um 19:55 Uhr schrieb Lynne <dev@lynne.ee>:
>
> Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de:
>
> > From: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
> >
> > This requests loops to be vectorized using SIMD
> > instructions.
> > The performance increase is far from hand-optimized
> > assembly but still significant over the plain C version.
> > Typical values are a 2-4x speedup where a hand-written
> > version would achieve 4x-10x.
> > So it is far from a replacement, however some architures
> > will get hand-written assembler quite late or not at all,
> > and this is a good improvement for a trivial amount of work.
> > The cause, besides the compiler being a compiler, is
> > usually that it does not manage to use saturating instructions
> > and thus has to use 32-bit operations where actually
> > saturating 16-bit operations would be sufficient.
> > Other causes are for example the av_clip functions that
> > are not ideal for vectorization (and even as scalar code
> > not optimal for any modern CPU that has either CSEL or
> > MAX/MIN instructions).
> > And of course this only works for relatively simple
> > loops, the IDCT functions for example seemed not possible
> > to optimize that way.
> > Also note that while clang may accept the code and sometimes
> > produces warnings, it does not seem to do anything actually
> > useful at all.
> > Here are example measurements using gcc 10 under Linux (in a VM unfortunately)
> > on AArch64 on Apple M1:
> > Commad:
> > time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit -threads 1 -noframedrop
> >
> > Original code:
> > real    0m19.572s
> > user    0m23.386s
> > sys     0m0.213s
> >
> > Changing all put_hevc:
> > real    0m15.648s
> > user    0m19.503s (83.4% of original)
> > sys     0m0.186s
> >
> > In addition changing add_residual:
> > real    0m15.424s
> > user    0m19.278s (82.4% of original)
> > sys     0m0.133s
> >
> > In addition changing planar copy dither:
> > real    0m15.040s
> > user    0m18.874s (80.7% of original)
> > sys     0m0.168s
> >
>
> I think I have to disagree.

> The performance gains are marginal

This sounds wrong.

Carl Eugen
Paul B Mahol Jan. 11, 2021, 11:03 a.m. UTC | #3
On Mon, Jan 11, 2021 at 1:26 AM Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:

> Am So., 10. Jan. 2021 um 19:55 Uhr schrieb Lynne <dev@lynne.ee>:
> >
> > Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de:
> >
> > > From: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
> > >
> > > This requests loops to be vectorized using SIMD
> > > instructions.
> > > The performance increase is far from hand-optimized
> > > assembly but still significant over the plain C version.
> > > Typical values are a 2-4x speedup where a hand-written
> > > version would achieve 4x-10x.
> > > So it is far from a replacement, however some architures
> > > will get hand-written assembler quite late or not at all,
> > > and this is a good improvement for a trivial amount of work.
> > > The cause, besides the compiler being a compiler, is
> > > usually that it does not manage to use saturating instructions
> > > and thus has to use 32-bit operations where actually
> > > saturating 16-bit operations would be sufficient.
> > > Other causes are for example the av_clip functions that
> > > are not ideal for vectorization (and even as scalar code
> > > not optimal for any modern CPU that has either CSEL or
> > > MAX/MIN instructions).
> > > And of course this only works for relatively simple
> > > loops, the IDCT functions for example seemed not possible
> > > to optimize that way.
> > > Also note that while clang may accept the code and sometimes
> > > produces warnings, it does not seem to do anything actually
> > > useful at all.
> > > Here are example measurements using gcc 10 under Linux (in a VM
> unfortunately)
> > > on AArch64 on Apple M1:
> > > Commad:
> > > time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit
> -threads 1 -noframedrop
> > >
> > > Original code:
> > > real    0m19.572s
> > > user    0m23.386s
> > > sys     0m0.213s
> > >
> > > Changing all put_hevc:
> > > real    0m15.648s
> > > user    0m19.503s (83.4% of original)
> > > sys     0m0.186s
> > >
> > > In addition changing add_residual:
> > > real    0m15.424s
> > > user    0m19.278s (82.4% of original)
> > > sys     0m0.133s
> > >
> > > In addition changing planar copy dither:
> > > real    0m15.040s
> > > user    0m18.874s (80.7% of original)
> > > sys     0m0.168s
> > >
> >
> > I think I have to disagree.
>
> > The performance gains are marginal
>
> This sounds wrong.
>

I disagree with Carl.


>
> Carl Eugen
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Reimar Döffinger Jan. 12, 2021, 6:28 p.m. UTC | #4
> 
> On 10 Jan 2021, at 19:55, Lynne <dev@lynne.ee> wrote:
> 
> Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de:
> 
>> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
>> 
>> real    0m15.040s
>> user    0m18.874s (80.7% of original)
>> sys     0m0.168s
>> 
> 
> I think I have to disagree.
> The performance gains are marginal,

It’s almost 20%. At least for this combination of
codec and stream a large amount of time is spend in
non-DSP functions, so even hand-written assembler
won’t give you huge gains.


> its definitely something the compiler should
> be able to decide on its own,

So you object to unlikely() macros as well?
It’s really just giving the compiler a hint it should try, though I admit the configure part makes it
look otherwise.

> Most of the loops this is added to are trivially SIMDable.

How many hours of effort do you consider “trivial”?
Especially if it’s someone not experienced?
It might be fairly trivial with intrinsics, however
many of your counter-arguments also apply
to intrinsics (and to a degree inline assembly).
That’s btw not just a rhetorical question because
I’m pretty sure I am not going to all the trouble
to port more of the arm 32-bit assembler functions
since it’s a huge PITA, and I was wondering if there
was a point to even have a try with intrinsics...

> Just because no one has
> had the motivation to do SIMD for a pretty unpopular codec doesn't mean we should
> compromise.

If you think of AArch64 specifically, I can
kind of agree.
However I wouldn’t say the word “compromise”
is appropriate when there’s a good chance nothing
better will ever come to exist.
But the real point is not AArch64, that is just
a very convenient test platform.
The point is to raise the minimum bar.
A new architecture, RISC-V for example or something
else should not be stuck at scalar performance
until someone actually gets around to implementing
assembler optimizations.
And just to be clear: I don’t actually care about
HEVC, it just seemed a nice target to do some
experiments.

Best regards,
Reimar
Soft Works Jan. 12, 2021, 6:52 p.m. UTC | #5
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Reimar.Doeffinger@gmx.de
> Sent: Sunday, January 10, 2021 5:44 PM
> To: ffmpeg-devel@ffmpeg.org
> Cc: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
> Subject: [FFmpeg-devel] [PATCH] Add support for "omp simd" pragma.
> 
> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
> 
> This requests loops to be vectorized using SIMD instructions.
> The performance increase is far from hand-optimized assembly but still
> significant over the plain C version.
> Typical values are a 2-4x speedup where a hand-written version would
> achieve 4x-10x.
> So it is far from a replacement, however some architures will get hand-
> written assembler quite late or not at all, and this is a good improvement for
> a trivial amount of work.
> The cause, besides the compiler being a compiler, is usually that it does not
> manage to use saturating instructions and thus has to use 32-bit operations
> where actually saturating 16-bit operations would be sufficient.
> Other causes are for example the av_clip functions that are not ideal for
> vectorization (and even as scalar code not optimal for any modern CPU that
> has either CSEL or MAX/MIN instructions).
> And of course this only works for relatively simple loops, the IDCT functions
> for example seemed not possible to optimize that way.

...

> +if enabled openmp_simd; then
> +    ompopt="-fopenmp"
> +    if ! test_cflags $ompopt ; then
> +        test_cflags -Xpreprocessor -fopenmp && ompopt="-Xpreprocessor -
> fopenmp"

Isn't it sufficient to specify -fopenmp-simd instead of -fopenmp for this patch?

As OMP SIMD is the only openmp feature that is used, there's no need to link
to the openmp lib. 

softworkz
Reimar Döffinger Jan. 12, 2021, 7:17 p.m. UTC | #6
> On 12 Jan 2021, at 19:52, Soft Works <softworkz@hotmail.com> wrote:
> 
> 
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> Reimar.Doeffinger@gmx.de
>> Sent: Sunday, January 10, 2021 5:44 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Cc: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
>> Subject: [FFmpeg-devel] [PATCH] Add support for "omp simd" pragma.
>> 
>> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
>> 
>> This requests loops to be vectorized using SIMD instructions.
>> The performance increase is far from hand-optimized assembly but still
>> significant over the plain C version.
>> Typical values are a 2-4x speedup where a hand-written version would
>> achieve 4x-10x.
>> So it is far from a replacement, however some architures will get hand-
>> written assembler quite late or not at all, and this is a good improvement for
>> a trivial amount of work.
>> The cause, besides the compiler being a compiler, is usually that it does not
>> manage to use saturating instructions and thus has to use 32-bit operations
>> where actually saturating 16-bit operations would be sufficient.
>> Other causes are for example the av_clip functions that are not ideal for
>> vectorization (and even as scalar code not optimal for any modern CPU that
>> has either CSEL or MAX/MIN instructions).
>> And of course this only works for relatively simple loops, the IDCT functions
>> for example seemed not possible to optimize that way.
> 
> ...
> 
>> +if enabled openmp_simd; then
>> +    ompopt="-fopenmp"
>> +    if ! test_cflags $ompopt ; then
>> +        test_cflags -Xpreprocessor -fopenmp && ompopt="-Xpreprocessor -
>> fopenmp"
> 
> Isn't it sufficient to specify -fopenmp-simd instead of -fopenmp for this patch?

I think so, I just didn’t know/even expect that option to exist!
Thanks a lot for the tip!

> As OMP SIMD is the only openmp feature that is used, there's no need to link
> to the openmp lib. 


That it doesn’t do anyway because -fopenmp is not in the linker flags,
but I admit that was a bit of a hacky solution.

Thanks,
Reimar
Lynne Jan. 12, 2021, 8:46 p.m. UTC | #7
Jan 12, 2021, 19:28 by Reimar.Doeffinger@gmx.de:

>>
>> On 10 Jan 2021, at 19:55, Lynne <dev@lynne.ee> wrote:
>>
>> Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de:
>>
>>> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
>>>
>>> real    0m15.040s
>>> user    0m18.874s (80.7% of original)
>>> sys     0m0.168s
>>>
>>
>> I think I have to disagree.
>> The performance gains are marginal,
>>
>
> It’s almost 20%. At least for this combination of
> codec and stream a large amount of time is spend in
> non-DSP functions, so even hand-written assembler
> won’t give you huge gains.
>
It's non-guaranteed 20% on a single system. It could change, and it could very
well mess up like gcc does with autovectorization, which we still explicitly disable
because FATE fails (-fno-tree-vectorize, and I was the one who sent an RFC to
try to undo it somewhat recently. Even though it was an RFC the reaction from devs
was quite cold).



>> its definitely something the compiler should
>> be able to decide on its own,
>>
>
> So you object to unlikely() macros as well?
> It’s really just giving the compiler a hint it should try, though I admit the configure part makes it
> look otherwise.
>
I'm more against the macro and changes to the code itself. If you can make it
work without adding a macro to individual loops or the likes of av_cold/av_hot or
any other changes to the code, I'll be more welcoming.
I really _hate_ compiler hints. Take a look at the upipe source code to see what
a cthulian monstrosity made of hint flags looks like. Every single branch had
a cold/hot macro and it was the project's coding style. It's completely irredeemable.



>> Most of the loops this is added to are trivially SIMDable.
>>
>
> How many hours of effort do you consider “trivial”?
> Especially if it’s someone not experienced?
> It might be fairly trivial with intrinsics, however
> many of your counter-arguments also apply
> to intrinsics (and to a degree inline assembly).
> That’s btw not just a rhetorical question because
> I’m pretty sure I am not going to all the trouble
> to port more of the arm 32-bit assembler functions
> since it’s a huge PITA, and I was wondering if there
> was a point to even have a try with intrinsics...
>
Intrinsics and inline assembly are a whole different thing than magic
macros that tell and force the compiler what a well written compiler
should already very well know about.



>> Just because no one has
>> had the motivation to do SIMD for a pretty unpopular codec doesn't mean we should
>> compromise.
>>
>
> If you think of AArch64 specifically, I can
> kind of agree.
> However I wouldn’t say the word “compromise”
> is appropriate when there’s a good chance nothing
> better will ever come to exist.
> But the real point is not AArch64, that is just
> a very convenient test platform.
> The point is to raise the minimum bar.
> A new architecture, RISC-V for example or something
> else should not be stuck at scalar performance
> until someone actually gets around to implementing
> assembler optimizations.
> And just to be clear: I don’t actually care about
> HEVC, it just seemed a nice target to do some
> experiments.
>
I already said all that can be said here: this will halt efforts on actually
optimizing the code in exchange for naive trust in compilers.
New platforms will be stuck at scalar performance anyway until
the compilers for the arch are smart enough to deal with vectorization.
Reimar Döffinger Jan. 12, 2021, 9:32 p.m. UTC | #8
> On 12 Jan 2021, at 21:46, Lynne <dev@lynne.ee> wrote:
> 
> Jan 12, 2021, 19:28 by Reimar.Doeffinger@gmx.de:
> 
>> It’s almost 20%. At least for this combination of
>> codec and stream a large amount of time is spend in
>> non-DSP functions, so even hand-written assembler
>> won’t give you huge gains.
>> 
> It's non-guaranteed 20% on a single system. It could change, and it could very
> well mess up like gcc does with autovectorization, which we still explicitly disable
> because FATE fails (-fno-tree-vectorize, and I was the one who sent an RFC to
> try to undo it somewhat recently. Even though it was an RFC the reaction from devs
> was quite cold).

Oh, thanks for the reminder, I thought that was gone because it seems
it’s not used for clang, and MPlayer does not seem to set that.
I need to compare it, however the problem with the auto-vectorization
is exactly that the compiler will try to apply it to everything,
which has at least 2 issues:
1) it gigantically increases the risk for bugs when it's every
single loop instead of loops that we already wrote assembler for
somewhere.
2) it will quite often make things worse, by vectorizing loops
that are rarely iterated over more than a few times (and it
needs to handle a whole lot of code to handle loop counts not
a multiple of vector size) - because all too often the compiler
can only take a wild guess if “width” is usually 1 or 1920,
while we DO know.

>>> its definitely something the compiler should
>>> be able to decide on its own,
>>> 
>> 
>> So you object to unlikely() macros as well?
>> It’s really just giving the compiler a hint it should try, though I admit the configure part makes it
>> look otherwise.
>> 
> I'm more against the macro and changes to the code itself. If you can make it
> work without adding a macro to individual loops or the likes of av_cold/av_hot or
> any other changes to the code, I'll be more welcoming.

I expect that will just run into the same issue as the tree-vectorize...

> I really _hate_ compiler hints. Take a look at the upipe source code to see what
> a cthulian monstrosity made of hint flags looks like. Every single branch had
> a cold/hot macro and it was the project's coding style. It's completely irredeemable.

I guess my suggested solution would be to require proof of
clearly measurable performance benefit.
But I see the point that if it gets “randomly” added to loops
that might turn out quite a mess.

>>> Most of the loops this is added to are trivially SIMDable.
>>> 
>> 
>> How many hours of effort do you consider “trivial”?
>> Especially if it’s someone not experienced?
>> It might be fairly trivial with intrinsics, however
>> many of your counter-arguments also apply
>> to intrinsics (and to a degree inline assembly).
>> That’s btw not just a rhetorical question because
>> I’m pretty sure I am not going to all the trouble
>> to port more of the arm 32-bit assembler functions
>> since it’s a huge PITA, and I was wondering if there
>> was a point to even have a try with intrinsics...
>> 
> Intrinsics and inline assembly are a whole different thing than magic
> macros that tell and force the compiler what a well written compiler
> should already very well know about.

There are no well written compilers, in a way ;)
I would also argue that most of what intrinsics do,
such a compiler should figure out on its own, too.
And the first time I tried intrinsics they slowed the
loop down by a factor 2 because the compiler stored and
loaded the value to stack between every intrinsic,
so it’s not like they are not without problems.
But I was actually thinking that it might be somewhat
interesting to have a kind of “generic SIMD intrinsics”.
Though I think I read that such a thing has already be
tried, so it might just be wasted time.

> I already said all that can be said here: this will halt efforts on actually
> optimizing the code in exchange for naive trust in compilers.

I’m not sure it will discourage it more than having to write
the optimizations over and over, for Armv7 NEON, for Armv8 Linux,
for Armv8 Windows, then SVE/SVE2, who knows maybe Armv9
will also need a rewrite.
SSE2, AVX256, AVX512 for x86, so much stuff never gets ported
to the new versions.
I’d also claim anyone naively trusting in compilers is unlikely
to write SIMD optimizations either way :)

> New platforms will be stuck at scalar performance anyway until
> the compilers for the arch are smart enough to deal with vectorization.

That seems to happen a long time before someone gets around to
optimising FFmpeg though.
This is particularly true when it’s a new OS and not CPU architecture
platform.
For example macOS we are lucky enough that the assembler etc. are
largely compatible to Linux.
But for Windows-on-Arm there is no GNU assembler, and the Microsoft
assembler needs a completely different syntax, so even the assembly
we DO have just doesn’t work.

Anyway, thanks for the discussion.
I still think the situation with SIMD optimizations should be improved
SOMEHOW, but I nothing but wild ideas on the HOW.
If anyone feels the same, I’d welcome further discussion.

Thanks,
Reimar
Soft Works Jan. 12, 2021, 9:32 p.m. UTC | #9
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Lynne
> Sent: Tuesday, January 12, 2021 9:47 PM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH] Add support for "omp simd" pragma.
> 
> Jan 12, 2021, 19:28 by Reimar.Doeffinger@gmx.de:
> 
> >>
> >> On 10 Jan 2021, at 19:55, Lynne <dev@lynne.ee> wrote:
> >>
> >> Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de:
> >>
> >>> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
> >>>
> >>> real    0m15.040s
> >>> user    0m18.874s (80.7% of original)
> >>> sys     0m0.168s
> >>>
> >>
> >> I think I have to disagree.
> >> The performance gains are marginal,
> >>
> >
> > It’s almost 20%. At least for this combination of codec and stream a
> > large amount of time is spend in non-DSP functions, so even
> > hand-written assembler won’t give you huge gains.
> >
> It's non-guaranteed 20% on a single system. It could change, and it could very
> well mess up like gcc does with autovectorization, which we still explicitly
> disable because FATE fails (-fno-tree-vectorize, and I was the one who sent
> an RFC to try to undo it somewhat recently. Even though it was an RFC the
> reaction from devs was quite cold).

I wonder whether there's a way to enable autovectorization only for 
specific loops? But that would probably be compiler-specific.

> >> its definitely something the compiler should be able to decide on its
> >> own,
> >>
> >
> > So you object to unlikely() macros as well?
> > It’s really just giving the compiler a hint it should try, though I
> > admit the configure part makes it look otherwise.
> >
> I'm more against the macro and changes to the code itself. If you can make it
> work without adding a macro to individual loops or the likes of
> av_cold/av_hot or any other changes to the code, I'll be more welcoming.
> I really _hate_ compiler hints. Take a look at the upipe source code to see
> what a cthulian monstrosity made of hint flags looks like. Every single branch
> had a cold/hot macro and it was the project's coding style. It's completely
> irredeemable.

OpenMP is a standard at least, which is supported by many compilers and
#pragma omp simd is not really a "monstrosity".


> >> Most of the loops this is added to are trivially SIMDable.

Could you provide some examples? What constructs would you suggest, 
that can be applied trivially? And that it would be compiled as SIMD even
though fno-tree-vectorize is set? 

Thanks,
softworkz
Martin Storsjö Jan. 13, 2021, 8:04 a.m. UTC | #10
Hi,

On Tue, 12 Jan 2021, Reimar Döffinger wrote:

> I’m not sure it will discourage it more than having to write
> the optimizations over and over, for Armv7 NEON, for Armv8 Linux,
> for Armv8 Windows, then SVE/SVE2, who knows maybe Armv9
> will also need a rewrite.

NEON code for armv8 windows and armv8 linux all use the exact same 
source, no need to write it twice.

> For example macOS we are lucky enough that the assembler etc. are
> largely compatible to Linux.

I'm not sure I'd say it's luck, it's pretty much by design there.

Historically, macOS build tools used an ancient fork of GAS, with very 
limited macroing capabilities. To remedy this, the gas-preprocessor tool 
was invented, for expanding modern gas macros, producing just a straight 
up feed of instructions, passed on to the native platform tools.

In modern times, the build tools are based on Clang/LLVM, and they support 
essentially all modern gas macro features (including altmacro, which was 
added in Clang 5). There's many parties that have an interest in this 
feature, e.g. support for building the Linux kernel with Clang.

Due to backwards compatibility with the old GAS fork's macroing capability 
(I think), there's some very vague differences between LLVM's macro 
support for other platforms and darwin, e.g. on other platforms, it's ok 
to invoke a macro as either "mymacro param1, param2" or "mymacro param1 
param2" (without commas between the arguments). On darwin targets, only 
the former works as intended.

All other platform differences are abstracted away with our macros in 
libavutil/aarch64/asm.S, see e.g. 
http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavutil/aarch64/asm.S;h=d1fa72b3c65a4a58e76029e94b998d935649aa90;hb=ca21cb1e36ccae2ee71d4299d477fa9284c1f551#l85.
For darwin platforms, the movrel macro expands to "add rX, rX, 
symbol@PAGEOFF" while it expands to "add rX, rX, :lo12:symbol" on other 
platforms (ELF and COFF).

All the source files just use the high level macros function, endfunc, 
movrel, etc, which handle the few platform specific details that differ.


If you write code with just one tool, it's of course certainly possible to 
accidentally use some corner case detail that another tool objects to, but 
that's why one needs testing on multiple platforms, via a CI system or 
FATE or whatever. Just like you regularly need to test C code on various 
platforms, even if you'd expect it to work if it's properly written.

> But for Windows-on-Arm there is no GNU assembler, and the Microsoft
> assembler needs a completely different syntax, so even the assembly
> we DO have just doesn’t work.

This is not true at all.

GCC and binutils don't support windows on arm/arm64 at all, that far is 
true.

But Clang/LLVM do (with https://github.com/mstorsjo/llvm-mingw you have an 
easily available packaged cross compiler and all), and they support the 
GAS syntax asm just fine.

If building with MSVC tools, yes you're right that armasm.exe/armasm64.exe 
takes a different syntax. But the gas-preprocessor tool (which is picked 
up automatically by our configure, one just needs to make sure it's 
available) handles expanding all the macros and rewriting directives into 
the armasm form, and feeding it to the armasm tools. Works fine and have 
done so for years. There's even a wiki page which tries to explain how to 
do it (although it's probably outdated in some aspects), see 
https://trac.ffmpeg.org/wiki/CompilationGuide/WinRT.

We even have regular fate tests of these configurations, see e.g. these:

http://fate.ffmpeg.org/report.cgi?slot=aarch64-mingw32-clang-trunk&time=20210113064430

http://fate.ffmpeg.org/report.cgi?time=20210109152105&slot=arm64-msvc2019

http://fate.ffmpeg.org/report.cgi?slot=armv7-mingw32-clang-trunk&time=20210113055653

http://fate.ffmpeg.org/report.cgi?time=20210109163844&slot=arm-msvc2019-phone

All of these run with full assembly optimizations enabled. So please don't 
tell me that our assembly doesn't work on windows on arm, because it does, 
and it has for years.

// Martin
Reimar Döffinger Jan. 13, 2021, 1:48 p.m. UTC | #11
> If building with MSVC tools, yes you're right that armasm.exe/armasm64.exe takes a different syntax. But the gas-preprocessor tool (which is picked up automatically by our configure, one just needs to make sure it's available) handles expanding all the macros and rewriting directives into the armasm form, and feeding it to the armasm tools. Works fine and have done so for years. There's even a wiki page which tries to explain how to do it (although it's probably outdated in some aspects), see https://trac.ffmpeg.org/wiki/CompilationGuide/WinRT.
> 

I went with the instructions in doc/platform.texi and that did not work at all,
It even tried to use cl.exe to compile the assembler files!

> All of these run with full assembly optimizations enabled. So please don't tell me that our assembly doesn't work on windows on arm, because it does, and it has for years.
> 

My apologies. I’ll correct to: it doesn’t work using the instructions
shipping with the source code (as far as I can tell) :)
Martin Storsjö Jan. 13, 2021, 2:16 p.m. UTC | #12
On Wed, 13 Jan 2021, Reimar Döffinger wrote:

>> If building with MSVC tools, yes you're right that armasm.exe/armasm64.exe takes a different syntax. But the gas-preprocessor tool (which is picked up automatically by our configure, one just needs to make sure it's available) handles expanding all the macros and rewriting directives into the armasm form, and feeding it to the armasm tools. Works fine and have done so for years. There's even a wiki page which tries to explain how to do it (although it's probably outdated in some aspects), see https://trac.ffmpeg.org/wiki/CompilationGuide/WinRT.
>> 
>
> I went with the instructions in doc/platform.texi and that did not work at all,
> It even tried to use cl.exe to compile the assembler files!

What did you end up trying/doing in this case? That sounds rather broken 
to me.

The main issue just is having gas-preprocessor available (but since you 
need a posix make, like from msys2, it should be pretty easy to have a 
perl installation for gas-preprocessor there) - but if it isn't found, 
configure really should be erroring out and not silently using cl as 
assembler...

My own setups for fate are a bit special as they're cross compiled from 
linux (with msvc wrapped in wine), but it should essentially just be 
"./configure --arch=arm64 --target-os=win32 --toolchain=msvc 
--enable-cross-compile", assuming you have MSVC targeting arm64 in $PATH.

// Martin
diff mbox series

Patch

diff --git a/configure b/configure
index 900505756b..73b7c3daeb 100755
--- a/configure
+++ b/configure
@@ -406,6 +406,7 @@  Toolchain options:
   --enable-pic             build position-independent code
   --enable-thumb           compile for Thumb instruction set
   --enable-lto             use link-time optimization
+  --enable-openmp-simd     use the "omp simd" pragma to optimize code
   --env="ENV=override"     override the environment variables
 
 Advanced options (experts only):
@@ -2335,6 +2336,7 @@  HAVE_LIST="
     opencl_dxva2
     opencl_vaapi_beignet
     opencl_vaapi_intel_media
+    openmp_simd
     perl
     pod2man
     texi2html
@@ -2446,6 +2448,7 @@  CMDLINE_SELECT="
     extra_warnings
     logging
     lto
+    openmp_simd
     optimizations
     rpath
     stripping
@@ -6926,6 +6929,26 @@  if enabled lto; then
     disable inline_asm_direct_symbol_refs
 fi
 
+if enabled openmp_simd; then
+    ompopt="-fopenmp"
+    if ! test_cflags $ompopt ; then
+        test_cflags -Xpreprocessor -fopenmp && ompopt="-Xpreprocessor -fopenmp"
+    fi
+    test_cc $ompopt <<EOF && add_cflags "$ompopt" || die "failed to enable openmp SIMD"
+#ifndef _OPENMP
+#error _OPENMP is not defined
+#endif
+void test(unsigned char *c)
+{
+    _Pragma("omp simd")
+    for (int i = 0; i < 256; i++)
+    {
+        c[i] *= 16;
+    }
+}
+EOF
+fi
+
 enabled ftrapv && check_cflags -ftrapv
 
 test_cc -mno-red-zone <<EOF && noredzone_flags="-mno-red-zone"
diff --git a/libavcodec/hevcdsp_template.c b/libavcodec/hevcdsp_template.c
index 56cd9e605d..1a8b4160ec 100644
--- a/libavcodec/hevcdsp_template.c
+++ b/libavcodec/hevcdsp_template.c
@@ -50,6 +50,7 @@  static av_always_inline void FUNC(add_residual)(uint8_t *_dst, int16_t *res,
     stride /= sizeof(pixel);
 
     for (y = 0; y < size; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < size; x++) {
             dst[x] = av_clip_pixel(dst[x] + *res);
             res++;
@@ -247,6 +248,7 @@  static void FUNC(idct_ ## H ## x ## H )(int16_t *coeffs,          \
     int16_t *src   = coeffs;                                      \
     IDCT_VAR ## H(H);                                             \
                                                                   \
+    FF_OMP_SIMD                                                   \
     for (i = 0; i < H; i++) {                                     \
         TR_ ## H(src, src, H, H, SCALE, limit2);                  \
         if (limit2 < H && i%4 == 0 && !!i)                        \
@@ -256,6 +258,7 @@  static void FUNC(idct_ ## H ## x ## H )(int16_t *coeffs,          \
                                                                   \
     shift = 20 - BIT_DEPTH;                                       \
     add   = 1 << (shift - 1);                                     \
+    FF_OMP_SIMD                                                   \
     for (i = 0; i < H; i++) {                                     \
         TR_ ## H(coeffs, coeffs, 1, 1, SCALE, limit);             \
         coeffs += H;                                              \
@@ -502,6 +505,7 @@  static void FUNC(put_hevc_pel_pixels)(int16_t *dst,
     ptrdiff_t srcstride = _srcstride / sizeof(pixel);
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = src[x] << (14 - BIT_DEPTH);
         src += srcstride;
@@ -543,6 +547,7 @@  static void FUNC(put_hevc_pel_bi_pixels)(uint8_t *_dst, ptrdiff_t _dststride, ui
 #endif
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((src[x] << (14 - BIT_DEPTH)) + src2[x] + offset) >> shift);
         src  += srcstride;
@@ -568,6 +573,7 @@  static void FUNC(put_hevc_pel_uni_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride,
 
     ox     = ox * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel((((src[x] << (14 - BIT_DEPTH)) * wx + offset) >> shift) + ox);
         src += srcstride;
@@ -592,6 +598,7 @@  static void FUNC(put_hevc_pel_bi_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride,
     ox0     = ox0 * (1 << (BIT_DEPTH - 8));
     ox1     = ox1 * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++) {
             dst[x] = av_clip_pixel(( (src[x] << (14 - BIT_DEPTH)) * wx1 + src2[x] * wx0 + (ox0 + ox1 + 1) * (1 << log2Wd)) >> (log2Wd + 1));
         }
@@ -623,6 +630,7 @@  static void FUNC(put_hevc_qpel_h)(int16_t *dst,
     ptrdiff_t     srcstride = _srcstride / sizeof(pixel);
     const int8_t *filter    = ff_hevc_qpel_filters[mx - 1];
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -639,6 +647,7 @@  static void FUNC(put_hevc_qpel_v)(int16_t *dst,
     ptrdiff_t     srcstride = _srcstride / sizeof(pixel);
     const int8_t *filter    = ff_hevc_qpel_filters[my - 1];
     for (y = 0; y < height; y++)  {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -662,6 +671,7 @@  static void FUNC(put_hevc_qpel_hv)(int16_t *dst,
     src   -= QPEL_EXTRA_BEFORE * srcstride;
     filter = ff_hevc_qpel_filters[mx - 1];
     for (y = 0; y < height + QPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -671,6 +681,7 @@  static void FUNC(put_hevc_qpel_hv)(int16_t *dst,
     tmp    = tmp_array + QPEL_EXTRA_BEFORE * MAX_PB_SIZE;
     filter = ff_hevc_qpel_filters[my - 1];
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6;
         tmp += MAX_PB_SIZE;
@@ -697,6 +708,7 @@  static void FUNC(put_hevc_qpel_uni_h)(uint8_t *_dst,  ptrdiff_t _dststride,
 #endif
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + offset) >> shift);
         src += srcstride;
@@ -724,6 +736,7 @@  static void FUNC(put_hevc_qpel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride, uint8_
 #endif
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + src2[x] + offset) >> shift);
         src  += srcstride;
@@ -751,6 +764,7 @@  static void FUNC(put_hevc_qpel_uni_v)(uint8_t *_dst,  ptrdiff_t _dststride,
 #endif
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + offset) >> shift);
         src += srcstride;
@@ -779,6 +793,7 @@  static void FUNC(put_hevc_qpel_bi_v)(uint8_t *_dst, ptrdiff_t _dststride, uint8_
 #endif
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + src2[x] + offset) >> shift);
         src  += srcstride;
@@ -810,6 +825,7 @@  static void FUNC(put_hevc_qpel_uni_hv)(uint8_t *_dst,  ptrdiff_t _dststride,
     src   -= QPEL_EXTRA_BEFORE * srcstride;
     filter = ff_hevc_qpel_filters[mx - 1];
     for (y = 0; y < height + QPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -820,6 +836,7 @@  static void FUNC(put_hevc_qpel_uni_hv)(uint8_t *_dst,  ptrdiff_t _dststride,
     filter = ff_hevc_qpel_filters[my - 1];
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + offset) >> shift);
         tmp += MAX_PB_SIZE;
@@ -849,6 +866,7 @@  static void FUNC(put_hevc_qpel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint8
     src   -= QPEL_EXTRA_BEFORE * srcstride;
     filter = ff_hevc_qpel_filters[mx - 1];
     for (y = 0; y < height + QPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -859,6 +877,7 @@  static void FUNC(put_hevc_qpel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint8
     filter = ff_hevc_qpel_filters[my - 1];
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + src2[x] + offset) >> shift);
         tmp  += MAX_PB_SIZE;
@@ -887,6 +906,7 @@  static void FUNC(put_hevc_qpel_uni_w_h)(uint8_t *_dst,  ptrdiff_t _dststride,
 
     ox = ox * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel((((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
         src += srcstride;
@@ -913,6 +933,7 @@  static void FUNC(put_hevc_qpel_bi_w_h)(uint8_t *_dst, ptrdiff_t _dststride, uint
     ox0     = ox0 * (1 << (BIT_DEPTH - 8));
     ox1     = ox1 * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx1 + src2[x] * wx0 +
                                     ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1));
@@ -942,6 +963,7 @@  static void FUNC(put_hevc_qpel_uni_w_v)(uint8_t *_dst,  ptrdiff_t _dststride,
 
     ox = ox * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel((((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
         src += srcstride;
@@ -968,6 +990,7 @@  static void FUNC(put_hevc_qpel_bi_w_v)(uint8_t *_dst, ptrdiff_t _dststride, uint
     ox0     = ox0 * (1 << (BIT_DEPTH - 8));
     ox1     = ox1 * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx1 + src2[x] * wx0 +
                                     ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1));
@@ -1000,6 +1023,7 @@  static void FUNC(put_hevc_qpel_uni_w_hv)(uint8_t *_dst,  ptrdiff_t _dststride,
     src   -= QPEL_EXTRA_BEFORE * srcstride;
     filter = ff_hevc_qpel_filters[mx - 1];
     for (y = 0; y < height + QPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -1011,6 +1035,7 @@  static void FUNC(put_hevc_qpel_uni_w_hv)(uint8_t *_dst,  ptrdiff_t _dststride,
 
     ox = ox * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel((((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox);
         tmp += MAX_PB_SIZE;
@@ -1037,6 +1062,7 @@  static void FUNC(put_hevc_qpel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, uin
     src   -= QPEL_EXTRA_BEFORE * srcstride;
     filter = ff_hevc_qpel_filters[mx - 1];
     for (y = 0; y < height + QPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -1049,6 +1075,7 @@  static void FUNC(put_hevc_qpel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, uin
     ox0     = ox0 * (1 << (BIT_DEPTH - 8));
     ox1     = ox1 * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx1 + src2[x] * wx0 +
                                     ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1));
@@ -1076,6 +1103,7 @@  static void FUNC(put_hevc_epel_h)(int16_t *dst,
     ptrdiff_t srcstride  = _srcstride / sizeof(pixel);
     const int8_t *filter = ff_hevc_epel_filters[mx - 1];
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -1093,6 +1121,7 @@  static void FUNC(put_hevc_epel_v)(int16_t *dst,
     const int8_t *filter = ff_hevc_epel_filters[my - 1];
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -1114,6 +1143,7 @@  static void FUNC(put_hevc_epel_hv)(int16_t *dst,
     src -= EPEL_EXTRA_BEFORE * srcstride;
 
     for (y = 0; y < height + EPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -1124,6 +1154,7 @@  static void FUNC(put_hevc_epel_hv)(int16_t *dst,
     filter = ff_hevc_epel_filters[my - 1];
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6;
         tmp += MAX_PB_SIZE;
@@ -1148,6 +1179,7 @@  static void FUNC(put_hevc_epel_uni_h)(uint8_t *_dst, ptrdiff_t _dststride, uint8
 #endif
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + offset) >> shift);
         src += srcstride;
@@ -1173,6 +1205,7 @@  static void FUNC(put_hevc_epel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride, uint8_
 #endif
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++) {
             dst[x] = av_clip_pixel(((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + src2[x] + offset) >> shift);
         }
@@ -1199,6 +1232,7 @@  static void FUNC(put_hevc_epel_uni_v)(uint8_t *_dst, ptrdiff_t _dststride, uint8
 #endif
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + offset) >> shift);
         src += srcstride;
@@ -1224,6 +1258,7 @@  static void FUNC(put_hevc_epel_bi_v)(uint8_t *_dst, ptrdiff_t _dststride, uint8_
 #endif
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + src2[x] + offset) >> shift);
         dst  += dststride;
@@ -1253,6 +1288,7 @@  static void FUNC(put_hevc_epel_uni_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint
     src -= EPEL_EXTRA_BEFORE * srcstride;
 
     for (y = 0; y < height + EPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -1263,6 +1299,7 @@  static void FUNC(put_hevc_epel_uni_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint
     filter = ff_hevc_epel_filters[my - 1];
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + offset) >> shift);
         tmp += MAX_PB_SIZE;
@@ -1292,6 +1329,7 @@  static void FUNC(put_hevc_epel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint8
     src -= EPEL_EXTRA_BEFORE * srcstride;
 
     for (y = 0; y < height + EPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -1302,6 +1340,7 @@  static void FUNC(put_hevc_epel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint8
     filter = ff_hevc_epel_filters[my - 1];
 
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + src2[x] + offset) >> shift);
         tmp  += MAX_PB_SIZE;
@@ -1328,6 +1367,7 @@  static void FUNC(put_hevc_epel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride, uin
 
     ox     = ox * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++) {
             dst[x] = av_clip_pixel((((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
         }
@@ -1353,6 +1393,7 @@  static void FUNC(put_hevc_epel_bi_w_h)(uint8_t *_dst, ptrdiff_t _dststride, uint
     ox0     = ox0 * (1 << (BIT_DEPTH - 8));
     ox1     = ox1 * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx1 + src2[x] * wx0 +
                                     ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1));
@@ -1380,6 +1421,7 @@  static void FUNC(put_hevc_epel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride, uin
 
     ox     = ox * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++) {
             dst[x] = av_clip_pixel((((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
         }
@@ -1405,6 +1447,7 @@  static void FUNC(put_hevc_epel_bi_w_v)(uint8_t *_dst, ptrdiff_t _dststride, uint
     ox0     = ox0 * (1 << (BIT_DEPTH - 8));
     ox1     = ox1 * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx1 + src2[x] * wx0 +
                                     ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1));
@@ -1435,6 +1478,7 @@  static void FUNC(put_hevc_epel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, ui
     src -= EPEL_EXTRA_BEFORE * srcstride;
 
     for (y = 0; y < height + EPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -1446,6 +1490,7 @@  static void FUNC(put_hevc_epel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, ui
 
     ox     = ox * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel((((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox);
         tmp += MAX_PB_SIZE;
@@ -1472,6 +1517,7 @@  static void FUNC(put_hevc_epel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, uin
     src -= EPEL_EXTRA_BEFORE * srcstride;
 
     for (y = 0; y < height + EPEL_EXTRA; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
         src += srcstride;
@@ -1484,6 +1530,7 @@  static void FUNC(put_hevc_epel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, uin
     ox0     = ox0 * (1 << (BIT_DEPTH - 8));
     ox1     = ox1 * (1 << (BIT_DEPTH - 8));
     for (y = 0; y < height; y++) {
+        FF_OMP_SIMD
         for (x = 0; x < width; x++)
             dst[x] = av_clip_pixel(((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx1 + src2[x] * wx0 +
                                     ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1));
diff --git a/libavutil/internal.h b/libavutil/internal.h
index 93ea57c324..b0543bbf02 100644
--- a/libavutil/internal.h
+++ b/libavutil/internal.h
@@ -299,4 +299,10 @@  int avpriv_dict_set_timestamp(AVDictionary **dict, const char *key, int64_t time
 #define FF_PSEUDOPAL 0
 #endif
 
+#if HAVE_OPENMP_SIMD
+#define FF_OMP_SIMD _Pragma("omp simd")
+#else
+#define FF_OMP_SIMD
+#endif
+
 #endif /* AVUTIL_INTERNAL_H */
diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c
index c4dd8a4d83..c112a61037 100644
--- a/libswscale/swscale_unscaled.c
+++ b/libswscale/swscale_unscaled.c
@@ -1743,6 +1743,7 @@  static int packedCopyWrapper(SwsContext *c, const uint8_t *src[],
     unsigned shift= src_depth-dst_depth, tmp;\
     if (c->dither == SWS_DITHER_NONE) {\
         for (i = 0; i < height; i++) {\
+            FF_OMP_SIMD \
             for (j = 0; j < length-7; j+=8) {\
                 dst[j+0] = dbswap(bswap(src[j+0])>>shift);\
                 dst[j+1] = dbswap(bswap(src[j+1])>>shift);\
@@ -1762,6 +1763,7 @@  static int packedCopyWrapper(SwsContext *c, const uint8_t *src[],
     } else if (shiftonly) {\
         for (i = 0; i < height; i++) {\
             const uint8_t *dither= dithers[shift-1][i&7];\
+            FF_OMP_SIMD \
             for (j = 0; j < length-7; j+=8) {\
                 tmp = (bswap(src[j+0]) + dither[0])>>shift; dst[j+0] = dbswap(tmp - (tmp>>dst_depth));\
                 tmp = (bswap(src[j+1]) + dither[1])>>shift; dst[j+1] = dbswap(tmp - (tmp>>dst_depth));\
@@ -1781,6 +1783,7 @@  static int packedCopyWrapper(SwsContext *c, const uint8_t *src[],
     } else {\
         for (i = 0; i < height; i++) {\
             const uint8_t *dither= dithers[shift-1][i&7];\
+            FF_OMP_SIMD \
             for (j = 0; j < length-7; j+=8) {\
                 tmp = bswap(src[j+0]); dst[j+0] = dbswap((tmp - (tmp>>dst_depth) + dither[0])>>shift);\
                 tmp = bswap(src[j+1]); dst[j+1] = dbswap((tmp - (tmp>>dst_depth) + dither[1])>>shift);\