Message ID | 20210110164351.86350-1-Reimar.Doeffinger@gmx.de |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel] Add support for "omp simd" pragma. | expand |
Context | Check | Description |
---|---|---|
andriy/x86_make | success | Make finished |
andriy/x86_make_fate | success | Make fate finished |
andriy/PPC64_make | success | Make finished |
andriy/PPC64_make_fate | success | Make fate finished |
Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de: > From: Reimar Döffinger <Reimar.Doeffinger@gmx.de> > > This requests loops to be vectorized using SIMD > instructions. > The performance increase is far from hand-optimized > assembly but still significant over the plain C version. > Typical values are a 2-4x speedup where a hand-written > version would achieve 4x-10x. > So it is far from a replacement, however some architures > will get hand-written assembler quite late or not at all, > and this is a good improvement for a trivial amount of work. > The cause, besides the compiler being a compiler, is > usually that it does not manage to use saturating instructions > and thus has to use 32-bit operations where actually > saturating 16-bit operations would be sufficient. > Other causes are for example the av_clip functions that > are not ideal for vectorization (and even as scalar code > not optimal for any modern CPU that has either CSEL or > MAX/MIN instructions). > And of course this only works for relatively simple > loops, the IDCT functions for example seemed not possible > to optimize that way. > Also note that while clang may accept the code and sometimes > produces warnings, it does not seem to do anything actually > useful at all. > Here are example measurements using gcc 10 under Linux (in a VM unfortunately) > on AArch64 on Apple M1: > Commad: > time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit -threads 1 -noframedrop > > Original code: > real 0m19.572s > user 0m23.386s > sys 0m0.213s > > Changing all put_hevc: > real 0m15.648s > user 0m19.503s (83.4% of original) > sys 0m0.186s > > In addition changing add_residual: > real 0m15.424s > user 0m19.278s (82.4% of original) > sys 0m0.133s > > In addition changing planar copy dither: > real 0m15.040s > user 0m18.874s (80.7% of original) > sys 0m0.168s > I think I have to disagree. The performance gains are marginal, its definitely something the compiler should be able to decide on its own, and it makes performance highly compiler dependent. And I'm not even resorting to the painfully obvious FUD arguments that could be made. Most of the loops this is added to are trivially SIMDable. Just because no one has had the motivation to do SIMD for a pretty unpopular codec doesn't mean we should compromise.
Am So., 10. Jan. 2021 um 19:55 Uhr schrieb Lynne <dev@lynne.ee>: > > Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de: > > > From: Reimar Döffinger <Reimar.Doeffinger@gmx.de> > > > > This requests loops to be vectorized using SIMD > > instructions. > > The performance increase is far from hand-optimized > > assembly but still significant over the plain C version. > > Typical values are a 2-4x speedup where a hand-written > > version would achieve 4x-10x. > > So it is far from a replacement, however some architures > > will get hand-written assembler quite late or not at all, > > and this is a good improvement for a trivial amount of work. > > The cause, besides the compiler being a compiler, is > > usually that it does not manage to use saturating instructions > > and thus has to use 32-bit operations where actually > > saturating 16-bit operations would be sufficient. > > Other causes are for example the av_clip functions that > > are not ideal for vectorization (and even as scalar code > > not optimal for any modern CPU that has either CSEL or > > MAX/MIN instructions). > > And of course this only works for relatively simple > > loops, the IDCT functions for example seemed not possible > > to optimize that way. > > Also note that while clang may accept the code and sometimes > > produces warnings, it does not seem to do anything actually > > useful at all. > > Here are example measurements using gcc 10 under Linux (in a VM unfortunately) > > on AArch64 on Apple M1: > > Commad: > > time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit -threads 1 -noframedrop > > > > Original code: > > real 0m19.572s > > user 0m23.386s > > sys 0m0.213s > > > > Changing all put_hevc: > > real 0m15.648s > > user 0m19.503s (83.4% of original) > > sys 0m0.186s > > > > In addition changing add_residual: > > real 0m15.424s > > user 0m19.278s (82.4% of original) > > sys 0m0.133s > > > > In addition changing planar copy dither: > > real 0m15.040s > > user 0m18.874s (80.7% of original) > > sys 0m0.168s > > > > I think I have to disagree. > The performance gains are marginal This sounds wrong. Carl Eugen
On Mon, Jan 11, 2021 at 1:26 AM Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote: > Am So., 10. Jan. 2021 um 19:55 Uhr schrieb Lynne <dev@lynne.ee>: > > > > Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de: > > > > > From: Reimar Döffinger <Reimar.Doeffinger@gmx.de> > > > > > > This requests loops to be vectorized using SIMD > > > instructions. > > > The performance increase is far from hand-optimized > > > assembly but still significant over the plain C version. > > > Typical values are a 2-4x speedup where a hand-written > > > version would achieve 4x-10x. > > > So it is far from a replacement, however some architures > > > will get hand-written assembler quite late or not at all, > > > and this is a good improvement for a trivial amount of work. > > > The cause, besides the compiler being a compiler, is > > > usually that it does not manage to use saturating instructions > > > and thus has to use 32-bit operations where actually > > > saturating 16-bit operations would be sufficient. > > > Other causes are for example the av_clip functions that > > > are not ideal for vectorization (and even as scalar code > > > not optimal for any modern CPU that has either CSEL or > > > MAX/MIN instructions). > > > And of course this only works for relatively simple > > > loops, the IDCT functions for example seemed not possible > > > to optimize that way. > > > Also note that while clang may accept the code and sometimes > > > produces warnings, it does not seem to do anything actually > > > useful at all. > > > Here are example measurements using gcc 10 under Linux (in a VM > unfortunately) > > > on AArch64 on Apple M1: > > > Commad: > > > time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit > -threads 1 -noframedrop > > > > > > Original code: > > > real 0m19.572s > > > user 0m23.386s > > > sys 0m0.213s > > > > > > Changing all put_hevc: > > > real 0m15.648s > > > user 0m19.503s (83.4% of original) > > > sys 0m0.186s > > > > > > In addition changing add_residual: > > > real 0m15.424s > > > user 0m19.278s (82.4% of original) > > > sys 0m0.133s > > > > > > In addition changing planar copy dither: > > > real 0m15.040s > > > user 0m18.874s (80.7% of original) > > > sys 0m0.168s > > > > > > > I think I have to disagree. > > > The performance gains are marginal > > This sounds wrong. > I disagree with Carl. > > Carl Eugen > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> > On 10 Jan 2021, at 19:55, Lynne <dev@lynne.ee> wrote: > > Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de: > >> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de> >> >> real 0m15.040s >> user 0m18.874s (80.7% of original) >> sys 0m0.168s >> > > I think I have to disagree. > The performance gains are marginal, It’s almost 20%. At least for this combination of codec and stream a large amount of time is spend in non-DSP functions, so even hand-written assembler won’t give you huge gains. > its definitely something the compiler should > be able to decide on its own, So you object to unlikely() macros as well? It’s really just giving the compiler a hint it should try, though I admit the configure part makes it look otherwise. > Most of the loops this is added to are trivially SIMDable. How many hours of effort do you consider “trivial”? Especially if it’s someone not experienced? It might be fairly trivial with intrinsics, however many of your counter-arguments also apply to intrinsics (and to a degree inline assembly). That’s btw not just a rhetorical question because I’m pretty sure I am not going to all the trouble to port more of the arm 32-bit assembler functions since it’s a huge PITA, and I was wondering if there was a point to even have a try with intrinsics... > Just because no one has > had the motivation to do SIMD for a pretty unpopular codec doesn't mean we should > compromise. If you think of AArch64 specifically, I can kind of agree. However I wouldn’t say the word “compromise” is appropriate when there’s a good chance nothing better will ever come to exist. But the real point is not AArch64, that is just a very convenient test platform. The point is to raise the minimum bar. A new architecture, RISC-V for example or something else should not be stuck at scalar performance until someone actually gets around to implementing assembler optimizations. And just to be clear: I don’t actually care about HEVC, it just seemed a nice target to do some experiments. Best regards, Reimar
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Reimar.Doeffinger@gmx.de > Sent: Sunday, January 10, 2021 5:44 PM > To: ffmpeg-devel@ffmpeg.org > Cc: Reimar Döffinger <Reimar.Doeffinger@gmx.de> > Subject: [FFmpeg-devel] [PATCH] Add support for "omp simd" pragma. > > From: Reimar Döffinger <Reimar.Doeffinger@gmx.de> > > This requests loops to be vectorized using SIMD instructions. > The performance increase is far from hand-optimized assembly but still > significant over the plain C version. > Typical values are a 2-4x speedup where a hand-written version would > achieve 4x-10x. > So it is far from a replacement, however some architures will get hand- > written assembler quite late or not at all, and this is a good improvement for > a trivial amount of work. > The cause, besides the compiler being a compiler, is usually that it does not > manage to use saturating instructions and thus has to use 32-bit operations > where actually saturating 16-bit operations would be sufficient. > Other causes are for example the av_clip functions that are not ideal for > vectorization (and even as scalar code not optimal for any modern CPU that > has either CSEL or MAX/MIN instructions). > And of course this only works for relatively simple loops, the IDCT functions > for example seemed not possible to optimize that way. ... > +if enabled openmp_simd; then > + ompopt="-fopenmp" > + if ! test_cflags $ompopt ; then > + test_cflags -Xpreprocessor -fopenmp && ompopt="-Xpreprocessor - > fopenmp" Isn't it sufficient to specify -fopenmp-simd instead of -fopenmp for this patch? As OMP SIMD is the only openmp feature that is used, there's no need to link to the openmp lib. softworkz
> On 12 Jan 2021, at 19:52, Soft Works <softworkz@hotmail.com> wrote: > > > >> -----Original Message----- >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of >> Reimar.Doeffinger@gmx.de >> Sent: Sunday, January 10, 2021 5:44 PM >> To: ffmpeg-devel@ffmpeg.org >> Cc: Reimar Döffinger <Reimar.Doeffinger@gmx.de> >> Subject: [FFmpeg-devel] [PATCH] Add support for "omp simd" pragma. >> >> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de> >> >> This requests loops to be vectorized using SIMD instructions. >> The performance increase is far from hand-optimized assembly but still >> significant over the plain C version. >> Typical values are a 2-4x speedup where a hand-written version would >> achieve 4x-10x. >> So it is far from a replacement, however some architures will get hand- >> written assembler quite late or not at all, and this is a good improvement for >> a trivial amount of work. >> The cause, besides the compiler being a compiler, is usually that it does not >> manage to use saturating instructions and thus has to use 32-bit operations >> where actually saturating 16-bit operations would be sufficient. >> Other causes are for example the av_clip functions that are not ideal for >> vectorization (and even as scalar code not optimal for any modern CPU that >> has either CSEL or MAX/MIN instructions). >> And of course this only works for relatively simple loops, the IDCT functions >> for example seemed not possible to optimize that way. > > ... > >> +if enabled openmp_simd; then >> + ompopt="-fopenmp" >> + if ! test_cflags $ompopt ; then >> + test_cflags -Xpreprocessor -fopenmp && ompopt="-Xpreprocessor - >> fopenmp" > > Isn't it sufficient to specify -fopenmp-simd instead of -fopenmp for this patch? I think so, I just didn’t know/even expect that option to exist! Thanks a lot for the tip! > As OMP SIMD is the only openmp feature that is used, there's no need to link > to the openmp lib. That it doesn’t do anyway because -fopenmp is not in the linker flags, but I admit that was a bit of a hacky solution. Thanks, Reimar
Jan 12, 2021, 19:28 by Reimar.Doeffinger@gmx.de: >> >> On 10 Jan 2021, at 19:55, Lynne <dev@lynne.ee> wrote: >> >> Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de: >> >>> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de> >>> >>> real 0m15.040s >>> user 0m18.874s (80.7% of original) >>> sys 0m0.168s >>> >> >> I think I have to disagree. >> The performance gains are marginal, >> > > It’s almost 20%. At least for this combination of > codec and stream a large amount of time is spend in > non-DSP functions, so even hand-written assembler > won’t give you huge gains. > It's non-guaranteed 20% on a single system. It could change, and it could very well mess up like gcc does with autovectorization, which we still explicitly disable because FATE fails (-fno-tree-vectorize, and I was the one who sent an RFC to try to undo it somewhat recently. Even though it was an RFC the reaction from devs was quite cold). >> its definitely something the compiler should >> be able to decide on its own, >> > > So you object to unlikely() macros as well? > It’s really just giving the compiler a hint it should try, though I admit the configure part makes it > look otherwise. > I'm more against the macro and changes to the code itself. If you can make it work without adding a macro to individual loops or the likes of av_cold/av_hot or any other changes to the code, I'll be more welcoming. I really _hate_ compiler hints. Take a look at the upipe source code to see what a cthulian monstrosity made of hint flags looks like. Every single branch had a cold/hot macro and it was the project's coding style. It's completely irredeemable. >> Most of the loops this is added to are trivially SIMDable. >> > > How many hours of effort do you consider “trivial”? > Especially if it’s someone not experienced? > It might be fairly trivial with intrinsics, however > many of your counter-arguments also apply > to intrinsics (and to a degree inline assembly). > That’s btw not just a rhetorical question because > I’m pretty sure I am not going to all the trouble > to port more of the arm 32-bit assembler functions > since it’s a huge PITA, and I was wondering if there > was a point to even have a try with intrinsics... > Intrinsics and inline assembly are a whole different thing than magic macros that tell and force the compiler what a well written compiler should already very well know about. >> Just because no one has >> had the motivation to do SIMD for a pretty unpopular codec doesn't mean we should >> compromise. >> > > If you think of AArch64 specifically, I can > kind of agree. > However I wouldn’t say the word “compromise” > is appropriate when there’s a good chance nothing > better will ever come to exist. > But the real point is not AArch64, that is just > a very convenient test platform. > The point is to raise the minimum bar. > A new architecture, RISC-V for example or something > else should not be stuck at scalar performance > until someone actually gets around to implementing > assembler optimizations. > And just to be clear: I don’t actually care about > HEVC, it just seemed a nice target to do some > experiments. > I already said all that can be said here: this will halt efforts on actually optimizing the code in exchange for naive trust in compilers. New platforms will be stuck at scalar performance anyway until the compilers for the arch are smart enough to deal with vectorization.
> On 12 Jan 2021, at 21:46, Lynne <dev@lynne.ee> wrote: > > Jan 12, 2021, 19:28 by Reimar.Doeffinger@gmx.de: > >> It’s almost 20%. At least for this combination of >> codec and stream a large amount of time is spend in >> non-DSP functions, so even hand-written assembler >> won’t give you huge gains. >> > It's non-guaranteed 20% on a single system. It could change, and it could very > well mess up like gcc does with autovectorization, which we still explicitly disable > because FATE fails (-fno-tree-vectorize, and I was the one who sent an RFC to > try to undo it somewhat recently. Even though it was an RFC the reaction from devs > was quite cold). Oh, thanks for the reminder, I thought that was gone because it seems it’s not used for clang, and MPlayer does not seem to set that. I need to compare it, however the problem with the auto-vectorization is exactly that the compiler will try to apply it to everything, which has at least 2 issues: 1) it gigantically increases the risk for bugs when it's every single loop instead of loops that we already wrote assembler for somewhere. 2) it will quite often make things worse, by vectorizing loops that are rarely iterated over more than a few times (and it needs to handle a whole lot of code to handle loop counts not a multiple of vector size) - because all too often the compiler can only take a wild guess if “width” is usually 1 or 1920, while we DO know. >>> its definitely something the compiler should >>> be able to decide on its own, >>> >> >> So you object to unlikely() macros as well? >> It’s really just giving the compiler a hint it should try, though I admit the configure part makes it >> look otherwise. >> > I'm more against the macro and changes to the code itself. If you can make it > work without adding a macro to individual loops or the likes of av_cold/av_hot or > any other changes to the code, I'll be more welcoming. I expect that will just run into the same issue as the tree-vectorize... > I really _hate_ compiler hints. Take a look at the upipe source code to see what > a cthulian monstrosity made of hint flags looks like. Every single branch had > a cold/hot macro and it was the project's coding style. It's completely irredeemable. I guess my suggested solution would be to require proof of clearly measurable performance benefit. But I see the point that if it gets “randomly” added to loops that might turn out quite a mess. >>> Most of the loops this is added to are trivially SIMDable. >>> >> >> How many hours of effort do you consider “trivial”? >> Especially if it’s someone not experienced? >> It might be fairly trivial with intrinsics, however >> many of your counter-arguments also apply >> to intrinsics (and to a degree inline assembly). >> That’s btw not just a rhetorical question because >> I’m pretty sure I am not going to all the trouble >> to port more of the arm 32-bit assembler functions >> since it’s a huge PITA, and I was wondering if there >> was a point to even have a try with intrinsics... >> > Intrinsics and inline assembly are a whole different thing than magic > macros that tell and force the compiler what a well written compiler > should already very well know about. There are no well written compilers, in a way ;) I would also argue that most of what intrinsics do, such a compiler should figure out on its own, too. And the first time I tried intrinsics they slowed the loop down by a factor 2 because the compiler stored and loaded the value to stack between every intrinsic, so it’s not like they are not without problems. But I was actually thinking that it might be somewhat interesting to have a kind of “generic SIMD intrinsics”. Though I think I read that such a thing has already be tried, so it might just be wasted time. > I already said all that can be said here: this will halt efforts on actually > optimizing the code in exchange for naive trust in compilers. I’m not sure it will discourage it more than having to write the optimizations over and over, for Armv7 NEON, for Armv8 Linux, for Armv8 Windows, then SVE/SVE2, who knows maybe Armv9 will also need a rewrite. SSE2, AVX256, AVX512 for x86, so much stuff never gets ported to the new versions. I’d also claim anyone naively trusting in compilers is unlikely to write SIMD optimizations either way :) > New platforms will be stuck at scalar performance anyway until > the compilers for the arch are smart enough to deal with vectorization. That seems to happen a long time before someone gets around to optimising FFmpeg though. This is particularly true when it’s a new OS and not CPU architecture platform. For example macOS we are lucky enough that the assembler etc. are largely compatible to Linux. But for Windows-on-Arm there is no GNU assembler, and the Microsoft assembler needs a completely different syntax, so even the assembly we DO have just doesn’t work. Anyway, thanks for the discussion. I still think the situation with SIMD optimizations should be improved SOMEHOW, but I nothing but wild ideas on the HOW. If anyone feels the same, I’d welcome further discussion. Thanks, Reimar
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Lynne > Sent: Tuesday, January 12, 2021 9:47 PM > To: FFmpeg development discussions and patches <ffmpeg- > devel@ffmpeg.org> > Subject: Re: [FFmpeg-devel] [PATCH] Add support for "omp simd" pragma. > > Jan 12, 2021, 19:28 by Reimar.Doeffinger@gmx.de: > > >> > >> On 10 Jan 2021, at 19:55, Lynne <dev@lynne.ee> wrote: > >> > >> Jan 10, 2021, 17:43 by Reimar.Doeffinger@gmx.de: > >> > >>> From: Reimar Döffinger <Reimar.Doeffinger@gmx.de> > >>> > >>> real 0m15.040s > >>> user 0m18.874s (80.7% of original) > >>> sys 0m0.168s > >>> > >> > >> I think I have to disagree. > >> The performance gains are marginal, > >> > > > > It’s almost 20%. At least for this combination of codec and stream a > > large amount of time is spend in non-DSP functions, so even > > hand-written assembler won’t give you huge gains. > > > It's non-guaranteed 20% on a single system. It could change, and it could very > well mess up like gcc does with autovectorization, which we still explicitly > disable because FATE fails (-fno-tree-vectorize, and I was the one who sent > an RFC to try to undo it somewhat recently. Even though it was an RFC the > reaction from devs was quite cold). I wonder whether there's a way to enable autovectorization only for specific loops? But that would probably be compiler-specific. > >> its definitely something the compiler should be able to decide on its > >> own, > >> > > > > So you object to unlikely() macros as well? > > It’s really just giving the compiler a hint it should try, though I > > admit the configure part makes it look otherwise. > > > I'm more against the macro and changes to the code itself. If you can make it > work without adding a macro to individual loops or the likes of > av_cold/av_hot or any other changes to the code, I'll be more welcoming. > I really _hate_ compiler hints. Take a look at the upipe source code to see > what a cthulian monstrosity made of hint flags looks like. Every single branch > had a cold/hot macro and it was the project's coding style. It's completely > irredeemable. OpenMP is a standard at least, which is supported by many compilers and #pragma omp simd is not really a "monstrosity". > >> Most of the loops this is added to are trivially SIMDable. Could you provide some examples? What constructs would you suggest, that can be applied trivially? And that it would be compiled as SIMD even though fno-tree-vectorize is set? Thanks, softworkz
Hi, On Tue, 12 Jan 2021, Reimar Döffinger wrote: > I’m not sure it will discourage it more than having to write > the optimizations over and over, for Armv7 NEON, for Armv8 Linux, > for Armv8 Windows, then SVE/SVE2, who knows maybe Armv9 > will also need a rewrite. NEON code for armv8 windows and armv8 linux all use the exact same source, no need to write it twice. > For example macOS we are lucky enough that the assembler etc. are > largely compatible to Linux. I'm not sure I'd say it's luck, it's pretty much by design there. Historically, macOS build tools used an ancient fork of GAS, with very limited macroing capabilities. To remedy this, the gas-preprocessor tool was invented, for expanding modern gas macros, producing just a straight up feed of instructions, passed on to the native platform tools. In modern times, the build tools are based on Clang/LLVM, and they support essentially all modern gas macro features (including altmacro, which was added in Clang 5). There's many parties that have an interest in this feature, e.g. support for building the Linux kernel with Clang. Due to backwards compatibility with the old GAS fork's macroing capability (I think), there's some very vague differences between LLVM's macro support for other platforms and darwin, e.g. on other platforms, it's ok to invoke a macro as either "mymacro param1, param2" or "mymacro param1 param2" (without commas between the arguments). On darwin targets, only the former works as intended. All other platform differences are abstracted away with our macros in libavutil/aarch64/asm.S, see e.g. http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavutil/aarch64/asm.S;h=d1fa72b3c65a4a58e76029e94b998d935649aa90;hb=ca21cb1e36ccae2ee71d4299d477fa9284c1f551#l85. For darwin platforms, the movrel macro expands to "add rX, rX, symbol@PAGEOFF" while it expands to "add rX, rX, :lo12:symbol" on other platforms (ELF and COFF). All the source files just use the high level macros function, endfunc, movrel, etc, which handle the few platform specific details that differ. If you write code with just one tool, it's of course certainly possible to accidentally use some corner case detail that another tool objects to, but that's why one needs testing on multiple platforms, via a CI system or FATE or whatever. Just like you regularly need to test C code on various platforms, even if you'd expect it to work if it's properly written. > But for Windows-on-Arm there is no GNU assembler, and the Microsoft > assembler needs a completely different syntax, so even the assembly > we DO have just doesn’t work. This is not true at all. GCC and binutils don't support windows on arm/arm64 at all, that far is true. But Clang/LLVM do (with https://github.com/mstorsjo/llvm-mingw you have an easily available packaged cross compiler and all), and they support the GAS syntax asm just fine. If building with MSVC tools, yes you're right that armasm.exe/armasm64.exe takes a different syntax. But the gas-preprocessor tool (which is picked up automatically by our configure, one just needs to make sure it's available) handles expanding all the macros and rewriting directives into the armasm form, and feeding it to the armasm tools. Works fine and have done so for years. There's even a wiki page which tries to explain how to do it (although it's probably outdated in some aspects), see https://trac.ffmpeg.org/wiki/CompilationGuide/WinRT. We even have regular fate tests of these configurations, see e.g. these: http://fate.ffmpeg.org/report.cgi?slot=aarch64-mingw32-clang-trunk&time=20210113064430 http://fate.ffmpeg.org/report.cgi?time=20210109152105&slot=arm64-msvc2019 http://fate.ffmpeg.org/report.cgi?slot=armv7-mingw32-clang-trunk&time=20210113055653 http://fate.ffmpeg.org/report.cgi?time=20210109163844&slot=arm-msvc2019-phone All of these run with full assembly optimizations enabled. So please don't tell me that our assembly doesn't work on windows on arm, because it does, and it has for years. // Martin
> If building with MSVC tools, yes you're right that armasm.exe/armasm64.exe takes a different syntax. But the gas-preprocessor tool (which is picked up automatically by our configure, one just needs to make sure it's available) handles expanding all the macros and rewriting directives into the armasm form, and feeding it to the armasm tools. Works fine and have done so for years. There's even a wiki page which tries to explain how to do it (although it's probably outdated in some aspects), see https://trac.ffmpeg.org/wiki/CompilationGuide/WinRT. > I went with the instructions in doc/platform.texi and that did not work at all, It even tried to use cl.exe to compile the assembler files! > All of these run with full assembly optimizations enabled. So please don't tell me that our assembly doesn't work on windows on arm, because it does, and it has for years. > My apologies. I’ll correct to: it doesn’t work using the instructions shipping with the source code (as far as I can tell) :)
On Wed, 13 Jan 2021, Reimar Döffinger wrote: >> If building with MSVC tools, yes you're right that armasm.exe/armasm64.exe takes a different syntax. But the gas-preprocessor tool (which is picked up automatically by our configure, one just needs to make sure it's available) handles expanding all the macros and rewriting directives into the armasm form, and feeding it to the armasm tools. Works fine and have done so for years. There's even a wiki page which tries to explain how to do it (although it's probably outdated in some aspects), see https://trac.ffmpeg.org/wiki/CompilationGuide/WinRT. >> > > I went with the instructions in doc/platform.texi and that did not work at all, > It even tried to use cl.exe to compile the assembler files! What did you end up trying/doing in this case? That sounds rather broken to me. The main issue just is having gas-preprocessor available (but since you need a posix make, like from msys2, it should be pretty easy to have a perl installation for gas-preprocessor there) - but if it isn't found, configure really should be erroring out and not silently using cl as assembler... My own setups for fate are a bit special as they're cross compiled from linux (with msvc wrapped in wine), but it should essentially just be "./configure --arch=arm64 --target-os=win32 --toolchain=msvc --enable-cross-compile", assuming you have MSVC targeting arm64 in $PATH. // Martin
diff --git a/configure b/configure index 900505756b..73b7c3daeb 100755 --- a/configure +++ b/configure @@ -406,6 +406,7 @@ Toolchain options: --enable-pic build position-independent code --enable-thumb compile for Thumb instruction set --enable-lto use link-time optimization + --enable-openmp-simd use the "omp simd" pragma to optimize code --env="ENV=override" override the environment variables Advanced options (experts only): @@ -2335,6 +2336,7 @@ HAVE_LIST=" opencl_dxva2 opencl_vaapi_beignet opencl_vaapi_intel_media + openmp_simd perl pod2man texi2html @@ -2446,6 +2448,7 @@ CMDLINE_SELECT=" extra_warnings logging lto + openmp_simd optimizations rpath stripping @@ -6926,6 +6929,26 @@ if enabled lto; then disable inline_asm_direct_symbol_refs fi +if enabled openmp_simd; then + ompopt="-fopenmp" + if ! test_cflags $ompopt ; then + test_cflags -Xpreprocessor -fopenmp && ompopt="-Xpreprocessor -fopenmp" + fi + test_cc $ompopt <<EOF && add_cflags "$ompopt" || die "failed to enable openmp SIMD" +#ifndef _OPENMP +#error _OPENMP is not defined +#endif +void test(unsigned char *c) +{ + _Pragma("omp simd") + for (int i = 0; i < 256; i++) + { + c[i] *= 16; + } +} +EOF +fi + enabled ftrapv && check_cflags -ftrapv test_cc -mno-red-zone <<EOF && noredzone_flags="-mno-red-zone" diff --git a/libavcodec/hevcdsp_template.c b/libavcodec/hevcdsp_template.c index 56cd9e605d..1a8b4160ec 100644 --- a/libavcodec/hevcdsp_template.c +++ b/libavcodec/hevcdsp_template.c @@ -50,6 +50,7 @@ static av_always_inline void FUNC(add_residual)(uint8_t *_dst, int16_t *res, stride /= sizeof(pixel); for (y = 0; y < size; y++) { + FF_OMP_SIMD for (x = 0; x < size; x++) { dst[x] = av_clip_pixel(dst[x] + *res); res++; @@ -247,6 +248,7 @@ static void FUNC(idct_ ## H ## x ## H )(int16_t *coeffs, \ int16_t *src = coeffs; \ IDCT_VAR ## H(H); \ \ + FF_OMP_SIMD \ for (i = 0; i < H; i++) { \ TR_ ## H(src, src, H, H, SCALE, limit2); \ if (limit2 < H && i%4 == 0 && !!i) \ @@ -256,6 +258,7 @@ static void FUNC(idct_ ## H ## x ## H )(int16_t *coeffs, \ \ shift = 20 - BIT_DEPTH; \ add = 1 << (shift - 1); \ + FF_OMP_SIMD \ for (i = 0; i < H; i++) { \ TR_ ## H(coeffs, coeffs, 1, 1, SCALE, limit); \ coeffs += H; \ @@ -502,6 +505,7 @@ static void FUNC(put_hevc_pel_pixels)(int16_t *dst, ptrdiff_t srcstride = _srcstride / sizeof(pixel); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = src[x] << (14 - BIT_DEPTH); src += srcstride; @@ -543,6 +547,7 @@ static void FUNC(put_hevc_pel_bi_pixels)(uint8_t *_dst, ptrdiff_t _dststride, ui #endif for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((src[x] << (14 - BIT_DEPTH)) + src2[x] + offset) >> shift); src += srcstride; @@ -568,6 +573,7 @@ static void FUNC(put_hevc_pel_uni_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride, ox = ox * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel((((src[x] << (14 - BIT_DEPTH)) * wx + offset) >> shift) + ox); src += srcstride; @@ -592,6 +598,7 @@ static void FUNC(put_hevc_pel_bi_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride, ox0 = ox0 * (1 << (BIT_DEPTH - 8)); ox1 = ox1 * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) { dst[x] = av_clip_pixel(( (src[x] << (14 - BIT_DEPTH)) * wx1 + src2[x] * wx0 + (ox0 + ox1 + 1) * (1 << log2Wd)) >> (log2Wd + 1)); } @@ -623,6 +630,7 @@ static void FUNC(put_hevc_qpel_h)(int16_t *dst, ptrdiff_t srcstride = _srcstride / sizeof(pixel); const int8_t *filter = ff_hevc_qpel_filters[mx - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -639,6 +647,7 @@ static void FUNC(put_hevc_qpel_v)(int16_t *dst, ptrdiff_t srcstride = _srcstride / sizeof(pixel); const int8_t *filter = ff_hevc_qpel_filters[my - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8); src += srcstride; @@ -662,6 +671,7 @@ static void FUNC(put_hevc_qpel_hv)(int16_t *dst, src -= QPEL_EXTRA_BEFORE * srcstride; filter = ff_hevc_qpel_filters[mx - 1]; for (y = 0; y < height + QPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -671,6 +681,7 @@ static void FUNC(put_hevc_qpel_hv)(int16_t *dst, tmp = tmp_array + QPEL_EXTRA_BEFORE * MAX_PB_SIZE; filter = ff_hevc_qpel_filters[my - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6; tmp += MAX_PB_SIZE; @@ -697,6 +708,7 @@ static void FUNC(put_hevc_qpel_uni_h)(uint8_t *_dst, ptrdiff_t _dststride, #endif for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + offset) >> shift); src += srcstride; @@ -724,6 +736,7 @@ static void FUNC(put_hevc_qpel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride, uint8_ #endif for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + src2[x] + offset) >> shift); src += srcstride; @@ -751,6 +764,7 @@ static void FUNC(put_hevc_qpel_uni_v)(uint8_t *_dst, ptrdiff_t _dststride, #endif for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + offset) >> shift); src += srcstride; @@ -779,6 +793,7 @@ static void FUNC(put_hevc_qpel_bi_v)(uint8_t *_dst, ptrdiff_t _dststride, uint8_ #endif for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + src2[x] + offset) >> shift); src += srcstride; @@ -810,6 +825,7 @@ static void FUNC(put_hevc_qpel_uni_hv)(uint8_t *_dst, ptrdiff_t _dststride, src -= QPEL_EXTRA_BEFORE * srcstride; filter = ff_hevc_qpel_filters[mx - 1]; for (y = 0; y < height + QPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -820,6 +836,7 @@ static void FUNC(put_hevc_qpel_uni_hv)(uint8_t *_dst, ptrdiff_t _dststride, filter = ff_hevc_qpel_filters[my - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + offset) >> shift); tmp += MAX_PB_SIZE; @@ -849,6 +866,7 @@ static void FUNC(put_hevc_qpel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint8 src -= QPEL_EXTRA_BEFORE * srcstride; filter = ff_hevc_qpel_filters[mx - 1]; for (y = 0; y < height + QPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -859,6 +877,7 @@ static void FUNC(put_hevc_qpel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint8 filter = ff_hevc_qpel_filters[my - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + src2[x] + offset) >> shift); tmp += MAX_PB_SIZE; @@ -887,6 +906,7 @@ static void FUNC(put_hevc_qpel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride, ox = ox * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel((((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox); src += srcstride; @@ -913,6 +933,7 @@ static void FUNC(put_hevc_qpel_bi_w_h)(uint8_t *_dst, ptrdiff_t _dststride, uint ox0 = ox0 * (1 << (BIT_DEPTH - 8)); ox1 = ox1 * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx1 + src2[x] * wx0 + ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1)); @@ -942,6 +963,7 @@ static void FUNC(put_hevc_qpel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride, ox = ox * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel((((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox); src += srcstride; @@ -968,6 +990,7 @@ static void FUNC(put_hevc_qpel_bi_w_v)(uint8_t *_dst, ptrdiff_t _dststride, uint ox0 = ox0 * (1 << (BIT_DEPTH - 8)); ox1 = ox1 * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx1 + src2[x] * wx0 + ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1)); @@ -1000,6 +1023,7 @@ static void FUNC(put_hevc_qpel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, src -= QPEL_EXTRA_BEFORE * srcstride; filter = ff_hevc_qpel_filters[mx - 1]; for (y = 0; y < height + QPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -1011,6 +1035,7 @@ static void FUNC(put_hevc_qpel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, ox = ox * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel((((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox); tmp += MAX_PB_SIZE; @@ -1037,6 +1062,7 @@ static void FUNC(put_hevc_qpel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, uin src -= QPEL_EXTRA_BEFORE * srcstride; filter = ff_hevc_qpel_filters[mx - 1]; for (y = 0; y < height + QPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -1049,6 +1075,7 @@ static void FUNC(put_hevc_qpel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, uin ox0 = ox0 * (1 << (BIT_DEPTH - 8)); ox1 = ox1 * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx1 + src2[x] * wx0 + ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1)); @@ -1076,6 +1103,7 @@ static void FUNC(put_hevc_epel_h)(int16_t *dst, ptrdiff_t srcstride = _srcstride / sizeof(pixel); const int8_t *filter = ff_hevc_epel_filters[mx - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -1093,6 +1121,7 @@ static void FUNC(put_hevc_epel_v)(int16_t *dst, const int8_t *filter = ff_hevc_epel_filters[my - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8); src += srcstride; @@ -1114,6 +1143,7 @@ static void FUNC(put_hevc_epel_hv)(int16_t *dst, src -= EPEL_EXTRA_BEFORE * srcstride; for (y = 0; y < height + EPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -1124,6 +1154,7 @@ static void FUNC(put_hevc_epel_hv)(int16_t *dst, filter = ff_hevc_epel_filters[my - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6; tmp += MAX_PB_SIZE; @@ -1148,6 +1179,7 @@ static void FUNC(put_hevc_epel_uni_h)(uint8_t *_dst, ptrdiff_t _dststride, uint8 #endif for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + offset) >> shift); src += srcstride; @@ -1173,6 +1205,7 @@ static void FUNC(put_hevc_epel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride, uint8_ #endif for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) { dst[x] = av_clip_pixel(((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + src2[x] + offset) >> shift); } @@ -1199,6 +1232,7 @@ static void FUNC(put_hevc_epel_uni_v)(uint8_t *_dst, ptrdiff_t _dststride, uint8 #endif for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + offset) >> shift); src += srcstride; @@ -1224,6 +1258,7 @@ static void FUNC(put_hevc_epel_bi_v)(uint8_t *_dst, ptrdiff_t _dststride, uint8_ #endif for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + src2[x] + offset) >> shift); dst += dststride; @@ -1253,6 +1288,7 @@ static void FUNC(put_hevc_epel_uni_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint src -= EPEL_EXTRA_BEFORE * srcstride; for (y = 0; y < height + EPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -1263,6 +1299,7 @@ static void FUNC(put_hevc_epel_uni_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint filter = ff_hevc_epel_filters[my - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + offset) >> shift); tmp += MAX_PB_SIZE; @@ -1292,6 +1329,7 @@ static void FUNC(put_hevc_epel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint8 src -= EPEL_EXTRA_BEFORE * srcstride; for (y = 0; y < height + EPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -1302,6 +1340,7 @@ static void FUNC(put_hevc_epel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride, uint8 filter = ff_hevc_epel_filters[my - 1]; for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + src2[x] + offset) >> shift); tmp += MAX_PB_SIZE; @@ -1328,6 +1367,7 @@ static void FUNC(put_hevc_epel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride, uin ox = ox * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) { dst[x] = av_clip_pixel((((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox); } @@ -1353,6 +1393,7 @@ static void FUNC(put_hevc_epel_bi_w_h)(uint8_t *_dst, ptrdiff_t _dststride, uint ox0 = ox0 * (1 << (BIT_DEPTH - 8)); ox1 = ox1 * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx1 + src2[x] * wx0 + ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1)); @@ -1380,6 +1421,7 @@ static void FUNC(put_hevc_epel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride, uin ox = ox * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) { dst[x] = av_clip_pixel((((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox); } @@ -1405,6 +1447,7 @@ static void FUNC(put_hevc_epel_bi_w_v)(uint8_t *_dst, ptrdiff_t _dststride, uint ox0 = ox0 * (1 << (BIT_DEPTH - 8)); ox1 = ox1 * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx1 + src2[x] * wx0 + ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1)); @@ -1435,6 +1478,7 @@ static void FUNC(put_hevc_epel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, ui src -= EPEL_EXTRA_BEFORE * srcstride; for (y = 0; y < height + EPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -1446,6 +1490,7 @@ static void FUNC(put_hevc_epel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, ui ox = ox * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel((((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox); tmp += MAX_PB_SIZE; @@ -1472,6 +1517,7 @@ static void FUNC(put_hevc_epel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, uin src -= EPEL_EXTRA_BEFORE * srcstride; for (y = 0; y < height + EPEL_EXTRA; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8); src += srcstride; @@ -1484,6 +1530,7 @@ static void FUNC(put_hevc_epel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, uin ox0 = ox0 * (1 << (BIT_DEPTH - 8)); ox1 = ox1 * (1 << (BIT_DEPTH - 8)); for (y = 0; y < height; y++) { + FF_OMP_SIMD for (x = 0; x < width; x++) dst[x] = av_clip_pixel(((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx1 + src2[x] * wx0 + ((ox0 + ox1 + 1) * (1 << log2Wd))) >> (log2Wd + 1)); diff --git a/libavutil/internal.h b/libavutil/internal.h index 93ea57c324..b0543bbf02 100644 --- a/libavutil/internal.h +++ b/libavutil/internal.h @@ -299,4 +299,10 @@ int avpriv_dict_set_timestamp(AVDictionary **dict, const char *key, int64_t time #define FF_PSEUDOPAL 0 #endif +#if HAVE_OPENMP_SIMD +#define FF_OMP_SIMD _Pragma("omp simd") +#else +#define FF_OMP_SIMD +#endif + #endif /* AVUTIL_INTERNAL_H */ diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c index c4dd8a4d83..c112a61037 100644 --- a/libswscale/swscale_unscaled.c +++ b/libswscale/swscale_unscaled.c @@ -1743,6 +1743,7 @@ static int packedCopyWrapper(SwsContext *c, const uint8_t *src[], unsigned shift= src_depth-dst_depth, tmp;\ if (c->dither == SWS_DITHER_NONE) {\ for (i = 0; i < height; i++) {\ + FF_OMP_SIMD \ for (j = 0; j < length-7; j+=8) {\ dst[j+0] = dbswap(bswap(src[j+0])>>shift);\ dst[j+1] = dbswap(bswap(src[j+1])>>shift);\ @@ -1762,6 +1763,7 @@ static int packedCopyWrapper(SwsContext *c, const uint8_t *src[], } else if (shiftonly) {\ for (i = 0; i < height; i++) {\ const uint8_t *dither= dithers[shift-1][i&7];\ + FF_OMP_SIMD \ for (j = 0; j < length-7; j+=8) {\ tmp = (bswap(src[j+0]) + dither[0])>>shift; dst[j+0] = dbswap(tmp - (tmp>>dst_depth));\ tmp = (bswap(src[j+1]) + dither[1])>>shift; dst[j+1] = dbswap(tmp - (tmp>>dst_depth));\ @@ -1781,6 +1783,7 @@ static int packedCopyWrapper(SwsContext *c, const uint8_t *src[], } else {\ for (i = 0; i < height; i++) {\ const uint8_t *dither= dithers[shift-1][i&7];\ + FF_OMP_SIMD \ for (j = 0; j < length-7; j+=8) {\ tmp = bswap(src[j+0]); dst[j+0] = dbswap((tmp - (tmp>>dst_depth) + dither[0])>>shift);\ tmp = bswap(src[j+1]); dst[j+1] = dbswap((tmp - (tmp>>dst_depth) + dither[1])>>shift);\