Message ID | DB6PR0101MB22142E76F282AD72079ABFB48F6E9@DB6PR0101MB2214.eurprd01.prod.exchangelabs.com |
---|---|
State | Accepted |
Commit | de33506e4b3e3362095aab167ad8bb87c1bd9488 |
Headers | show |
Series | [FFmpeg-devel] swscale/x86/rgb2_rgb: Empty MMX state in ff_shuffle_bytes_2103_mmxext | expand |
Context | Check | Description |
---|---|---|
yinshiyou/make_loongarch64 | success | Make finished |
yinshiyou/make_fate_loongarch64 | success | Make fate finished |
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
Andreas Rheinhardt: > Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr > filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x > filter-paletteuse-bayer filter-paletteuse-bayer0 > filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests > when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to > "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten > when using SSSE3). > > Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> > --- > libswscale/x86/rgb_2_rgb.asm | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/libswscale/x86/rgb_2_rgb.asm b/libswscale/x86/rgb_2_rgb.asm > index c695c61d5c..76ca1eec03 100644 > --- a/libswscale/x86/rgb_2_rgb.asm > +++ b/libswscale/x86/rgb_2_rgb.asm > @@ -104,6 +104,7 @@ jge .end > jl .loop_simd > > .end: > + emms > RET > > ;------------------------------------------------------------------------------ I'd really love if someone with x86 assembly skills could look over this trivial patch and confirm whether it is indeed correct. All I currently know is that is works for me. - Andreas
On Mon, Aug 22, 2022 at 11:59:17PM +0200, Andreas Rheinhardt wrote: > Andreas Rheinhardt: > > Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr > > filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x > > filter-paletteuse-bayer filter-paletteuse-bayer0 > > filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests > > when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to > > "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten > > when using SSSE3). > > > > Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> > > --- > > libswscale/x86/rgb_2_rgb.asm | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/libswscale/x86/rgb_2_rgb.asm b/libswscale/x86/rgb_2_rgb.asm > > index c695c61d5c..76ca1eec03 100644 > > --- a/libswscale/x86/rgb_2_rgb.asm > > +++ b/libswscale/x86/rgb_2_rgb.asm > > @@ -104,6 +104,7 @@ jge .end > > jl .loop_simd > > > > .end: > > + emms > > RET > > > > ;------------------------------------------------------------------------------ > > I'd really love if someone with x86 assembly skills could look over this > trivial patch and confirm whether it is indeed correct. All I currently > know is that is works for me. emms needs to be called between MMX and float code, as far outside of loops as possible that would suggest outside the for() loops in rgbToRgbWrapper() and any other code using it. thats what we did and what is most efficient. One can make an argument that emms must be called before returning to C code when its needed. That though would imply also that all uses of emms_c() are wrong Above assumes iam not missing something thx [...]
Michael Niedermayer: > On Mon, Aug 22, 2022 at 11:59:17PM +0200, Andreas Rheinhardt wrote: >> Andreas Rheinhardt: >>> Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr >>> filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x >>> filter-paletteuse-bayer filter-paletteuse-bayer0 >>> filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests >>> when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to >>> "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten >>> when using SSSE3). >>> >>> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> >>> --- >>> libswscale/x86/rgb_2_rgb.asm | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/libswscale/x86/rgb_2_rgb.asm b/libswscale/x86/rgb_2_rgb.asm >>> index c695c61d5c..76ca1eec03 100644 >>> --- a/libswscale/x86/rgb_2_rgb.asm >>> +++ b/libswscale/x86/rgb_2_rgb.asm >>> @@ -104,6 +104,7 @@ jge .end >>> jl .loop_simd >>> >>> .end: >>> + emms >>> RET >>> >>> ;------------------------------------------------------------------------------ >> >> I'd really love if someone with x86 assembly skills could look over this >> trivial patch and confirm whether it is indeed correct. All I currently >> know is that is works for me. > > emms needs to be called between MMX and float code, as far outside of loops > as possible > that would suggest outside the for() loops in rgbToRgbWrapper() and any > other code using it. But there is another aspect that the above is missing: Namely that if emms_c() is put outside of MMX functions, then it will be called even when it is unnecessary. In this case it is unnecessary for all modern CPUs, as this function is overridden when SSSE3 is available. > > thats what we did and what is most efficient. One can make an argument that > emms must be called before returning to C code when its needed. That though > would imply also that all uses of emms_c() are wrong > Well, e.g. the x64 psABI contains this clause: "The CPU shall be in x87 mode upon entry to a function. Therefore, every function that uses the MMX registers is required to issue an emms or femms instruction after using MMX registers, before returning or calling another function." So using emms_c() is ABI-incompliant. If I add an av_assert0_fpu() at the beginning of av_log_default_callback (a function that may be overridden by a user-defined callback that actually relies on us conforming to the ABI), several FATE tests fail. I am sure that there are lots of av_logs or other functions that are in parts of the code where the CPU is not in x87 mode and that are just not executed in fate because they are error logs. - Andreas PS: On the brighter side: fate.ffmpeg.org now contains three more green boxes!
On Tue, Aug 23, 2022 at 07:28:19PM +0200, Andreas Rheinhardt wrote: > Michael Niedermayer: > > On Mon, Aug 22, 2022 at 11:59:17PM +0200, Andreas Rheinhardt wrote: > >> Andreas Rheinhardt: > >>> Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr > >>> filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x > >>> filter-paletteuse-bayer filter-paletteuse-bayer0 > >>> filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests > >>> when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to > >>> "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten > >>> when using SSSE3). > >>> > >>> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> > >>> --- > >>> libswscale/x86/rgb_2_rgb.asm | 1 + > >>> 1 file changed, 1 insertion(+) > >>> > >>> diff --git a/libswscale/x86/rgb_2_rgb.asm b/libswscale/x86/rgb_2_rgb.asm > >>> index c695c61d5c..76ca1eec03 100644 > >>> --- a/libswscale/x86/rgb_2_rgb.asm > >>> +++ b/libswscale/x86/rgb_2_rgb.asm > >>> @@ -104,6 +104,7 @@ jge .end > >>> jl .loop_simd > >>> > >>> .end: > >>> + emms > >>> RET > >>> > >>> ;------------------------------------------------------------------------------ > >> > >> I'd really love if someone with x86 assembly skills could look over this > >> trivial patch and confirm whether it is indeed correct. All I currently > >> know is that is works for me. > > > > emms needs to be called between MMX and float code, as far outside of loops > > as possible > > that would suggest outside the for() loops in rgbToRgbWrapper() and any > > other code using it. > > But there is another aspect that the above is missing: Namely that if > emms_c() is put outside of MMX functions, then it will be called even > when it is unnecessary. In this case it is unnecessary for all modern > CPUs, as this function is overridden when SSSE3 is available. If you dont like that, dont call it when its not needed or call it a few hundread times unnecessary like your patch does. or write only code that doesnt need emms maybe there are more options ... thx [...]
Michael Niedermayer: > On Tue, Aug 23, 2022 at 07:28:19PM +0200, Andreas Rheinhardt wrote: >> Michael Niedermayer: >>> On Mon, Aug 22, 2022 at 11:59:17PM +0200, Andreas Rheinhardt wrote: >>>> Andreas Rheinhardt: >>>>> Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr >>>>> filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x >>>>> filter-paletteuse-bayer filter-paletteuse-bayer0 >>>>> filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests >>>>> when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to >>>>> "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten >>>>> when using SSSE3). >>>>> >>>>> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> >>>>> --- >>>>> libswscale/x86/rgb_2_rgb.asm | 1 + >>>>> 1 file changed, 1 insertion(+) >>>>> >>>>> diff --git a/libswscale/x86/rgb_2_rgb.asm b/libswscale/x86/rgb_2_rgb.asm >>>>> index c695c61d5c..76ca1eec03 100644 >>>>> --- a/libswscale/x86/rgb_2_rgb.asm >>>>> +++ b/libswscale/x86/rgb_2_rgb.asm >>>>> @@ -104,6 +104,7 @@ jge .end >>>>> jl .loop_simd >>>>> >>>>> .end: >>>>> + emms >>>>> RET >>>>> >>>>> ;------------------------------------------------------------------------------ >>>> >>>> I'd really love if someone with x86 assembly skills could look over this >>>> trivial patch and confirm whether it is indeed correct. All I currently >>>> know is that is works for me. >>> >>> emms needs to be called between MMX and float code, as far outside of loops >>> as possible >>> that would suggest outside the for() loops in rgbToRgbWrapper() and any >>> other code using it. >> >> But there is another aspect that the above is missing: Namely that if >> emms_c() is put outside of MMX functions, then it will be called even >> when it is unnecessary. In this case it is unnecessary for all modern >> CPUs, as this function is overridden when SSSE3 is available. > > If you dont like that, > dont call it when its not needed or call it a few hundread times unnecessary > like your patch does. > or write only code that doesnt need emms > maybe there are more options ... > If emms_c() is used as now outside of MMX functions, then a "dont call it when its not needed" would involve a check and would therefore still incur cost for users who don't use this. Also it is unclear how such a check would even look like given that one can use av_force_cpu_flags(). See also 55fc2c5a892c50feb1b9a8f55b74ec6594755ddb. This patch also only calls it a few hundred times unnecessarily if one runs this without SSSE3. CPUs without SSSE3 are ancient today. For the non-ancient CPUs, using emms_c() adds an EMMS. - Andreas
On Tue, Aug 23, 2022 at 08:09:09PM +0200, Andreas Rheinhardt wrote: > Michael Niedermayer: > > On Tue, Aug 23, 2022 at 07:28:19PM +0200, Andreas Rheinhardt wrote: > >> Michael Niedermayer: > >>> On Mon, Aug 22, 2022 at 11:59:17PM +0200, Andreas Rheinhardt wrote: > >>>> Andreas Rheinhardt: > >>>>> Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr > >>>>> filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x > >>>>> filter-paletteuse-bayer filter-paletteuse-bayer0 > >>>>> filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests > >>>>> when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to > >>>>> "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten > >>>>> when using SSSE3). > >>>>> > >>>>> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> > >>>>> --- > >>>>> libswscale/x86/rgb_2_rgb.asm | 1 + > >>>>> 1 file changed, 1 insertion(+) > >>>>> > >>>>> diff --git a/libswscale/x86/rgb_2_rgb.asm b/libswscale/x86/rgb_2_rgb.asm > >>>>> index c695c61d5c..76ca1eec03 100644 > >>>>> --- a/libswscale/x86/rgb_2_rgb.asm > >>>>> +++ b/libswscale/x86/rgb_2_rgb.asm > >>>>> @@ -104,6 +104,7 @@ jge .end > >>>>> jl .loop_simd > >>>>> > >>>>> .end: > >>>>> + emms > >>>>> RET > >>>>> > >>>>> ;------------------------------------------------------------------------------ > >>>> > >>>> I'd really love if someone with x86 assembly skills could look over this > >>>> trivial patch and confirm whether it is indeed correct. All I currently > >>>> know is that is works for me. > >>> > >>> emms needs to be called between MMX and float code, as far outside of loops > >>> as possible > >>> that would suggest outside the for() loops in rgbToRgbWrapper() and any > >>> other code using it. > >> > >> But there is another aspect that the above is missing: Namely that if > >> emms_c() is put outside of MMX functions, then it will be called even > >> when it is unnecessary. In this case it is unnecessary for all modern > >> CPUs, as this function is overridden when SSSE3 is available. > > > > If you dont like that, > > dont call it when its not needed or call it a few hundread times unnecessary > > like your patch does. > > or write only code that doesnt need emms > > maybe there are more options ... > > > > If emms_c() is used as now outside of MMX functions, then a "dont call > it when its not needed" would involve a check and would therefore still > incur cost for users who don't use this. Also it is unclear how such a > check would even look like given that one can use av_force_cpu_flags(). > See also 55fc2c5a892c50feb1b9a8f55b74ec6594755ddb. > This patch also only calls it a few hundred times unnecessarily if one > runs this without SSSE3. CPUs without SSSE3 are ancient today. For the > non-ancient CPUs, using emms_c() adds an EMMS. do whatever you prefer. The best solution depends on assumptions. The impact is biggest on old CPUs where EMMS is also a slow instruction But as you say these are ancient today. very small impact on many vs small to moderate impact on a today rare setup the worst is if the bug is left open and time is wasted on bikesheding thx [...]
Michael Niedermayer: > On Tue, Aug 23, 2022 at 08:09:09PM +0200, Andreas Rheinhardt wrote: >> Michael Niedermayer: >>> On Tue, Aug 23, 2022 at 07:28:19PM +0200, Andreas Rheinhardt wrote: >>>> Michael Niedermayer: >>>>> On Mon, Aug 22, 2022 at 11:59:17PM +0200, Andreas Rheinhardt wrote: >>>>>> Andreas Rheinhardt: >>>>>>> Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr >>>>>>> filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x >>>>>>> filter-paletteuse-bayer filter-paletteuse-bayer0 >>>>>>> filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests >>>>>>> when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to >>>>>>> "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten >>>>>>> when using SSSE3). >>>>>>> >>>>>>> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> >>>>>>> --- >>>>>>> libswscale/x86/rgb_2_rgb.asm | 1 + >>>>>>> 1 file changed, 1 insertion(+) >>>>>>> >>>>>>> diff --git a/libswscale/x86/rgb_2_rgb.asm b/libswscale/x86/rgb_2_rgb.asm >>>>>>> index c695c61d5c..76ca1eec03 100644 >>>>>>> --- a/libswscale/x86/rgb_2_rgb.asm >>>>>>> +++ b/libswscale/x86/rgb_2_rgb.asm >>>>>>> @@ -104,6 +104,7 @@ jge .end >>>>>>> jl .loop_simd >>>>>>> >>>>>>> .end: >>>>>>> + emms >>>>>>> RET >>>>>>> >>>>>>> ;------------------------------------------------------------------------------ >>>>>> >>>>>> I'd really love if someone with x86 assembly skills could look over this >>>>>> trivial patch and confirm whether it is indeed correct. All I currently >>>>>> know is that is works for me. >>>>> >>>>> emms needs to be called between MMX and float code, as far outside of loops >>>>> as possible >>>>> that would suggest outside the for() loops in rgbToRgbWrapper() and any >>>>> other code using it. >>>> >>>> But there is another aspect that the above is missing: Namely that if >>>> emms_c() is put outside of MMX functions, then it will be called even >>>> when it is unnecessary. In this case it is unnecessary for all modern >>>> CPUs, as this function is overridden when SSSE3 is available. >>> >>> If you dont like that, >>> dont call it when its not needed or call it a few hundread times unnecessary >>> like your patch does. >>> or write only code that doesnt need emms >>> maybe there are more options ... >>> >> >> If emms_c() is used as now outside of MMX functions, then a "dont call >> it when its not needed" would involve a check and would therefore still >> incur cost for users who don't use this. Also it is unclear how such a >> check would even look like given that one can use av_force_cpu_flags(). >> See also 55fc2c5a892c50feb1b9a8f55b74ec6594755ddb. >> This patch also only calls it a few hundred times unnecessarily if one >> runs this without SSSE3. CPUs without SSSE3 are ancient today. For the >> non-ancient CPUs, using emms_c() adds an EMMS. > > do whatever you prefer. > The best solution depends on assumptions. > The impact is biggest on old CPUs where EMMS is also a slow instruction > But as you say these are ancient today. very small impact on many vs > small to moderate impact on a today rare setup > the worst is if the bug is left open and time is wasted on bikesheding > Given that Lynne already approved this on IRC, I have already applied it as de33506e4b3e3362095aab167ad8bb87c1bd9488. Several of your FATE-boxes are now green: E.g. https://fate.ffmpeg.org/history.cgi?slot=x86_32-debian-kfreebsd-gcc-4.4-cpuflags-sse Rejoice! - Andreas
diff --git a/libswscale/x86/rgb_2_rgb.asm b/libswscale/x86/rgb_2_rgb.asm index c695c61d5c..76ca1eec03 100644 --- a/libswscale/x86/rgb_2_rgb.asm +++ b/libswscale/x86/rgb_2_rgb.asm @@ -104,6 +104,7 @@ jge .end jl .loop_simd .end: + emms RET ;------------------------------------------------------------------------------
Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x filter-paletteuse-bayer filter-paletteuse-bayer0 filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten when using SSSE3). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> --- libswscale/x86/rgb_2_rgb.asm | 1 + 1 file changed, 1 insertion(+)