Message ID | 20160903132522.29006-1-timo@rothenpieler.org |
---|---|
State | Superseded |
Headers | show |
> @@ -236,6 +237,57 @@ static int planarToP010Wrapper(SwsContext *c, const uint8_t *src8[], > return srcSliceH; > } > > +#if AV_HAVE_BIGENDIAN || 1 Nevermind the || 1, left over from testing speed differences and forgot to remove it.
2016-09-03 15:25 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
> + output_pixel(tdstY++, (t | (t << 8)) & 0xFFC0);
Please remove the "& 0xFFC0" here and below.
Carl Eugen
On 9/4/16, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote: > 2016-09-03 15:25 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > >> + output_pixel(tdstY++, (t | (t << 8)) & 0xFFC0); > > Please remove the "& 0xFFC0" here and below. Please explain your reasoning here.
2016-09-04 13:10 GMT+02:00 Paul B Mahol <onemda@gmail.com>: > On 9/4/16, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote: >> 2016-09-03 15:25 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: >> >>> + output_pixel(tdstY++, (t | (t << 8)) & 0xFFC0); >> >> Please remove the "& 0xFFC0" here and below. > > Please explain your reasoning here. The linked specification clearly explains that the content of P016 and P010 is identical and that it is not necessary to set the lsb's to 0. I believe we agree that the patch is slightly ugly because of the defines but developers seem to agree that it has to be accepted for performance reasons (nobody objected). Even if "& 0xFFC0" is very cheap, I don't think it comes for free. (Or does it?) Finally, with the change, the function can also be used for P016, note that I tried to object to P010: It does not serve any real purpose, if I remember correctly, the explanation for the commit was that there is a bug in FFmpeg's pix_fmt decision routine that needed to be worked-around ("hacked"). Carl Eugen
On Sun, Sep 4, 2016 at 1:18 PM, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote: > 2016-09-04 13:10 GMT+02:00 Paul B Mahol <onemda@gmail.com>: >> On 9/4/16, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote: >>> 2016-09-03 15:25 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: >>> >>>> + output_pixel(tdstY++, (t | (t << 8)) & 0xFFC0); >>> >>> Please remove the "& 0xFFC0" here and below. >> >> Please explain your reasoning here. > > The linked specification clearly explains that the content > of P016 and P010 is identical and that it is not necessary > to set the lsb's to 0. While the specification does say that P010's memory layout was particularly chosen to be able to simply alias to P016 if needed, having random garbage in there is not good, so the zeroing of those bits should remain. - Hendrik
2016-09-04 15:42 GMT+02:00 Hendrik Leppkes <h.leppkes@gmail.com>: > On Sun, Sep 4, 2016 at 1:18 PM, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote: >> 2016-09-04 13:10 GMT+02:00 Paul B Mahol <onemda@gmail.com>: >>> On 9/4/16, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote: >>>> 2016-09-03 15:25 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: >>>> >>>>> + output_pixel(tdstY++, (t | (t << 8)) & 0xFFC0); >>>> >>>> Please remove the "& 0xFFC0" here and below. >>> >>> Please explain your reasoning here. >> >> The linked specification clearly explains that the content >> of P016 and P010 is identical and that it is not necessary >> to set the lsb's to 0. > > While the specification does say that P010's memory layout > was particularly chosen to be able to simply alias to P016 if > needed, having random garbage in there is not good, so > the zeroing of those bits should remain. Since it is not garbage (and not random), your argumentation makes no sense. Is the operation really free as you all seem to insinuate? Carl Eugen
> Finally, with the change, the function can also be used > for P016, note that I tried to object to P010: It does not > serve any real purpose, if I remember correctly, the > explanation for the commit was that there is a bug in > FFmpeg's pix_fmt decision routine that needed to > be worked-around ("hacked"). It's the input format to nvenc in 10bit mode. The purpose of this patch is to make conversion from yuv420p (8 bit) to p010 (10 bit) fast.
On Sep 4, 2016 5:02 PM, "Timo Rothenpieler" <timo@rothenpieler.org> wrote: > > > Finally, with the change, the function can also be used > > for P016, note that I tried to object to P010: It does not > > serve any real purpose, if I remember correctly, the > > explanation for the commit was that there is a bug in > > FFmpeg's pix_fmt decision routine that needed to > > be worked-around ("hacked"). > > It's the input format to nvenc in 10bit mode. > The purpose of this patch is to make conversion from yuv420p (8 bit) to > p010 (10 bit) fast. > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel Especially for 4K and UHD frames. You may not notice conversion speed change in HD or SD resolutions.
2016-09-04 16:02 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > The purpose of this patch is to make conversion from > yuv420p (8 bit) to p010 (10 bit) fast. Do I understand you correctly that your patch is faster without the change I suggested? Carl Eugen
Hi, On Sun, Sep 4, 2016 at 10:06 AM, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote: > 2016-09-04 16:02 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > > The purpose of this patch is to make conversion from > > yuv420p (8 bit) to p010 (10 bit) fast. > > Do I understand you correctly that your patch is > faster without the change I suggested? We should also consider whether not zeroing out the LSBs affects other pieces of code (or hardware) which support p010 as input. If they all ignore the LSBs as the spec seems to require, I agree with Carl. If they break or the output changes, it may be more complicated. Ronald
Hi, On Sep 4, 2016 5:33 PM, "Ronald S. Bultje" <rsbultje@gmail.com> wrote: > > Hi, > > On Sun, Sep 4, 2016 at 10:06 AM, Carl Eugen Hoyos <ceffmpeg@gmail.com> > wrote: > > > 2016-09-04 16:02 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > > > The purpose of this patch is to make conversion from > > > yuv420p (8 bit) to p010 (10 bit) fast. > > > > Do I understand you correctly that your patch is > > faster without the change I suggested? > > > We should also consider whether not zeroing out the LSBs affects other > pieces of code (or hardware) which support p010 as input. > > If they all ignore the LSBs as the spec seems to require, I agree with > Carl. If they break or the output changes, it may be more complicated. > > Ronald > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel I can test the patch in Monday with and without zeros and report if it break anything for Nvenc Hevc encoding.
On 9/4/2016 4:06 PM, Carl Eugen Hoyos wrote: > 2016-09-04 16:02 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: >> The purpose of this patch is to make conversion from >> yuv420p (8 bit) to p010 (10 bit) fast. > > Do I understand you correctly that your patch is > faster without the change I suggested? With the &: [bench @ 0x600045b80] t:0.011178 avg:0.011172 max:0.018297 min:0.010505 Without it: [bench @ 0x600045b80] t:0.008455 avg:0.008517 max:0.015815 min:0.007941 So it is quite a bit faster. Tested with nvenc hevc10 encoding, and the output is visually identical, and the file size is also exactly the same. So it seems to cleanly ignore the unused bits. Also, given that at least microsoft argues with upcasting to 16 bit, the approach without zeroing the lsb would be more accurate, as t << 8 | t is how one would convert 8 bit to 16 bit. So I'd say going with the faster approach here should be fine. If at some point someone runs into something that chokes on the bits being non-zero, which I think is highly unlikely, it can be changed back.
Hi, On Sep 4, 2016 5:42 PM, "Timo Rothenpieler" <timo@rothenpieler.org> wrote: > > On 9/4/2016 4:06 PM, Carl Eugen Hoyos wrote: > > 2016-09-04 16:02 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > >> The purpose of this patch is to make conversion from > >> yuv420p (8 bit) to p010 (10 bit) fast. > > > > Do I understand you correctly that your patch is > > faster without the change I suggested? > > With the &: > [bench @ 0x600045b80] t:0.011178 avg:0.011172 max:0.018297 min:0.010505 > > Without it: > [bench @ 0x600045b80] t:0.008455 avg:0.008517 max:0.015815 min:0.007941 > > So it is quite a bit faster. > > Tested with nvenc hevc10 encoding, and the output is visually identical, > and the file size is also exactly the same. > So it seems to cleanly ignore the unused bits. > > Also, given that at least microsoft argues with upcasting to 16 bit, the > approach without zeroing the lsb would be more accurate, as > > t << 8 | t > > is how one would convert 8 bit to 16 bit. > > > So I'd say going with the faster approach here should be fine. > If at some point someone runs into something that chokes on the bits > being non-zero, which I think is highly unlikely, it can be changed back. Sounds good, thanks for testing and confirming. Ronald
2016-09-04 20:34 GMT+03:00 Ronald S. Bultje <rsbultje@gmail.com>: > Hi, > > On Sep 4, 2016 5:42 PM, "Timo Rothenpieler" <timo@rothenpieler.org> wrote: > > > > On 9/4/2016 4:06 PM, Carl Eugen Hoyos wrote: > > > 2016-09-04 16:02 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > > >> The purpose of this patch is to make conversion from > > >> yuv420p (8 bit) to p010 (10 bit) fast. > > > > > > Do I understand you correctly that your patch is > > > faster without the change I suggested? > > > > With the &: > > [bench @ 0x600045b80] t:0.011178 avg:0.011172 max:0.018297 min:0.010505 > > > > Without it: > > [bench @ 0x600045b80] t:0.008455 avg:0.008517 max:0.015815 min:0.007941 > > > > So it is quite a bit faster. > > > > Tested with nvenc hevc10 encoding, and the output is visually identical, > > and the file size is also exactly the same. > > So it seems to cleanly ignore the unused bits. > > > > Also, given that at least microsoft argues with upcasting to 16 bit, the > > approach without zeroing the lsb would be more accurate, as > > > > t << 8 | t > > > > is how one would convert 8 bit to 16 bit. > > > > > > So I'd say going with the faster approach here should be fine. > > If at some point someone runs into something that chokes on the bits > > being non-zero, which I think is highly unlikely, it can be changed back. > > Sounds good, thanks for testing and confirming. > > Ronald > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > Hello, Is there any obstacle to release this patch ?
2016-09-05 10:53 GMT+02:00 Ali KIZIL <alikizil@gmail.com>:
> Is there any obstacle to release this patch ?
Which patch?
Carl Eugen
2016-09-05 12:04 GMT+03:00 Carl Eugen Hoyos <ceffmpeg@gmail.com>: > 2016-09-05 10:53 GMT+02:00 Ali KIZIL <alikizil@gmail.com>: > > > Is there any obstacle to release this patch ? > > Which patch? > > Carl Eugen > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > Sorry Carl, maybe I missed. I mean the patch for "add unscaled conversion from yuv420p to p010". I see the one "add unscaled conversion from yuv420p10 to p010" in github, but I can not see the one "add unscaled conversion from yuv420p to p010". Lastly it was confirmed that padding zeros were not needed. Kind Regards,
2016-09-03 15:25 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
> +static int planar8ToP010leWrapper
If you want, you can rename this function to planar8ToP01xleWrapper
or planar8ToP016leWrapper after removing the binary and.
Carl Eugen
2016-09-05 11:06 GMT+02:00 Ali KIZIL <alikizil@gmail.com>: > 2016-09-05 12:04 GMT+03:00 Carl Eugen Hoyos <ceffmpeg@gmail.com>: > >> 2016-09-05 10:53 GMT+02:00 Ali KIZIL <alikizil@gmail.com>: >> >> > Is there any obstacle to release this patch ? >> >> Which patch? > > Sorry Carl, maybe I missed. I mean the patch for "add unscaled > conversion from yuv420p to p010". Everybody who commented since yesterday afternoon agrees that the patch should be changed (slightly). So yes, there still is an obstacle. Please cut your quotes and please give every OP more than a few hours to resend. Carl Eugen
diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c index ca7374a..cca2302 100644 --- a/libswscale/swscale_unscaled.c +++ b/libswscale/swscale_unscaled.c @@ -33,6 +33,7 @@ #include "libavutil/bswap.h" #include "libavutil/pixdesc.h" #include "libavutil/avassert.h" +#include "libavutil/avconfig.h" DECLARE_ALIGNED(8, static const uint8_t, dithers)[8][8][8]={ { @@ -236,6 +237,57 @@ static int planarToP010Wrapper(SwsContext *c, const uint8_t *src8[], return srcSliceH; } +#if AV_HAVE_BIGENDIAN || 1 +#define output_pixel(p, v) do { \ + uint16_t *pp = (p); \ + AV_WL16(pp, (v)); \ + } while(0) +#else +#define output_pixel(p, v) (*p) = (v) +#endif + +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[], + int srcStride[], int srcSliceY, + int srcSliceH, uint8_t *dstParam8[], + int dstStride[]) +{ + uint16_t *dstY = (uint16_t*)(dstParam8[0] + dstStride[0] * srcSliceY); + uint16_t *dstUV = (uint16_t*)(dstParam8[1] + dstStride[1] * srcSliceY / 2); + int x, y, t; + + av_assert0(!(dstStride[0] % 2 || dstStride[1] % 2)); + + for (y = 0; y < srcSliceH; y++) { + uint16_t *tdstY = dstY; + const uint8_t *tsrc0 = src[0]; + for (x = c->srcW; x > 0; x--) { + t = *tsrc0++; + output_pixel(tdstY++, (t | (t << 8)) & 0xFFC0); + } + src[0] += srcStride[0]; + dstY += dstStride[0] / 2; + + if (!(y & 1)) { + uint16_t *tdstUV = dstUV; + const uint8_t *tsrc1 = src[1]; + const uint8_t *tsrc2 = src[2]; + for (x = c->srcW / 2; x > 0; x--) { + t = *tsrc1++; + output_pixel(tdstUV++, (t | (t << 8)) & 0xFFC0); + t = *tsrc2++; + output_pixel(tdstUV++, (t | (t << 8)) & 0xFFC0); + } + src[1] += srcStride[1]; + src[2] += srcStride[2]; + dstUV += dstStride[1] / 2; + } + } + + return srcSliceH; +} + +#undef output_pixel + static int planarToYuy2Wrapper(SwsContext *c, const uint8_t *src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t *dstParam[], int dstStride[]) @@ -1645,6 +1697,11 @@ void ff_get_unscaled_swscale(SwsContext *c) dstFormat == AV_PIX_FMT_P010) { c->swscale = planarToP010Wrapper; } + /* yuv420p_to_p010le */ + if ((srcFormat == AV_PIX_FMT_YUV420P || srcFormat == AV_PIX_FMT_YUVA420P) && + dstFormat == AV_PIX_FMT_P010LE) { + c->swscale = planar8ToP010leWrapper; + } if (srcFormat == AV_PIX_FMT_YUV410P && !(dstH & 3) && (dstFormat == AV_PIX_FMT_YUV420P || dstFormat == AV_PIX_FMT_YUVA420P) && diff --git a/tests/ref/fate/filter-pixdesc-p010le b/tests/ref/fate/filter-pixdesc-p010le index cac2635..2500604 100644 --- a/tests/ref/fate/filter-pixdesc-p010le +++ b/tests/ref/fate/filter-pixdesc-p010le @@ -1 +1 @@ -pixdesc-p010le 0268fd44f63022e21ada69704534fc85 +pixdesc-p010le 7b4a503997eb4e14cba80ee52db85e39 diff --git a/tests/ref/fate/filter-pixfmts-copy b/tests/ref/fate/filter-pixfmts-copy index ce957f7..bcc4475 100644 --- a/tests/ref/fate/filter-pixfmts-copy +++ b/tests/ref/fate/filter-pixfmts-copy @@ -36,7 +36,7 @@ monow 54d16d2c01abfd72ecdb5e51e283937c nv12 8e24feb2c544dc26a20047a71e4c27aa nv21 335d85c9af6110f26ae9e187a82ed2cf p010be 7f9842d6015026136bad60d03c035cc3 -p010le 1929db89609c4b8c6d9c9030a9e7843d +p010le 9ba7bc4611e36b2435eb2dff353b8af5 pal8 ff5929f5b42075793b2c34cb441bede5 rgb0 0de71e5a1f97f81fb51397a0435bfa72 rgb24 f4438057d046e6d98ade4e45294b21be diff --git a/tests/ref/fate/filter-pixfmts-crop b/tests/ref/fate/filter-pixfmts-crop index e2c77a8..51c6df9 100644 --- a/tests/ref/fate/filter-pixfmts-crop +++ b/tests/ref/fate/filter-pixfmts-crop @@ -34,7 +34,7 @@ gray16le 9ff7c866bd98def4e6c91542c1c45f80 nv12 92cda427f794374731ec0321ee00caac nv21 1bcfc197f4fb95de85ba58182d8d2f69 p010be 8b2de2eb6b099bbf355bfc55a0694ddc -p010le a1e4f713e145dfc465bfe0cc77096a03 +p010le fa78436272020be0d2569139808429b6 pal8 1f2cdc8e718f95c875dbc1034a688bfb rgb0 736646b70dd9a0be22b8da8041e35035 rgb24 c5fbbf816bb2000f4d2914e335698ef5 diff --git a/tests/ref/fate/filter-pixfmts-field b/tests/ref/fate/filter-pixfmts-field index 20c76a1..8232a5c 100644 --- a/tests/ref/fate/filter-pixfmts-field +++ b/tests/ref/fate/filter-pixfmts-field @@ -36,7 +36,7 @@ monow 03d783611d265cae78293f88ea126ea1 nv12 16f7a46708ef25ebd0b72e47920cc11e nv21 7294574037cc7f9373ef5695d8ebe809 p010be a0311a09bba7383553267d2b3b9c075e -p010le f1cc90d292046109a626db2da9f0f9b6 +p010le 634c62ef33b362795339a03907a33137 pal8 0658c18dcd8d052d59dfbe23f5b368d9 rgb0 ca3fa6e865b91b3511c7f2bf62830059 rgb24 25ab271e26a5785be169578d99da5dd0 diff --git a/tests/ref/fate/filter-pixfmts-hflip b/tests/ref/fate/filter-pixfmts-hflip index 8e902fb..9417eb8 100644 --- a/tests/ref/fate/filter-pixfmts-hflip +++ b/tests/ref/fate/filter-pixfmts-hflip @@ -34,7 +34,7 @@ gray16le d91ce41e304419bcf32ac792f01bd64f nv12 801e58f1be5fd0b5bc4bf007c604b0b4 nv21 9f10dfff8963dc327d3395af21f0554f p010be 744b13e44d39e1ff7588983fa03e0101 -p010le aeb31f50c66f376b0530c7bb6287212b +p010le 447768b443c5cd8ff591ccb53463f220 pal8 5b7c77d99817b4f52339742a47de7797 rgb0 0092452f37d73da20193265ace0b7d57 rgb24 21571104e6091a689feabb7867e513dd diff --git a/tests/ref/fate/filter-pixfmts-il b/tests/ref/fate/filter-pixfmts-il index e568843..2ca37a3 100644 --- a/tests/ref/fate/filter-pixfmts-il +++ b/tests/ref/fate/filter-pixfmts-il @@ -36,7 +36,7 @@ monow 6e9cfb8d3a344c5f0c3e1d5e1297e580 nv12 3c3ba9b1b4c4dfff09c26f71b51dd146 nv21 ab586d8781246b5a32d8760a61db9797 p010be 3df51286ef66b53e3e283dbbab582263 -p010le 38945445b360fa737e9e37257393e823 +p010le 7c101300e86f25e5528583ed811f8d25 rgb0 cfaf68671e43248267d8cd50cae8c13f rgb24 88894f608cf33ba310f21996748d77a7 rgb444be 99d36d814988fb388aacdef575dacfcf diff --git a/tests/ref/fate/filter-pixfmts-null b/tests/ref/fate/filter-pixfmts-null index ce957f7..bcc4475 100644 --- a/tests/ref/fate/filter-pixfmts-null +++ b/tests/ref/fate/filter-pixfmts-null @@ -36,7 +36,7 @@ monow 54d16d2c01abfd72ecdb5e51e283937c nv12 8e24feb2c544dc26a20047a71e4c27aa nv21 335d85c9af6110f26ae9e187a82ed2cf p010be 7f9842d6015026136bad60d03c035cc3 -p010le 1929db89609c4b8c6d9c9030a9e7843d +p010le 9ba7bc4611e36b2435eb2dff353b8af5 pal8 ff5929f5b42075793b2c34cb441bede5 rgb0 0de71e5a1f97f81fb51397a0435bfa72 rgb24 f4438057d046e6d98ade4e45294b21be diff --git a/tests/ref/fate/filter-pixfmts-scale b/tests/ref/fate/filter-pixfmts-scale index 62ea6fa..6320557 100644 --- a/tests/ref/fate/filter-pixfmts-scale +++ b/tests/ref/fate/filter-pixfmts-scale @@ -36,7 +36,7 @@ monow 35c68b86c226d6990b2dcb573a05ff6b nv12 b118d24a3653fe66e5d9e079033aef79 nv21 c74bb1c10dbbdee8a1f682b194486c4d p010be 1d6726d94bf1385996a9a9840dd0e878 -p010le 5d436e6b35292a0e356d81f37f989b66 +p010le 4b316f2b9e18972299beb73511278fa8 pal8 29e10892009b2cfe431815ec3052ed3b rgb0 fbd27e98154efb7535826afed41e9bb0 rgb24 e022e741451e81f2ecce1c7240b93e87 diff --git a/tests/ref/fate/filter-pixfmts-vflip b/tests/ref/fate/filter-pixfmts-vflip index 9651724..667a1b6 100644 --- a/tests/ref/fate/filter-pixfmts-vflip +++ b/tests/ref/fate/filter-pixfmts-vflip @@ -36,7 +36,7 @@ monow 90a947bfcd5f2261e83b577f48ec57b1 nv12 261ebe585ae2aa4e70d39a10c1679294 nv21 2909feacd27bebb080c8e0fa41795269 p010be 06e9354b6e0e38ba41736352cedc0bd5 -p010le cdf6a3c38d9d4e3f079fa369e1dda662 +p010le 9686439b0d21cae1949d6aebe98b5e88 pal8 450b0155d0f2d5628bf95a442db5f817 rgb0 56a7ea69541bcd27bef6a5615784722b rgb24 195e6dae1c3a488b9d3ceb7560d25d85