Message ID | 20160902143635.31585-2-timo@rothenpieler.org |
---|---|
State | Superseded |
Headers | show |
2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > +#if AV_HAVE_BIGENDIAN > +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[], Why does this function not work on both big and little endian hardware? Carl Eugen
On 9/2/2016 7:16 PM, Carl Eugen Hoyos wrote: > 2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > >> +#if AV_HAVE_BIGENDIAN >> +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[], > > Why does this function not work on both big and little endian hardware? It does, but it's significantly slower. In my tests, it takes double the time than the pure native one.
Hi!
2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
> + AV_WL16(&dstUV[2*x ], (t | (t << 8)) & 0xFFC0);
Why is "& 0xFFC0" necessary?
(Same below.)
Carl Eugen
2016-09-03 0:06 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > On 9/2/2016 7:16 PM, Carl Eugen Hoyos wrote: >> 2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: >> >>> +#if AV_HAVE_BIGENDIAN >>> +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[], >> >> Why does this function not work on both big and little endian hardware? > > It does, but it's significantly slower. > In my tests, it takes double the time than the pure native one. Do you know why exactly it is slower? If performance matters, this likely can be SIMD-optimized, no reason to duplicate the function. Carl Eugen
On 9/3/2016 1:47 PM, Carl Eugen Hoyos wrote: > 2016-09-03 0:06 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: >> On 9/2/2016 7:16 PM, Carl Eugen Hoyos wrote: >>> 2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: >>> >>>> +#if AV_HAVE_BIGENDIAN >>>> +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[], >>> >>> Why does this function not work on both big and little endian hardware? >> >> It does, but it's significantly slower. >> In my tests, it takes double the time than the pure native one. > > Do you know why exactly it is slower? > > If performance matters, this likely can be SIMD-optimized, no reason to > duplicate the function. No idea, but it was hinted that the AV_WL macros do some thing to assure it works on systems with strict alignment requirements. And it's slow enough to be no longer capable of processing in real time, while the other implementation easily handles 100+ fps. I have another idea how to reduce the overhead of having two versions.
On 9/3/2016 1:46 PM, Carl Eugen Hoyos wrote: > Hi! > > 2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > >> + AV_WL16(&dstUV[2*x ], (t | (t << 8)) & 0xFFC0); > > Why is "& 0xFFC0" necessary? > (Same below.) Because P010 expects the 10 bits in the 10 most significant bit. I'm not 100% sure if the other 6 bits are undefined or 0, but all the other implementations treat them as zeroes.
2016-09-03 14:54 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: >>> + AV_WL16(&dstUV[2*x ], (t | (t << 8)) & 0xFFC0); >> >> Why is "& 0xFFC0" necessary? >> (Same below.) > > Because P010 expects the 10 bits in the 10 most significant bit. > I'm not 100% sure if the other 6 bits are undefined or 0, but all the > other implementations treat them as zeroes. I suggest to remove this. Carl Eugen
On 9/3/2016 3:15 PM, Carl Eugen Hoyos wrote: > 2016-09-03 14:54 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > >>>> + AV_WL16(&dstUV[2*x ], (t | (t << 8)) & 0xFFC0); >>> >>> Why is "& 0xFFC0" necessary? >>> (Same below.) >> >> Because P010 expects the 10 bits in the 10 most significant bit. >> I'm not 100% sure if the other 6 bits are undefined or 0, but all the >> other implementations treat them as zeroes. > > I suggest to remove this. At least https://technet.microsoft.com/pt-br/library/bb970578.aspx describes the lower 6 bits as set to 0, so leaving them in an undefined state might have unintended sideeffects.
2016-09-03 15:18 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: > On 9/3/2016 3:15 PM, Carl Eugen Hoyos wrote: >> 2016-09-03 14:54 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>: >> >>>>> + AV_WL16(&dstUV[2*x ], (t | (t << 8)) & 0xFFC0); >>>> >>>> Why is "& 0xFFC0" necessary? >>>> (Same below.) >>> >>> Because P010 expects the 10 bits in the 10 most significant bit. >>> I'm not 100% sure if the other 6 bits are undefined or 0, but all the >>> other implementations treat them as zeroes. >> >> I suggest to remove this. > > At least https://technet.microsoft.com/pt-br/library/bb970578.aspx > describes the lower 6 bits as set to 0 And the very next sentence explicitely contradicts. > so leaving them in an undefined > state might have unintended sideeffects. That is more than unlikely. Carl Eugen
diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c index f0b2fbf..bdbedee 100644 --- a/libswscale/swscale_unscaled.c +++ b/libswscale/swscale_unscaled.c @@ -33,6 +33,7 @@ #include "libavutil/bswap.h" #include "libavutil/pixdesc.h" #include "libavutil/avassert.h" +#include "libavutil/avconfig.h" DECLARE_ALIGNED(8, static const uint8_t, dithers)[8][8][8]={ { @@ -236,6 +237,83 @@ static int planarToP010Wrapper(SwsContext *c, const uint8_t *src8[], return srcSliceH; } +#if AV_HAVE_BIGENDIAN +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[], + int srcStride[], int srcSliceY, + int srcSliceH, uint8_t *dstParam8[], + int dstStride[]) +{ + uint16_t *dstY = (uint16_t*)(dstParam8[0] + dstStride[0] * srcSliceY); + uint16_t *dstUV = (uint16_t*)(dstParam8[1] + dstStride[1] * srcSliceY / 2); + int x, y, t; + + av_assert0(!(dstStride[0] % 2 || dstStride[1] % 2)); + + for (y = 0; y < srcSliceH; y++) { + for (x = 0; x < c->srcW; x++) { + t = src[0][x]; + AV_WL16(&dstY[x], (t | (t << 8)) & 0xFFC0); + } + src[0] += srcStride[0]; + dstY += dstStride[0] / 2; + + if (!(y & 1)) { + for (x = 0; x < c->srcW / 2; x++) { + t = src[1][x]; + AV_WL16(&dstUV[2*x ], (t | (t << 8)) & 0xFFC0); + t = src[2][x]; + AV_WL16(&dstUV[2*x+1], (t | (t << 8)) & 0xFFC0); + } + src[1] += srcStride[1]; + src[2] += srcStride[2]; + dstUV += dstStride[1] / 2; + } + } + + return srcSliceH; +} +#else +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[], + int srcStride[], int srcSliceY, + int srcSliceH, uint8_t *dstParam8[], + int dstStride[]) +{ + uint16_t *dstY = (uint16_t*)(dstParam8[0] + dstStride[0] * srcSliceY); + uint16_t *dstUV = (uint16_t*)(dstParam8[1] + dstStride[1] * srcSliceY / 2); + int x, y, t; + + av_assert0(!(dstStride[0] % 2 || dstStride[1] % 2)); + + for (y = 0; y < srcSliceH; y++) { + uint16_t *tdstY = dstY; + const uint8_t *tsrc0 = src[0]; + for (x = c->srcW; x > 0; x--) { + t = *tsrc0++; + *tdstY++ = (t | (t << 8)) & 0xFFC0; + } + src[0] += srcStride[0]; + dstY += dstStride[0] / 2; + + if (!(y & 1)) { + uint16_t *tdstUV = dstUV; + const uint8_t *tsrc1 = src[1]; + const uint8_t *tsrc2 = src[2]; + for (x = c->srcW / 2; x > 0; x--) { + t = *tsrc1++; + *tdstUV++ = (t | (t << 8)) & 0xFFC0; + t = *tsrc2++; + *tdstUV++ = (t | (t << 8)) & 0xFFC0; + } + src[1] += srcStride[1]; + src[2] += srcStride[2]; + dstUV += dstStride[1] / 2; + } + } + + return srcSliceH; +} +#endif + static int planarToYuy2Wrapper(SwsContext *c, const uint8_t *src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t *dstParam[], int dstStride[]) @@ -1644,6 +1722,11 @@ void ff_get_unscaled_swscale(SwsContext *c) dstFormat == AV_PIX_FMT_P010) { c->swscale = planarToP010Wrapper; } + /* yuv420p_to_p010le */ + if ((srcFormat == AV_PIX_FMT_YUV420P || srcFormat == AV_PIX_FMT_YUVA420P) && + dstFormat == AV_PIX_FMT_P010LE) { + c->swscale = planar8ToP010leWrapper; + } if (srcFormat == AV_PIX_FMT_YUV410P && !(dstH & 3) && (dstFormat == AV_PIX_FMT_YUV420P || dstFormat == AV_PIX_FMT_YUVA420P) && diff --git a/tests/ref/fate/filter-pixdesc-p010le b/tests/ref/fate/filter-pixdesc-p010le index cac2635..2500604 100644 --- a/tests/ref/fate/filter-pixdesc-p010le +++ b/tests/ref/fate/filter-pixdesc-p010le @@ -1 +1 @@ -pixdesc-p010le 0268fd44f63022e21ada69704534fc85 +pixdesc-p010le 7b4a503997eb4e14cba80ee52db85e39 diff --git a/tests/ref/fate/filter-pixfmts-copy b/tests/ref/fate/filter-pixfmts-copy index ce957f7..bcc4475 100644 --- a/tests/ref/fate/filter-pixfmts-copy +++ b/tests/ref/fate/filter-pixfmts-copy @@ -36,7 +36,7 @@ monow 54d16d2c01abfd72ecdb5e51e283937c nv12 8e24feb2c544dc26a20047a71e4c27aa nv21 335d85c9af6110f26ae9e187a82ed2cf p010be 7f9842d6015026136bad60d03c035cc3 -p010le 1929db89609c4b8c6d9c9030a9e7843d +p010le 9ba7bc4611e36b2435eb2dff353b8af5 pal8 ff5929f5b42075793b2c34cb441bede5 rgb0 0de71e5a1f97f81fb51397a0435bfa72 rgb24 f4438057d046e6d98ade4e45294b21be diff --git a/tests/ref/fate/filter-pixfmts-crop b/tests/ref/fate/filter-pixfmts-crop index e2c77a8..51c6df9 100644 --- a/tests/ref/fate/filter-pixfmts-crop +++ b/tests/ref/fate/filter-pixfmts-crop @@ -34,7 +34,7 @@ gray16le 9ff7c866bd98def4e6c91542c1c45f80 nv12 92cda427f794374731ec0321ee00caac nv21 1bcfc197f4fb95de85ba58182d8d2f69 p010be 8b2de2eb6b099bbf355bfc55a0694ddc -p010le a1e4f713e145dfc465bfe0cc77096a03 +p010le fa78436272020be0d2569139808429b6 pal8 1f2cdc8e718f95c875dbc1034a688bfb rgb0 736646b70dd9a0be22b8da8041e35035 rgb24 c5fbbf816bb2000f4d2914e335698ef5 diff --git a/tests/ref/fate/filter-pixfmts-field b/tests/ref/fate/filter-pixfmts-field index 20c76a1..8232a5c 100644 --- a/tests/ref/fate/filter-pixfmts-field +++ b/tests/ref/fate/filter-pixfmts-field @@ -36,7 +36,7 @@ monow 03d783611d265cae78293f88ea126ea1 nv12 16f7a46708ef25ebd0b72e47920cc11e nv21 7294574037cc7f9373ef5695d8ebe809 p010be a0311a09bba7383553267d2b3b9c075e -p010le f1cc90d292046109a626db2da9f0f9b6 +p010le 634c62ef33b362795339a03907a33137 pal8 0658c18dcd8d052d59dfbe23f5b368d9 rgb0 ca3fa6e865b91b3511c7f2bf62830059 rgb24 25ab271e26a5785be169578d99da5dd0 diff --git a/tests/ref/fate/filter-pixfmts-hflip b/tests/ref/fate/filter-pixfmts-hflip index 8e902fb..9417eb8 100644 --- a/tests/ref/fate/filter-pixfmts-hflip +++ b/tests/ref/fate/filter-pixfmts-hflip @@ -34,7 +34,7 @@ gray16le d91ce41e304419bcf32ac792f01bd64f nv12 801e58f1be5fd0b5bc4bf007c604b0b4 nv21 9f10dfff8963dc327d3395af21f0554f p010be 744b13e44d39e1ff7588983fa03e0101 -p010le aeb31f50c66f376b0530c7bb6287212b +p010le 447768b443c5cd8ff591ccb53463f220 pal8 5b7c77d99817b4f52339742a47de7797 rgb0 0092452f37d73da20193265ace0b7d57 rgb24 21571104e6091a689feabb7867e513dd diff --git a/tests/ref/fate/filter-pixfmts-il b/tests/ref/fate/filter-pixfmts-il index e568843..2ca37a3 100644 --- a/tests/ref/fate/filter-pixfmts-il +++ b/tests/ref/fate/filter-pixfmts-il @@ -36,7 +36,7 @@ monow 6e9cfb8d3a344c5f0c3e1d5e1297e580 nv12 3c3ba9b1b4c4dfff09c26f71b51dd146 nv21 ab586d8781246b5a32d8760a61db9797 p010be 3df51286ef66b53e3e283dbbab582263 -p010le 38945445b360fa737e9e37257393e823 +p010le 7c101300e86f25e5528583ed811f8d25 rgb0 cfaf68671e43248267d8cd50cae8c13f rgb24 88894f608cf33ba310f21996748d77a7 rgb444be 99d36d814988fb388aacdef575dacfcf diff --git a/tests/ref/fate/filter-pixfmts-null b/tests/ref/fate/filter-pixfmts-null index ce957f7..bcc4475 100644 --- a/tests/ref/fate/filter-pixfmts-null +++ b/tests/ref/fate/filter-pixfmts-null @@ -36,7 +36,7 @@ monow 54d16d2c01abfd72ecdb5e51e283937c nv12 8e24feb2c544dc26a20047a71e4c27aa nv21 335d85c9af6110f26ae9e187a82ed2cf p010be 7f9842d6015026136bad60d03c035cc3 -p010le 1929db89609c4b8c6d9c9030a9e7843d +p010le 9ba7bc4611e36b2435eb2dff353b8af5 pal8 ff5929f5b42075793b2c34cb441bede5 rgb0 0de71e5a1f97f81fb51397a0435bfa72 rgb24 f4438057d046e6d98ade4e45294b21be diff --git a/tests/ref/fate/filter-pixfmts-scale b/tests/ref/fate/filter-pixfmts-scale index 62ea6fa..6320557 100644 --- a/tests/ref/fate/filter-pixfmts-scale +++ b/tests/ref/fate/filter-pixfmts-scale @@ -36,7 +36,7 @@ monow 35c68b86c226d6990b2dcb573a05ff6b nv12 b118d24a3653fe66e5d9e079033aef79 nv21 c74bb1c10dbbdee8a1f682b194486c4d p010be 1d6726d94bf1385996a9a9840dd0e878 -p010le 5d436e6b35292a0e356d81f37f989b66 +p010le 4b316f2b9e18972299beb73511278fa8 pal8 29e10892009b2cfe431815ec3052ed3b rgb0 fbd27e98154efb7535826afed41e9bb0 rgb24 e022e741451e81f2ecce1c7240b93e87 diff --git a/tests/ref/fate/filter-pixfmts-vflip b/tests/ref/fate/filter-pixfmts-vflip index 9651724..667a1b6 100644 --- a/tests/ref/fate/filter-pixfmts-vflip +++ b/tests/ref/fate/filter-pixfmts-vflip @@ -36,7 +36,7 @@ monow 90a947bfcd5f2261e83b577f48ec57b1 nv12 261ebe585ae2aa4e70d39a10c1679294 nv21 2909feacd27bebb080c8e0fa41795269 p010be 06e9354b6e0e38ba41736352cedc0bd5 -p010le cdf6a3c38d9d4e3f079fa369e1dda662 +p010le 9686439b0d21cae1949d6aebe98b5e88 pal8 450b0155d0f2d5628bf95a442db5f817 rgb0 56a7ea69541bcd27bef6a5615784722b rgb24 195e6dae1c3a488b9d3ceb7560d25d85