diff mbox

[FFmpeg-devel,2/2] swscale: add unscaled conversion from yuv420p to p010

Message ID 20160902143635.31585-2-timo@rothenpieler.org
State Superseded
Headers show

Commit Message

Timo Rothenpieler Sept. 2, 2016, 2:36 p.m. UTC
---
 libswscale/swscale_unscaled.c        | 83 ++++++++++++++++++++++++++++++++++++
 tests/ref/fate/filter-pixdesc-p010le |  2 +-
 tests/ref/fate/filter-pixfmts-copy   |  2 +-
 tests/ref/fate/filter-pixfmts-crop   |  2 +-
 tests/ref/fate/filter-pixfmts-field  |  2 +-
 tests/ref/fate/filter-pixfmts-hflip  |  2 +-
 tests/ref/fate/filter-pixfmts-il     |  2 +-
 tests/ref/fate/filter-pixfmts-null   |  2 +-
 tests/ref/fate/filter-pixfmts-scale  |  2 +-
 tests/ref/fate/filter-pixfmts-vflip  |  2 +-
 10 files changed, 92 insertions(+), 9 deletions(-)

Comments

Carl Eugen Hoyos Sept. 2, 2016, 5:16 p.m. UTC | #1
2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:

> +#if AV_HAVE_BIGENDIAN
> +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[],

Why does this function not work on both big and little endian hardware?

Carl Eugen
Timo Rothenpieler Sept. 2, 2016, 10:06 p.m. UTC | #2
On 9/2/2016 7:16 PM, Carl Eugen Hoyos wrote:
> 2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
> 
>> +#if AV_HAVE_BIGENDIAN
>> +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[],
> 
> Why does this function not work on both big and little endian hardware?

It does, but it's significantly slower.
In my tests, it takes double the time than the pure native one.
Carl Eugen Hoyos Sept. 3, 2016, 11:46 a.m. UTC | #3
Hi!

2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:

> +                AV_WL16(&dstUV[2*x  ], (t | (t << 8)) & 0xFFC0);

Why is "& 0xFFC0" necessary?
(Same below.)

Carl Eugen
Carl Eugen Hoyos Sept. 3, 2016, 11:47 a.m. UTC | #4
2016-09-03 0:06 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
> On 9/2/2016 7:16 PM, Carl Eugen Hoyos wrote:
>> 2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
>>
>>> +#if AV_HAVE_BIGENDIAN
>>> +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[],
>>
>> Why does this function not work on both big and little endian hardware?
>
> It does, but it's significantly slower.
> In my tests, it takes double the time than the pure native one.

Do you know why exactly it is slower?

If performance matters, this likely can be SIMD-optimized, no reason to
duplicate the function.

Carl Eugen
Timo Rothenpieler Sept. 3, 2016, 12:53 p.m. UTC | #5
On 9/3/2016 1:47 PM, Carl Eugen Hoyos wrote:
> 2016-09-03 0:06 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
>> On 9/2/2016 7:16 PM, Carl Eugen Hoyos wrote:
>>> 2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
>>>
>>>> +#if AV_HAVE_BIGENDIAN
>>>> +static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[],
>>>
>>> Why does this function not work on both big and little endian hardware?
>>
>> It does, but it's significantly slower.
>> In my tests, it takes double the time than the pure native one.
> 
> Do you know why exactly it is slower?
> 
> If performance matters, this likely can be SIMD-optimized, no reason to
> duplicate the function.

No idea, but it was hinted that the AV_WL macros do some thing to assure
it works on systems with strict alignment requirements.

And it's slow enough to be no longer capable of processing in real time,
while the other implementation easily handles 100+ fps.

I have another idea how to reduce the overhead of having two versions.
Timo Rothenpieler Sept. 3, 2016, 12:54 p.m. UTC | #6
On 9/3/2016 1:46 PM, Carl Eugen Hoyos wrote:
> Hi!
> 
> 2016-09-02 16:36 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
> 
>> +                AV_WL16(&dstUV[2*x  ], (t | (t << 8)) & 0xFFC0);
> 
> Why is "& 0xFFC0" necessary?
> (Same below.)

Because P010 expects the 10 bits in the 10 most significant bit.
I'm not 100% sure if the other 6 bits are undefined or 0, but all the
other implementations treat them as zeroes.
Carl Eugen Hoyos Sept. 3, 2016, 1:15 p.m. UTC | #7
2016-09-03 14:54 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:

>>> +                AV_WL16(&dstUV[2*x  ], (t | (t << 8)) & 0xFFC0);
>>
>> Why is "& 0xFFC0" necessary?
>> (Same below.)
>
> Because P010 expects the 10 bits in the 10 most significant bit.
> I'm not 100% sure if the other 6 bits are undefined or 0, but all the
> other implementations treat them as zeroes.

I suggest to remove this.

Carl Eugen
Timo Rothenpieler Sept. 3, 2016, 1:18 p.m. UTC | #8
On 9/3/2016 3:15 PM, Carl Eugen Hoyos wrote:
> 2016-09-03 14:54 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
> 
>>>> +                AV_WL16(&dstUV[2*x  ], (t | (t << 8)) & 0xFFC0);
>>>
>>> Why is "& 0xFFC0" necessary?
>>> (Same below.)
>>
>> Because P010 expects the 10 bits in the 10 most significant bit.
>> I'm not 100% sure if the other 6 bits are undefined or 0, but all the
>> other implementations treat them as zeroes.
> 
> I suggest to remove this.

At least https://technet.microsoft.com/pt-br/library/bb970578.aspx
describes the lower 6 bits as set to 0, so leaving them in an undefined
state might have unintended sideeffects.
Carl Eugen Hoyos Sept. 3, 2016, 1:22 p.m. UTC | #9
2016-09-03 15:18 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
> On 9/3/2016 3:15 PM, Carl Eugen Hoyos wrote:
>> 2016-09-03 14:54 GMT+02:00 Timo Rothenpieler <timo@rothenpieler.org>:
>>
>>>>> +                AV_WL16(&dstUV[2*x  ], (t | (t << 8)) & 0xFFC0);
>>>>
>>>> Why is "& 0xFFC0" necessary?
>>>> (Same below.)
>>>
>>> Because P010 expects the 10 bits in the 10 most significant bit.
>>> I'm not 100% sure if the other 6 bits are undefined or 0, but all the
>>> other implementations treat them as zeroes.
>>
>> I suggest to remove this.
>
> At least https://technet.microsoft.com/pt-br/library/bb970578.aspx
> describes the lower 6 bits as set to 0

And the very next sentence explicitely contradicts.

> so leaving them in an undefined
> state might have unintended sideeffects.

That is more than unlikely.

Carl Eugen
diff mbox

Patch

diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c
index f0b2fbf..bdbedee 100644
--- a/libswscale/swscale_unscaled.c
+++ b/libswscale/swscale_unscaled.c
@@ -33,6 +33,7 @@ 
 #include "libavutil/bswap.h"
 #include "libavutil/pixdesc.h"
 #include "libavutil/avassert.h"
+#include "libavutil/avconfig.h"
 
 DECLARE_ALIGNED(8, static const uint8_t, dithers)[8][8][8]={
 {
@@ -236,6 +237,83 @@  static int planarToP010Wrapper(SwsContext *c, const uint8_t *src8[],
     return srcSliceH;
 }
 
+#if AV_HAVE_BIGENDIAN
+static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[],
+                                  int srcStride[], int srcSliceY,
+                                  int srcSliceH, uint8_t *dstParam8[],
+                                  int dstStride[])
+{
+    uint16_t *dstY = (uint16_t*)(dstParam8[0] + dstStride[0] * srcSliceY);
+    uint16_t *dstUV = (uint16_t*)(dstParam8[1] + dstStride[1] * srcSliceY / 2);
+    int x, y, t;
+
+    av_assert0(!(dstStride[0] % 2 || dstStride[1] % 2));
+
+    for (y = 0; y < srcSliceH; y++) {
+        for (x = 0; x < c->srcW; x++) {
+            t = src[0][x];
+            AV_WL16(&dstY[x], (t | (t << 8)) & 0xFFC0);
+        }
+        src[0] += srcStride[0];
+        dstY += dstStride[0] / 2;
+
+        if (!(y & 1)) {
+            for (x = 0; x < c->srcW / 2; x++) {
+                t = src[1][x];
+                AV_WL16(&dstUV[2*x  ], (t | (t << 8)) & 0xFFC0);
+                t = src[2][x];
+                AV_WL16(&dstUV[2*x+1], (t | (t << 8)) & 0xFFC0);
+            }
+            src[1] += srcStride[1];
+            src[2] += srcStride[2];
+            dstUV += dstStride[1] / 2;
+        }
+    }
+
+    return srcSliceH;
+}
+#else
+static int planar8ToP010leWrapper(SwsContext *c, const uint8_t *src[],
+                                  int srcStride[], int srcSliceY,
+                                  int srcSliceH, uint8_t *dstParam8[],
+                                  int dstStride[])
+{
+    uint16_t *dstY = (uint16_t*)(dstParam8[0] + dstStride[0] * srcSliceY);
+    uint16_t *dstUV = (uint16_t*)(dstParam8[1] + dstStride[1] * srcSliceY / 2);
+    int x, y, t;
+
+    av_assert0(!(dstStride[0] % 2 || dstStride[1] % 2));
+
+    for (y = 0; y < srcSliceH; y++) {
+        uint16_t *tdstY = dstY;
+        const uint8_t *tsrc0 = src[0];
+        for (x = c->srcW; x > 0; x--) {
+            t = *tsrc0++;
+            *tdstY++ = (t | (t << 8)) & 0xFFC0;
+        }
+        src[0] += srcStride[0];
+        dstY += dstStride[0] / 2;
+
+        if (!(y & 1)) {
+            uint16_t *tdstUV = dstUV;
+            const uint8_t *tsrc1 = src[1];
+            const uint8_t *tsrc2 = src[2];
+            for (x = c->srcW / 2; x > 0; x--) {
+                t = *tsrc1++;
+                *tdstUV++ = (t | (t << 8)) & 0xFFC0;
+                t = *tsrc2++;
+                *tdstUV++ = (t | (t << 8)) & 0xFFC0;
+            }
+            src[1] += srcStride[1];
+            src[2] += srcStride[2];
+            dstUV += dstStride[1] / 2;
+        }
+    }
+
+    return srcSliceH;
+}
+#endif
+
 static int planarToYuy2Wrapper(SwsContext *c, const uint8_t *src[],
                                int srcStride[], int srcSliceY, int srcSliceH,
                                uint8_t *dstParam[], int dstStride[])
@@ -1644,6 +1722,11 @@  void ff_get_unscaled_swscale(SwsContext *c)
         dstFormat == AV_PIX_FMT_P010) {
         c->swscale = planarToP010Wrapper;
     }
+    /* yuv420p_to_p010le */
+    if ((srcFormat == AV_PIX_FMT_YUV420P || srcFormat == AV_PIX_FMT_YUVA420P) &&
+        dstFormat == AV_PIX_FMT_P010LE) {
+        c->swscale = planar8ToP010leWrapper;
+    }
 
     if (srcFormat == AV_PIX_FMT_YUV410P && !(dstH & 3) &&
         (dstFormat == AV_PIX_FMT_YUV420P || dstFormat == AV_PIX_FMT_YUVA420P) &&
diff --git a/tests/ref/fate/filter-pixdesc-p010le b/tests/ref/fate/filter-pixdesc-p010le
index cac2635..2500604 100644
--- a/tests/ref/fate/filter-pixdesc-p010le
+++ b/tests/ref/fate/filter-pixdesc-p010le
@@ -1 +1 @@ 
-pixdesc-p010le      0268fd44f63022e21ada69704534fc85
+pixdesc-p010le      7b4a503997eb4e14cba80ee52db85e39
diff --git a/tests/ref/fate/filter-pixfmts-copy b/tests/ref/fate/filter-pixfmts-copy
index ce957f7..bcc4475 100644
--- a/tests/ref/fate/filter-pixfmts-copy
+++ b/tests/ref/fate/filter-pixfmts-copy
@@ -36,7 +36,7 @@  monow               54d16d2c01abfd72ecdb5e51e283937c
 nv12                8e24feb2c544dc26a20047a71e4c27aa
 nv21                335d85c9af6110f26ae9e187a82ed2cf
 p010be              7f9842d6015026136bad60d03c035cc3
-p010le              1929db89609c4b8c6d9c9030a9e7843d
+p010le              9ba7bc4611e36b2435eb2dff353b8af5
 pal8                ff5929f5b42075793b2c34cb441bede5
 rgb0                0de71e5a1f97f81fb51397a0435bfa72
 rgb24               f4438057d046e6d98ade4e45294b21be
diff --git a/tests/ref/fate/filter-pixfmts-crop b/tests/ref/fate/filter-pixfmts-crop
index e2c77a8..51c6df9 100644
--- a/tests/ref/fate/filter-pixfmts-crop
+++ b/tests/ref/fate/filter-pixfmts-crop
@@ -34,7 +34,7 @@  gray16le            9ff7c866bd98def4e6c91542c1c45f80
 nv12                92cda427f794374731ec0321ee00caac
 nv21                1bcfc197f4fb95de85ba58182d8d2f69
 p010be              8b2de2eb6b099bbf355bfc55a0694ddc
-p010le              a1e4f713e145dfc465bfe0cc77096a03
+p010le              fa78436272020be0d2569139808429b6
 pal8                1f2cdc8e718f95c875dbc1034a688bfb
 rgb0                736646b70dd9a0be22b8da8041e35035
 rgb24               c5fbbf816bb2000f4d2914e335698ef5
diff --git a/tests/ref/fate/filter-pixfmts-field b/tests/ref/fate/filter-pixfmts-field
index 20c76a1..8232a5c 100644
--- a/tests/ref/fate/filter-pixfmts-field
+++ b/tests/ref/fate/filter-pixfmts-field
@@ -36,7 +36,7 @@  monow               03d783611d265cae78293f88ea126ea1
 nv12                16f7a46708ef25ebd0b72e47920cc11e
 nv21                7294574037cc7f9373ef5695d8ebe809
 p010be              a0311a09bba7383553267d2b3b9c075e
-p010le              f1cc90d292046109a626db2da9f0f9b6
+p010le              634c62ef33b362795339a03907a33137
 pal8                0658c18dcd8d052d59dfbe23f5b368d9
 rgb0                ca3fa6e865b91b3511c7f2bf62830059
 rgb24               25ab271e26a5785be169578d99da5dd0
diff --git a/tests/ref/fate/filter-pixfmts-hflip b/tests/ref/fate/filter-pixfmts-hflip
index 8e902fb..9417eb8 100644
--- a/tests/ref/fate/filter-pixfmts-hflip
+++ b/tests/ref/fate/filter-pixfmts-hflip
@@ -34,7 +34,7 @@  gray16le            d91ce41e304419bcf32ac792f01bd64f
 nv12                801e58f1be5fd0b5bc4bf007c604b0b4
 nv21                9f10dfff8963dc327d3395af21f0554f
 p010be              744b13e44d39e1ff7588983fa03e0101
-p010le              aeb31f50c66f376b0530c7bb6287212b
+p010le              447768b443c5cd8ff591ccb53463f220
 pal8                5b7c77d99817b4f52339742a47de7797
 rgb0                0092452f37d73da20193265ace0b7d57
 rgb24               21571104e6091a689feabb7867e513dd
diff --git a/tests/ref/fate/filter-pixfmts-il b/tests/ref/fate/filter-pixfmts-il
index e568843..2ca37a3 100644
--- a/tests/ref/fate/filter-pixfmts-il
+++ b/tests/ref/fate/filter-pixfmts-il
@@ -36,7 +36,7 @@  monow               6e9cfb8d3a344c5f0c3e1d5e1297e580
 nv12                3c3ba9b1b4c4dfff09c26f71b51dd146
 nv21                ab586d8781246b5a32d8760a61db9797
 p010be              3df51286ef66b53e3e283dbbab582263
-p010le              38945445b360fa737e9e37257393e823
+p010le              7c101300e86f25e5528583ed811f8d25
 rgb0                cfaf68671e43248267d8cd50cae8c13f
 rgb24               88894f608cf33ba310f21996748d77a7
 rgb444be            99d36d814988fb388aacdef575dacfcf
diff --git a/tests/ref/fate/filter-pixfmts-null b/tests/ref/fate/filter-pixfmts-null
index ce957f7..bcc4475 100644
--- a/tests/ref/fate/filter-pixfmts-null
+++ b/tests/ref/fate/filter-pixfmts-null
@@ -36,7 +36,7 @@  monow               54d16d2c01abfd72ecdb5e51e283937c
 nv12                8e24feb2c544dc26a20047a71e4c27aa
 nv21                335d85c9af6110f26ae9e187a82ed2cf
 p010be              7f9842d6015026136bad60d03c035cc3
-p010le              1929db89609c4b8c6d9c9030a9e7843d
+p010le              9ba7bc4611e36b2435eb2dff353b8af5
 pal8                ff5929f5b42075793b2c34cb441bede5
 rgb0                0de71e5a1f97f81fb51397a0435bfa72
 rgb24               f4438057d046e6d98ade4e45294b21be
diff --git a/tests/ref/fate/filter-pixfmts-scale b/tests/ref/fate/filter-pixfmts-scale
index 62ea6fa..6320557 100644
--- a/tests/ref/fate/filter-pixfmts-scale
+++ b/tests/ref/fate/filter-pixfmts-scale
@@ -36,7 +36,7 @@  monow               35c68b86c226d6990b2dcb573a05ff6b
 nv12                b118d24a3653fe66e5d9e079033aef79
 nv21                c74bb1c10dbbdee8a1f682b194486c4d
 p010be              1d6726d94bf1385996a9a9840dd0e878
-p010le              5d436e6b35292a0e356d81f37f989b66
+p010le              4b316f2b9e18972299beb73511278fa8
 pal8                29e10892009b2cfe431815ec3052ed3b
 rgb0                fbd27e98154efb7535826afed41e9bb0
 rgb24               e022e741451e81f2ecce1c7240b93e87
diff --git a/tests/ref/fate/filter-pixfmts-vflip b/tests/ref/fate/filter-pixfmts-vflip
index 9651724..667a1b6 100644
--- a/tests/ref/fate/filter-pixfmts-vflip
+++ b/tests/ref/fate/filter-pixfmts-vflip
@@ -36,7 +36,7 @@  monow               90a947bfcd5f2261e83b577f48ec57b1
 nv12                261ebe585ae2aa4e70d39a10c1679294
 nv21                2909feacd27bebb080c8e0fa41795269
 p010be              06e9354b6e0e38ba41736352cedc0bd5
-p010le              cdf6a3c38d9d4e3f079fa369e1dda662
+p010le              9686439b0d21cae1949d6aebe98b5e88
 pal8                450b0155d0f2d5628bf95a442db5f817
 rgb0                56a7ea69541bcd27bef6a5615784722b
 rgb24               195e6dae1c3a488b9d3ceb7560d25d85