diff mbox

[FFmpeg-devel,v3,2/5] avcodec/v210enc: make 8bit and 10bit function consistent

Message ID 20190901132023.28531-2-lance.lmwang@gmail.com
State New
Headers show

Commit Message

Lance Wang Sept. 1, 2019, 1:20 p.m. UTC
From: Limin Wang <lance.lmwang@gmail.com>

I have benchmarked the performance with c code and haven't see any
performance impact.

Signed-off-by: Limin Wang <lance.lmwang@gmail.com>
---
 libavcodec/v210enc.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

Comments

Michael Niedermayer Sept. 16, 2019, 7:06 p.m. UTC | #1
On Sun, Sep 01, 2019 at 09:20:20PM +0800, lance.lmwang@gmail.com wrote:
> From: Limin Wang <lance.lmwang@gmail.com>
> 
> I have benchmarked the performance with c code and haven't see any
> performance impact.
> 
> Signed-off-by: Limin Wang <lance.lmwang@gmail.com>
> ---
>  libavcodec/v210enc.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/libavcodec/v210enc.c b/libavcodec/v210enc.c
> index 1b840b2..69a2efe 100644
> --- a/libavcodec/v210enc.c
> +++ b/libavcodec/v210enc.c
> @@ -43,12 +43,7 @@ static void v210_planar_pack_8_c(const uint8_t *y, const uint8_t *u,
>      uint32_t val;
>      int i;
>  
> -    /* unroll this to match the assembly */
> -    for (i = 0; i < width - 11; i += 12) {
> -        WRITE_PIXELS(u, y, v, 8);
> -        WRITE_PIXELS(y, u, y, 8);
> -        WRITE_PIXELS(v, y, u, 8);
> -        WRITE_PIXELS(y, v, y, 8);
> +    for (i = 0; i < width - 5; i += 6) {
>          WRITE_PIXELS(u, y, v, 8);
>          WRITE_PIXELS(y, u, y, 8);
>          WRITE_PIXELS(v, y, u, 8);

I have retested this with START/STOP_TIMER
and the more unrolled loop is consistently faster

./ffmpeg -cpuflags 0 -v 99 -i matrixbench_mpeg2.mpg -vcodec v210 -an test.avi

 31620 decicycles in TEST, 2096691 runs,    461 skips  0  0  0  0  0  0  0  0  0  0  0 21 13  9  8  7  8  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 31509 decicycles in TEST, 2096892 runs,    260 skips  0  0  0  0  0  0  0  0  0  0  0 21 10  9  8  6  7  3  2  0  0  0  0  0  0  0  0  0  0  0  0  0
 32069 decicycles in TEST, 2096965 runs,    187 skips  0  0  0  0  0  0  0  0  0  0  0 21 16 10  8  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 31522 decicycles in TEST, 2096962 runs,    190 skips  0  0  0  0  0  0  0  0  0  0  0 21 10  9  8  6  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 31537 decicycles in TEST, 2096784 runs,    368 skips  0  0  0  0  0  0  0  0  0  0  0 21 12  8  9  7  7  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0

prev:
 30705 decicycles in TEST, 2096875 runs,    277 skips  0  0  0  0  0  0  0  0  0  0  0 21 15  9  9  7  5  3  1  0  0  0  0  0  0  0  0  0  0  0  0  0
 30771 decicycles in TEST, 2096907 runs,    245 skips  0  0  0  0  0  0  0  0  0  0  0 21 15  9  8  6  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 30560 decicycles in TEST, 2096904 runs,    248 skips  0  0  0  0  0  0  0  0  0  0  0 21 10  9  9  6  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 31020 decicycles in TEST, 2096974 runs,    178 skips  0  0  0  0  0  0  0  0  0  0  0 21 16  9  8  6  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 31018 decicycles in TEST, 2096980 runs,    172 skips  0  0  0  0  0  0  0  0  0  0  0 21 16  9  8  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

[...]
Lance Wang Sept. 17, 2019, 1:27 a.m. UTC | #2
On Mon, Sep 16, 2019 at 09:06:06PM +0200, Michael Niedermayer wrote:
> On Sun, Sep 01, 2019 at 09:20:20PM +0800, lance.lmwang@gmail.com wrote:
> > From: Limin Wang <lance.lmwang@gmail.com>
> > 
> > I have benchmarked the performance with c code and haven't see any
> > performance impact.
> > 
> > Signed-off-by: Limin Wang <lance.lmwang@gmail.com>
> > ---
> >  libavcodec/v210enc.c | 7 +------
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> > 
> > diff --git a/libavcodec/v210enc.c b/libavcodec/v210enc.c
> > index 1b840b2..69a2efe 100644
> > --- a/libavcodec/v210enc.c
> > +++ b/libavcodec/v210enc.c
> > @@ -43,12 +43,7 @@ static void v210_planar_pack_8_c(const uint8_t *y, const uint8_t *u,
> >      uint32_t val;
> >      int i;
> >  
> > -    /* unroll this to match the assembly */
> > -    for (i = 0; i < width - 11; i += 12) {
> > -        WRITE_PIXELS(u, y, v, 8);
> > -        WRITE_PIXELS(y, u, y, 8);
> > -        WRITE_PIXELS(v, y, u, 8);
> > -        WRITE_PIXELS(y, v, y, 8);
> > +    for (i = 0; i < width - 5; i += 6) {
> >          WRITE_PIXELS(u, y, v, 8);
> >          WRITE_PIXELS(y, u, y, 8);
> >          WRITE_PIXELS(v, y, u, 8);
> 
> I have retested this with START/STOP_TIMER
> and the more unrolled loop is consistently faster

Sorry, I haven't used START/STOP_TIMER before, so only using -benchmark for checking quickly.
As it's faster and we can't make the two function consistent, so I'll update the patch and
discard patch#2 and patch#3.


> 
> ./ffmpeg -cpuflags 0 -v 99 -i matrixbench_mpeg2.mpg -vcodec v210 -an test.avi
> 
>  31620 decicycles in TEST, 2096691 runs,    461 skips  0  0  0  0  0  0  0  0  0  0  0 21 13  9  8  7  8  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31509 decicycles in TEST, 2096892 runs,    260 skips  0  0  0  0  0  0  0  0  0  0  0 21 10  9  8  6  7  3  2  0  0  0  0  0  0  0  0  0  0  0  0  0
>  32069 decicycles in TEST, 2096965 runs,    187 skips  0  0  0  0  0  0  0  0  0  0  0 21 16 10  8  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31522 decicycles in TEST, 2096962 runs,    190 skips  0  0  0  0  0  0  0  0  0  0  0 21 10  9  8  6  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31537 decicycles in TEST, 2096784 runs,    368 skips  0  0  0  0  0  0  0  0  0  0  0 21 12  8  9  7  7  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0
> 
> prev:
>  30705 decicycles in TEST, 2096875 runs,    277 skips  0  0  0  0  0  0  0  0  0  0  0 21 15  9  9  7  5  3  1  0  0  0  0  0  0  0  0  0  0  0  0  0
>  30771 decicycles in TEST, 2096907 runs,    245 skips  0  0  0  0  0  0  0  0  0  0  0 21 15  9  8  6  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  30560 decicycles in TEST, 2096904 runs,    248 skips  0  0  0  0  0  0  0  0  0  0  0 21 10  9  9  6  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31020 decicycles in TEST, 2096974 runs,    178 skips  0  0  0  0  0  0  0  0  0  0  0 21 16  9  8  6  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31018 decicycles in TEST, 2096980 runs,    172 skips  0  0  0  0  0  0  0  0  0  0  0 21 16  9  8  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
> 
> [...]
> -- 
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> I have often repented speaking, but never of holding my tongue.
> -- Xenocrates



> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Lance Wang Sept. 17, 2019, 10:15 a.m. UTC | #3
On Mon, Sep 16, 2019 at 09:06:06PM +0200, Michael Niedermayer wrote:
> On Sun, Sep 01, 2019 at 09:20:20PM +0800, lance.lmwang@gmail.com wrote:
> > From: Limin Wang <lance.lmwang@gmail.com>
> > 
> > I have benchmarked the performance with c code and haven't see any
> > performance impact.
> > 
> > Signed-off-by: Limin Wang <lance.lmwang@gmail.com>
> > ---
> >  libavcodec/v210enc.c | 7 +------
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> > 
> > diff --git a/libavcodec/v210enc.c b/libavcodec/v210enc.c
> > index 1b840b2..69a2efe 100644
> > --- a/libavcodec/v210enc.c
> > +++ b/libavcodec/v210enc.c
> > @@ -43,12 +43,7 @@ static void v210_planar_pack_8_c(const uint8_t *y, const uint8_t *u,
> >      uint32_t val;
> >      int i;
> >  
> > -    /* unroll this to match the assembly */
> > -    for (i = 0; i < width - 11; i += 12) {
> > -        WRITE_PIXELS(u, y, v, 8);
> > -        WRITE_PIXELS(y, u, y, 8);
> > -        WRITE_PIXELS(v, y, u, 8);
> > -        WRITE_PIXELS(y, v, y, 8);
> > +    for (i = 0; i < width - 5; i += 6) {
> >          WRITE_PIXELS(u, y, v, 8);
> >          WRITE_PIXELS(y, u, y, 8);
> >          WRITE_PIXELS(v, y, u, 8);
> 
> I have retested this with START/STOP_TIMER
> and the more unrolled loop is consistently faster
> 
> ./ffmpeg -cpuflags 0 -v 99 -i matrixbench_mpeg2.mpg -vcodec v210 -an test.avi
> 
>  31620 decicycles in TEST, 2096691 runs,    461 skips  0  0  0  0  0  0  0  0  0  0  0 21 13  9  8  7  8  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31509 decicycles in TEST, 2096892 runs,    260 skips  0  0  0  0  0  0  0  0  0  0  0 21 10  9  8  6  7  3  2  0  0  0  0  0  0  0  0  0  0  0  0  0
>  32069 decicycles in TEST, 2096965 runs,    187 skips  0  0  0  0  0  0  0  0  0  0  0 21 16 10  8  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31522 decicycles in TEST, 2096962 runs,    190 skips  0  0  0  0  0  0  0  0  0  0  0 21 10  9  8  6  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31537 decicycles in TEST, 2096784 runs,    368 skips  0  0  0  0  0  0  0  0  0  0  0 21 12  8  9  7  7  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0
> 
> prev:
>  30705 decicycles in TEST, 2096875 runs,    277 skips  0  0  0  0  0  0  0  0  0  0  0 21 15  9  9  7  5  3  1  0  0  0  0  0  0  0  0  0  0  0  0  0
>  30771 decicycles in TEST, 2096907 runs,    245 skips  0  0  0  0  0  0  0  0  0  0  0 21 15  9  8  6  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  30560 decicycles in TEST, 2096904 runs,    248 skips  0  0  0  0  0  0  0  0  0  0  0 21 10  9  9  6  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31020 decicycles in TEST, 2096974 runs,    178 skips  0  0  0  0  0  0  0  0  0  0  0 21 16  9  8  6  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>  31018 decicycles in TEST, 2096980 runs,    172 skips  0  0  0  0  0  0  0  0  0  0  0 21 16  9  8  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
> 

Michael, I have updated the patch V4 for review, old patch#2,3 are discard for the performance different. 

> [...]
> -- 
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> I have often repented speaking, but never of holding my tongue.
> -- Xenocrates



> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
diff mbox

Patch

diff --git a/libavcodec/v210enc.c b/libavcodec/v210enc.c
index 1b840b2..69a2efe 100644
--- a/libavcodec/v210enc.c
+++ b/libavcodec/v210enc.c
@@ -43,12 +43,7 @@  static void v210_planar_pack_8_c(const uint8_t *y, const uint8_t *u,
     uint32_t val;
     int i;
 
-    /* unroll this to match the assembly */
-    for (i = 0; i < width - 11; i += 12) {
-        WRITE_PIXELS(u, y, v, 8);
-        WRITE_PIXELS(y, u, y, 8);
-        WRITE_PIXELS(v, y, u, 8);
-        WRITE_PIXELS(y, v, y, 8);
+    for (i = 0; i < width - 5; i += 6) {
         WRITE_PIXELS(u, y, v, 8);
         WRITE_PIXELS(y, u, y, 8);
         WRITE_PIXELS(v, y, u, 8);