diff mbox

[FFmpeg-devel,v2] avcodec/mips: [loongson] reoptimize h264_chroma_mc8_mmi v2.

Message ID 1535610244-8195-1-git-send-email-yinshiyou-hf@loongson.cn
State Superseded
Commit f91237baf6f7c11a84075b1cea1f0dc2e1c70cff
Headers show

Commit Message

Shiyou Yin Aug. 30, 2018, 6:24 a.m. UTC
Reoptimize function ff_put_h264_chroma_mc8_mmi and ff_avg_h264_chroma_mc8_mmi.
Performance of h264 decoding improved about 5%(from 69fps to 73fps, tested on loongson 3A3000).

Change-Id: Iccd7f4e480b2d0bfc47e4d409874c4adb77416cc
---
 libavcodec/mips/h264chroma_mmi.c | 744 ++++++++++++++++++++++++---------------
 1 file changed, 455 insertions(+), 289 deletions(-)

Comments

Michael Niedermayer Aug. 30, 2018, 9:46 p.m. UTC | #1
On Thu, Aug 30, 2018 at 02:24:04PM +0800, Shiyou Yin wrote:
> Reoptimize function ff_put_h264_chroma_mc8_mmi and ff_avg_h264_chroma_mc8_mmi.
> Performance of h264 decoding improved about 5%(from 69fps to 73fps, tested on loongson 3A3000).

what do you mean by "Reoptimize"?
does this port some optimizations from elsewhere ?
does this take the same code as previous optimizations did and re implements
(better/faster) MIPS code based on it ?

what is the speed difference ?


> 
> Change-Id: Iccd7f4e480b2d0bfc47e4d409874c4adb77416cc

what is this ?


[...]
Shiyou Yin Aug. 31, 2018, 2:30 a.m. UTC | #2
>-----Original Message-----
>From: ffmpeg-devel-bounces@ffmpeg.org [mailto:ffmpeg-devel-bounces@ffmpeg.org] On Behalf Of
>Michael Niedermayer
>Sent: Friday, August 31, 2018 5:47 AM
>To: FFmpeg development discussions and patches
>Subject: Re: [FFmpeg-devel] [PATCH v2] avcodec/mips: [loongson] reoptimize h264_chroma_mc8_mmi v2.
>
>On Thu, Aug 30, 2018 at 02:24:04PM +0800, Shiyou Yin wrote:
>> Reoptimize function ff_put_h264_chroma_mc8_mmi and ff_avg_h264_chroma_mc8_mmi.
>> Performance of h264 decoding improved about 5%(from 69fps to 73fps, tested on loongson 3A3000).
>
>what do you mean by "Reoptimize"?
>does this port some optimizations from elsewhere ?
>does this take the same code as previous optimizations did and re implements
>(better/faster) MIPS code based on it ?
>
>what is the speed difference ?

Two functions have optimized with mmi yet.
This patch was based on the previous version, then optimized the branch condition and the code in
branch.
This patch will speed up about 5% for h264 decode on loongson platform.

>>
>> Change-Id: Iccd7f4e480b2d0bfc47e4d409874c4adb77416cc
>
>what is this ?

The original patch was made from loongson's local repository(we use gerrit to manage patch review
and merge), each commit has it's own Change-Id.
When I use git am to apply this patch to ffmpeg source, so the Change-Id was keeped with commit
message.
I will remove it in the next version.
Michael Niedermayer Aug. 31, 2018, 11:03 a.m. UTC | #3
On Fri, Aug 31, 2018 at 10:30:21AM +0800, Shiyou Yin wrote:
> >-----Original Message-----
> >From: ffmpeg-devel-bounces@ffmpeg.org [mailto:ffmpeg-devel-bounces@ffmpeg.org] On Behalf Of
> >Michael Niedermayer
> >Sent: Friday, August 31, 2018 5:47 AM
> >To: FFmpeg development discussions and patches
> >Subject: Re: [FFmpeg-devel] [PATCH v2] avcodec/mips: [loongson] reoptimize h264_chroma_mc8_mmi v2.
> >
> >On Thu, Aug 30, 2018 at 02:24:04PM +0800, Shiyou Yin wrote:
> >> Reoptimize function ff_put_h264_chroma_mc8_mmi and ff_avg_h264_chroma_mc8_mmi.
> >> Performance of h264 decoding improved about 5%(from 69fps to 73fps, tested on loongson 3A3000).
> >
> >what do you mean by "Reoptimize"?
> >does this port some optimizations from elsewhere ?
> >does this take the same code as previous optimizations did and re implements
> >(better/faster) MIPS code based on it ?
> >
> >what is the speed difference ?
> 
> Two functions have optimized with mmi yet.
> This patch was based on the previous version, then optimized the branch condition and the code in
> branch.
> This patch will speed up about 5% for h264 decode on loongson platform.
> 

> >>
> >> Change-Id: Iccd7f4e480b2d0bfc47e4d409874c4adb77416cc
> >
> >what is this ?
> 
> The original patch was made from loongson's local repository(we use gerrit to manage patch review
> and merge), each commit has it's own Change-Id.
> When I use git am to apply this patch to ffmpeg source, so the Change-Id was keeped with commit
> message.
> I will remove it in the next version.

you can use git notes to keep track of internal change-ids but they
should not be in public commits as noone can do anything with them except you.

[...]
diff mbox

Patch

diff --git a/libavcodec/mips/h264chroma_mmi.c b/libavcodec/mips/h264chroma_mmi.c
index bafe0f9..91b2cc4 100644
--- a/libavcodec/mips/h264chroma_mmi.c
+++ b/libavcodec/mips/h264chroma_mmi.c
@@ -29,326 +29,322 @@ 
 void ff_put_h264_chroma_mc8_mmi(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
         int h, int x, int y)
 {
-    const int A = (8 - x) * (8 - y);
-    const int B = x * (8 - y);
-    const int C = (8 - x) * y;
-    const int D = x * y;
-    const int E = B + C;
+    int A = 64, B, C, D, E;
     double ftmp[10];
     uint64_t tmp[1];
-    mips_reg addr[1];
-    DECLARE_VAR_ALL64;
 
-    if (D) {
+    if (!(x || y)) {
+        /* x=0, y=0, A=64 */
         __asm__ volatile (
-            "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]            \n\t"
-            "dli        %[tmp0],    0x06                                \n\t"
-            "pshufh     %[A],       %[A],           %[ftmp0]            \n\t"
-            "pshufh     %[B],       %[B],           %[ftmp0]            \n\t"
-            "mtc1       %[tmp0],    %[ftmp9]                            \n\t"
-            "pshufh     %[C],       %[C],           %[ftmp0]            \n\t"
-            "pshufh     %[D],       %[D],           %[ftmp0]            \n\t"
+            "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]           \n\t"
+            "dli        %[tmp0],    0x06                               \n\t"
+            "mtc1       %[tmp0],    %[ftmp4]                           \n\t"
 
-            "1:                                                         \n\t"
-            PTR_ADDU   "%[addr0],   %[src],         %[stride]           \n\t"
+            "1:                                                        \n\t"
             MMI_ULDC1(%[ftmp1], %[src], 0x00)
-            MMI_ULDC1(%[ftmp2], %[src], 0x01)
-            MMI_ULDC1(%[ftmp3], %[addr0], 0x00)
-            MMI_ULDC1(%[ftmp4], %[addr0], 0x01)
-
-            "punpcklbh  %[ftmp5],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp6],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpcklbh  %[ftmp7],   %[ftmp2],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp8],   %[ftmp2],       %[ftmp0]            \n\t"
-            "pmullh     %[ftmp5],   %[ftmp5],       %[A]                \n\t"
-            "pmullh     %[ftmp7],   %[ftmp7],       %[B]                \n\t"
-            "paddh      %[ftmp1],   %[ftmp5],       %[ftmp7]            \n\t"
-            "pmullh     %[ftmp6],   %[ftmp6],       %[A]                \n\t"
-            "pmullh     %[ftmp8],   %[ftmp8],       %[B]                \n\t"
-            "paddh      %[ftmp2],   %[ftmp6],       %[ftmp8]            \n\t"
-
-            "punpcklbh  %[ftmp5],   %[ftmp3],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp6],   %[ftmp3],       %[ftmp0]            \n\t"
-            "punpcklbh  %[ftmp7],   %[ftmp4],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp8],   %[ftmp4],       %[ftmp0]            \n\t"
-            "pmullh     %[ftmp5],   %[ftmp5],       %[C]                \n\t"
-            "pmullh     %[ftmp7],   %[ftmp7],       %[D]                \n\t"
-            "paddh      %[ftmp3],   %[ftmp5],       %[ftmp7]            \n\t"
-            "pmullh     %[ftmp6],   %[ftmp6],       %[C]                \n\t"
-            "pmullh     %[ftmp8],   %[ftmp8],       %[D]                \n\t"
-            "paddh      %[ftmp4],   %[ftmp6],       %[ftmp8]            \n\t"
-
-            "paddh      %[ftmp1],   %[ftmp1],       %[ftmp3]            \n\t"
-            "paddh      %[ftmp2],   %[ftmp2],       %[ftmp4]            \n\t"
-            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]         \n\t"
-            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]         \n\t"
-            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp9]            \n\t"
-            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp9]            \n\t"
-            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            "addi       %[h],       %[h],           -0x01               \n\t"
+            "addi       %[h],       %[h],           -0x04              \n\t"
+            PTR_ADDU   "%[src],     %[src],         %[stride]          \n\t"
+            MMI_ULDC1(%[ftmp5], %[src], 0x00)
+            PTR_ADDU   "%[src],     %[src],         %[stride]          \n\t"
+            MMI_ULDC1(%[ftmp6], %[src], 0x00)
+            PTR_ADDU   "%[src],     %[src],         %[stride]          \n\t"
+            MMI_ULDC1(%[ftmp7], %[src], 0x00)
+
+            "punpcklbh  %[ftmp2],   %[ftmp1],       %[ftmp0]           \n\t"
+            "punpckhbh  %[ftmp3],   %[ftmp1],       %[ftmp0]           \n\t"
+            "psllh      %[ftmp1],   %[ftmp2],       %[ftmp4]           \n\t"
+            "psllh      %[ftmp2],   %[ftmp3],       %[ftmp4]           \n\t"
+            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]        \n\t"
+            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]        \n\t"
+            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp4]           \n\t"
+            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp4]           \n\t"
+            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]           \n\t"
             MMI_SDC1(%[ftmp1], %[dst], 0x00)
-            PTR_ADDU   "%[src],     %[src],         %[stride]           \n\t"
-            PTR_ADDU   "%[dst],     %[dst],         %[stride]           \n\t"
-            "bnez       %[h],       1b                                  \n\t"
-            : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
-              [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
-              [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
-              [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
-              [ftmp8]"=&f"(ftmp[8]),        [ftmp9]"=&f"(ftmp[9]),
-              [tmp0]"=&r"(tmp[0]),
-              RESTRICT_ASM_ALL64
-              [addr0]"=&r"(addr[0]),
-              [dst]"+&r"(dst),              [src]"+&r"(src),
-              [h]"+&r"(h)
-            : [stride]"r"((mips_reg)stride),[ff_pw_32]"f"(ff_pw_32),
-              [A]"f"(A),                    [B]"f"(B),
-              [C]"f"(C),                    [D]"f"(D)
-            : "memory"
-        );
-    } else if (E) {
-        const int step = C ? stride : 1;
-
-        __asm__ volatile (
-            "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]            \n\t"
-            "dli        %[tmp0],    0x06                                \n\t"
-            "pshufh     %[A],       %[A],           %[ftmp0]            \n\t"
-            "pshufh     %[E],       %[E],           %[ftmp0]            \n\t"
-            "mtc1       %[tmp0],    %[ftmp7]                            \n\t"
-
-            "1:                                                         \n\t"
-            PTR_ADDU   "%[addr0],   %[src],         %[step]             \n\t"
-            MMI_ULDC1(%[ftmp1], %[src], 0x00)
-            MMI_ULDC1(%[ftmp2], %[addr0], 0x00)
 
-            "punpcklbh  %[ftmp3],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp4],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpcklbh  %[ftmp5],   %[ftmp2],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp6],   %[ftmp2],       %[ftmp0]            \n\t"
-            "pmullh     %[ftmp3],   %[ftmp3],       %[A]                \n\t"
-            "pmullh     %[ftmp5],   %[ftmp5],       %[E]                \n\t"
-            "paddh      %[ftmp1],   %[ftmp3],       %[ftmp5]            \n\t"
-            "pmullh     %[ftmp4],   %[ftmp4],       %[A]                \n\t"
-            "pmullh     %[ftmp6],   %[ftmp6],       %[E]                \n\t"
-            "paddh      %[ftmp2],   %[ftmp4],       %[ftmp6]            \n\t"
-
-            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]         \n\t"
-            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]         \n\t"
-            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp7]            \n\t"
-            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp7]            \n\t"
-            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            "addi       %[h],       %[h],           -0x01               \n\t"
+            "punpcklbh  %[ftmp2],   %[ftmp5],       %[ftmp0]           \n\t"
+            "punpckhbh  %[ftmp3],   %[ftmp5],       %[ftmp0]           \n\t"
+            "psllh      %[ftmp1],   %[ftmp2],       %[ftmp4]           \n\t"
+            "psllh      %[ftmp2],   %[ftmp3],       %[ftmp4]           \n\t"
+            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]        \n\t"
+            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]        \n\t"
+            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp4]           \n\t"
+            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp4]           \n\t"
+            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]           \n\t"
+            PTR_ADDU   "%[dst],     %[dst],         %[stride]          \n\t"
             MMI_SDC1(%[ftmp1], %[dst], 0x00)
-            PTR_ADDU   "%[src],     %[src],         %[stride]           \n\t"
-            PTR_ADDU   "%[dst],     %[dst],         %[stride]           \n\t"
-            "bnez       %[h],       1b                                  \n\t"
-            : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
-              [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
-              [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
-              [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
-              [tmp0]"=&r"(tmp[0]),
-              RESTRICT_ASM_ALL64
-              [addr0]"=&r"(addr[0]),
-              [dst]"+&r"(dst),              [src]"+&r"(src),
-              [h]"+&r"(h)
-            : [stride]"r"((mips_reg)stride),[step]"r"((mips_reg)step),
-              [ff_pw_32]"f"(ff_pw_32),
-              [A]"f"(A),                    [E]"f"(E)
-            : "memory"
-        );
-    } else {
-        __asm__ volatile (
-            "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]            \n\t"
-            "dli        %[tmp0],    0x06                                \n\t"
-            "pshufh     %[A],       %[A],           %[ftmp0]            \n\t"
-            "mtc1       %[tmp0],    %[ftmp4]                            \n\t"
 
-            "1:                                                         \n\t"
-            MMI_ULDC1(%[ftmp1], %[src], 0x00)
-            "punpcklbh  %[ftmp2],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp3],   %[ftmp1],       %[ftmp0]            \n\t"
-            "pmullh     %[ftmp1],   %[ftmp2],       %[A]                \n\t"
-            "pmullh     %[ftmp2],   %[ftmp3],       %[A]                \n\t"
-            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]         \n\t"
-            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]         \n\t"
-            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp4]            \n\t"
-            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp4]            \n\t"
-            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            PTR_ADDU   "%[src],     %[src],         %[stride]           \n\t"
+            "punpcklbh  %[ftmp2],   %[ftmp6],       %[ftmp0]           \n\t"
+            "punpckhbh  %[ftmp3],   %[ftmp6],       %[ftmp0]           \n\t"
+            "psllh      %[ftmp1],   %[ftmp2],       %[ftmp4]           \n\t"
+            "psllh      %[ftmp2],   %[ftmp3],       %[ftmp4]           \n\t"
+            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]        \n\t"
+            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]        \n\t"
+            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp4]           \n\t"
+            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp4]           \n\t"
+            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]           \n\t"
+            PTR_ADDU   "%[dst],     %[dst],         %[stride]          \n\t"
             MMI_SDC1(%[ftmp1], %[dst], 0x00)
 
-            PTR_ADDU   "%[dst],     %[dst],         %[stride]           \n\t"
-            MMI_ULDC1(%[ftmp1], %[src], 0x00)
-            "punpcklbh  %[ftmp2],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp3],   %[ftmp1],       %[ftmp0]            \n\t"
-            "pmullh     %[ftmp1],   %[ftmp2],       %[A]                \n\t"
-            "pmullh     %[ftmp2],   %[ftmp3],       %[A]                \n\t"
-            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]         \n\t"
-            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]         \n\t"
-            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp4]            \n\t"
-            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp4]            \n\t"
-            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            "addi       %[h],       %[h],           -0x02               \n\t"
+            "punpcklbh  %[ftmp2],   %[ftmp7],       %[ftmp0]           \n\t"
+            "punpckhbh  %[ftmp3],   %[ftmp7],       %[ftmp0]           \n\t"
+            "psllh      %[ftmp1],   %[ftmp2],       %[ftmp4]           \n\t"
+            "psllh      %[ftmp2],   %[ftmp3],       %[ftmp4]           \n\t"
+            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]        \n\t"
+            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]        \n\t"
+            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp4]           \n\t"
+            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp4]           \n\t"
+            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]           \n\t"
+            PTR_ADDU   "%[dst],     %[dst],         %[stride]          \n\t"
             MMI_SDC1(%[ftmp1], %[dst], 0x00)
 
-            PTR_ADDU   "%[src],     %[src],         %[stride]           \n\t"
-            PTR_ADDU   "%[dst],     %[dst],         %[stride]           \n\t"
-            "bnez       %[h],       1b                                  \n\t"
+            PTR_ADDU   "%[src],     %[src],         %[stride]          \n\t"
+            PTR_ADDU   "%[dst],     %[dst],         %[stride]          \n\t"
+            "bnez       %[h],       1b                                 \n\t"
             : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
               [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
-              [ftmp4]"=&f"(ftmp[4]),
+              [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
+              [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
               [tmp0]"=&r"(tmp[0]),
-              RESTRICT_ASM_ALL64
               [dst]"+&r"(dst),              [src]"+&r"(src),
               [h]"+&r"(h)
-            : [stride]"r"((mips_reg)stride),[ff_pw_32]"f"(ff_pw_32),
-              [A]"f"(A)
+            : [stride]"r"((mips_reg)stride),[ff_pw_32]"f"(ff_pw_32)
             : "memory"
         );
+    } else {
+        if (x && y) {
+            /* x!=0, y!=0 */
+            D = x * y;
+            B = (x << 3) - D;
+            C = (y << 3) - D;
+            A = 64 - D - B - C;
+
+            __asm__ volatile (
+                "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]           \n\t"
+                "dli        %[tmp0],    0x06                               \n\t"
+                "pshufh     %[A],       %[A],           %[ftmp0]           \n\t"
+                "pshufh     %[B],       %[B],           %[ftmp0]           \n\t"
+                "mtc1       %[tmp0],    %[ftmp9]                           \n\t"
+                "pshufh     %[C],       %[C],           %[ftmp0]           \n\t"
+                "pshufh     %[D],       %[D],           %[ftmp0]           \n\t"
+
+                "1:                                                        \n\t"
+                MMI_ULDC1(%[ftmp1], %[src], 0x00)
+                MMI_ULDC1(%[ftmp2], %[src], 0x01)
+                PTR_ADDU   "%[src],     %[src],         %[stride]          \n\t"
+                MMI_ULDC1(%[ftmp3], %[src], 0x00)
+                MMI_ULDC1(%[ftmp4], %[src], 0x01)
+                "addi       %[h],       %[h],           -0x02              \n\t"
+
+                "punpcklbh  %[ftmp5],   %[ftmp1],       %[ftmp0]           \n\t"
+                "punpckhbh  %[ftmp6],   %[ftmp1],       %[ftmp0]           \n\t"
+                "punpcklbh  %[ftmp7],   %[ftmp2],       %[ftmp0]           \n\t"
+                "punpckhbh  %[ftmp8],   %[ftmp2],       %[ftmp0]           \n\t"
+                "pmullh     %[ftmp5],   %[ftmp5],       %[A]               \n\t"
+                "pmullh     %[ftmp7],   %[ftmp7],       %[B]               \n\t"
+                "paddh      %[ftmp1],   %[ftmp5],       %[ftmp7]           \n\t"
+                "pmullh     %[ftmp6],   %[ftmp6],       %[A]               \n\t"
+                "pmullh     %[ftmp8],   %[ftmp8],       %[B]               \n\t"
+                "paddh      %[ftmp2],   %[ftmp6],       %[ftmp8]           \n\t"
+
+                "punpcklbh  %[ftmp5],   %[ftmp3],       %[ftmp0]           \n\t"
+                "punpckhbh  %[ftmp6],   %[ftmp3],       %[ftmp0]           \n\t"
+                "punpcklbh  %[ftmp7],   %[ftmp4],       %[ftmp0]           \n\t"
+                "punpckhbh  %[ftmp8],   %[ftmp4],       %[ftmp0]           \n\t"
+                "pmullh     %[ftmp5],   %[ftmp5],       %[C]               \n\t"
+                "pmullh     %[ftmp7],   %[ftmp7],       %[D]               \n\t"
+                "paddh      %[ftmp3],   %[ftmp5],       %[ftmp7]           \n\t"
+                "pmullh     %[ftmp6],   %[ftmp6],       %[C]               \n\t"
+                "pmullh     %[ftmp8],   %[ftmp8],       %[D]               \n\t"
+                "paddh      %[ftmp4],   %[ftmp6],       %[ftmp8]           \n\t"
+
+                "paddh      %[ftmp1],   %[ftmp1],       %[ftmp3]           \n\t"
+                "paddh      %[ftmp2],   %[ftmp2],       %[ftmp4]           \n\t"
+                "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]        \n\t"
+                "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]        \n\t"
+                "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp9]           \n\t"
+                "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp9]           \n\t"
+                "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]           \n\t"
+                MMI_SDC1(%[ftmp1], %[dst], 0x00)
+                PTR_ADDU   "%[dst],     %[dst],         %[stride]          \n\t"
+
+                MMI_ULDC1(%[ftmp1], %[src], 0x00)
+                MMI_ULDC1(%[ftmp2], %[src], 0x01)
+                PTR_ADDU   "%[src],     %[src],         %[stride]          \n\t"
+                MMI_ULDC1(%[ftmp3], %[src], 0x00)
+                MMI_ULDC1(%[ftmp4], %[src], 0x01)
+
+                "punpcklbh  %[ftmp5],   %[ftmp1],       %[ftmp0]           \n\t"
+                "punpckhbh  %[ftmp6],   %[ftmp1],       %[ftmp0]           \n\t"
+                "punpcklbh  %[ftmp7],   %[ftmp2],       %[ftmp0]           \n\t"
+                "punpckhbh  %[ftmp8],   %[ftmp2],       %[ftmp0]           \n\t"
+                "pmullh     %[ftmp5],   %[ftmp5],       %[A]               \n\t"
+                "pmullh     %[ftmp7],   %[ftmp7],       %[B]               \n\t"
+                "paddh      %[ftmp1],   %[ftmp5],       %[ftmp7]           \n\t"
+                "pmullh     %[ftmp6],   %[ftmp6],       %[A]               \n\t"
+                "pmullh     %[ftmp8],   %[ftmp8],       %[B]               \n\t"
+                "paddh      %[ftmp2],   %[ftmp6],       %[ftmp8]           \n\t"
+
+                "punpcklbh  %[ftmp5],   %[ftmp3],       %[ftmp0]           \n\t"
+                "punpckhbh  %[ftmp6],   %[ftmp3],       %[ftmp0]           \n\t"
+                "punpcklbh  %[ftmp7],   %[ftmp4],       %[ftmp0]           \n\t"
+                "punpckhbh  %[ftmp8],   %[ftmp4],       %[ftmp0]           \n\t"
+                "pmullh     %[ftmp5],   %[ftmp5],       %[C]               \n\t"
+                "pmullh     %[ftmp7],   %[ftmp7],       %[D]               \n\t"
+                "paddh      %[ftmp3],   %[ftmp5],       %[ftmp7]           \n\t"
+                "pmullh     %[ftmp6],   %[ftmp6],       %[C]               \n\t"
+                "pmullh     %[ftmp8],   %[ftmp8],       %[D]               \n\t"
+                "paddh      %[ftmp4],   %[ftmp6],       %[ftmp8]           \n\t"
+
+                "paddh      %[ftmp1],   %[ftmp1],       %[ftmp3]           \n\t"
+                "paddh      %[ftmp2],   %[ftmp2],       %[ftmp4]           \n\t"
+                "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]        \n\t"
+                "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]        \n\t"
+                "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp9]           \n\t"
+                "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp9]           \n\t"
+                "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]           \n\t"
+                MMI_SDC1(%[ftmp1], %[dst], 0x00)
+                PTR_ADDU   "%[dst],     %[dst],         %[stride]          \n\t"
+
+                "bnez       %[h],       1b                                 \n\t"
+                : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
+                  [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
+                  [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
+                  [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
+                  [ftmp8]"=&f"(ftmp[8]),        [ftmp9]"=&f"(ftmp[9]),
+                  [tmp0]"=&r"(tmp[0]),
+                  [dst]"+&r"(dst),              [src]"+&r"(src),
+                  [h]"+&r"(h)
+                : [stride]"r"((mips_reg)stride),[ff_pw_32]"f"(ff_pw_32),
+                  [A]"f"(A),                    [B]"f"(B),
+                  [C]"f"(C),                    [D]"f"(D)
+                : "memory"
+            );
+        } else {
+            if (x) {
+                /* x!=0, y==0 */
+                E = x << 3;
+                A = 64 - E;
+
+                __asm__ volatile (
+                    "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]           \n\t"
+                    "dli        %[tmp0],    0x06                               \n\t"
+                    "pshufh     %[A],       %[A],           %[ftmp0]           \n\t"
+                    "pshufh     %[E],       %[E],           %[ftmp0]           \n\t"
+                    "mtc1       %[tmp0],    %[ftmp7]                           \n\t"
+
+                    "1:                                                        \n\t"
+                    MMI_ULDC1(%[ftmp1], %[src], 0x00)
+                    MMI_ULDC1(%[ftmp2], %[src], 0x01)
+                    "addi       %[h],       %[h],           -0x01              \n\t"
+                    PTR_ADDU   "%[src],     %[src],         %[stride]          \n\t"
+
+                    "punpcklbh  %[ftmp3],   %[ftmp1],       %[ftmp0]           \n\t"
+                    "punpckhbh  %[ftmp4],   %[ftmp1],       %[ftmp0]           \n\t"
+                    "punpcklbh  %[ftmp5],   %[ftmp2],       %[ftmp0]           \n\t"
+                    "punpckhbh  %[ftmp6],   %[ftmp2],       %[ftmp0]           \n\t"
+                    "pmullh     %[ftmp3],   %[ftmp3],       %[A]               \n\t"
+                    "pmullh     %[ftmp5],   %[ftmp5],       %[E]               \n\t"
+                    "paddh      %[ftmp1],   %[ftmp3],       %[ftmp5]           \n\t"
+                    "pmullh     %[ftmp4],   %[ftmp4],       %[A]               \n\t"
+                    "pmullh     %[ftmp6],   %[ftmp6],       %[E]               \n\t"
+                    "paddh      %[ftmp2],   %[ftmp4],       %[ftmp6]           \n\t"
+
+                    "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]        \n\t"
+                    "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]        \n\t"
+                    "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp7]           \n\t"
+                    "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp7]           \n\t"
+                    "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]           \n\t"
+                    MMI_SDC1(%[ftmp1], %[dst], 0x00)
+                    PTR_ADDU   "%[dst],     %[dst],         %[stride]          \n\t"
+                    "bnez       %[h],       1b                                 \n\t"
+                    : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
+                      [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
+                      [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
+                      [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
+                      [tmp0]"=&r"(tmp[0]),
+                      [dst]"+&r"(dst),              [src]"+&r"(src),
+                      [h]"+&r"(h)
+                    : [stride]"r"((mips_reg)stride),
+                      [ff_pw_32]"f"(ff_pw_32),
+                      [A]"f"(A),                    [E]"f"(E)
+                    : "memory"
+                );
+            } else {
+                /* x==0, y!=0 */
+                E = y << 3;
+                A = 64 - E;
+
+                __asm__ volatile (
+                    "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]           \n\t"
+                    "dli        %[tmp0],    0x06                               \n\t"
+                    "pshufh     %[A],       %[A],           %[ftmp0]           \n\t"
+                    "pshufh     %[E],       %[E],           %[ftmp0]           \n\t"
+                    "mtc1       %[tmp0],    %[ftmp7]                           \n\t"
+
+                    "1:                                                        \n\t"
+                    MMI_ULDC1(%[ftmp1], %[src], 0x00)
+                    PTR_ADDU   "%[src],     %[src],         %[stride]          \n\t"
+                    MMI_ULDC1(%[ftmp2], %[src], 0x00)
+                    "addi       %[h],       %[h],           -0x01              \n\t"
+
+                    "punpcklbh  %[ftmp3],   %[ftmp1],       %[ftmp0]           \n\t"
+                    "punpckhbh  %[ftmp4],   %[ftmp1],       %[ftmp0]           \n\t"
+                    "punpcklbh  %[ftmp5],   %[ftmp2],       %[ftmp0]           \n\t"
+                    "punpckhbh  %[ftmp6],   %[ftmp2],       %[ftmp0]           \n\t"
+                    "pmullh     %[ftmp3],   %[ftmp3],       %[A]               \n\t"
+                    "pmullh     %[ftmp5],   %[ftmp5],       %[E]               \n\t"
+                    "paddh      %[ftmp1],   %[ftmp3],       %[ftmp5]           \n\t"
+                    "pmullh     %[ftmp4],   %[ftmp4],       %[A]               \n\t"
+                    "pmullh     %[ftmp6],   %[ftmp6],       %[E]               \n\t"
+                    "paddh      %[ftmp2],   %[ftmp4],       %[ftmp6]           \n\t"
+
+                    "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]        \n\t"
+                    "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]        \n\t"
+                    "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp7]           \n\t"
+                    "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp7]           \n\t"
+                    "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]           \n\t"
+                    MMI_SDC1(%[ftmp1], %[dst], 0x00)
+
+                    PTR_ADDU   "%[dst],     %[dst],         %[stride]          \n\t"
+                    "bnez       %[h],       1b                                 \n\t"
+                    : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
+                      [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
+                      [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
+                      [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
+                      [tmp0]"=&r"(tmp[0]),
+                      [dst]"+&r"(dst),              [src]"+&r"(src),
+                      [h]"+&r"(h)
+                    : [stride]"r"((mips_reg)stride),
+                      [ff_pw_32]"f"(ff_pw_32),
+                      [A]"f"(A),                    [E]"f"(E)
+                    : "memory"
+                );
+            }
+        }
     }
 }
 
 void ff_avg_h264_chroma_mc8_mmi(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
         int h, int x, int y)
 {
-    const int A = (8 - x) * (8 - y);
-    const int B = x * (8 - y);
-    const int C = (8 - x) * y;
-    const int D = x * y;
-    const int E = B + C;
+    int A = 64, B, C, D, E;
     double ftmp[10];
     uint64_t tmp[1];
-    mips_reg addr[1];
-    DECLARE_VAR_ALL64;
 
-    if (D) {
+    if(!(x || y)){
+        /* x=0, y=0, A=64 */
         __asm__ volatile (
             "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]            \n\t"
             "dli        %[tmp0],    0x06                                \n\t"
             "pshufh     %[A],       %[A],           %[ftmp0]            \n\t"
-            "pshufh     %[B],       %[B],           %[ftmp0]            \n\t"
-            "mtc1       %[tmp0],    %[ftmp9]                            \n\t"
-            "pshufh     %[C],       %[C],           %[ftmp0]            \n\t"
-            "pshufh     %[D],       %[D],           %[ftmp0]            \n\t"
+            "mtc1       %[tmp0],    %[ftmp4]                            \n\t"
 
             "1:                                                         \n\t"
-            PTR_ADDU   "%[addr0],   %[src],         %[stride]           \n\t"
             MMI_ULDC1(%[ftmp1], %[src], 0x00)
-            MMI_ULDC1(%[ftmp2], %[src], 0x01)
-            MMI_ULDC1(%[ftmp3], %[addr0], 0x00)
-            MMI_ULDC1(%[ftmp4], %[addr0], 0x01)
-
-            "punpcklbh  %[ftmp5],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp6],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpcklbh  %[ftmp7],   %[ftmp2],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp8],   %[ftmp2],       %[ftmp0]            \n\t"
-            "pmullh     %[ftmp5],   %[ftmp5],       %[A]                \n\t"
-            "pmullh     %[ftmp7],   %[ftmp7],       %[B]                \n\t"
-            "paddh      %[ftmp1],   %[ftmp5],       %[ftmp7]            \n\t"
-            "pmullh     %[ftmp6],   %[ftmp6],       %[A]                \n\t"
-            "pmullh     %[ftmp8],   %[ftmp8],       %[B]                \n\t"
-            "paddh      %[ftmp2],   %[ftmp6],       %[ftmp8]            \n\t"
-
-            "punpcklbh  %[ftmp5],   %[ftmp3],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp6],   %[ftmp3],       %[ftmp0]            \n\t"
-            "punpcklbh  %[ftmp7],   %[ftmp4],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp8],   %[ftmp4],       %[ftmp0]            \n\t"
-            "pmullh     %[ftmp5],   %[ftmp5],       %[C]                \n\t"
-            "pmullh     %[ftmp7],   %[ftmp7],       %[D]                \n\t"
-            "paddh      %[ftmp3],   %[ftmp5],       %[ftmp7]            \n\t"
-            "pmullh     %[ftmp6],   %[ftmp6],       %[C]                \n\t"
-            "pmullh     %[ftmp8],   %[ftmp8],       %[D]                \n\t"
-            "paddh      %[ftmp4],   %[ftmp6],       %[ftmp8]            \n\t"
-
-            "paddh      %[ftmp1],   %[ftmp1],       %[ftmp3]            \n\t"
-            "paddh      %[ftmp2],   %[ftmp2],       %[ftmp4]            \n\t"
-            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]         \n\t"
-            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]         \n\t"
-            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp9]            \n\t"
-            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp9]            \n\t"
-            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            MMI_LDC1(%[ftmp2], %[dst], 0x00)
-            "pavgb      %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            "addi       %[h],       %[h],           -0x01               \n\t"
-            MMI_SDC1(%[ftmp1], %[dst], 0x00)
-            PTR_ADDU   "%[dst],     %[dst],         %[stride]           \n\t"
             PTR_ADDU   "%[src],     %[src],         %[stride]           \n\t"
-            "bnez       %[h],       1b                                  \n\t"
-            : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
-              [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
-              [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
-              [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
-              [ftmp8]"=&f"(ftmp[8]),        [ftmp9]"=&f"(ftmp[9]),
-              [tmp0]"=&r"(tmp[0]),
-              RESTRICT_ASM_ALL64
-              [addr0]"=&r"(addr[0]),
-              [dst]"+&r"(dst),              [src]"+&r"(src),
-              [h]"+&r"(h)
-            : [stride]"r"((mips_reg)stride),[ff_pw_32]"f"(ff_pw_32),
-              [A]"f"(A),                    [B]"f"(B),
-              [C]"f"(C),                    [D]"f"(D)
-            : "memory"
-        );
-    } else if (E) {
-        const int step = C ? stride : 1;
-
-        __asm__ volatile (
-            "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]            \n\t"
-            "dli        %[tmp0],    0x06                                \n\t"
-            "pshufh     %[A],       %[A],           %[ftmp0]            \n\t"
-            "pshufh     %[E],       %[E],           %[ftmp0]            \n\t"
-            "mtc1       %[tmp0],    %[ftmp7]                            \n\t"
-
-            "1:                                                         \n\t"
-            PTR_ADDU   "%[addr0],   %[src],         %[step]             \n\t"
-            MMI_ULDC1(%[ftmp1], %[src], 0x00)
-            MMI_ULDC1(%[ftmp2], %[addr0], 0x00)
-
-            "punpcklbh  %[ftmp3],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp4],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpcklbh  %[ftmp5],   %[ftmp2],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp6],   %[ftmp2],       %[ftmp0]            \n\t"
-            "pmullh     %[ftmp3],   %[ftmp3],       %[A]                \n\t"
-            "pmullh     %[ftmp5],   %[ftmp5],       %[E]                \n\t"
-            "paddh      %[ftmp1],   %[ftmp3],       %[ftmp5]            \n\t"
-            "pmullh     %[ftmp4],   %[ftmp4],       %[A]                \n\t"
-            "pmullh     %[ftmp6],   %[ftmp6],       %[E]                \n\t"
-            "paddh      %[ftmp2],   %[ftmp4],       %[ftmp6]            \n\t"
-
-            "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]         \n\t"
-            "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]         \n\t"
-            "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp7]            \n\t"
-            "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp7]            \n\t"
-            "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            MMI_LDC1(%[ftmp2], %[dst], 0x00)
-            "pavgb      %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            "addi       %[h],       %[h],           -0x01               \n\t"
-            MMI_SDC1(%[ftmp1], %[dst], 0x00)
+            MMI_ULDC1(%[ftmp5], %[src], 0x00)
             PTR_ADDU   "%[src],     %[src],         %[stride]           \n\t"
-            PTR_ADDU   "%[dst],     %[dst],         %[stride]           \n\t"
-            "bnez       %[h],       1b                                  \n\t"
-            : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
-              [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
-              [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
-              [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
-              [tmp0]"=&r"(tmp[0]),
-              RESTRICT_ASM_ALL64
-              [addr0]"=&r"(addr[0]),
-              [dst]"+&r"(dst),              [src]"+&r"(src),
-              [h]"+&r"(h)
-            : [stride]"r"((mips_reg)stride),[step]"r"((mips_reg)step),
-              [ff_pw_32]"f"(ff_pw_32),
-              [A]"f"(A),                    [E]"f"(E)
-            : "memory"
-        );
-    } else {
-        __asm__ volatile (
-            "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]            \n\t"
-            "dli        %[tmp0],    0x06                                \n\t"
-            "pshufh     %[A],       %[A],           %[ftmp0]            \n\t"
-            "mtc1       %[tmp0],    %[ftmp4]                            \n\t"
 
-            "1:                                                         \n\t"
-            MMI_ULDC1(%[ftmp1], %[src], 0x00)
             "punpcklbh  %[ftmp2],   %[ftmp1],       %[ftmp0]            \n\t"
             "punpckhbh  %[ftmp3],   %[ftmp1],       %[ftmp0]            \n\t"
             "pmullh     %[ftmp1],   %[ftmp2],       %[A]                \n\t"
@@ -360,13 +356,11 @@  void ff_avg_h264_chroma_mc8_mmi(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
             "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
             MMI_LDC1(%[ftmp2], %[dst], 0x00)
             "pavgb      %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            PTR_ADDU   "%[src],     %[src],         %[stride]           \n\t"
             MMI_SDC1(%[ftmp1], %[dst], 0x00)
             PTR_ADDU   "%[dst],     %[dst],         %[stride]           \n\t"
 
-            MMI_ULDC1(%[ftmp1], %[src], 0x00)
-            "punpcklbh  %[ftmp2],   %[ftmp1],       %[ftmp0]            \n\t"
-            "punpckhbh  %[ftmp3],   %[ftmp1],       %[ftmp0]            \n\t"
+            "punpcklbh  %[ftmp2],   %[ftmp5],       %[ftmp0]            \n\t"
+            "punpckhbh  %[ftmp3],   %[ftmp5],       %[ftmp0]            \n\t"
             "pmullh     %[ftmp1],   %[ftmp2],       %[A]                \n\t"
             "pmullh     %[ftmp2],   %[ftmp3],       %[A]                \n\t"
             "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]         \n\t"
@@ -376,23 +370,195 @@  void ff_avg_h264_chroma_mc8_mmi(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
             "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
             MMI_LDC1(%[ftmp2], %[dst], 0x00)
             "pavgb      %[ftmp1],   %[ftmp1],       %[ftmp2]            \n\t"
-            "addi       %[h],       %[h],           -0x02               \n\t"
             MMI_SDC1(%[ftmp1], %[dst], 0x00)
-
-            PTR_ADDU   "%[src],     %[src],         %[stride]           \n\t"
             PTR_ADDU   "%[dst],     %[dst],         %[stride]           \n\t"
+
+            "addi       %[h],       %[h],           -0x02               \n\t"
             "bnez       %[h],       1b                                  \n\t"
             : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
               [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
-              [ftmp4]"=&f"(ftmp[4]),
+              [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
               [tmp0]"=&r"(tmp[0]),
-              RESTRICT_ASM_ALL64
               [dst]"+&r"(dst),              [src]"+&r"(src),
               [h]"+&r"(h)
             : [stride]"r"((mips_reg)stride),[ff_pw_32]"f"(ff_pw_32),
               [A]"f"(A)
             : "memory"
         );
+    } else {
+        if(x && y) {
+            /* x!=0, y!=0 */
+            D = x * y;
+            B = (x << 3) - D;
+            C = (y << 3) - D;
+            A = 64 - D - B - C;
+            __asm__ volatile (
+                "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]       \n\t"
+                "dli        %[tmp0],    0x06                           \n\t"
+                "pshufh     %[A],       %[A],           %[ftmp0]       \n\t"
+                "pshufh     %[B],       %[B],           %[ftmp0]       \n\t"
+                "mtc1       %[tmp0],    %[ftmp9]                       \n\t"
+                "pshufh     %[C],       %[C],           %[ftmp0]       \n\t"
+                "pshufh     %[D],       %[D],           %[ftmp0]       \n\t"
+
+                "1:                                                    \n\t"
+                MMI_ULDC1(%[ftmp1], %[src], 0x00)
+                MMI_ULDC1(%[ftmp2], %[src], 0x01)
+                PTR_ADDU   "%[src],     %[src],         %[stride]      \n\t"
+                MMI_ULDC1(%[ftmp3], %[src], 0x00)
+                MMI_ULDC1(%[ftmp4], %[src], 0x01)
+                "addi       %[h],       %[h],           -0x01          \n\t"
+
+                "punpcklbh  %[ftmp5],   %[ftmp1],       %[ftmp0]       \n\t"
+                "punpckhbh  %[ftmp6],   %[ftmp1],       %[ftmp0]       \n\t"
+                "punpcklbh  %[ftmp7],   %[ftmp2],       %[ftmp0]       \n\t"
+                "punpckhbh  %[ftmp8],   %[ftmp2],       %[ftmp0]       \n\t"
+                "pmullh     %[ftmp5],   %[ftmp5],       %[A]           \n\t"
+                "pmullh     %[ftmp7],   %[ftmp7],       %[B]           \n\t"
+                "paddh      %[ftmp1],   %[ftmp5],       %[ftmp7]       \n\t"
+                "pmullh     %[ftmp6],   %[ftmp6],       %[A]           \n\t"
+                "pmullh     %[ftmp8],   %[ftmp8],       %[B]           \n\t"
+                "paddh      %[ftmp2],   %[ftmp6],       %[ftmp8]       \n\t"
+
+                "punpcklbh  %[ftmp5],   %[ftmp3],       %[ftmp0]       \n\t"
+                "punpckhbh  %[ftmp6],   %[ftmp3],       %[ftmp0]       \n\t"
+                "punpcklbh  %[ftmp7],   %[ftmp4],       %[ftmp0]       \n\t"
+                "punpckhbh  %[ftmp8],   %[ftmp4],       %[ftmp0]       \n\t"
+                "pmullh     %[ftmp5],   %[ftmp5],       %[C]           \n\t"
+                "pmullh     %[ftmp7],   %[ftmp7],       %[D]           \n\t"
+                "paddh      %[ftmp3],   %[ftmp5],       %[ftmp7]       \n\t"
+                "pmullh     %[ftmp6],   %[ftmp6],       %[C]           \n\t"
+                "pmullh     %[ftmp8],   %[ftmp8],       %[D]           \n\t"
+                "paddh      %[ftmp4],   %[ftmp6],       %[ftmp8]       \n\t"
+
+                "paddh      %[ftmp1],   %[ftmp1],       %[ftmp3]       \n\t"
+                "paddh      %[ftmp2],   %[ftmp2],       %[ftmp4]       \n\t"
+                "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]    \n\t"
+                "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]    \n\t"
+                "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp9]       \n\t"
+                "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp9]       \n\t"
+                "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]       \n\t"
+                MMI_LDC1(%[ftmp2], %[dst], 0x00)
+                "pavgb      %[ftmp1],   %[ftmp1],       %[ftmp2]       \n\t"
+                MMI_SDC1(%[ftmp1], %[dst], 0x00)
+                PTR_ADDU   "%[dst],     %[dst],         %[stride]      \n\t"
+                "bnez       %[h],       1b                             \n\t"
+                : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
+                  [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
+                  [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
+                  [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
+                  [ftmp8]"=&f"(ftmp[8]),        [ftmp9]"=&f"(ftmp[9]),
+                  [tmp0]"=&r"(tmp[0]),
+                  [dst]"+&r"(dst),              [src]"+&r"(src),
+                  [h]"+&r"(h)
+                : [stride]"r"((mips_reg)stride),[ff_pw_32]"f"(ff_pw_32),
+                  [A]"f"(A),                    [B]"f"(B),
+                  [C]"f"(C),                    [D]"f"(D)
+                : "memory"
+            );
+        } else {
+            if(x) {
+                /* x!=0, y==0 */
+                E = x << 3;
+                A = 64 - E;
+                __asm__ volatile (
+                    "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]       \n\t"
+                    "dli        %[tmp0],    0x06                           \n\t"
+                    "pshufh     %[A],       %[A],           %[ftmp0]       \n\t"
+                    "pshufh     %[E],       %[E],           %[ftmp0]       \n\t"
+                    "mtc1       %[tmp0],    %[ftmp7]                       \n\t"
+
+                    "1:                                                    \n\t"
+                    MMI_ULDC1(%[ftmp1], %[src], 0x00)
+                    MMI_ULDC1(%[ftmp2], %[src], 0x01)
+                    PTR_ADDU   "%[src],     %[src],         %[stride]      \n\t"
+                    "addi       %[h],       %[h],           -0x01          \n\t"
+
+                    "punpcklbh  %[ftmp3],   %[ftmp1],       %[ftmp0]       \n\t"
+                    "punpckhbh  %[ftmp4],   %[ftmp1],       %[ftmp0]       \n\t"
+                    "punpcklbh  %[ftmp5],   %[ftmp2],       %[ftmp0]       \n\t"
+                    "punpckhbh  %[ftmp6],   %[ftmp2],       %[ftmp0]       \n\t"
+                    "pmullh     %[ftmp3],   %[ftmp3],       %[A]           \n\t"
+                    "pmullh     %[ftmp5],   %[ftmp5],       %[E]           \n\t"
+                    "paddh      %[ftmp1],   %[ftmp3],       %[ftmp5]       \n\t"
+                    "pmullh     %[ftmp4],   %[ftmp4],       %[A]           \n\t"
+                    "pmullh     %[ftmp6],   %[ftmp6],       %[E]           \n\t"
+                    "paddh      %[ftmp2],   %[ftmp4],       %[ftmp6]       \n\t"
+
+                    "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]    \n\t"
+                    "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]    \n\t"
+                    "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp7]       \n\t"
+                    "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp7]       \n\t"
+                    "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]       \n\t"
+                    MMI_LDC1(%[ftmp2], %[dst], 0x00)
+                    "pavgb      %[ftmp1],   %[ftmp1],       %[ftmp2]       \n\t"
+                    MMI_SDC1(%[ftmp1], %[dst], 0x00)
+                    PTR_ADDU   "%[dst],     %[dst],         %[stride]      \n\t"
+                    "bnez       %[h],       1b                             \n\t"
+                    : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
+                      [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
+                      [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
+                      [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
+                      [tmp0]"=&r"(tmp[0]),
+                      [dst]"+&r"(dst),              [src]"+&r"(src),
+                      [h]"+&r"(h)
+                    : [stride]"r"((mips_reg)stride),
+                      [ff_pw_32]"f"(ff_pw_32),
+                      [A]"f"(A),                    [E]"f"(E)
+                    : "memory"
+                );
+            } else {
+                /* x==0, y!=0 */
+                E = y << 3;
+                A = 64 - E;
+                __asm__ volatile (
+                    "xor        %[ftmp0],   %[ftmp0],       %[ftmp0]       \n\t"
+                    "dli        %[tmp0],    0x06                           \n\t"
+                    "pshufh     %[A],       %[A],           %[ftmp0]       \n\t"
+                    "pshufh     %[E],       %[E],           %[ftmp0]       \n\t"
+                    "mtc1       %[tmp0],    %[ftmp7]                       \n\t"
+
+                    "1:                                                    \n\t"
+                    MMI_ULDC1(%[ftmp1], %[src], 0x00)
+                    PTR_ADDU   "%[src],     %[src],         %[stride]      \n\t"
+                    MMI_ULDC1(%[ftmp2], %[src], 0x00)
+                    "addi       %[h],       %[h],           -0x01          \n\t"
+
+                    "punpcklbh  %[ftmp3],   %[ftmp1],       %[ftmp0]       \n\t"
+                    "punpckhbh  %[ftmp4],   %[ftmp1],       %[ftmp0]       \n\t"
+                    "punpcklbh  %[ftmp5],   %[ftmp2],       %[ftmp0]       \n\t"
+                    "punpckhbh  %[ftmp6],   %[ftmp2],       %[ftmp0]       \n\t"
+                    "pmullh     %[ftmp3],   %[ftmp3],       %[A]           \n\t"
+                    "pmullh     %[ftmp5],   %[ftmp5],       %[E]           \n\t"
+                    "paddh      %[ftmp1],   %[ftmp3],       %[ftmp5]       \n\t"
+                    "pmullh     %[ftmp4],   %[ftmp4],       %[A]           \n\t"
+                    "pmullh     %[ftmp6],   %[ftmp6],       %[E]           \n\t"
+                    "paddh      %[ftmp2],   %[ftmp4],       %[ftmp6]       \n\t"
+
+                    "paddh      %[ftmp1],   %[ftmp1],       %[ff_pw_32]    \n\t"
+                    "paddh      %[ftmp2],   %[ftmp2],       %[ff_pw_32]    \n\t"
+                    "psrlh      %[ftmp1],   %[ftmp1],       %[ftmp7]       \n\t"
+                    "psrlh      %[ftmp2],   %[ftmp2],       %[ftmp7]       \n\t"
+                    "packushb   %[ftmp1],   %[ftmp1],       %[ftmp2]       \n\t"
+                    MMI_LDC1(%[ftmp2], %[dst], 0x00)
+                    "pavgb      %[ftmp1],   %[ftmp1],       %[ftmp2]       \n\t"
+                    MMI_SDC1(%[ftmp1], %[dst], 0x00)
+                    PTR_ADDU   "%[dst],     %[dst],         %[stride]      \n\t"
+                    "bnez       %[h],       1b                             \n\t"
+                    : [ftmp0]"=&f"(ftmp[0]),        [ftmp1]"=&f"(ftmp[1]),
+                      [ftmp2]"=&f"(ftmp[2]),        [ftmp3]"=&f"(ftmp[3]),
+                      [ftmp4]"=&f"(ftmp[4]),        [ftmp5]"=&f"(ftmp[5]),
+                      [ftmp6]"=&f"(ftmp[6]),        [ftmp7]"=&f"(ftmp[7]),
+                      [tmp0]"=&r"(tmp[0]),
+                      [dst]"+&r"(dst),              [src]"+&r"(src),
+                      [h]"+&r"(h)
+                    : [stride]"r"((mips_reg)stride),
+                      [ff_pw_32]"f"(ff_pw_32),
+                      [A]"f"(A),                    [E]"f"(E)
+                    : "memory"
+                );
+            }
+        }
     }
 }