From patchwork Fri Apr 3 20:41:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?FR=C3=89D=C3=89RIC_RECOULES?= X-Patchwork-Id: 18616 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 118A544AAF8 for ; Fri, 3 Apr 2020 23:42:07 +0300 (EEST) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E3CCA68AF1D; Fri, 3 Apr 2020 23:42:06 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from zm-mta-out-3.u-ga.fr (zm-mta-out-3.u-ga.fr [152.77.200.56]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8B9616880FE for ; Fri, 3 Apr 2020 23:42:00 +0300 (EEST) Received: from zm-mta-out.u-ga.fr (zm-mta-out.u-ga.fr [152.77.200.53]) by zm-mta-out-3.u-ga.fr (Postfix) with ESMTP id D212941072; Fri, 3 Apr 2020 22:41:58 +0200 (CEST) Received: from zm-mbx06.u-ga.fr (zm-mbx06.u-ga.fr [152.77.200.20]) by zm-mta-out.u-ga.fr (Postfix) with ESMTP id CAE8380840; Fri, 3 Apr 2020 22:41:58 +0200 (CEST) Date: Fri, 3 Apr 2020 22:41:58 +0200 (CEST) From: =?utf-8?b?RlLDiUTDiVJJQw==?= RECOULES To: ffmpeg-devel@ffmpeg.org Message-ID: <1873931254.3575536.1585946518805.JavaMail.zimbra@univ-grenoble-alpes.fr> MIME-Version: 1.0 X-Originating-IP: [46.193.2.18] X-Mailer: Zimbra 8.8.15_GA_3918 (ZimbraWebClient - FF72 (Linux)/8.8.15_GA_3895) Thread-Index: PmScAfChm6YhiKdoZwOyI2ZF5oWGFg== Thread-Topic: Issues and patches X-Content-Filtered-By: Mailman/MimeDel 2.1.20 Subject: [FFmpeg-devel] [inline assembly compliance] Issues and patches X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Richard Bonichon , =?utf-8?q?S=C3=A9bastien?= Bardin Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Dear developpers, we are academic researchers working in automated program analysis. We are currently interested in checking compliance of inline asm chunks as found in C programs. While benchmarking our tool and technique, we found a number of issues in FFMPEG. We report them to you, as well as adequate patches. Actually, we found 59 significant compliance issues in your code. We join 3 patches for some of them, together with explanations and we can send you other patches on demand. * All these bugs are related to compliance between the block of asm and its surrounding "contract" (in gcc-style notation). They are akin to undefined or implementation-defined behaviours in C: they currently do not manifest themselves in your program, but at some point in time with compiler optimizations becoming more and more aggressive or changes in undocumented compiler choices regarding asm chunks, they can suddenly trigger a (hard-to-find) bug. * The typical problems come from the compiler missing dataflow information and performing undue optimizations on this wrong basis, or the compiler allocating an already used register. Actually, we demonstrate "in lab" problems with all these categories of bugs in case of inlining (especially with LTO enabler) or code refactoring. * Some of those issues may seems benign or irrealistic but it cost nothing to patch so, why not do it? We would be very interested to hear your opinion on these matters. Are you interested in such errors and patches? Also, besides the patches, we are currently working on a code analyzer prototype designed to check asm compliance and to propose patches when the chunk is not compliant. This is still work in progress and we are finalizing it. The errors and patches I reported to you came from my prototype. In case such a prototype would be made available, would you consider using it? Best regards Frédéric Recoules 1. Overview of found issues --------------------------- 1.1 missing "memory" clobber (x 8) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Memory keyword is missing breaking data dependencies. Every operation initializing memory blocks pointed by the input pointers can be discarded. Every read of those blocks may return the same value as before the chunk was executed. 1.2 xmm registers are clobbered (x 2) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ xmm registers are not declared in clobbers (wherease other chunks using XMM register have listed them in clobbers). 1.3 mmx registers are clobbered (x56) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mmx registers are never listed in clobbers. 1.4 inter-chunk dependency with mmx register (x 23) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some mmx registers are initialized by a chunk and used by another. Even if the chunk are volatile and contiguous in the sources, the compiler is still able to insert some instructions between them. 1.5 static symbols are hard written in assembly (x 27) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some static const values are accessed by name in the chunk without being stated in the interface. The risk here is that the compiler deem the value as never used and get rid of it. If so, the compilation will fail so, not a "big" issue but a bad practice. 2. Proposed patches (x 3 functions) ------------------- Refactor assembly chunk contained in: - libavcodec/x86/lossless_videoencdsp_init.c - libavcodec/x86/rnd_template.c Following GNU inline assembly guidelines, the patch: - add missing MMX clobbers - replace positional placeholder (%[0-9]) by symbolic names - replace volatile keyword and "memory" clobber by corresponding i/o entries - replace register clobbering (ex: FF_REG_a) by scratch registers - refine some macros - [Cosmetic] mnemonic alignment [build] HAVE_MMX_CLOBBERS have been added to configure following the XMM scheme --- configure | 3 + libavcodec/x86/hpeldsp_init.c | 8 + libavcodec/x86/inline_asm.h | 30 ++- libavcodec/x86/lossless_videoencdsp_init.c | 55 ++-- libavcodec/x86/rnd_template.c | 282 +++++++++++---------- libavutil/x86/asm.h | 20 ++ 6 files changed, 231 insertions(+), 167 deletions(-) diff --git a/configure b/configure index dcead3a300..2e4b68915d 100755 --- a/configure +++ b/configure @@ -2250,6 +2250,7 @@ TOOLCHAIN_FEATURES=" symver_gnu_asm vfp_args xform_asm + mmx_clobbers xmm_clobbers " @@ -5792,6 +5793,8 @@ EOF check_inline_asm ebx_available '""::"b"(0)' && check_inline_asm ebx_available '"":::"%ebx"' + # check whether xmm clobbers are supported + check_inline_asm mmx_clobbers '"":::"%mm0"' # check whether xmm clobbers are supported check_inline_asm xmm_clobbers '"":::"%xmm0"' diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index d89928cec6..17501a2b5a 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -95,6 +95,8 @@ void ff_avg_approx_pixels8_xy2_3dnow(uint8_t *block, const uint8_t *pixels, /* MMX no rounding */ #define DEF(x, y) x ## _no_rnd_ ## y ## _mmx #define SET_RND MOVQ_WONE +#define SET_RND_TPL MOVQ_WONE_TPL +#define SET_RND_IN_COMMA #define PAVGBP(a, b, c, d, e, f) PAVGBP_MMX_NO_RND(a, b, c, d, e, f) #define PAVGB(a, b, c, e) PAVGB_MMX_NO_RND(a, b, c, e) #define STATIC static @@ -104,6 +106,8 @@ void ff_avg_approx_pixels8_xy2_3dnow(uint8_t *block, const uint8_t *pixels, #undef DEF #undef SET_RND +#undef SET_RND_TPL +#undef SET_RND_IN_COMMA #undef PAVGBP #undef PAVGB #undef STATIC @@ -121,6 +125,8 @@ CALL_2X_PIXELS(put_no_rnd_pixels16_xy2_mmx, put_no_rnd_pixels8_xy2_mmx, 8) #define DEF(x, y) x ## _ ## y ## _mmx #define SET_RND MOVQ_WTWO +#define SET_RND_TPL MOVQ_WTWO_TPL +#define SET_RND_IN_COMMA MOVQ_WTWO_IN_COMMA #define PAVGBP(a, b, c, d, e, f) PAVGBP_MMX(a, b, c, d, e, f) #define PAVGB(a, b, c, e) PAVGB_MMX(a, b, c, e) @@ -134,6 +140,8 @@ CALL_2X_PIXELS(put_no_rnd_pixels16_xy2_mmx, put_no_rnd_pixels8_xy2_mmx, 8) #undef DEF #undef SET_RND +#undef SET_RND_TPL +#undef SET_RND_IN_COMMA #undef PAVGBP #undef PAVGB diff --git a/libavcodec/x86/inline_asm.h b/libavcodec/x86/inline_asm.h index 0198746719..565e9e260a 100644 --- a/libavcodec/x86/inline_asm.h +++ b/libavcodec/x86/inline_asm.h @@ -23,13 +23,16 @@ #include "constants.h" -#define MOVQ_WONE(regd) \ - __asm__ volatile ( \ - "pcmpeqd %%" #regd ", %%" #regd " \n\t" \ - "psrlw $15, %%" #regd ::) +#define MOVQ_WONE_TPL(regd) \ + "pcmpeqd %%"#regd", %%"#regd" \n\t" \ + "psrlw $15, %%" #regd" \n\t" + +#define MOVQ_WONE(regd) __asm__ volatile (MOVQ_WONE_TPL(regd) ::) #define JUMPALIGN() __asm__ volatile (".p2align 3"::) -#define MOVQ_ZERO(regd) __asm__ volatile ("pxor %%"#regd", %%"#regd ::) + +#define MOVQ_ZERO_TPL(regd) "pxor %%"#regd", %%"#regd" \n\t" +#define MOVQ_ZERO(regd) __asm__ volatile (MOVQ_ZERO_TPL(regd) ::) #define MOVQ_BFE(regd) \ __asm__ volatile ( \ @@ -37,17 +40,20 @@ "paddb %%"#regd", %%"#regd" \n\t" ::) #ifndef PIC -#define MOVQ_WTWO(regd) __asm__ volatile ("movq %0, %%"#regd" \n\t" :: "m"(ff_pw_2)) +#define MOVQ_WTWO_TPL(regd) "movq %[ff_pw_2], %%"#regd" \n\t" +#define MOVQ_WTWO_IN [ff_pw_2] "m" (ff_pw_2) +#define MOVQ_WTWO_IN_COMMA MOVQ_WTWO_IN, #else // for shared library it's better to use this way for accessing constants // pcmpeqd -> -1 -#define MOVQ_WTWO(regd) \ - __asm__ volatile ( \ - "pcmpeqd %%"#regd", %%"#regd" \n\t" \ - "psrlw $15, %%"#regd" \n\t" \ - "psllw $1, %%"#regd" \n\t"::) - +#define MOVQ_WTWO_TPL(regd) \ + "pcmpeqd %%"#regd", %%"#regd" \n\t" \ + "psrlw $15, %%"#regd" \n\t" \ + "psllw $1, %%"#regd" \n\t" +#define MOVQ_WTWO_IN +#define MOVQ_WTWO_IN_COMMA #endif +#define MOVQ_WTWO(regd) __asm__ volatile (MOVQ_WTWO_TPL(regd) :: MOVQ_WTWO_IN) // using regr as temporary and for the output result // first argument is unmodified and second is trashed diff --git a/libavcodec/x86/lossless_videoencdsp_init.c b/libavcodec/x86/lossless_videoencdsp_init.c index 40407add52..3f2d9968b7 100644 --- a/libavcodec/x86/lossless_videoencdsp_init.c +++ b/libavcodec/x86/lossless_videoencdsp_init.c @@ -48,29 +48,38 @@ static void sub_median_pred_mmxext(uint8_t *dst, const uint8_t *src1, x86_reg i = 0; uint8_t l, lt; - __asm__ volatile ( - "movq (%1, %0), %%mm0 \n\t" // LT - "psllq $8, %%mm0 \n\t" - "1: \n\t" - "movq (%1, %0), %%mm1 \n\t" // T - "movq -1(%2, %0), %%mm2 \n\t" // L - "movq (%2, %0), %%mm3 \n\t" // X - "movq %%mm2, %%mm4 \n\t" // L - "psubb %%mm0, %%mm2 \n\t" - "paddb %%mm1, %%mm2 \n\t" // L + T - LT - "movq %%mm4, %%mm5 \n\t" // L - "pmaxub %%mm1, %%mm4 \n\t" // max(T, L) - "pminub %%mm5, %%mm1 \n\t" // min(T, L) - "pminub %%mm2, %%mm4 \n\t" - "pmaxub %%mm1, %%mm4 \n\t" - "psubb %%mm4, %%mm3 \n\t" // dst - pred - "movq %%mm3, (%3, %0) \n\t" - "add $8, %0 \n\t" - "movq -1(%1, %0), %%mm0 \n\t" // LT - "cmp %4, %0 \n\t" - " jb 1b \n\t" - : "+r" (i) - : "r" (src1), "r" (src2), "r" (dst), "r" ((x86_reg) w)); + __asm__ + ( + "movq (%[src1], %[i]), %%mm0 \n\t" // LT + "psllq $8, %%mm0 \n\t" + "1: \n\t" + "movq (%[src1], %[i]), %%mm1 \n\t" // T + "movq -1(%[src2], %[i]), %%mm2 \n\t" // L + "movq (%[src2], %[i]), %%mm3 \n\t" // X + "movq %%mm2, %%mm4 \n\t" // L + "psubb %%mm0, %%mm2 \n\t" + "paddb %%mm1, %%mm2 \n\t" // L + T - LT + "movq %%mm4, %%mm5 \n\t" // L + "pmaxub %%mm1, %%mm4 \n\t" // max(T, L) + "pminub %%mm5, %%mm1 \n\t" // min(T, L) + "pminub %%mm2, %%mm4 \n\t" + "pmaxub %%mm1, %%mm4 \n\t" + "psubb %%mm4, %%mm3 \n\t" // dst - pred + "movq %%mm3, (%[dst], %[i]) \n\t" + "add $8, %[i] \n\t" + "movq -1(%[src1], %[i]), %%mm0 \n\t" // LT + "cmp %[w], %[i] \n\t" + "jb 1b \n\t" + : "=m" (*(uint8_t (*)[])dst), + [i] "+&r" (i) + : "m" (*(const uint8_t (*)[])src1), + "m" (*(const uint8_t (*)[])src2), + [src1] "r" (src1), + [src2] "r" (src2), + [dst] "r" (dst), + [w] "r" ((x86_reg) w) + : MMX_CLOBBERS("mm0", "mm1", "mm2", "mm3", "mm4", "mm5") + ); l = *left; lt = *left_top; diff --git a/libavcodec/x86/rnd_template.c b/libavcodec/x86/rnd_template.c index 09946bd23f..1be010e066 100644 --- a/libavcodec/x86/rnd_template.c +++ b/libavcodec/x86/rnd_template.c @@ -30,146 +30,164 @@ #include "inline_asm.h" // put_pixels -av_unused STATIC void DEF(put, pixels8_xy2)(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h) +av_unused STATIC void DEF(put, pixels8_xy2)(uint8_t *block, + const uint8_t *pixels, + ptrdiff_t line_size, int h) { - MOVQ_ZERO(mm7); - SET_RND(mm6); // =2 for rnd and =1 for no_rnd version - __asm__ volatile( - "movq (%1), %%mm0 \n\t" - "movq 1(%1), %%mm4 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm4, %%mm5 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpckhbw %%mm7, %%mm1 \n\t" - "punpckhbw %%mm7, %%mm5 \n\t" - "paddusw %%mm0, %%mm4 \n\t" - "paddusw %%mm1, %%mm5 \n\t" - "xor %%"FF_REG_a", %%"FF_REG_a" \n\t" - "add %3, %1 \n\t" - ".p2align 3 \n\t" - "1: \n\t" - "movq (%1, %%"FF_REG_a"), %%mm0 \n\t" - "movq 1(%1, %%"FF_REG_a"), %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpckhbw %%mm7, %%mm1 \n\t" - "punpckhbw %%mm7, %%mm3 \n\t" - "paddusw %%mm2, %%mm0 \n\t" - "paddusw %%mm3, %%mm1 \n\t" - "paddusw %%mm6, %%mm4 \n\t" - "paddusw %%mm6, %%mm5 \n\t" - "paddusw %%mm0, %%mm4 \n\t" - "paddusw %%mm1, %%mm5 \n\t" - "psrlw $2, %%mm4 \n\t" - "psrlw $2, %%mm5 \n\t" - "packuswb %%mm5, %%mm4 \n\t" - "movq %%mm4, (%2, %%"FF_REG_a") \n\t" - "add %3, %%"FF_REG_a" \n\t" + x86_reg i = 0; + __asm__ ( + MOVQ_ZERO_TPL(mm7) + SET_RND_TPL(mm6) // =2 for rnd and =1 for no_rnd version + "movq (%[pixels]), %%mm0 \n\t" + "movq 1(%[pixels]), %%mm4 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm4, %%mm5 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpckhbw %%mm7, %%mm1 \n\t" + "punpckhbw %%mm7, %%mm5 \n\t" + "paddusw %%mm0, %%mm4 \n\t" + "paddusw %%mm1, %%mm5 \n\t" + "add %[line_size], %[pixels] \n\t" + ".p2align 3 \n\t" + "1: \n\t" + "movq (%[pixels], %[i]), %%mm0 \n\t" + "movq 1(%[pixels], %[i]), %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpckhbw %%mm7, %%mm1 \n\t" + "punpckhbw %%mm7, %%mm3 \n\t" + "paddusw %%mm2, %%mm0 \n\t" + "paddusw %%mm3, %%mm1 \n\t" + "paddusw %%mm6, %%mm4 \n\t" + "paddusw %%mm6, %%mm5 \n\t" + "paddusw %%mm0, %%mm4 \n\t" + "paddusw %%mm1, %%mm5 \n\t" + "psrlw $2, %%mm4 \n\t" + "psrlw $2, %%mm5 \n\t" + "packuswb %%mm5, %%mm4 \n\t" + "movq %%mm4, (%[block], %[i]) \n\t" + "add %[line_size], %[i] \n\t" - "movq (%1, %%"FF_REG_a"), %%mm2 \n\t" // 0 <-> 2 1 <-> 3 - "movq 1(%1, %%"FF_REG_a"), %%mm4 \n\t" - "movq %%mm2, %%mm3 \n\t" - "movq %%mm4, %%mm5 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpckhbw %%mm7, %%mm3 \n\t" - "punpckhbw %%mm7, %%mm5 \n\t" - "paddusw %%mm2, %%mm4 \n\t" - "paddusw %%mm3, %%mm5 \n\t" - "paddusw %%mm6, %%mm0 \n\t" - "paddusw %%mm6, %%mm1 \n\t" - "paddusw %%mm4, %%mm0 \n\t" - "paddusw %%mm5, %%mm1 \n\t" - "psrlw $2, %%mm0 \n\t" - "psrlw $2, %%mm1 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "movq %%mm0, (%2, %%"FF_REG_a") \n\t" - "add %3, %%"FF_REG_a" \n\t" + "movq (%[pixels], %[i]), %%mm2 \n\t" + // 0 <-> 2 1 <-> 3 + "movq 1(%[pixels], %[i]), %%mm4 \n\t" + "movq %%mm2, %%mm3 \n\t" + "movq %%mm4, %%mm5 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpckhbw %%mm7, %%mm3 \n\t" + "punpckhbw %%mm7, %%mm5 \n\t" + "paddusw %%mm2, %%mm4 \n\t" + "paddusw %%mm3, %%mm5 \n\t" + "paddusw %%mm6, %%mm0 \n\t" + "paddusw %%mm6, %%mm1 \n\t" + "paddusw %%mm4, %%mm0 \n\t" + "paddusw %%mm5, %%mm1 \n\t" + "psrlw $2, %%mm0 \n\t" + "psrlw $2, %%mm1 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "movq %%mm0, (%[block], %[i]) \n\t" + "add %[line_size], %[i] \n\t" - "subl $2, %0 \n\t" - "jnz 1b \n\t" - :"+g"(h), "+S"(pixels) - :"D"(block), "r"((x86_reg)line_size) - :FF_REG_a, "memory"); + "subl $2, %[h] \n\t" + "jnz 1b \n\t" + : "=m" (*(uint8_t (*)[])block), + [h] "+&g" (h), + [pixels] "+&S" (pixels), + [i] "+&r" (i) + : SET_RND_IN_COMMA + "m" (*(const uint8_t (*)[])pixels), + [block] "D" (block), + [line_size] "r" ((x86_reg)line_size) + : MMX_CLOBBERS("mm0", "mm1", "mm2", "mm3", + "mm4", "mm5", "mm6", "mm7")); } // avg_pixels // this routine is 'slightly' suboptimal but mostly unused -av_unused STATIC void DEF(avg, pixels8_xy2)(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h) +av_unused STATIC void DEF(avg, pixels8_xy2)(uint8_t *block, + const uint8_t *pixels, + ptrdiff_t line_size, int h) { - MOVQ_ZERO(mm7); - SET_RND(mm6); // =2 for rnd and =1 for no_rnd version - __asm__ volatile( - "movq (%1), %%mm0 \n\t" - "movq 1(%1), %%mm4 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm4, %%mm5 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpckhbw %%mm7, %%mm1 \n\t" - "punpckhbw %%mm7, %%mm5 \n\t" - "paddusw %%mm0, %%mm4 \n\t" - "paddusw %%mm1, %%mm5 \n\t" - "xor %%"FF_REG_a", %%"FF_REG_a" \n\t" - "add %3, %1 \n\t" - ".p2align 3 \n\t" - "1: \n\t" - "movq (%1, %%"FF_REG_a"), %%mm0 \n\t" - "movq 1(%1, %%"FF_REG_a"), %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpckhbw %%mm7, %%mm1 \n\t" - "punpckhbw %%mm7, %%mm3 \n\t" - "paddusw %%mm2, %%mm0 \n\t" - "paddusw %%mm3, %%mm1 \n\t" - "paddusw %%mm6, %%mm4 \n\t" - "paddusw %%mm6, %%mm5 \n\t" - "paddusw %%mm0, %%mm4 \n\t" - "paddusw %%mm1, %%mm5 \n\t" - "psrlw $2, %%mm4 \n\t" - "psrlw $2, %%mm5 \n\t" - "movq (%2, %%"FF_REG_a"), %%mm3 \n\t" - "packuswb %%mm5, %%mm4 \n\t" - "pcmpeqd %%mm2, %%mm2 \n\t" - "paddb %%mm2, %%mm2 \n\t" - PAVGB_MMX(%%mm3, %%mm4, %%mm5, %%mm2) - "movq %%mm5, (%2, %%"FF_REG_a") \n\t" - "add %3, %%"FF_REG_a" \n\t" + x86_reg i = 0; + __asm__ ( + MOVQ_ZERO_TPL(mm7) + SET_RND_TPL(mm6) // =2 for rnd and =1 for no_rnd version + "movq (%[pixels]), %%mm0 \n\t" + "movq 1(%[pixels]), %%mm4 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm4, %%mm5 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpckhbw %%mm7, %%mm1 \n\t" + "punpckhbw %%mm7, %%mm5 \n\t" + "paddusw %%mm0, %%mm4 \n\t" + "paddusw %%mm1, %%mm5 \n\t" + "add %[line_size], %[pixels] \n\t" + ".p2align 3 \n\t" + "1: \n\t" + "movq (%[pixels], %[i]), %%mm0 \n\t" + "movq 1(%[pixels], %[i]), %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpckhbw %%mm7, %%mm1 \n\t" + "punpckhbw %%mm7, %%mm3 \n\t" + "paddusw %%mm2, %%mm0 \n\t" + "paddusw %%mm3, %%mm1 \n\t" + "paddusw %%mm6, %%mm4 \n\t" + "paddusw %%mm6, %%mm5 \n\t" + "paddusw %%mm0, %%mm4 \n\t" + "paddusw %%mm1, %%mm5 \n\t" + "psrlw $2, %%mm4 \n\t" + "psrlw $2, %%mm5 \n\t" + "movq (%[block], %[i]), %%mm3 \n\t" + "packuswb %%mm5, %%mm4 \n\t" + "pcmpeqd %%mm2, %%mm2 \n\t" + "paddb %%mm2, %%mm2 \n\t" + PAVGB_MMX(%%mm3, %%mm4, %%mm5, %%mm2) + "movq %%mm5, (%[block], %[i]) \n\t" + "add %[line_size], %[i] \n\t" - "movq (%1, %%"FF_REG_a"), %%mm2 \n\t" // 0 <-> 2 1 <-> 3 - "movq 1(%1, %%"FF_REG_a"), %%mm4 \n\t" - "movq %%mm2, %%mm3 \n\t" - "movq %%mm4, %%mm5 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpckhbw %%mm7, %%mm3 \n\t" - "punpckhbw %%mm7, %%mm5 \n\t" - "paddusw %%mm2, %%mm4 \n\t" - "paddusw %%mm3, %%mm5 \n\t" - "paddusw %%mm6, %%mm0 \n\t" - "paddusw %%mm6, %%mm1 \n\t" - "paddusw %%mm4, %%mm0 \n\t" - "paddusw %%mm5, %%mm1 \n\t" - "psrlw $2, %%mm0 \n\t" - "psrlw $2, %%mm1 \n\t" - "movq (%2, %%"FF_REG_a"), %%mm3 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "pcmpeqd %%mm2, %%mm2 \n\t" - "paddb %%mm2, %%mm2 \n\t" - PAVGB_MMX(%%mm3, %%mm0, %%mm1, %%mm2) - "movq %%mm1, (%2, %%"FF_REG_a") \n\t" - "add %3, %%"FF_REG_a" \n\t" + "movq (%[pixels], %[i]), %%mm2 \n\t" + // 0 <-> 2 1 <-> 3 + "movq 1(%[pixels], %[i]), %%mm4 \n\t" + "movq %%mm2, %%mm3 \n\t" + "movq %%mm4, %%mm5 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpckhbw %%mm7, %%mm3 \n\t" + "punpckhbw %%mm7, %%mm5 \n\t" + "paddusw %%mm2, %%mm4 \n\t" + "paddusw %%mm3, %%mm5 \n\t" + "paddusw %%mm6, %%mm0 \n\t" + "paddusw %%mm6, %%mm1 \n\t" + "paddusw %%mm4, %%mm0 \n\t" + "paddusw %%mm5, %%mm1 \n\t" + "psrlw $2, %%mm0 \n\t" + "psrlw $2, %%mm1 \n\t" + "movq (%[block], %[i]), %%mm3 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "pcmpeqd %%mm2, %%mm2 \n\t" + "paddb %%mm2, %%mm2 \n\t" + PAVGB_MMX(%%mm3, %%mm0, %%mm1, %%mm2) + "movq %%mm1, (%[block], %[i]) \n\t" + "add %[line_size], %[i] \n\t" - "subl $2, %0 \n\t" - "jnz 1b \n\t" - :"+g"(h), "+S"(pixels) - :"D"(block), "r"((x86_reg)line_size) - :FF_REG_a, "memory"); + "subl $2, %[h] \n\t" + "jnz 1b \n\t" + : "=m" (*(uint8_t (*)[])block), + [h] "+&g" (h), + [pixels] "+&S" (pixels), + [i] "+&r" (i) + : SET_RND_IN_COMMA + "m" (*(const uint8_t (*)[])pixels), + [block] "D" (block), + [line_size] "r" ((x86_reg)line_size) + : MMX_CLOBBERS("mm0", "mm1", "mm2", "mm3", + "mm4", "mm5", "mm6", "mm7")); } diff --git a/libavutil/x86/asm.h b/libavutil/x86/asm.h index 9bff42d628..bb3c13f5c1 100644 --- a/libavutil/x86/asm.h +++ b/libavutil/x86/asm.h @@ -79,6 +79,26 @@ typedef int x86_reg; # define BROKEN_RELOCATIONS 1 #endif +/* + * If gcc is not set to support mmx (-mmmx) it will not accept mmx registers + * in the clobber list for inline asm. MMX_CLOBBERS takes a list of mmx + * registers to be marked as clobbered and evaluates to nothing if they are + * not supported, or to the list itself if they are supported. Since a clobber + * list may not be empty, XMM_CLOBBERS_ONLY should be used if the mmx + * registers are the only in the clobber list. + * For example a list with "eax" and "mm0" as clobbers should become: + * : MMX_CLOBBERS("mm0",) "eax" + * and a list with only "mm0" should become: + * MMX_CLOBBERS_ONLY("mm0") + */ +#if HAVE_MMX_CLOBBERS +# define MMX_CLOBBERS(...) __VA_ARGS__ +# define MMX_CLOBBERS_ONLY(...) : __VA_ARGS__ +#else +# define MMX_CLOBBERS(...) +# define MMX_CLOBBERS_ONLY(...) +#endif + /* * If gcc is not set to support sse (-msse) it will not accept xmm registers * in the clobber list for inline asm. XMM_CLOBBERS takes a list of xmm -- 2.17.1