From patchwork Mon Mar 25 15:02:26 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= X-Patchwork-Id: 47431 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:c889:b0:1a3:b6bb:3029 with SMTP id hb9csp1246856pzb; Mon, 25 Mar 2024 08:03:40 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCXk6xxmCOm7Vfx3tMxk2uc09p5qteWM9JnXGh7KvvkCPLpS9JTsXyGTazTrowJAzE3JqTZv9BBcb/maUCsrYeZASQE14sndL6s27w== X-Google-Smtp-Source: AGHT+IHMjrfShXEWAGnwJ6cignvXalhuNCXa+xGFDIxW2pACfVhzkEoft503WjtjgNKXJ+VU2KG+ X-Received: by 2002:a50:bb0c:0:b0:56b:7f64:86f7 with SMTP id y12-20020a50bb0c000000b0056b7f6486f7mr4428513ede.3.1711379019952; Mon, 25 Mar 2024 08:03:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1711379019; cv=none; d=google.com; s=arc-20160816; b=IgS414wEcsZTysSXAtpEmJwwgr+ov9NlLTLJ0gvdFOMKV+HMl72B3qOnVfi5c6AZiy Ws6ZQ5hdjUEPsziKgkr/tDtE0YJOVJlRsgavLH0UGa3surjczPbubEL7GXm2K63Xs1x8 NLgibqNB524IM65/qWVgrLvDgwQlE1L4ZufcStgOBytSC3qsV0+goMOEG4DSc/9QFmOT otHVOSSak81VKWrIF/5/8yCKmjs/U47AwOqtihJQs6rIlHmQ/Ij/q9IJdkAqqtRcZd1W 03TDTTlQGgOmOhJGBqA6Yf41EyEwPkbpkxewSidRDyOu/MowiYEvOxLs0SYIDaU3Bu/9 9+Wg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=mKrGOhvUtJKA/hjW7fmebfphyV2jXAB2Vxhp1mAw4xg=; fh=DRs4GoYGH1Ee5ERfKSLc+LBtFKtorfi5XrD3r13LPeM=; b=E3XAU+tAHIbr0LnTxm1jC6sQQmwSF0wRWDUPUSiRHo6CK7oHk+MhqH40ceOeND1lnO xUGmSbRiHB4+FAT1wJ98YeGQW2dyIu551XbrKS69aevelgkUYZIfm3Goz9twE/+8o29b tbv+51qQTE+1rQ9SLNMgXDZM7jWSOcvIW2dXdnVoE9QKeGs6wJX+e8kslhTzO16OdN53 EybBlWxrzdQXtn1bVnby74EW952/zmRcU3Wqi/144LO+n7V+HIIzgZ2sCKOo/HlIQrxI Y8N0E+nDLDv+XCOwsJfPKR1rRFskraIw1HdujDIpQxOWcBStFkZvS1I+STgwrmlHiG7K mi9g==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@martin-st.20230601.gappssmtp.com header.s=20230601 header.b=j1014ZT8; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id er22-20020a056402449600b0056c076245f3si1697525edb.218.2024.03.25.08.03.39; Mon, 25 Mar 2024 08:03:39 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@martin-st.20230601.gappssmtp.com header.s=20230601 header.b=j1014ZT8; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A01E768D592; Mon, 25 Mar 2024 17:02:57 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id F1F7F68D471 for ; Mon, 25 Mar 2024 17:02:47 +0200 (EET) Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-2d28051376eso84533031fa.0 for ; Mon, 25 Mar 2024 08:02:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1711378967; x=1711983767; darn=ffmpeg.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=EAZOzNQ3WW8wv1RIdEbwYAqBTXxm+Jc4pnbIij7Nl1E=; b=j1014ZT8lg5PcaQcvXReRX+NrAql3f2MxJJjdXNMDnXTPd2JJtPZXWjxWkwEzIqaA7 SrdAOfJdDS8LlnOpWGGL+jvClI0nZHkMFxBVwl/dZWZ7FUmcm21zM32o1sGwb3Hogv9j wJQLA3OQAkIXPLu2k+K3nvjMSOed5P5rjQUmDDTSuCpkVa+kApx0Nmn5BCJXMzyIfk+7 srJ0HcG90w3k+i28yXgLgk6pKV8WqCEk7fjkp4/a3qFAsm2AtF7EvK/P5n2Tp1gzJ/WA j05R5v9Bv71J9d6sveVvhY33UDm2XtaZ6dC+1Ef4kG/2RCWzixg6c36uC7P+QPoeYVGL oveA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711378967; x=1711983767; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EAZOzNQ3WW8wv1RIdEbwYAqBTXxm+Jc4pnbIij7Nl1E=; b=qhYXgt3xMFxAvCdfCFadgOVWgcgDVeLnPRoxo5gEQBAJgOg6hrHrl/jY8btY4SapB9 HXFI1ZZXYd9CgKdJ9jeNEwFZKQ35hsKEbc2EQxLYhJpU3l/TKQK7u/myihq3YpZ5o9+5 EEwOw15yK0xOommav0ZsPxOxsNqaL6+MbwMPqVcf7QW8agio31WM9rCx7Fa64X7QuBCc XkiT/jDmfoqkmbbuq73m2gRLT3IyVdNdlQ7oZsPoAAb3S7sLE63iy6jbXgnFe0J0DBZF b1fk27SCntd3T2+lkqtOVs/nDUSkvKY2tLm0gk0X5NWK1fh8g3+DKmt+a5rQqza7XHrM GB1Q== X-Gm-Message-State: AOJu0Yye8A6Ya057kIGPGjp5W6FfhOcGkO78czH08GrVioC7U1pHu4JR RQFwQqrP13SS+A+Cidb7FOLI5nIpFcepm4jpeBV8e/oNtr1kYGh4xlnbXM6tFUu2SL+gwQasoSK O/VFt X-Received: by 2002:a05:651c:1693:b0:2d4:5370:5e8a with SMTP id bd19-20020a05651c169300b002d453705e8amr6180266ljb.22.1711378967264; Mon, 25 Mar 2024 08:02:47 -0700 (PDT) Received: from localhost (host-114-191.parnet.fi. [77.234.114.191]) by smtp.gmail.com with ESMTPSA id a3-20020a2e9803000000b002d48d8e22c1sm1482042ljj.35.2024.03.25.08.02.46 (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 25 Mar 2024 08:02:47 -0700 (PDT) From: =?utf-8?q?Martin_Storsj=C3=B6?= To: ffmpeg-devel@ffmpeg.org Date: Mon, 25 Mar 2024 17:02:26 +0200 Message-Id: <20240325150243.59058-5-martin@martin.st> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20240325150243.59058-1-martin@martin.st> References: <20240325150243.59058-1-martin@martin.st> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Logan Lyu , "J . Dekker" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: tfDnb0smLCY7 For widths of 32 pixels and more, loop first horizontally, then vertically. Previously, this function would process a 16 pixel wide slice of the block, looping vertically. After processing the whole height, it would backtrack and process the next 16 pixel wide slice. When doing 8tap filtering horizontally, the function must load 7 more pixels (in practice, 8) following the actual inputs, and this was done for each slice. By iterating first horizontally throughout each line, then vertically, we access data in a more cache friendly order, and we don't need to reload data unnecessarily. Keep the original order in put_hevc_\type\()_h12_8_neon; the only suboptimal case there is for width=24. But specializing an optimal variant for that would require more code, which might not be worth it. For the h16 case, this implementation would give a slowdown, as it now loads the first 8 pixels separately from the rest, but for larger widths, it is a gain. Therefore, keep the h16 case as it was (but remove the outer loop), and create a new specialized version for horizontal looping with 16 pixels at a time. Before: Cortex A53 A72 A73 Graviton 3 put_hevc_qpel_h16_8_neon: 710.5 667.7 692.5 211.0 put_hevc_qpel_h32_8_neon: 2791.5 2643.5 2732.0 883.5 put_hevc_qpel_h64_8_neon: 10954.0 10657.0 10874.2 3241.5 After: put_hevc_qpel_h16_8_neon: 697.5 663.5 705.7 212.5 put_hevc_qpel_h32_8_neon: 2767.2 2684.5 2791.2 920.5 put_hevc_qpel_h64_8_neon: 10559.2 10471.5 10932.2 3051.7 --- libavcodec/aarch64/hevcdsp_init_aarch64.c | 20 +++-- libavcodec/aarch64/hevcdsp_qpel_neon.S | 103 +++++++++++++++++----- 2 files changed, 94 insertions(+), 29 deletions(-) diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c index d2f2a3681f..1e9f5e32db 100644 --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c @@ -109,6 +109,8 @@ void ff_hevc_put_hevc_qpel_h12_8_neon(int16_t *dst, const uint8_t *_src, ptrdiff intptr_t mx, intptr_t my, int width); void ff_hevc_put_hevc_qpel_h16_8_neon(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_h32_8_neon(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height, + intptr_t mx, intptr_t my, int width); void ff_hevc_put_hevc_qpel_uni_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); @@ -124,6 +126,9 @@ void ff_hevc_put_hevc_qpel_uni_h12_8_neon(uint8_t *_dst, ptrdiff_t _dststride, c void ff_hevc_put_hevc_qpel_uni_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_uni_h32_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, + ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t + my, int width); void ff_hevc_put_hevc_qpel_bi_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t mx, intptr_t my, int width); @@ -139,6 +144,9 @@ void ff_hevc_put_hevc_qpel_bi_h12_8_neon(uint8_t *_dst, ptrdiff_t _dststride, co void ff_hevc_put_hevc_qpel_bi_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_bi_h32_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, + ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t + mx, intptr_t my, int width); #define NEON8_FNPROTO(fn, args, ext) \ void ff_hevc_put_hevc_##fn##4_8_neon##ext args; \ @@ -335,28 +343,28 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) c->put_hevc_qpel[3][0][1] = ff_hevc_put_hevc_qpel_h8_8_neon; c->put_hevc_qpel[4][0][1] = c->put_hevc_qpel[6][0][1] = ff_hevc_put_hevc_qpel_h12_8_neon; - c->put_hevc_qpel[5][0][1] = + c->put_hevc_qpel[5][0][1] = ff_hevc_put_hevc_qpel_h16_8_neon; c->put_hevc_qpel[7][0][1] = c->put_hevc_qpel[8][0][1] = - c->put_hevc_qpel[9][0][1] = ff_hevc_put_hevc_qpel_h16_8_neon; + c->put_hevc_qpel[9][0][1] = ff_hevc_put_hevc_qpel_h32_8_neon; c->put_hevc_qpel_uni[1][0][1] = ff_hevc_put_hevc_qpel_uni_h4_8_neon; c->put_hevc_qpel_uni[2][0][1] = ff_hevc_put_hevc_qpel_uni_h6_8_neon; c->put_hevc_qpel_uni[3][0][1] = ff_hevc_put_hevc_qpel_uni_h8_8_neon; c->put_hevc_qpel_uni[4][0][1] = c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_qpel_uni_h12_8_neon; - c->put_hevc_qpel_uni[5][0][1] = + c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_qpel_uni_h16_8_neon; c->put_hevc_qpel_uni[7][0][1] = c->put_hevc_qpel_uni[8][0][1] = - c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_qpel_uni_h16_8_neon; + c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_qpel_uni_h32_8_neon; c->put_hevc_qpel_bi[1][0][1] = ff_hevc_put_hevc_qpel_bi_h4_8_neon; c->put_hevc_qpel_bi[2][0][1] = ff_hevc_put_hevc_qpel_bi_h6_8_neon; c->put_hevc_qpel_bi[3][0][1] = ff_hevc_put_hevc_qpel_bi_h8_8_neon; c->put_hevc_qpel_bi[4][0][1] = c->put_hevc_qpel_bi[6][0][1] = ff_hevc_put_hevc_qpel_bi_h12_8_neon; - c->put_hevc_qpel_bi[5][0][1] = + c->put_hevc_qpel_bi[5][0][1] = ff_hevc_put_hevc_qpel_bi_h16_8_neon; c->put_hevc_qpel_bi[7][0][1] = c->put_hevc_qpel_bi[8][0][1] = - c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_qpel_bi_h16_8_neon; + c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_qpel_bi_h32_8_neon; NEON8_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels,); NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v,); diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S index 432558bb95..0fcded344b 100644 --- a/libavcodec/aarch64/hevcdsp_qpel_neon.S +++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S @@ -383,11 +383,9 @@ endfunc .ifc \type, qpel function ff_hevc_put_hevc_h16_8_neon, export=0 - uxtl v16.8h, v16.8b uxtl v17.8h, v17.8b uxtl v18.8h, v18.8b - uxtl v19.8h, v19.8b uxtl v20.8h, v20.8b uxtl v21.8h, v21.8b @@ -408,7 +406,6 @@ function ff_hevc_put_hevc_h16_8_neon, export=0 mla v28.8h, v24.8h, v0.h[\i] mla v29.8h, v25.8h, v0.h[\i] .endr - subs x9, x9, #2 ret endfunc .endif @@ -439,7 +436,10 @@ function ff_hevc_put_hevc_\type\()_h12_8_neon, export=1 1: ld1 {v16.8b-v18.8b}, [src], x13 ld1 {v19.8b-v21.8b}, [x12], x13 + uxtl v16.8h, v16.8b + uxtl v19.8h, v19.8b bl ff_hevc_put_hevc_h16_8_neon + subs x9, x9, #2 .ifc \type, qpel st1 {v26.8h}, [dst], #16 @@ -504,7 +504,6 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1 .ifc \type, qpel_bi ldrh w8, [sp] // width mov x16, #(MAX_PB_SIZE << 2) // src2bstridel - lsl x17, x5, #7 // src2b reset add x15, x4, #(MAX_PB_SIZE << 1) // src2b .endif sub src, src, #3 @@ -519,11 +518,14 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1 .endif add x10, dst, dststride // dstb add x12, src, srcstride // srcb -0: mov x9, height + 1: ld1 {v16.8b-v18.8b}, [src], x13 ld1 {v19.8b-v21.8b}, [x12], x13 + uxtl v16.8h, v16.8b + uxtl v19.8h, v19.8b bl ff_hevc_put_hevc_h16_8_neon + subs height, height, #2 .ifc \type, qpel st1 {v26.8h, v27.8h}, [dst], x14 @@ -550,28 +552,83 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1 st1 {v28.8b, v29.8b}, [x10], x14 .endif b.gt 1b // double line - subs width, width, #16 - // reset src - msub src, srcstride, height, src - msub x12, srcstride, height, x12 - // reset dst - msub dst, dststride, height, dst - msub x10, dststride, height, x10 + ret mx +endfunc + +function ff_hevc_put_hevc_\type\()_h32_8_neon, export=1 + load_filter mx + sxtw height, heightw + mov mx, x30 .ifc \type, qpel_bi - // reset xsrc - sub x4, x4, x17 - sub x15, x15, x17 - add x4, x4, #32 - add x15, x15, #32 + ldrh w8, [sp] // width + mov x16, #(MAX_PB_SIZE << 2) // src2bstridel + lsl x17, x5, #7 // src2b reset + add x15, x4, #(MAX_PB_SIZE << 1) // src2b + sub x16, x16, width, uxtw #1 .endif - add src, src, #16 - add x12, x12, #16 + sub src, src, #3 + mov mx, x30 +.ifc \type, qpel + mov dststride, #(MAX_PB_SIZE << 1) + lsl x13, srcstride, #1 // srcstridel + mov x14, #(MAX_PB_SIZE << 2) + sub x14, x14, width, uxtw #1 +.else + lsl x14, dststride, #1 // dststridel + lsl x13, srcstride, #1 // srcstridel + sub x14, x14, width, uxtw +.endif + sub x13, x13, width, uxtw + sub x13, x13, #8 + add x10, dst, dststride // dstb + add x12, src, srcstride // srcb +0: mov w9, width + ld1 {v16.8b}, [src], #8 + ld1 {v19.8b}, [x12], #8 + uxtl v16.8h, v16.8b + uxtl v19.8h, v19.8b +1: + ld1 {v17.8b-v18.8b}, [src], #16 + ld1 {v20.8b-v21.8b}, [x12], #16 + + bl ff_hevc_put_hevc_h16_8_neon + subs w9, w9, #16 + + mov v16.16b, v18.16b + mov v19.16b, v21.16b .ifc \type, qpel - add dst, dst, #32 - add x10, x10, #32 + st1 {v26.8h, v27.8h}, [dst], #32 + st1 {v28.8h, v29.8h}, [x10], #32 +.else +.ifc \type, qpel_bi + ld1 {v20.8h, v21.8h}, [ x4], #32 + ld1 {v22.8h, v23.8h}, [x15], #32 + sqadd v26.8h, v26.8h, v20.8h + sqadd v27.8h, v27.8h, v21.8h + sqadd v28.8h, v28.8h, v22.8h + sqadd v29.8h, v29.8h, v23.8h + sqrshrun v26.8b, v26.8h, #7 + sqrshrun v27.8b, v27.8h, #7 + sqrshrun v28.8b, v28.8h, #7 + sqrshrun v29.8b, v29.8h, #7 .else - add dst, dst, #16 - add x10, x10, #16 + sqrshrun v26.8b, v26.8h, #6 + sqrshrun v27.8b, v27.8h, #6 + sqrshrun v28.8b, v28.8h, #6 + sqrshrun v29.8b, v29.8h, #6 +.endif + st1 {v26.8b, v27.8b}, [dst], #16 + st1 {v28.8b, v29.8b}, [x10], #16 +.endif + b.gt 1b // double line + subs height, height, #2 + add src, src, x13 + add x12, x12, x13 + add dst, dst, x14 + add x10, x10, x14 +.ifc \type, qpel_bi + add x4, x4, x16 + add x15, x15, x16 .endif b.gt 0b ret mx