From patchwork Thu Nov 9 18:34:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= X-Patchwork-Id: 44602 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:4fa4:b0:181:818d:5e7f with SMTP id gh36csp683069pzb; Thu, 9 Nov 2023 10:35:13 -0800 (PST) X-Google-Smtp-Source: AGHT+IHPccP7LMuyswshiHJN2lDzQsOdbNhYgItnhMpHRbzuIvLFPkBpqeWaZ3QCczkQ2T0tp5IU X-Received: by 2002:a17:906:8910:b0:9e4:121c:c292 with SMTP id fr16-20020a170906891000b009e4121cc292mr2357888ejc.77.1699554913073; Thu, 09 Nov 2023 10:35:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699554913; cv=none; d=google.com; s=arc-20160816; b=Z3tpMuAlYn1mmpBTetyyDopefZu55vV/GV+Vczz8kRYIMCmnrvna5xkqJlEeN0pB+W j1o8ekrmS+RoQmhhMTxCLNOwxK3V3zHAiYoH8SW710+psAH+i6s8BknYom0mNxHX9YkU WlGNhQXCDG/QhKV2AiFuGMSPz1egKSZMnoK3HBUHkPRF6rIAf9LICrFE8+jXxsJrqo4o dcGRcbXnzxXXQuZWrEuDUK5HUsDnG3TPjd5vMmZGo/7T5WjgY6ucHVtqHHucOpBQKTl5 T7uLRIapg6i6xlOZRx9AJzFygmXKlQ47nZ5Rna1VQP8xCuaABqvQP14xvBMueTnbmYcl a3pg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:references:in-reply-to:message-id :date:to:from:delivered-to; bh=NN/SYvcXwmLo1O2OyGM1FSao+g7B8zPjKrTiexWNkP4=; fh=YOA8vD9MJZuwZ71F/05pj6KdCjf6jQRmzLS+CATXUQk=; b=Hy0qnAIl0qurk2CohN66lxjMbzjf2j7sje9tnw1U62vX70iKok7BU8vpH71hlyqiYg wohmrQWUbHPa22dpZxGPnDuhE1HJ03DDn/v6ths4Aqnnrre1CdUbGiQzEIpRsoT8rjsU uXV4USywvOT+FyEYMptAtjY8XT5+j5zmcVoIK+plrxcKdt+W5EpvfXQJeypIWv1k4/GZ CfsZPlt0HnzwHmTfZs9anjhM9MEhOI9QHMZZXE7Alt6NgSuUe7LPx9IKFehKTYmDZHMh dgS1RpWI1Ajai+ZQbzZxKutEbvLTB/tqruSlKBL/sApRTOBaFVSxHYDRrMthRbDfEQt/ cY5Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id s14-20020a170906220e00b009d441527214si3626330ejs.1045.2023.11.09.10.35.12; Thu, 09 Nov 2023 10:35:13 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 2B94368CBC8; Thu, 9 Nov 2023 20:35:03 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from ursule.remlab.net (vps-a2bccee9.vps.ovh.net [51.75.19.47]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9017768CAA9 for ; Thu, 9 Nov 2023 20:34:54 +0200 (EET) Received: from basile.remlab.net (localhost [IPv6:::1]) by ursule.remlab.net (Postfix) with ESMTP id 2E9D4C00C2 for ; Thu, 9 Nov 2023 20:34:54 +0200 (EET) From: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= To: ffmpeg-devel@ffmpeg.org Date: Thu, 9 Nov 2023 20:34:53 +0200 Message-ID: <20231109183453.12390-2-remi@remlab.net> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20231109183453.12390-1-remi@remlab.net> References: <20231109183453.12390-1-remi@remlab.net> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 2/2] sws/rgb2rgb: fix unaligned accesses in R-V V YUYV to I422p X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: MDZDjLm6jpYv In my personal opinion, we should not need to support unaligned YUY2 pixel maps. They should always be aligned to at least 32 bits, and the current code assumes just 16 bits. However checkasm does test for unaligned input bitmaps. QEMU accepts it, but real hardware dose not. In this particular case, we can at the same time improve performance and handle unaligned inputs, so do just that. uyvytoyuv422_c: 104060.0 uyvytoyuv422_rvv_i32: 25284.0 (before) uyvytoyuv422_rvv_i32: 20148.2 (after) --- libswscale/riscv/rgb2rgb_rvv.S | 45 +++++++++++++++++----------------- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/libswscale/riscv/rgb2rgb_rvv.S b/libswscale/riscv/rgb2rgb_rvv.S index 172f5918dc..716948dc82 100644 --- a/libswscale/riscv/rgb2rgb_rvv.S +++ b/libswscale/riscv/rgb2rgb_rvv.S @@ -126,32 +126,33 @@ func ff_deinterleave_bytes_rvv, zve32x ret endfunc -.macro yuy2_to_i422p y_shift - slli t4, a4, 1 // pixel width -> (source) byte width +.macro yuy2_to_i422p luma, chroma + srai t4, a4, 1 // pixel width -> chroma width lw t6, (sp) + slli t5, a4, 1 // pixel width -> (source) byte width sub a6, a6, a4 - srai a4, a4, 1 // pixel width -> chroma width - sub a7, a7, a4 - sub t6, t6, t4 + sub a7, a7, t4 + sub t6, t6, t5 1: mv t4, a4 addi a5, a5, -1 2: - vsetvli t5, t4, e8, m2, ta, ma - vlseg2e16.v v16, (a3) - sub t4, t4, t5 - vnsrl.wi v24, v16, \y_shift // Y0 - sh2add a3, t5, a3 - vnsrl.wi v26, v20, \y_shift // Y1 - vnsrl.wi v28, v16, 8 - \y_shift // U - vnsrl.wi v30, v20, 8 - \y_shift // V - vsseg2e8.v v24, (a0) - sh1add a0, t5, a0 - vse8.v v28, (a1) - add a1, t5, a1 - vse8.v v30, (a2) - add a2, t5, a2 - bnez t4, 2b + vsetvli t5, t4, e8, m4, ta, ma + vlseg2e8.v v16, (a3) + srli t1, t5, 1 + vsetvli zero, t1, e8, m2, ta, ma + vnsrl.wi v24, \chroma, 0 // U + sub t4, t4, t5 + vnsrl.wi v28, \chroma, 8 // V + sh1add a3, t5, a3 + vse8.v v24, (a1) + add a1, t1, a1 + vse8.v v28, (a2) + add a2, t1, a2 + vsetvli zero, t5, e8, m4, ta, ma + vse8.v \luma, (a0) + add a0, t5, a0 + bnez t4, 2b add a3, a3, t6 add a0, a0, a6 @@ -163,9 +164,9 @@ endfunc .endm func ff_uyvytoyuv422_rvv, zve32x - yuy2_to_i422p 8 + yuy2_to_i422p v20, v16 endfunc func ff_yuyvtoyuv422_rvv, zve32x - yuy2_to_i422p 0 + yuy2_to_i422p v16, v20 endfunc