From patchwork Tue Jan 10 19:39:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Helmrich, Christian" X-Patchwork-Id: 39954 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:bc95:b0:ad:ade2:bfd2 with SMTP id fx21csp4216420pzb; Tue, 10 Jan 2023 11:39:18 -0800 (PST) X-Google-Smtp-Source: AMrXdXvAqe4FyqFdbk/gNTgGfGAs/Ndgsx1tYH1qO928+jw0KwyHoUTJ5FuJit/2OXnKNfruWoCB X-Received: by 2002:a17:906:b0cd:b0:7ac:a2f5:cd0a with SMTP id bk13-20020a170906b0cd00b007aca2f5cd0amr56626646ejb.44.1673379557531; Tue, 10 Jan 2023 11:39:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673379557; cv=none; d=google.com; s=arc-20160816; b=QG8QvQKSYPcKPhKbBy7N1O+d8sRlg76RlcqIcIa90RIHji5AWcI55Mg75JeK5/EzJg tzk8S7+hYCpwAOih9r2Vo//nvbUFIAVpQTX0Za/pVVRoOro9achnRKyucnfRih4KXFVF 3cDPDQTVtb0f6b267NLdnKj8wYtZLkawn4xVESFcyHnDpv511fxdGemGdS/ABojVhj8x ETzNtTJRBbVsjAj4QMjq6OgdYnhveRrQi5y7Dt52nsm+G3C09XpFTIfDZ04678uBbtuj G7Brvc/ITjMzF1EUV1ELpld6gjt0M4NDFH9JF6fj/skEtx7tjIqXkmmPdK+WvItutdvL npBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:cc:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject :mime-version:content-language:accept-language:message-id:date :thread-index:thread-topic:to:from:delivered-to; bh=pHPnb50BRSnZ8imuhXsCIxzJBjtPycbAJp25jyPDl/Q=; b=ejkFXbXISmPg/NDUC/ES0Fx7UB1gYOIMWVjyxUGgPwOUOMk0fb29LmbtdkUPejTwne QSH8al0RHfeOsInv0VwQqLvElppjS0wSVkzNGvhDSkPMNp2X3v8AFrzHH4nqQ7y2IvpS SnH8XcNfHLTbPbITR+11jNUqnxJaORfNl9BhkbJGCZFF6XURa+quj/aVZqNyjDM6xDS8 FxpeWOE4gzSFTLhG5HZbpii1b4XBQAUTSlaO5SywbTpBk1NqvbOw/l3zTzSExeRoKx+p JMjkfc/YkL5Ry4l6ZnaODGGz0zkC5YF+tmOM+8xuPBkSgEr5DuP6jg3LemQvaI+OAX9n pP4g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id qf1-20020a1709077f0100b007c0c0cb9f25si12373804ejc.3.2023.01.10.11.39.16; Tue, 10 Jan 2023 11:39:17 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8C7EF68BD05; Tue, 10 Jan 2023 21:39:11 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail.hhi.fraunhofer.de (mail.HHI.FRAUNHOFER.DE [193.174.67.45]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 1394168BC53 for ; Tue, 10 Jan 2023 21:39:05 +0200 (EET) Received: from mail.hhi.fraunhofer.de (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0BE087C147 for ; Tue, 10 Jan 2023 20:39:04 +0100 (CET) X-IMSS-DKIM-Authentication-Result: mail.hhi.fraunhofer.de; sigcount=0 Received: from mail.hhi.fraunhofer.de (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DEAD27C146 for ; Tue, 10 Jan 2023 20:39:03 +0100 (CET) Received: from mx.fe.hhi.de (unknown [172.16.0.105]) by mail.hhi.fraunhofer.de (Postfix) with ESMTPS for ; Tue, 10 Jan 2023 20:39:03 +0100 (CET) Received: from mxsrv2.fe.hhi.de (172.16.0.105) by mxsrv2.fe.hhi.de (172.16.0.105) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1118.20; Tue, 10 Jan 2023 20:39:03 +0100 Received: from mxsrv2.fe.hhi.de ([fe80::a6ac:c6a3:8cbe:4317]) by mxsrv2.fe.hhi.de ([fe80::a6ac:c6a3:8cbe:4317%6]) with mapi id 15.02.1118.020; Tue, 10 Jan 2023 20:39:03 +0100 From: "Helmrich, Christian" To: "ffmpeg-devel@ffmpeg.org" Thread-Topic: [PATCH] Request for adding XPSNR avfilter Thread-Index: AQHZJSrwOOZOGq7jyUGmIcLheMkTfw== Date: Tue, 10 Jan 2023 19:39:03 +0000 Message-ID: Accept-Language: de-DE, en-US Content-Language: de-DE X-MS-Has-Attach: yes X-MS-TNEF-Correlator: x-originating-ip: [192.168.22.100] MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: [FFmpeg-devel] [PATCH] Request for adding XPSNR avfilter X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: "Stoffers, Christian" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: X3gGXJ0YFhcn Hi, please find attached a patch (relative to FFmpeg master as of early January 10, 2023) adding avfilter support for extended perceptually weighted peak signal-to-noise ratio (XPSNR) measurements for videos, as described in the related addition to filters.texi. The XPSNR code was originally vectorized using SIMD intrinsics, but we concluded that FFmpeg code requires asm instead of such intrinsics, so we let gcc auto-convert these instructions to pure assembly; see the vf_xpsnr.asm file. If the added asm code is too lengthy, intrinsics would be possible, or something else is missing, please let us know. Best, Christian Helmrich and Christian Stoffers Fraunhofer HHI diff --git a/Changelog b/Changelog index 179f63c7d5..35f6b6e64f 100644 --- a/Changelog +++ b/Changelog @@ -28,6 +28,7 @@ version : - showcwt multimedia filter - corr video filter - adrc audio filter +- XPSNR video filter version 5.1: diff --git a/doc/filters.texi b/doc/filters.texi index 57088ccc6c..ab725c506f 100644 --- a/doc/filters.texi +++ b/doc/filters.texi @@ -19141,6 +19141,7 @@ pseudocolor="'if(between(val,ymax,amax),lerp(ymin,ymax,(val-ymax)/(amax-ymax)),- @end example @end itemize +@anchor{psnr} @section psnr Obtain the average, maximum and minimum PSNR (Peak Signal to Noise @@ -24820,6 +24821,72 @@ minimum values, and @code{1} maximum values. This filter supports all above options as @ref{commands}, excluding option @code{inputs}. +@section xpsnr + +Obtain the average (across all input frames) and minimum (across all color plane averages) +eXtended Perceptually weighted peak Signal-to-Noise Ratio (XPSNR) between two input videos. + +The XPSNR is a low-complexity psychovisually motivated distortion measurement algorithm for +assessing the difference between two video streams or images. This is especially useful for +objectively quantifying the distortions caused by video and image codecs, as an alternative +to a formal subjective test. The logarithmic XPSNR output values are in a similar range as +those of traditional @ref{psnr} assessments but better reflect human impressions of visual +coding quality. More details on the XPSNR measure, which essentially represents a blockwise +weighted variant of the PSNR measure, can be found in the following freely available papers: + +@itemize +@item +C. R. Helmrich, M. Siekmann, S. Becker, S. Bosse, D. Marpe, and T. Wiegand, "XPSNR: A +Low-Complexity Extension of the Perceptually Weighted Peak Signal-to-Noise Ratio for +High-Resolution Video Quality Assessment," in Proc. IEEE Int. Conf. Acoustics, Speech, +Sig. Process. (ICASSP), virt./online, May 2020. @url{www.ecodis.de/xpsnr.htm} + +@item +C. R. Helmrich, S. Bosse, H. Schwarz, D. Marpe, and T. Wiegand, "A Study of the +Extended Perceptually Weighted Peak Signal-to-Noise Ratio (XPSNR) for Video Compression +with Different Resolutions and Bit Depths," ITU Journal: ICT Discoveries, vol. 3, no. +1, pp. 65 - 72, May 2020. @url{http://handle.itu.int/11.1002/pub/8153d78b-en} +@end itemize + +When publishing the results of XPSNR assessments obtained using, e.g., this FFmpeg filter, a +reference to the above papers as a means of documentation is strongly encouraged. The filter +requires two input videos. The first input is considered a (usually not distorted) reference +source and is passed unchanged to the output, whereas the second input is a (distorted) test +signal. Except for the bit depth, these two video inputs must have the same pixel format. In +addition, for best performance, both compared input videos should be in YCbCr color format. + +The obtained overall XPSNR values mentioned above are printed through the logging system. In +case of input with multiple color planes, we suggest reporting of the minimum XPSNR average. + +The following parameter, which behaves like the one for the @ref{psnr} filter, is accepted: + +@table @option +@item stats_file, f +If specified, the filter will use the named file to save the XPSNR value of each individual +frame and color plane. When the file name equals "-", that data is sent to standard output. +@end table + +This filter also supports the @ref{framesync} options. + +@subsection Examples +@itemize +@item +XPSNR analysis of two 1080p HD videos, ref_source.yuv and test_video.yuv, both at 24 frames +per second, with color format 4:2:0, bit depth 8, and output of a logfile named "xpsnr.log": +@example +ffmpeg -s 1920x1080 -framerate 24 -pix_fmt yuv420p -i ref_source.yuv -s 1920x1080 -framerate +24 -pix_fmt yuv420p -i test_video.yuv -lavfi xpsnr="stats_file=xpsnr.log" -f null - +@end example + +@item +XPSNR analysis of two 2160p UHD videos, ref_source.yuv with bit depth 8 and test_video.yuv +with bit depth 10, both at 60 frames per second with color format 4:2:0, no logfile output: +@example +ffmpeg -s 3840x2160 -framerate 60 -pix_fmt yuv420p -i ref_source.yuv -s 3840x2160 -framerate +60 -pix_fmt yuv420p10le -i test_video.yuv -lavfi xpsnr="stats_file=-" -f null - +@end example +@end itemize + @section xstack Stack video inputs into custom layout. diff --git a/libavfilter/Makefile b/libavfilter/Makefile index 5783be281d..14ba19fa4e 100644 --- a/libavfilter/Makefile +++ b/libavfilter/Makefile @@ -544,6 +544,7 @@ OBJS-$(CONFIG_XCORRELATE_FILTER) += vf_convolve.o framesync.o OBJS-$(CONFIG_XFADE_FILTER) += vf_xfade.o OBJS-$(CONFIG_XFADE_OPENCL_FILTER) += vf_xfade_opencl.o opencl.o opencl/xfade.o OBJS-$(CONFIG_XMEDIAN_FILTER) += vf_xmedian.o framesync.o +OBJS-$(CONFIG_XPSNR_FILTER) += vf_xpsnr.o framesync.o OBJS-$(CONFIG_XSTACK_FILTER) += vf_stack.o framesync.o OBJS-$(CONFIG_YADIF_FILTER) += vf_yadif.o yadif_common.o OBJS-$(CONFIG_YADIF_CUDA_FILTER) += vf_yadif_cuda.o vf_yadif_cuda.ptx.o \ diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c index 52741b60e4..3b93a9af06 100644 --- a/libavfilter/allfilters.c +++ b/libavfilter/allfilters.c @@ -513,6 +513,7 @@ extern const AVFilter ff_vf_xcorrelate; extern const AVFilter ff_vf_xfade; extern const AVFilter ff_vf_xfade_opencl; extern const AVFilter ff_vf_xmedian; +extern const AVFilter ff_vf_xpsnr; extern const AVFilter ff_vf_xstack; extern const AVFilter ff_vf_yadif; extern const AVFilter ff_vf_yadif_cuda; diff --git a/libavfilter/version.h b/libavfilter/version.h index 9fabc544b5..a56ba3bb6d 100644 --- a/libavfilter/version.h +++ b/libavfilter/version.h @@ -31,7 +31,7 @@ #include "version_major.h" -#define LIBAVFILTER_VERSION_MINOR 53 +#define LIBAVFILTER_VERSION_MINOR 54 #define LIBAVFILTER_VERSION_MICRO 100 diff --git a/libavfilter/x86/Makefile b/libavfilter/x86/Makefile index e87481bd7a..641b1f740f 100644 --- a/libavfilter/x86/Makefile +++ b/libavfilter/x86/Makefile @@ -38,6 +38,7 @@ OBJS-$(CONFIG_TRANSPOSE_FILTER) += x86/vf_transpose_init.o OBJS-$(CONFIG_VOLUME_FILTER) += x86/af_volume_init.o OBJS-$(CONFIG_V360_FILTER) += x86/vf_v360_init.o OBJS-$(CONFIG_W3FDIF_FILTER) += x86/vf_w3fdif_init.o +OBJS-$(CONFIG_XPSNR_FILTER) += x86/vf_xpsnr_init.o OBJS-$(CONFIG_YADIF_FILTER) += x86/vf_yadif_init.o X86ASM-OBJS-$(CONFIG_SCENE_SAD) += x86/scene_sad.o @@ -80,4 +81,5 @@ X86ASM-OBJS-$(CONFIG_TRANSPOSE_FILTER) += x86/vf_transpose.o X86ASM-OBJS-$(CONFIG_VOLUME_FILTER) += x86/af_volume.o X86ASM-OBJS-$(CONFIG_V360_FILTER) += x86/vf_v360.o X86ASM-OBJS-$(CONFIG_W3FDIF_FILTER) += x86/vf_w3fdif.o +X86ASM-OBJS-$(CONFIG_XPSNR_FILTER) += x86/vf_xpsnr.o X86ASM-OBJS-$(CONFIG_YADIF_FILTER) += x86/vf_yadif.o x86/yadif-16.o x86/yadif-10.o diff --git a/libavfilter/vf_xpsnr.c b/libavfilter/vf_xpsnr.c new file mode 100644 index 0000000000..5b6a47aa69 --- /dev/null +++ b/libavfilter/vf_xpsnr.c @@ -0,0 +1,832 @@ +/* + * Copyright (c) 2023 Christian R. Helmrich + * Copyright (c) 2023 Christian Stoffers + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/** + * @file + * Calculate the extended perceptually weighted PSNR (XPSNR) between two input videos. + * + * Authors: Christian Helmrich and Christian Stoffers, Fraunhofer HHI, Berlin, Germany + */ + +#include +#include "libavutil/avstring.h" +#include "libavutil/file_open.h" +#include "libavutil/opt.h" +#include "libavutil/pixdesc.h" +#include "avfilter.h" +#include "drawutils.h" +#include "framesync.h" +#include "internal.h" +#include "xpsnr.h" + +/* XPSNR structure definition */ + +typedef struct XPSNRContext +{ + /* required basic variables */ + const AVClass *class; + int bpp; /* unpacked */ + int depth; /* packed */ + char comps[4]; + int num_comps; + uint64_t num_frames_64; + unsigned frame_rate; + FFFrameSync fs; + int line_sizes[4]; + int plane_height[4]; + int plane_width[4]; + uint8_t rgba_map[4]; + FILE *stats_file; + char *stats_file_str; + /* XPSNR specific variables */ + double *sse_luma; + double *weights; + AVBufferRef* buf_org [3]; + AVBufferRef* buf_org_m1[3]; + AVBufferRef* buf_org_m2[3]; + AVBufferRef* buf_rec [3]; + uint64_t max_error_64; + double sum_wdist [3]; + double sum_xpsnr [3]; + bool and_is_inf[3]; + bool is_rgb; + PSNRDSPContext dsp; +} +XPSNRContext; + +/* required macro definitions */ + +#define FLAGS AV_OPT_FLAG_FILTERING_PARAM | AV_OPT_FLAG_VIDEO_PARAM +#ifndef MAX +#define MAX(a, b) (((a) > (b)) ? (a) : (b)) +#endif +#define OFFSET(x) offsetof(XPSNRContext, x) +#define XPSNR_GAMMA 2 + +static const AVOption xpsnr_options[] = +{ + {"stats_file", "Set file where to store per-frame XPSNR information", OFFSET (stats_file_str), AV_OPT_TYPE_STRING, {.str = NULL}, 0, 0, FLAGS}, + {"f", "Set file where to store per-frame XPSNR information", OFFSET (stats_file_str), AV_OPT_TYPE_STRING, {.str = NULL}, 0, 0, FLAGS}, + { NULL } +}; + +FRAMESYNC_DEFINE_CLASS (xpsnr, XPSNRContext, fs); + +/* XPSNR function definitions */ + +static uint64_t highds (const int x_act, const int y_act, const int w_act, const int h_act, const int16_t *o_m0, const int o) +{ + uint64_t sa_act = 0; + + for (int y = y_act; y < h_act; y += 2) + { + for (int x = x_act; x < w_act; x += 2) + { + const int f = 12 * ((int)o_m0[ y *o + x ] + (int)o_m0[ y *o + x+1] + (int)o_m0[(y+1)*o + x ] + (int)o_m0[(y+1)*o + x+1]) + - 3 * ((int)o_m0[(y-1)*o + x ] + (int)o_m0[(y-1)*o + x+1] + (int)o_m0[(y+2)*o + x ] + (int)o_m0[(y+2)*o + x+1]) + - 3 * ((int)o_m0[ y *o + x-1] + (int)o_m0[ y *o + x+2] + (int)o_m0[(y+1)*o + x-1] + (int)o_m0[(y+1)*o + x+2]) + - 2 * ((int)o_m0[(y-1)*o + x-1] + (int)o_m0[(y-1)*o + x+2] + (int)o_m0[(y+2)*o + x-1] + (int)o_m0[(y+2)*o + x+2]) + - ((int)o_m0[(y-2)*o + x-1] + (int)o_m0[(y-2)*o + x ] + (int)o_m0[(y-2)*o + x+1] + (int)o_m0[(y-2)*o + x+2] + + (int)o_m0[(y+3)*o + x-1] + (int)o_m0[(y+3)*o + x ] + (int)o_m0[(y+3)*o + x+1] + (int)o_m0[(y+3)*o + x+2] + + (int)o_m0[(y-1)*o + x-2] + (int)o_m0[ y *o + x-2] + (int)o_m0[(y+1)*o + x-2] + (int)o_m0[(y+2)*o + x-2] + + (int)o_m0[(y-1)*o + x+3] + (int)o_m0[ y *o + x+3] + (int)o_m0[(y+1)*o + x+3] + (int)o_m0[(y+2)*o + x+3]); + sa_act += (uint64_t) abs(f); + } + } + return sa_act; +} + +static uint64_t diff1st (const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, const int o) +{ + uint64_t ta_act = 0; + + for (uint32_t y = 0; y < h_act; y += 2) + { + for (uint32_t x = 0; x < w_act; x += 2) + { + const int t = (int)o_m0[y*o + x] + (int)o_m0[y*o + x+1] + (int)o_m0[(y+1)*o + x] + (int)o_m0[(y+1)*o + x+1] + - ((int)o_m1[y*o + x] + (int)o_m1[y*o + x+1] + (int)o_m1[(y+1)*o + x] + (int)o_m1[(y+1)*o + x+1]); + ta_act += (uint64_t) abs(t); + o_m1[y*o + x ] = o_m0[y*o + x ]; o_m1[(y+1)*o + x ] = o_m0[(y+1)*o + x ]; + o_m1[y*o + x+1] = o_m0[y*o + x+1]; o_m1[(y+1)*o + x+1] = o_m0[(y+1)*o + x+1]; + } + } + return (ta_act * XPSNR_GAMMA); +} + +static uint64_t diff2nd (const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, int16_t *o_m2, const int o) +{ + uint64_t ta_act = 0; + + for (uint32_t y = 0; y < h_act; y += 2) + { + for (uint32_t x = 0; x < w_act; x += 2) + { + const int t = (int)o_m0[y*o + x] + (int)o_m0[y*o + x+1] + (int)o_m0[(y+1)*o + x] + (int)o_m0[(y+1)*o + x+1] + - 2 * ((int)o_m1[y*o + x] + (int)o_m1[y*o + x+1] + (int)o_m1[(y+1)*o + x] + (int)o_m1[(y+1)*o + x+1]) + + (int)o_m2[y*o + x] + (int)o_m2[y*o + x+1] + (int)o_m2[(y+1)*o + x] + (int)o_m2[(y+1)*o + x+1]; + ta_act += (uint64_t) abs(t); + o_m2[y*o + x ] = o_m1[y*o + x ]; o_m2[(y+1)*o + x ] = o_m1[(y+1)*o + x ]; + o_m2[y*o + x+1] = o_m1[y*o + x+1]; o_m2[(y+1)*o + x+1] = o_m1[(y+1)*o + x+1]; + o_m1[y*o + x ] = o_m0[y*o + x ]; o_m1[(y+1)*o + x ] = o_m0[(y+1)*o + x ]; + o_m1[y*o + x+1] = o_m0[y*o + x+1]; o_m1[(y+1)*o + x+1] = o_m0[(y+1)*o + x+1]; + } + } + return (ta_act * XPSNR_GAMMA); +} + +static uint64_t sse_line_16bit (const uint8_t *blk_org8, const uint8_t *blk_rec8, int block_width) +{ + const uint16_t *blk_org = (const uint16_t*) blk_org8; + const uint16_t *blk_rec = (const uint16_t*) blk_rec8; + uint64_t sse = 0; /* sum for one pixel line */ + + for (int x = 0; x < block_width; x++) + { + const int64_t error = (int64_t) blk_org[x] - (int64_t) blk_rec[x]; + + sse += error * error; + } + + /* sum of squared errors for the pixel line */ + return sse; +} + +static inline uint64_t calc_squared_error(XPSNRContext const *s, + const int16_t *blk_org, const uint32_t stride_org, + const int16_t *blk_rec, const uint32_t stride_rec, + const uint32_t block_width, const uint32_t block_height) +{ + uint64_t sse = 0; /* sum of squared errors */ + + for (uint32_t y = 0; y < block_height; y++) + { + sse += s->dsp.sse_line ((const uint8_t*) blk_org, (const uint8_t*) blk_rec, (int) block_width); + blk_org += stride_org; + blk_rec += stride_rec; + } + + /* return nonweighted sum of squared errors */ + return sse; +} + +static inline double calc_squared_error_and_weight (XPSNRContext const *s, + const int16_t *pic_org, const uint32_t stride_org, + int16_t *pic_org_m1, int16_t *pic_org_m2, + const int16_t *pic_rec, const uint32_t stride_rec, + const uint32_t offset_x, const uint32_t offset_y, + const uint32_t block_width, const uint32_t block_height, + const uint32_t bit_depth, const uint32_t int_frame_rate, double *ms_act) +{ + const int o = (int) stride_org; + const int r = (int) stride_rec; + const int16_t *o_m0 = pic_org + offset_y*o + offset_x; + int16_t *o_m1 = pic_org_m1 + offset_y*o + offset_x; + int16_t *o_m2 = pic_org_m2 + offset_y*o + offset_x; + const int16_t *r_m0 = pic_rec + offset_y*r + offset_x; + const int b_val = (s->plane_width[0] * s->plane_height[0] > 2048 * 1152 ? 2 : 1); /* threshold is a bit more than HD resolution */ + const int x_act = (offset_x > 0 ? 0 : b_val); + const int y_act = (offset_y > 0 ? 0 : b_val); + const int w_act = (offset_x + block_width < (uint32_t) s->plane_width [0] ? (int) block_width : (int) block_width - b_val); + const int h_act = (offset_y + block_height < (uint32_t) s->plane_height[0] ? (int) block_height : (int) block_height - b_val); + + const double sse = (double) calc_squared_error (s, o_m0, stride_org, + r_m0, stride_rec, + block_width, block_height); + uint64_t sa_act = 0; /* spatial abs. activity */ + uint64_t ta_act = 0; /* temporal abs. activity */ + + if (w_act <= x_act || h_act <= y_act) /* small */ + { + return sse; + } + + if (b_val > 1) /* highpass with downsampling */ + { + if (w_act > 12) + { + sa_act = s->dsp.highds_func (x_act, y_act, w_act, h_act, o_m0, o); + } + else + { + highds (x_act, y_act, w_act, h_act, o_m0, o); + } + } + else /* <= HD, highpass without downsampling */ + { + for (int y = y_act; y < h_act; y++) + { + for (int x = x_act; x < w_act; x++) + { + const int f = 12 * (int)o_m0[y*o + x] - 2 * ((int)o_m0[y*o + x-1] + (int)o_m0[y*o + x+1] + (int)o_m0[(y-1)*o + x] + (int)o_m0[(y+1)*o + x]) + - ((int)o_m0[(y-1)*o + x-1] + (int)o_m0[(y-1)*o + x+1] + (int)o_m0[(y+1)*o + x-1] + (int)o_m0[(y+1)*o + x+1]); + sa_act += (uint64_t) abs(f); + } + } + } + + /* calculate weights (mean squared activity) */ + *ms_act = (double) sa_act / ((double)(w_act - x_act) * (double)(h_act - y_act)); + + if (b_val > 1) /* highpass with downsampling */ + { + if (int_frame_rate < 32) /* 1st-order diff */ + { + ta_act = s->dsp.diff1st_func (block_width, block_height, o_m0, o_m1, o); + } + else /* 2nd-order diff (diff of two diffs) */ + { + ta_act = s->dsp.diff2nd_func (block_width, block_height, o_m0, o_m1, o_m2, o); + } + } + else /* <= HD, highpass without downsampling */ + { + if (int_frame_rate < 32) /* 1st-order diff */ + { + for (uint32_t y = 0; y < block_height; y++) + { + for (uint32_t x = 0; x < block_width; x++) + { + const int t = (int)o_m0[y*o + x] - (int)o_m1[y*o + x]; + + ta_act += XPSNR_GAMMA * (uint64_t) abs(t); + o_m1[y*o + x] = o_m0[y*o + x]; + } + } + } + else /* 2nd-order diff (diff of two diffs) */ + { + for (uint32_t y = 0; y < block_height; y++) + { + for (uint32_t x = 0; x < block_width; x++) + { + const int t = (int)o_m0[y*o + x] - 2 * (int)o_m1[y*o + x] + (int)o_m2[y*o + x]; + + ta_act += XPSNR_GAMMA * (uint64_t) abs(t); + o_m2[y*o + x] = o_m1[y*o + x]; + o_m1[y*o + x] = o_m0[y*o + x]; + } + } + } + } + + /* weight += mean squared temporal activity */ + *ms_act += (double) ta_act / ((double) block_width * (double) block_height); + + /* lower limit, accounts for high-pass gain */ + if (*ms_act < (double)(1 << (bit_depth - 6))) *ms_act = (double)(1 << (bit_depth - 6)); + + *ms_act *= *ms_act; /* since SSE is squared */ + + /* return nonweighted sum of squared errors */ + return sse; +} + +static inline double get_avg_xpsnr (const double sqrt_wsse_val, const double sum_xpsnr_val, + const uint32_t image_width, const uint32_t image_height, + const uint64_t max_error_64, const uint64_t num_frames_64) +{ + if (num_frames_64 == 0) return INFINITY; + + if (sqrt_wsse_val >= (double) num_frames_64) /* sq.-mean-root dist average */ + { + const double avg_dist = sqrt_wsse_val / (double) num_frames_64; + const uint64_t num64 = (uint64_t) image_width * (uint64_t) image_height * max_error_64; + + return 10.0 * log10 ((double) num64 / ((double) avg_dist * (double) avg_dist)); + } + + return sum_xpsnr_val / (double) num_frames_64; /* older log-domain average */ +} + +static int get_wsse (AVFilterContext *ctx, int16_t **org, int16_t **org_m1, int16_t **org_m2, int16_t **rec, uint64_t* const wsse64) +{ + XPSNRContext* const s = ctx->priv; + const uint32_t w = s->plane_width [0]; /* luma image width in pixels */ + const uint32_t h = s->plane_height[0];/* luma image height in pixels */ + const double r = (double)(w * h) / (3840.0 * 2160.0); /* UHD ratio */ + const uint32_t b = MAX (0, 4 * (int32_t)(32.0 * sqrt (r) + 0.5)); /* block size, integer multiple of 4 for SIMD */ + const uint32_t w_blk = (w + b - 1) / b; /* luma width in units of blocks */ + const double avg_act = sqrt (16.0 * (double)(1 << (2 * s->depth - 9)) / sqrt (MAX (0.00001, r))); /* = sqrt (a_pic) */ + const int* stride_org = (s->bpp == 1 ? s->plane_width : s->line_sizes); + uint32_t x, y, idx_blk = 0; /* the "16.0" above is due to fixed-point code */ + double* const sse_luma = s->sse_luma; + double* const weights = s->weights; + int c; + + if ((wsse64 == NULL) || (s->depth < 6) || (s->depth > 16) || (s->num_comps <= 0) || (s->num_comps > 3) || (w == 0) || (h == 0)) + { + av_log (ctx, AV_LOG_ERROR, "Error in XPSNR routine: invalid argument(s).\n"); + + return AVERROR (EINVAL); + } + + if ((weights == NULL) || (b >= 4 && sse_luma == NULL)) + { + av_log (ctx, AV_LOG_ERROR, "Failed to allocate temporary block memory.\n"); + + return AVERROR (ENOMEM); + } + + if (b >= 4) + { + const int16_t *p_org = org[0]; + const uint32_t s_org = stride_org[0] / s->bpp; + const int16_t *p_rec = rec[0]; + const uint32_t s_rec = s->plane_width[0]; + int16_t *p_org_m1 = org_m1[0]; /* pixel */ + int16_t *p_org_m2 = org_m2[0]; /* memory */ + double wsse_luma = 0.0; + + for (y = 0; y < h; y += b) /* calculate block SSE and perceptual weights */ + { + const uint32_t block_height = (y + b > h ? h - y : b); + + for (x = 0; x < w; x += b, idx_blk++) + { + const uint32_t block_width = (x + b > w ? w - x : b); + double ms_act = 1.0, ms_act_prev = 0.0; + + sse_luma[idx_blk] = calc_squared_error_and_weight(s, p_org, s_org, + p_org_m1, p_org_m2, + p_rec, s_rec, + x, y, + block_width, block_height, + s->depth, s->frame_rate, &ms_act); + weights[idx_blk] = 1.0 / sqrt (ms_act); + + if (w * h <= 640u * 480u) /* in-line "minimum-smoothing" as in paper */ + { + if (x == 0) /* first column */ + { + ms_act_prev = (idx_blk > 1 ? weights[idx_blk - 2] : 0); + } + else /* after first column */ + { + ms_act_prev = (x > b ? MAX (weights[idx_blk - 2], weights[idx_blk]) : weights[idx_blk]); + } + if (idx_blk > w_blk) /* after first row and first column */ + { + ms_act_prev = MAX (ms_act_prev, weights[idx_blk - 1 - w_blk]); /* min (left, top) */ + } + if ((idx_blk > 0) && (weights[idx_blk - 1] > ms_act_prev)) + { + weights[idx_blk - 1] = ms_act_prev; + } + if ((x + b >= w) && (y + b >= h) && (idx_blk > w_blk)) /* last block in picture */ + { + ms_act_prev = MAX (weights[idx_blk - 1], weights[idx_blk - w_blk]); + if (weights[idx_blk] > ms_act_prev) + { + weights[idx_blk] = ms_act_prev; + } + } + } + } /* for x */ + } /* for y */ + + for (y = idx_blk = 0; y < h; y += b) /* calculate sum for luma (Y) XPSNR */ + { + for (x = 0; x < w; x += b, idx_blk++) + { + wsse_luma += sse_luma[idx_blk] * weights[idx_blk]; + } + } + wsse64[0] = (wsse_luma <= 0.0 ? 0 : (uint64_t)(wsse_luma * avg_act + 0.5)); + } /* b >= 4 */ + + for (c = 0; c < s->num_comps; c++) /* finalize SSE data for all components */ + { + const int16_t *p_org = org[c]; + const uint32_t s_org = stride_org[c] / s->bpp; + const int16_t *p_rec = rec[c]; + const uint32_t s_rec = s->plane_width[c]; + const uint32_t w_pln = s->plane_width[c]; + const uint32_t h_pln = s->plane_height[c]; + + if (b < 4) /* picture is too small for XPSNR, calculate nonweighted PSNR */ + { + wsse64[c] = calc_squared_error (s, p_org, s_org, + p_rec, s_rec, + w_pln, h_pln); + } + else if (c > 0) /* b >= 4, so Y XPSNR has already been calculated above! */ + { + const uint32_t bx = (b * w_pln) / w; + const uint32_t by = (b * h_pln) / h; /* up to chroma downsampling by 4 */ + double wsse_chroma = 0.0; + + for (y = idx_blk = 0; y < h_pln; y += by) /* calc chroma (Cb/Cr) XPSNR */ + { + const uint32_t block_height = (y + by > h_pln ? h_pln - y : by); + + for (x = 0; x < w_pln; x += bx, idx_blk++) + { + const uint32_t block_width = (x + bx > w_pln ? w_pln - x : bx); + + wsse_chroma += (double) calc_squared_error (s, p_org + y*s_org + x, s_org, + p_rec + y*s_rec + x, s_rec, + block_width, block_height) * weights[idx_blk]; + } + } + wsse64[c] = (wsse_chroma <= 0.0 ? 0 : (uint64_t)(wsse_chroma * avg_act + 0.5)); + } + } /* for c */ + + return 0; +} + +static int do_xpsnr (FFFrameSync *fs) +{ + AVFilterContext *ctx = fs->parent; + XPSNRContext* const s = ctx->priv; + const uint32_t w = s->plane_width [0]; /* luma image width in pixels */ + const uint32_t h = s->plane_height[0]; /* luma image height in pixels */ + const uint32_t b = MAX (0, 4 * (int32_t)(32.0 * sqrt ((double)(w * h) / (3840.0 * 2160.0)) + 0.5)); /* block size */ + const uint32_t w_blk = (w + b - 1) / b; /* luma width in units of blocks */ + const uint32_t h_blk = (h + b - 1) / b; /* luma height in units of blocks */ + AVFrame *master, *ref = NULL; + int16_t *porg [3]; + int16_t *porg_m1[3]; + int16_t *porg_m2[3]; + int16_t *prec [3]; + uint64_t wsse64 [3] = {0, 0, 0}; + double cur_xpsnr[3] = {INFINITY, INFINITY, INFINITY}; + int c, ret_value; + + if ((ret_value = ff_framesync_dualinput_get (fs, &master, &ref)) < 0) return ret_value; + if (ref == NULL) return ff_filter_frame (ctx->outputs[0], master); + + /* prepare XPSNR calculations: allocate temporary picture and block memory */ + if (s->sse_luma == NULL) s->sse_luma = (double*) av_mallocz (w_blk * h_blk * sizeof (double)); + if (s->weights == NULL) s->weights = (double*) av_mallocz (w_blk * h_blk * sizeof (double)); + + for (c = 0; c < s->num_comps; c++) /* allocate temporal org buffer memory */ + { + s->line_sizes[c] = master->linesize[c]; + + if (c == 0) /* luma ch. */ + { + const int stride_org_bpp = (s->bpp == 1 ? s->plane_width[c] : s->line_sizes[c] / s->bpp); + + if (s->buf_org_m1[c] == NULL) s->buf_org_m1[c] = av_buffer_allocz (stride_org_bpp * s->plane_height[c] * sizeof (int16_t)); + if (s->buf_org_m2[c] == NULL) s->buf_org_m2[c] = av_buffer_allocz (stride_org_bpp * s->plane_height[c] * sizeof (int16_t)); + + porg_m1[c] = (int16_t*) s->buf_org_m1[c]->data; + porg_m2[c] = (int16_t*) s->buf_org_m2[c]->data; + } + } + + if (s->bpp == 1) /* 8 bit */ + { + for (c = 0; c < s->num_comps; c++) /* allocate the org/rec buffer memory */ + { + const int m = s->line_sizes[c]; /* master stride */ + const int o = s->plane_width[c]; /* XPSNR stride */ + + if (s->buf_org[c] == NULL) s->buf_org[c] = av_buffer_allocz (s->plane_width[c] * s->plane_height[c] * sizeof (int16_t)); + if (s->buf_rec[c] == NULL) s->buf_rec[c] = av_buffer_allocz (s->plane_width[c] * s->plane_height[c] * sizeof (int16_t)); + + porg[c] = (int16_t*) s->buf_org[c]->data; + prec[c] = (int16_t*) s->buf_rec[c]->data; + + for (int y = 0; y < s->plane_height[c]; y++) + { + for (int x = 0; x < s->plane_width[c]; x++) + { + porg[c][y*o + x] = (int16_t) master->data[c][y*m + x]; + prec[c][y*o + x] = (int16_t) ref->data[c][y*o + x]; + } + } + } + } + else /* 10, 12, or 14 bit */ + { + for (c = 0; c < s->num_comps; c++) + { + porg[c] = (int16_t*) master->data[c]; + prec[c] = (int16_t*) ref->data[c]; + } + } + + /* extended perceptually weighted peak signal-to-noise ratio (XPSNR) value */ + + if ((ret_value = get_wsse (ctx, (int16_t **)&porg, (int16_t **)&porg_m1, (int16_t **)&porg_m2, (int16_t **)&prec, wsse64)) < 0) + { + return ret_value; /* an error here implies something went wrong earlier! */ + } + + for (c = 0; c < s->num_comps; c++) + { + const double sqrt_wsse = sqrt ((double) wsse64[c]); + + cur_xpsnr[c] = get_avg_xpsnr (sqrt_wsse, INFINITY, + s->plane_width[c], s->plane_height[c], + s->max_error_64, 1 /* single frame */); + s->sum_wdist[c] += sqrt_wsse; + s->sum_xpsnr[c] += cur_xpsnr[c]; + s->and_is_inf[c] &= isinf (cur_xpsnr[c]); + } + s->num_frames_64++; + + if (s->stats_file) /* print out the frame and component-wise XPSNR average */ + { + fprintf (s->stats_file, "n: %4"PRId64"", s->num_frames_64); + + for (c = 0; c < s->num_comps; c++) + { + fprintf (s->stats_file, " XPSNR %c: %3.4f", s->comps[c], cur_xpsnr[c]); + } + fprintf (s->stats_file, "\n"); + } + + return ff_filter_frame (ctx->outputs[0], master); +} + +static av_cold int init (AVFilterContext *ctx) +{ + XPSNRContext* const s = ctx->priv; + int c; + + if (s->stats_file_str) + { + if (!strcmp (s->stats_file_str, "-")) /* no statistics file, take stdout */ + { + s->stats_file = stdout; + } + else + { + s->stats_file = avpriv_fopen_utf8 (s->stats_file_str, "w"); + + if (s->stats_file == NULL) + { + const int err = AVERROR (errno); + char buf[128]; + + av_strerror (err, buf, sizeof (buf)); + av_log (ctx, AV_LOG_ERROR, "Could not open statistics file %s: %s\n", s->stats_file_str, buf); + + return err; + } + } + } + + s->sse_luma = NULL; + s->weights = NULL; + + for (c = 0; c < 3; c++) /* initialize XPSNR value of every color component */ + { + s->buf_org [c] = NULL; + s->buf_org_m1[c] = NULL; + s->buf_org_m2[c] = NULL; + s->buf_rec [c] = NULL; + s->sum_wdist [c] = 0.0; + s->sum_xpsnr [c] = 0.0; + s->and_is_inf[c] = true; + } + + s->fs.on_event = do_xpsnr; + + return 0; +} + +static const enum AVPixelFormat pix_fmts[] = +{ + AV_PIX_FMT_GRAY8, AV_PIX_FMT_GRAY9, AV_PIX_FMT_GRAY10, AV_PIX_FMT_GRAY12, AV_PIX_FMT_GRAY14, AV_PIX_FMT_GRAY16, +#define PF_NOALPHA(suf) AV_PIX_FMT_YUV420##suf, AV_PIX_FMT_YUV422##suf, AV_PIX_FMT_YUV444##suf +#define PF_ALPHA(suf) AV_PIX_FMT_YUVA420##suf, AV_PIX_FMT_YUVA422##suf, AV_PIX_FMT_YUVA444##suf +#define PF(suf) PF_NOALPHA(suf), PF_ALPHA(suf) + PF(P), PF(P9), PF(P10), PF_NOALPHA(P12), PF_NOALPHA(P14), PF(P16), + AV_PIX_FMT_YUV440P, AV_PIX_FMT_YUV411P, AV_PIX_FMT_YUV410P, + AV_PIX_FMT_YUVJ411P, AV_PIX_FMT_YUVJ420P, AV_PIX_FMT_YUVJ422P, + AV_PIX_FMT_YUVJ440P, AV_PIX_FMT_YUVJ444P, + AV_PIX_FMT_GBRP, AV_PIX_FMT_GBRP9, AV_PIX_FMT_GBRP10, + AV_PIX_FMT_GBRP12, AV_PIX_FMT_GBRP14, AV_PIX_FMT_GBRP16, + AV_PIX_FMT_GBRAP, AV_PIX_FMT_GBRAP10, AV_PIX_FMT_GBRAP12, AV_PIX_FMT_GBRAP16, + AV_PIX_FMT_NONE +}; + +static int config_input_ref (AVFilterLink *inlink) +{ + const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get (inlink->format); + AVFilterContext *ctx = inlink->dst; + XPSNRContext* const s = ctx->priv; + + if ((ctx->inputs[0]->w != ctx->inputs[1]->w) || + (ctx->inputs[0]->h != ctx->inputs[1]->h)) + { + av_log (ctx, AV_LOG_ERROR, "Width and height of the input videos must match.\n"); + + return AVERROR (EINVAL); + } + + if (ctx->inputs[0]->format != ctx->inputs[1]->format) + { + av_log (ctx, AV_LOG_ERROR, "The input videos must be of the same pixel format.\n"); + + return AVERROR (EINVAL); + } + + s->bpp = (desc->comp[0].depth <= 8 ? 1 : 2); + s->depth = desc->comp[0].depth; +#if 1 + s->max_error_64 = (1 << s->depth) - 1; /* conventional limit */ +#else + s->max_error_64 = 255 * (1 << (s->depth - 8)); /* JVET style */ +#endif + s->max_error_64 *= s->max_error_64; + + s->frame_rate = inlink->frame_rate.num / inlink->frame_rate.den; + + s->num_comps = (desc->nb_components > 3 ? 3 : desc->nb_components); + + s->is_rgb = (ff_fill_rgba_map (s->rgba_map, inlink->format) >= 0); + s->comps[0] = (s->is_rgb ? 'R' : 'Y'); + s->comps[1] = (s->is_rgb ? 'G' : 'U'); + s->comps[2] = (s->is_rgb ? 'B' : 'V'); + s->comps[3] = 'A'; + + s->plane_width [1] = s->plane_width [2] = AV_CEIL_RSHIFT (inlink->w, desc->log2_chroma_w); + s->plane_width [0] = s->plane_width [3] = inlink->w; + s->plane_height[1] = s->plane_height[2] = AV_CEIL_RSHIFT (inlink->h, desc->log2_chroma_h); + s->plane_height[0] = s->plane_height[3] = inlink->h; + + s->dsp.sse_line = sse_line_16bit; + s->dsp.highds_func = highds; /* initialize customized AVX2 */ + s->dsp.diff1st_func = diff1st; /* SIMD routines from XPSNR */ + s->dsp.diff2nd_func = diff2nd; +#if ARCH_X86 + ff_xpsnr_init_x86 (&s->dsp, 15); /* inheritances from PSNR */ +#endif + + return 0; +} + +static int config_output (AVFilterLink *outlink) +{ + AVFilterContext *ctx = outlink->src; + AVFilterLink *inlink = ctx->inputs[0]; + XPSNRContext *s = ctx->priv; + int ret_value; + + if ((ret_value = ff_framesync_init_dualinput (&s->fs, ctx)) < 0) return ret_value; + + outlink->w = inlink->w; + outlink->h = inlink->h; + outlink->frame_rate = inlink->frame_rate; + outlink->sample_aspect_ratio = inlink->sample_aspect_ratio; + outlink->time_base = inlink->time_base; + + if ((ret_value = ff_framesync_configure (&s->fs)) < 0) return ret_value; + + outlink->time_base = s->fs.time_base; + + if (av_cmp_q (inlink->time_base, outlink->time_base) || + av_cmp_q (ctx->inputs[1]->time_base, outlink->time_base)) + { + av_log (ctx, AV_LOG_WARNING, "Not matching timebases found between first input: %d/%d and second input %d/%d, results may be incorrect!\n", + inlink->time_base.num, inlink->time_base.den, + ctx->inputs[1]->time_base.num, ctx->inputs[1]->time_base.den); + } + + return 0; +} + +static int activate (AVFilterContext *ctx) +{ + XPSNRContext *s = ctx->priv; + + return ff_framesync_activate (&s->fs); +} + +static av_cold void uninit (AVFilterContext *ctx) +{ + XPSNRContext* const s = ctx->priv; + int c; + + if (s->num_frames_64 > 0) /* print out overall per-component XPSNR average */ + { + const double xpsnr_luma = get_avg_xpsnr(s->sum_wdist[0], s->sum_xpsnr[0], + s->plane_width[0], s->plane_height[0], + s->max_error_64, s->num_frames_64); + double xpsnr_min = xpsnr_luma; + + /* luma */ + av_log (ctx, AV_LOG_INFO, "XPSNR %c: %3.4f", s->comps[0], xpsnr_luma); + if (s->stats_file) + { + fprintf (s->stats_file, "\nXPSNR average, %"PRId64" frames", s->num_frames_64); + fprintf (s->stats_file, " %c: %3.4f", s->comps[0], xpsnr_luma); + } + /* chroma */ + for (c = 1; c < s->num_comps; c++) + { + const double xpsnr_chroma = get_avg_xpsnr(s->sum_wdist[c], s->sum_xpsnr[c], + s->plane_width[c], s->plane_height[c], + s->max_error_64, s->num_frames_64); + if (xpsnr_min > xpsnr_chroma) xpsnr_min = xpsnr_chroma; + + av_log (ctx, AV_LOG_INFO, " %c: %3.4f", s->comps[c], xpsnr_chroma); + if (s->stats_file && s->stats_file != stdout) + { + fprintf (s->stats_file, " %c: %3.4f", s->comps[c], xpsnr_chroma); + } + } + /* print out line break (with minimum XPSNR across the color components) */ + if (s->num_comps > 1) + { + av_log (ctx, AV_LOG_INFO, " (minimum: %3.4f)\n", xpsnr_min); + if (s->stats_file && s->stats_file != stdout) + { + fprintf (s->stats_file, " (minimum: %3.4f)\n", xpsnr_min); + } + } + else + { + av_log (ctx, AV_LOG_INFO, "\n"); + if (s->stats_file && s->stats_file != stdout) + { + fprintf (s->stats_file, "\n"); + } + } + } + + ff_framesync_uninit (&s->fs); /* free temporary picture and block memory */ + + if (s->stats_file && s->stats_file != stdout) fclose (s->stats_file); + + if (s->sse_luma) av_freep (&s->sse_luma); + if (s->weights ) av_freep (&s->weights ); + + for (c = 0; c < s->num_comps; c++) /* free addl temporal org buffer memory */ + { + if (s->buf_org_m1[c]) av_freep (&s->buf_org_m1[c]); + if (s->buf_org_m2[c]) av_freep (&s->buf_org_m2[c]); + } + if (s->bpp == 1) /* 8 bit */ + { + for (c = 0; c < s->num_comps; c++) /* free org/rec picture buffer memory */ + { + if (&s->buf_org[c]) av_freep (&s->buf_org[c]); + if (&s->buf_rec[c]) av_freep (&s->buf_rec[c]); + } + } +} + +static const AVFilterPad xpsnr_inputs[] = +{ + { + .name = "main", + .type = AVMEDIA_TYPE_VIDEO, + }, + { + .name = "reference", + .type = AVMEDIA_TYPE_VIDEO, + .config_props = config_input_ref, + } +}; + +static const AVFilterPad xpsnr_outputs[] = +{ + { + .name = "default", + .type = AVMEDIA_TYPE_VIDEO, + .config_props = config_output, + } +}; + +const AVFilter ff_vf_xpsnr = +{ + .name = "xpsnr", + .description = NULL_IF_CONFIG_SMALL ("Calculate the extended perceptually weighted peak signal-to-noise ratio (XPSNR) between two video streams."), + .preinit = xpsnr_framesync_preinit, + .init = init, + .uninit = uninit, + .activate = activate, + .priv_size = sizeof (XPSNRContext), + .priv_class = &xpsnr_class, + FILTER_INPUTS (xpsnr_inputs), + FILTER_OUTPUTS(xpsnr_outputs), + FILTER_PIXFMTS_ARRAY(pix_fmts), + .flags = AVFILTER_FLAG_SUPPORT_TIMELINE_INTERNAL | + AVFILTER_FLAG_SLICE_THREADS | + AVFILTER_FLAG_METADATA_ONLY, +}; diff --git a/libavfilter/x86/vf_xpsnr.asm b/libavfilter/x86/vf_xpsnr.asm new file mode 100644 index 0000000000..bfeeff718f --- /dev/null +++ b/libavfilter/x86/vf_xpsnr.asm @@ -0,0 +1,2108 @@ +%if ARCH_X86_64 +default rel +global highds_simd +global diff1st_simd +global diff2nd_simd +SECTION .text +highds_simd: + push rbp + mov rbp, rsp + and rsp, 0FFFFFFFFFFFFFFE0H + sub rsp, 1576 + mov dword [rsp-4CH], edi + mov dword [rsp-50H], esi + mov dword [rsp-54H], edx + mov dword [rsp-58H], ecx + mov qword [rsp-60H], r8 + mov dword [rsp-64H], r9d + mov qword [rsp], 0 + mov word [rsp-20H], 0 + mov word [rsp-1EH], 0 + mov word [rsp-1CH], -1 + mov word [rsp-1AH], -2 + mov word [rsp-18H], -3 + mov word [rsp-16H], -3 + mov word [rsp-14H], -2 + mov word [rsp-12H], -1 + movzx eax, word [rsp-12H] + vmovd xmm0, eax + movzx eax, word [rsp-14H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm1, xmm0 + movzx eax, word [rsp-16H] + vmovd xmm0, eax + movzx eax, word [rsp-18H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm4, xmm0 + movzx eax, word [rsp-1AH] + vmovd xmm0, eax + movzx eax, word [rsp-1CH] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm2, xmm0 + movzx eax, word [rsp-1EH] + vmovd xmm0, eax + movzx eax, word [rsp-20H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm3, xmm0 + vpunpckldq xmm0, xmm1, xmm4 + vmovdqa xmm1, xmm0 + vpunpckldq xmm0, xmm2, xmm3 + vpunpcklqdq xmm0, xmm1, xmm0 + vmovaps oword [rsp+98H], xmm0 + mov word [rsp-30H], 0 + mov word [rsp-2EH], 0 + mov word [rsp-2CH], -1 + mov word [rsp-2AH], -3 + mov word [rsp-28H], 12 + mov word [rsp-26H], 12 + mov word [rsp-24H], -3 + mov word [rsp-22H], -1 + movzx eax, word [rsp-22H] + vmovd xmm0, eax + movzx eax, word [rsp-24H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm1, xmm0 + movzx eax, word [rsp-26H] + vmovd xmm0, eax + movzx eax, word [rsp-28H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm4, xmm0 + movzx eax, word [rsp-2AH] + vmovd xmm0, eax + movzx eax, word [rsp-2CH] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm2, xmm0 + movzx eax, word [rsp-2EH] + vmovd xmm0, eax + movzx eax, word [rsp-30H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm3, xmm0 + vpunpckldq xmm0, xmm1, xmm4 + vmovdqa xmm1, xmm0 + vpunpckldq xmm0, xmm2, xmm3 + vpunpcklqdq xmm0, xmm1, xmm0 + vmovaps oword [rsp+0A8H], xmm0 + mov word [rsp-40H], 0 + mov word [rsp-3EH], 0 + mov word [rsp-3CH], 0 + mov word [rsp-3AH], -1 + mov word [rsp-38H], -1 + mov word [rsp-36H], -1 + mov word [rsp-34H], -1 + mov word [rsp-32H], 0 + movzx eax, word [rsp-32H] + vmovd xmm0, eax + movzx eax, word [rsp-34H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm1, xmm0 + movzx eax, word [rsp-36H] + vmovd xmm0, eax + movzx eax, word [rsp-38H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm4, xmm0 + movzx eax, word [rsp-3AH] + vmovd xmm0, eax + movzx eax, word [rsp-3CH] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm2, xmm0 + movzx eax, word [rsp-3EH] + vmovd xmm0, eax + movzx eax, word [rsp-40H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm3, xmm0 + vpunpckldq xmm0, xmm1, xmm4 + vmovdqa xmm1, xmm0 + vpunpckldq xmm0, xmm2, xmm3 + vpunpcklqdq xmm0, xmm1, xmm0 + vmovaps oword [rsp+0B8H], xmm0 + mov eax, dword [rsp-50H] + mov dword [rsp-10H], eax + jmp L_008 + +L_001: mov eax, dword [rsp-4CH] + mov dword [rsp-0CH], eax + jmp L_007 + +L_002: mov eax, dword [rsp-10H] + sub eax, 2 + imul eax, dword [rsp-64H] + mov edx, eax + mov eax, dword [rsp-0CH] + add eax, edx + cdqe + add rax, rax + lea rdx, [rax-4H] + mov rax, qword [rsp-60H] + add rax, rdx + mov qword [rsp+30H], rax + mov rax, qword [rsp+30H] + vlddqu ymm0, yword [rax] + vmovdqa yword [rsp+4A8H], ymm0 + mov eax, dword [rsp-10H] + sub eax, 1 + imul eax, dword [rsp-64H] + mov edx, eax + mov eax, dword [rsp-0CH] + add eax, edx + cdqe + add rax, rax + lea rdx, [rax-4H] + mov rax, qword [rsp-60H] + add rax, rdx + mov qword [rsp+28H], rax + mov rax, qword [rsp+28H] + vlddqu ymm0, yword [rax] + vmovdqa yword [rsp+4C8H], ymm0 + mov eax, dword [rsp-10H] + imul eax, dword [rsp-64H] + mov edx, eax + mov eax, dword [rsp-0CH] + add eax, edx + cdqe + add rax, rax + lea rdx, [rax-4H] + mov rax, qword [rsp-60H] + add rax, rdx + mov qword [rsp+20H], rax + mov rax, qword [rsp+20H] + vlddqu ymm0, yword [rax] + vmovdqa yword [rsp+4E8H], ymm0 + mov eax, dword [rsp-10H] + add eax, 1 + imul eax, dword [rsp-64H] + mov edx, eax + mov eax, dword [rsp-0CH] + add eax, edx + cdqe + add rax, rax + lea rdx, [rax-4H] + mov rax, qword [rsp-60H] + add rax, rdx + mov qword [rsp+18H], rax + mov rax, qword [rsp+18H] + vlddqu ymm0, yword [rax] + vmovdqa yword [rsp+508H], ymm0 + mov eax, dword [rsp-10H] + add eax, 2 + imul eax, dword [rsp-64H] + mov edx, eax + mov eax, dword [rsp-0CH] + add eax, edx + cdqe + add rax, rax + lea rdx, [rax-4H] + mov rax, qword [rsp-60H] + add rax, rdx + mov qword [rsp+10H], rax + mov rax, qword [rsp+10H] + vlddqu ymm0, yword [rax] + vmovdqa yword [rsp+528H], ymm0 + mov eax, dword [rsp-10H] + add eax, 3 + imul eax, dword [rsp-64H] + mov edx, eax + mov eax, dword [rsp-0CH] + add eax, edx + cdqe + add rax, rax + lea rdx, [rax-4H] + mov rax, qword [rsp-60H] + add rax, rdx + mov qword [rsp+8H], rax + mov rax, qword [rsp+8H] + vlddqu ymm0, yword [rax] + vmovdqa yword [rsp+548H], ymm0 + mov dword [rsp-8H], 0 + jmp L_006 + +L_003: mov eax, dword [rsp-8H] + lea edx, [rax*4] + mov eax, dword [rsp-0CH] + add eax, edx + cmp dword [rsp-54H], eax + jle L_004 + mov dword [rsp-4H], 0 + vmovdqa ymm0, yword [rsp+4E8H] + vmovdqa yword [rsp+608H], ymm0 + vmovdqa ymm0, yword [rsp+608H] + vmovaps oword [rsp+38H], xmm0 + vmovdqa ymm0, yword [rsp+508H] + vmovdqa yword [rsp+5E8H], ymm0 + vmovdqa ymm0, yword [rsp+5E8H] + vmovaps oword [rsp+48H], xmm0 + vmovdqa xmm0, oword [rsp+38H] + vmovaps oword [rsp+2A8H], xmm0 + vmovdqa xmm0, oword [rsp+0A8H] + vmovaps oword [rsp+2B8H], xmm0 + vmovdqa xmm0, oword [rsp+2B8H] + vmovdqa xmm1, oword [rsp+2A8H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+48H] + vmovaps oword [rsp+288H], xmm0 + vmovdqa xmm0, oword [rsp+0A8H] + vmovaps oword [rsp+298H], xmm0 + vmovdqa xmm0, oword [rsp+298H] + vmovdqa xmm1, oword [rsp+288H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0D8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+268H], xmm0 + vmovdqa xmm0, oword [rsp+0D8H] + vmovaps oword [rsp+278H], xmm0 + vmovdqa xmm1, oword [rsp+278H] + vmovdqa xmm0, oword [rsp+268H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+248H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+258H], xmm0 + vmovdqa xmm1, oword [rsp+258H] + vmovdqa xmm0, oword [rsp+248H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+228H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+238H], xmm0 + vmovdqa xmm1, oword [rsp+238H] + vmovdqa xmm0, oword [rsp+228H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovd eax, xmm0 + add dword [rsp-4H], eax + vmovdqa ymm0, yword [rsp+4C8H] + vmovdqa yword [rsp+5C8H], ymm0 + vmovdqa ymm0, yword [rsp+5C8H] + vmovaps oword [rsp+58H], xmm0 + vmovdqa ymm0, yword [rsp+528H] + vmovdqa yword [rsp+5A8H], ymm0 + vmovdqa ymm0, yword [rsp+5A8H] + vmovaps oword [rsp+68H], xmm0 + vmovdqa xmm0, oword [rsp+58H] + vmovaps oword [rsp+208H], xmm0 + vmovdqa xmm0, oword [rsp+98H] + vmovaps oword [rsp+218H], xmm0 + vmovdqa xmm0, oword [rsp+218H] + vmovdqa xmm1, oword [rsp+208H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+68H] + vmovaps oword [rsp+1E8H], xmm0 + vmovdqa xmm0, oword [rsp+98H] + vmovaps oword [rsp+1F8H], xmm0 + vmovdqa xmm0, oword [rsp+1F8H] + vmovdqa xmm1, oword [rsp+1E8H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0D8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+1C8H], xmm0 + vmovdqa xmm0, oword [rsp+0D8H] + vmovaps oword [rsp+1D8H], xmm0 + vmovdqa xmm1, oword [rsp+1D8H] + vmovdqa xmm0, oword [rsp+1C8H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+1A8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+1B8H], xmm0 + vmovdqa xmm1, oword [rsp+1B8H] + vmovdqa xmm0, oword [rsp+1A8H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+188H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+198H], xmm0 + vmovdqa xmm1, oword [rsp+198H] + vmovdqa xmm0, oword [rsp+188H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovd eax, xmm0 + add dword [rsp-4H], eax + vmovdqa ymm0, yword [rsp+4A8H] + vmovdqa yword [rsp+588H], ymm0 + vmovdqa ymm0, yword [rsp+588H] + vmovaps oword [rsp+78H], xmm0 + vmovdqa ymm0, yword [rsp+548H] + vmovdqa yword [rsp+568H], ymm0 + vmovdqa ymm0, yword [rsp+568H] + vmovaps oword [rsp+88H], xmm0 + vmovdqa xmm0, oword [rsp+78H] + vmovaps oword [rsp+168H], xmm0 + vmovdqa xmm0, oword [rsp+0B8H] + vmovaps oword [rsp+178H], xmm0 + vmovdqa xmm0, oword [rsp+178H] + vmovdqa xmm1, oword [rsp+168H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+88H] + vmovaps oword [rsp+148H], xmm0 + vmovdqa xmm0, oword [rsp+0B8H] + vmovaps oword [rsp+158H], xmm0 + vmovdqa xmm0, oword [rsp+158H] + vmovdqa xmm1, oword [rsp+148H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0D8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+128H], xmm0 + vmovdqa xmm0, oword [rsp+0D8H] + vmovaps oword [rsp+138H], xmm0 + vmovdqa xmm1, oword [rsp+138H] + vmovdqa xmm0, oword [rsp+128H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+108H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+118H], xmm0 + vmovdqa xmm1, oword [rsp+118H] + vmovdqa xmm0, oword [rsp+108H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+0E8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+0F8H], xmm0 + vmovdqa xmm1, oword [rsp+0F8H] + vmovdqa xmm0, oword [rsp+0E8H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovd eax, xmm0 + add dword [rsp-4H], eax + mov eax, dword [rsp-4H] + cdq + mov eax, edx + xor eax, dword [rsp-4H] + sub eax, edx + cdqe + add qword [rsp], rax +L_004: mov eax, dword [rsp-8H] + lea edx, [rax*4] + mov eax, dword [rsp-0CH] + add eax, edx + add eax, 2 + cmp dword [rsp-54H], eax + jle L_005 + mov dword [rsp-4H], 0 + vmovdqa xmm0, oword [rsp+38H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [rsp+38H], xmm0 + vmovdqa xmm0, oword [rsp+48H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [rsp+48H], xmm0 + vmovdqa xmm0, oword [rsp+38H] + vmovaps oword [rsp+488H], xmm0 + vmovdqa xmm0, oword [rsp+0A8H] + vmovaps oword [rsp+498H], xmm0 + vmovdqa xmm0, oword [rsp+498H] + vmovdqa xmm1, oword [rsp+488H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+48H] + vmovaps oword [rsp+468H], xmm0 + vmovdqa xmm0, oword [rsp+0A8H] + vmovaps oword [rsp+478H], xmm0 + vmovdqa xmm0, oword [rsp+478H] + vmovdqa xmm1, oword [rsp+468H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0D8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+448H], xmm0 + vmovdqa xmm0, oword [rsp+0D8H] + vmovaps oword [rsp+458H], xmm0 + vmovdqa xmm1, oword [rsp+458H] + vmovdqa xmm0, oword [rsp+448H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+428H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+438H], xmm0 + vmovdqa xmm1, oword [rsp+438H] + vmovdqa xmm0, oword [rsp+428H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+408H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+418H], xmm0 + vmovdqa xmm1, oword [rsp+418H] + vmovdqa xmm0, oword [rsp+408H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovd eax, xmm0 + add dword [rsp-4H], eax + vmovdqa xmm0, oword [rsp+58H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [rsp+58H], xmm0 + vmovdqa xmm0, oword [rsp+68H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [rsp+68H], xmm0 + vmovdqa xmm0, oword [rsp+58H] + vmovaps oword [rsp+3E8H], xmm0 + vmovdqa xmm0, oword [rsp+98H] + vmovaps oword [rsp+3F8H], xmm0 + vmovdqa xmm0, oword [rsp+3F8H] + vmovdqa xmm1, oword [rsp+3E8H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+68H] + vmovaps oword [rsp+3C8H], xmm0 + vmovdqa xmm0, oword [rsp+98H] + vmovaps oword [rsp+3D8H], xmm0 + vmovdqa xmm0, oword [rsp+3D8H] + vmovdqa xmm1, oword [rsp+3C8H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0D8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+3A8H], xmm0 + vmovdqa xmm0, oword [rsp+0D8H] + vmovaps oword [rsp+3B8H], xmm0 + vmovdqa xmm1, oword [rsp+3B8H] + vmovdqa xmm0, oword [rsp+3A8H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+388H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+398H], xmm0 + vmovdqa xmm1, oword [rsp+398H] + vmovdqa xmm0, oword [rsp+388H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+368H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+378H], xmm0 + vmovdqa xmm1, oword [rsp+378H] + vmovdqa xmm0, oword [rsp+368H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovd eax, xmm0 + add dword [rsp-4H], eax + vmovdqa xmm0, oword [rsp+78H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [rsp+78H], xmm0 + vmovdqa xmm0, oword [rsp+88H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [rsp+88H], xmm0 + vmovdqa xmm0, oword [rsp+78H] + vmovaps oword [rsp+348H], xmm0 + vmovdqa xmm0, oword [rsp+0B8H] + vmovaps oword [rsp+358H], xmm0 + vmovdqa xmm0, oword [rsp+358H] + vmovdqa xmm1, oword [rsp+348H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+88H] + vmovaps oword [rsp+328H], xmm0 + vmovdqa xmm0, oword [rsp+0B8H] + vmovaps oword [rsp+338H], xmm0 + vmovdqa xmm0, oword [rsp+338H] + vmovdqa xmm1, oword [rsp+328H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [rsp+0D8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+308H], xmm0 + vmovdqa xmm0, oword [rsp+0D8H] + vmovaps oword [rsp+318H], xmm0 + vmovdqa xmm1, oword [rsp+318H] + vmovdqa xmm0, oword [rsp+308H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+2E8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+2F8H], xmm0 + vmovdqa xmm1, oword [rsp+2F8H] + vmovdqa xmm0, oword [rsp+2E8H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+2C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovaps oword [rsp+2D8H], xmm0 + vmovdqa xmm1, oword [rsp+2D8H] + vmovdqa xmm0, oword [rsp+2C8H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [rsp+0C8H], xmm0 + vmovdqa xmm0, oword [rsp+0C8H] + vmovd eax, xmm0 + add dword [rsp-4H], eax + mov eax, dword [rsp-4H] + cdq + mov eax, edx + xor eax, dword [rsp-4H] + sub eax, edx + cdqe + add qword [rsp], rax + vpermq ymm0, yword [rsp+4A8H], 39H + vmovdqa yword [rsp+4A8H], ymm0 + vpermq ymm0, yword [rsp+4C8H], 39H + vmovdqa yword [rsp+4C8H], ymm0 + vpermq ymm0, yword [rsp+4E8H], 39H + vmovdqa yword [rsp+4E8H], ymm0 + vpermq ymm0, yword [rsp+508H], 39H + vmovdqa yword [rsp+508H], ymm0 + vpermq ymm0, yword [rsp+528H], 39H + vmovdqa yword [rsp+528H], ymm0 + vpermq ymm0, yword [rsp+548H], 39H + vmovdqa yword [rsp+548H], ymm0 +L_005: add dword [rsp-8H], 1 +L_006: cmp dword [rsp-8H], 2 + jle L_003 + add dword [rsp-0CH], 12 +L_007: mov eax, dword [rsp-0CH] + cmp eax, dword [rsp-54H] + jl L_002 + add dword [rsp-10H], 2 +L_008: mov eax, dword [rsp-10H] + cmp eax, dword [rsp-58H] + jl L_001 + mov rax, qword [rsp] + leave + ret + +diff1st_simd: + push rbp + mov rbp, rsp + sub rsp, 480 + mov dword [rbp-1C4H], edi + mov dword [rbp-1C8H], esi + mov qword [rbp-1D0H], rdx + mov qword [rbp-1D8H], rcx + mov dword [rbp-1DCH], r8d + + + mov rax, qword [fs:abs 28H] + mov qword [rbp-8H], rax + xor eax, eax + mov qword [rbp-1A8H], 0 + + mov word [rbp-1B2H], 0 + mov dword [rbp-1B0H], 0 + jmp L_012 + +L_009: mov dword [rbp-1ACH], 0 + jmp L_011 + +L_010: mov eax, dword [rbp-1DCH] + imul eax, dword [rbp-1B0H] + mov edx, eax + mov eax, dword [rbp-1ACH] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-1D0H] + add rax, rdx + mov qword [rbp-178H], rax + mov rax, qword [rbp-178H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-170H], xmm0 + mov eax, dword [rbp-1B0H] + lea edx, [rax+1H] + mov eax, dword [rbp-1DCH] + imul edx, eax + mov eax, dword [rbp-1ACH] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-1D0H] + add rax, rdx + mov qword [rbp-180H], rax + mov rax, qword [rbp-180H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-160H], xmm0 + mov eax, dword [rbp-1DCH] + imul eax, dword [rbp-1B0H] + mov edx, eax + mov eax, dword [rbp-1ACH] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-1D8H] + add rax, rdx + mov qword [rbp-188H], rax + mov rax, qword [rbp-188H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-150H], xmm0 + mov eax, dword [rbp-1B0H] + lea edx, [rax+1H] + mov eax, dword [rbp-1DCH] + imul edx, eax + mov eax, dword [rbp-1ACH] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-1D8H] + add rax, rdx + mov qword [rbp-190H], rax + mov rax, qword [rbp-190H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-140H], xmm0 + vmovdqa xmm0, oword [rbp-170H] + vmovaps oword [rbp-30H], xmm0 + vmovdqa xmm0, oword [rbp-160H] + vmovaps oword [rbp-20H], xmm0 + vmovdqa xmm1, oword [rbp-30H] + vmovdqa xmm0, oword [rbp-20H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [rbp-130H], xmm0 + vmovdqa xmm0, oword [rbp-150H] + vmovaps oword [rbp-50H], xmm0 + vmovdqa xmm0, oword [rbp-140H] + vmovaps oword [rbp-40H], xmm0 + vmovdqa xmm1, oword [rbp-50H] + vmovdqa xmm0, oword [rbp-40H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [rbp-120H], xmm0 + vmovdqa xmm0, oword [rbp-130H] + vmovaps oword [rbp-70H], xmm0 + vmovdqa xmm0, oword [rbp-120H] + vmovaps oword [rbp-60H], xmm0 + vmovdqa xmm0, oword [rbp-70H] + vmovdqa xmm1, oword [rbp-60H] + vpsubw xmm0, xmm0, xmm1 + vmovaps oword [rbp-120H], xmm0 + vmovdqa xmm0, oword [rbp-120H] + vmovaps oword [rbp-90H], xmm0 + vmovdqa xmm0, oword [rbp-120H] + vmovaps oword [rbp-80H], xmm0 + vmovdqa xmm1, oword [rbp-80H] + vmovdqa xmm0, oword [rbp-90H] + vphaddw xmm0, xmm0, xmm1 + vmovaps oword [rbp-120H], xmm0 + vmovdqa xmm0, oword [rbp-120H] + vmovaps oword [rbp-0A0H], xmm0 + vmovdqa xmm0, oword [rbp-0A0H] + vpabsw xmm0, xmm0 + vmovaps oword [rbp-120H], xmm0 + vmovdqa xmm0, oword [rbp-120H] + vmovaps oword [rbp-0C0H], xmm0 + vmovdqa xmm0, oword [rbp-120H] + vmovaps oword [rbp-0B0H], xmm0 + vmovdqa xmm1, oword [rbp-0B0H] + vmovdqa xmm0, oword [rbp-0C0H] + vphaddsw xmm0, xmm0, xmm1 + vmovaps oword [rbp-120H], xmm0 + vmovdqa xmm0, oword [rbp-120H] + vmovaps oword [rbp-0E0H], xmm0 + vmovdqa xmm0, oword [rbp-120H] + vmovaps oword [rbp-0D0H], xmm0 + vmovdqa xmm1, oword [rbp-0D0H] + vmovdqa xmm0, oword [rbp-0E0H] + vphaddsw xmm0, xmm0, xmm1 + vmovaps oword [rbp-120H], xmm0 + vmovdqa xmm0, oword [rbp-120H] + vmovaps oword [rbp-0F0H], xmm0 + vmovdqa xmm0, oword [rbp-0F0H] + vmovd edx, xmm0 + lea rax, [rbp-1B2H] + mov word [rax], dx + movzx eax, word [rbp-1B2H] + movzx eax, ax + add qword [rbp-1A8H], rax + mov eax, dword [rbp-1DCH] + imul eax, dword [rbp-1B0H] + mov edx, eax + mov eax, dword [rbp-1ACH] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-1D8H] + add rax, rdx + mov qword [rbp-198H], rax + vmovdqa xmm0, oword [rbp-170H] + vmovaps oword [rbp-100H], xmm0 + vmovdqa xmm0, oword [rbp-100H] + mov rax, qword [rbp-198H] + vmovups oword [rax], xmm0 + nop + mov eax, dword [rbp-1B0H] + lea edx, [rax+1H] + mov eax, dword [rbp-1DCH] + imul edx, eax + mov eax, dword [rbp-1ACH] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-1D8H] + add rax, rdx + mov qword [rbp-1A0H], rax + vmovdqa xmm0, oword [rbp-160H] + vmovaps oword [rbp-110H], xmm0 + vmovdqa xmm0, oword [rbp-110H] + mov rax, qword [rbp-1A0H] + vmovups oword [rax], xmm0 + nop + add dword [rbp-1ACH], 8 +L_011: mov eax, dword [rbp-1ACH] + cmp eax, dword [rbp-1C4H] + jc L_010 + add dword [rbp-1B0H], 2 +L_012: mov eax, dword [rbp-1B0H] + cmp eax, dword [rbp-1C8H] + jc L_009 + mov rax, qword [rbp-1A8H] + add rax, rax + mov rcx, qword [rbp-8H] + + + xor rcx, qword [fs:abs 28H] + jz L_013 +L_013: leave + ret + +diff2nd_simd: + push rbp + mov rbp, rsp + sub rsp, 688 + mov dword [rbp-284H], edi + mov dword [rbp-288H], esi + mov qword [rbp-290H], rdx + mov qword [rbp-298H], rcx + mov qword [rbp-2A0H], r8 + mov dword [rbp-2A4H], r9d + + + mov rax, qword [fs:abs 28H] + mov qword [rbp-8H], rax + xor eax, eax + mov qword [rbp-268H], 0 + + mov word [rbp-276H], 0 + mov dword [rbp-274H], 0 + jmp L_017 + +L_014: mov dword [rbp-270H], 0 + jmp L_016 + +L_015: mov eax, dword [rbp-2A4H] + imul eax, dword [rbp-274H] + mov edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-290H] + add rax, rdx + mov qword [rbp-218H], rax + mov rax, qword [rbp-218H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-210H], xmm0 + mov eax, dword [rbp-274H] + lea edx, [rax+1H] + mov eax, dword [rbp-2A4H] + imul edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-290H] + add rax, rdx + mov qword [rbp-220H], rax + mov rax, qword [rbp-220H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-200H], xmm0 + mov eax, dword [rbp-2A4H] + imul eax, dword [rbp-274H] + mov edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-298H] + add rax, rdx + mov qword [rbp-228H], rax + mov rax, qword [rbp-228H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-1F0H], xmm0 + mov eax, dword [rbp-274H] + lea edx, [rax+1H] + mov eax, dword [rbp-2A4H] + imul edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-298H] + add rax, rdx + mov qword [rbp-230H], rax + mov rax, qword [rbp-230H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-1E0H], xmm0 + mov eax, dword [rbp-2A4H] + imul eax, dword [rbp-274H] + mov edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-2A0H] + add rax, rdx + mov qword [rbp-238H], rax + mov rax, qword [rbp-238H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-1D0H], xmm0 + mov eax, dword [rbp-274H] + lea edx, [rax+1H] + mov eax, dword [rbp-2A4H] + imul edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-2A0H] + add rax, rdx + mov qword [rbp-240H], rax + mov rax, qword [rbp-240H] + vlddqu xmm0, oword [rax] + vmovaps oword [rbp-1C0H], xmm0 + vmovdqa xmm0, oword [rbp-210H] + vmovaps oword [rbp-30H], xmm0 + vmovdqa xmm0, oword [rbp-200H] + vmovaps oword [rbp-20H], xmm0 + vmovdqa xmm1, oword [rbp-30H] + vmovdqa xmm0, oword [rbp-20H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [rbp-1B0H], xmm0 + vmovdqa xmm0, oword [rbp-1F0H] + vmovaps oword [rbp-50H], xmm0 + vmovdqa xmm0, oword [rbp-1E0H] + vmovaps oword [rbp-40H], xmm0 + vmovdqa xmm1, oword [rbp-50H] + vmovdqa xmm0, oword [rbp-40H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [rbp-1A0H], xmm0 + vmovdqa xmm0, oword [rbp-1D0H] + vmovaps oword [rbp-70H], xmm0 + vmovdqa xmm0, oword [rbp-1C0H] + vmovaps oword [rbp-60H], xmm0 + vmovdqa xmm1, oword [rbp-70H] + vmovdqa xmm0, oword [rbp-60H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [rbp-190H], xmm0 + vmovdqa xmm0, oword [rbp-1B0H] + vmovaps oword [rbp-90H], xmm0 + vmovdqa xmm0, oword [rbp-190H] + vmovaps oword [rbp-80H], xmm0 + vmovdqa xmm1, oword [rbp-90H] + vmovdqa xmm0, oword [rbp-80H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [rbp-1B0H], xmm0 + vmovdqa xmm0, oword [rbp-1B0H] + vmovaps oword [rbp-0B0H], xmm0 + vmovdqa xmm0, oword [rbp-1A0H] + vmovaps oword [rbp-0A0H], xmm0 + vmovdqa xmm1, oword [rbp-0A0H] + vmovdqa xmm0, oword [rbp-0B0H] + vphaddw xmm0, xmm0, xmm1 + vmovaps oword [rbp-1B0H], xmm0 + vmovdqa xmm0, oword [rbp-1B0H] + vpshufd xmm0, xmm0, 0EEH + vmovaps oword [rbp-1A0H], xmm0 + vmovdqa xmm0, oword [rbp-1A0H] + vmovaps oword [rbp-0C0H], xmm0 + mov dword [rbp-26CH], 1 + vmovdqa xmm1, oword [rbp-0C0H] + vmovd xmm0, dword [rbp-26CH] + vpsllw xmm0, xmm1, xmm0 + vmovaps oword [rbp-1A0H], xmm0 + vmovdqa xmm0, oword [rbp-1B0H] + vmovaps oword [rbp-0E0H], xmm0 + vmovdqa xmm0, oword [rbp-1A0H] + vmovaps oword [rbp-0D0H], xmm0 + vmovdqa xmm0, oword [rbp-0E0H] + vmovdqa xmm1, oword [rbp-0D0H] + vpsubw xmm0, xmm0, xmm1 + vmovaps oword [rbp-1A0H], xmm0 + vmovdqa xmm0, oword [rbp-1A0H] + vmovaps oword [rbp-0F0H], xmm0 + vmovdqa xmm0, oword [rbp-0F0H] + vpabsw xmm0, xmm0 + vmovaps oword [rbp-1A0H], xmm0 + vmovdqa xmm0, oword [rbp-1A0H] + vmovaps oword [rbp-110H], xmm0 + vmovdqa xmm0, oword [rbp-1A0H] + vmovaps oword [rbp-100H], xmm0 + vmovdqa xmm1, oword [rbp-100H] + vmovdqa xmm0, oword [rbp-110H] + vphaddsw xmm0, xmm0, xmm1 + vmovaps oword [rbp-1A0H], xmm0 + vmovdqa xmm0, oword [rbp-1A0H] + vmovaps oword [rbp-130H], xmm0 + vmovdqa xmm0, oword [rbp-1A0H] + vmovaps oword [rbp-120H], xmm0 + vmovdqa xmm1, oword [rbp-120H] + vmovdqa xmm0, oword [rbp-130H] + vphaddsw xmm0, xmm0, xmm1 + vmovaps oword [rbp-1A0H], xmm0 + vmovdqa xmm0, oword [rbp-1A0H] + vmovaps oword [rbp-140H], xmm0 + vmovdqa xmm0, oword [rbp-140H] + vmovd edx, xmm0 + lea rax, [rbp-276H] + mov word [rax], dx + movzx eax, word [rbp-276H] + movzx eax, ax + add qword [rbp-268H], rax + mov eax, dword [rbp-2A4H] + imul eax, dword [rbp-274H] + mov edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-2A0H] + add rax, rdx + mov qword [rbp-248H], rax + vmovdqa xmm0, oword [rbp-1F0H] + vmovaps oword [rbp-150H], xmm0 + vmovdqa xmm0, oword [rbp-150H] + mov rax, qword [rbp-248H] + vmovups oword [rax], xmm0 + nop + mov eax, dword [rbp-274H] + lea edx, [rax+1H] + mov eax, dword [rbp-2A4H] + imul edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-2A0H] + add rax, rdx + mov qword [rbp-250H], rax + vmovdqa xmm0, oword [rbp-1E0H] + vmovaps oword [rbp-160H], xmm0 + vmovdqa xmm0, oword [rbp-160H] + mov rax, qword [rbp-250H] + vmovups oword [rax], xmm0 + nop + mov eax, dword [rbp-2A4H] + imul eax, dword [rbp-274H] + mov edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-298H] + add rax, rdx + mov qword [rbp-258H], rax + vmovdqa xmm0, oword [rbp-210H] + vmovaps oword [rbp-170H], xmm0 + vmovdqa xmm0, oword [rbp-170H] + mov rax, qword [rbp-258H] + vmovups oword [rax], xmm0 + nop + mov eax, dword [rbp-274H] + lea edx, [rax+1H] + mov eax, dword [rbp-2A4H] + imul edx, eax + mov eax, dword [rbp-270H] + add eax, edx + mov eax, eax + lea rdx, [rax+rax] + mov rax, qword [rbp-298H] + add rax, rdx + mov qword [rbp-260H], rax + vmovdqa xmm0, oword [rbp-200H] + vmovaps oword [rbp-180H], xmm0 + vmovdqa xmm0, oword [rbp-180H] + mov rax, qword [rbp-260H] + vmovups oword [rax], xmm0 + nop + add dword [rbp-270H], 8 +L_016: mov eax, dword [rbp-270H] + cmp eax, dword [rbp-284H] + jc L_015 + add dword [rbp-274H], 2 +L_017: mov eax, dword [rbp-274H] + cmp eax, dword [rbp-288H] + jc L_014 + mov rax, qword [rbp-268H] + add rax, rax + mov rcx, qword [rbp-8H] + + + xor rcx, qword [fs:abs 28H] + jz L_018 +L_018: leave + ret + + +SECTION .data + + +SECTION .bss + + +SECTION .note.gnu.property align=8 + + db 04H, 00H, 00H, 00H, 10H, 00H, 00H, 00H + db 05H, 00H, 00H, 00H, 47H, 4EH, 55H, 00H + db 02H, 00H, 00H, 0C0H, 04H, 00H, 00H, 00H + db 03H, 00H, 00H, 00H, 00H, 00H, 00H, 00H + +%else +global __x86.get_pc_thunk.ax +extern __stack_chk_fail_local +extern _GLOBAL_OFFSET_TABLE_ +global highds_simd +global diff1st_simd +global diff2nd_simd + +SECTION .text + +highds_simd: + push ebp + mov ebp, esp + and esp, 0FFFFFFE0H + sub esp, 1632 + call __x86.get_pc_thunk.ax + add eax, _GLOBAL_OFFSET_TABLE_-$ + mov dword [esp+68H], 0 + mov dword [esp+6CH], 0 + mov word [esp+30H], 0 + mov word [esp+32H], 0 + mov word [esp+34H], -1 + mov word [esp+36H], -2 + mov word [esp+38H], -3 + mov word [esp+3AH], -3 + mov word [esp+3CH], -2 + mov word [esp+3EH], -1 + movzx eax, word [esp+3EH] + vmovd xmm0, eax + movzx eax, word [esp+3CH] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm1, xmm0 + movzx eax, word [esp+3AH] + vmovd xmm0, eax + movzx eax, word [esp+38H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm4, xmm0 + movzx eax, word [esp+36H] + vmovd xmm0, eax + movzx eax, word [esp+34H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm2, xmm0 + movzx eax, word [esp+32H] + vmovd xmm0, eax + movzx eax, word [esp+30H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm3, xmm0 + vpunpckldq xmm0, xmm1, xmm4 + vmovdqa xmm1, xmm0 + vpunpckldq xmm0, xmm2, xmm3 + vpunpcklqdq xmm0, xmm1, xmm0 + vmovaps oword [esp+0D0H], xmm0 + mov word [esp+20H], 0 + mov word [esp+22H], 0 + mov word [esp+24H], -1 + mov word [esp+26H], -3 + mov word [esp+28H], 12 + mov word [esp+2AH], 12 + mov word [esp+2CH], -3 + mov word [esp+2EH], -1 + movzx eax, word [esp+2EH] + vmovd xmm0, eax + movzx eax, word [esp+2CH] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm1, xmm0 + movzx eax, word [esp+2AH] + vmovd xmm0, eax + movzx eax, word [esp+28H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm4, xmm0 + movzx eax, word [esp+26H] + vmovd xmm0, eax + movzx eax, word [esp+24H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm2, xmm0 + movzx eax, word [esp+22H] + vmovd xmm0, eax + movzx eax, word [esp+20H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm3, xmm0 + vpunpckldq xmm0, xmm1, xmm4 + vmovdqa xmm1, xmm0 + vpunpckldq xmm0, xmm2, xmm3 + vpunpcklqdq xmm0, xmm1, xmm0 + vmovaps oword [esp+0E0H], xmm0 + mov word [esp+10H], 0 + mov word [esp+12H], 0 + mov word [esp+14H], 0 + mov word [esp+16H], -1 + mov word [esp+18H], -1 + mov word [esp+1AH], -1 + mov word [esp+1CH], -1 + mov word [esp+1EH], 0 + movzx eax, word [esp+1EH] + vmovd xmm0, eax + movzx eax, word [esp+1CH] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm1, xmm0 + movzx eax, word [esp+1AH] + vmovd xmm0, eax + movzx eax, word [esp+18H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm4, xmm0 + movzx eax, word [esp+16H] + vmovd xmm0, eax + movzx eax, word [esp+14H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm2, xmm0 + movzx eax, word [esp+12H] + vmovd xmm0, eax + movzx eax, word [esp+10H] + vpinsrw xmm0, xmm0, eax, 1 + vmovdqa xmm3, xmm0 + vpunpckldq xmm0, xmm1, xmm4 + vmovdqa xmm1, xmm0 + vpunpckldq xmm0, xmm2, xmm3 + vpunpcklqdq xmm0, xmm1, xmm0 + vmovaps oword [esp+0F0H], xmm0 + mov eax, dword [ebp+0CH] + mov dword [esp+40H], eax + jmp L_008 + +L_001: mov eax, dword [ebp+8H] + mov dword [esp+44H], eax + jmp L_007 + +L_002: mov eax, dword [esp+40H] + sub eax, 2 + imul eax, dword [ebp+1CH] + mov edx, eax + mov eax, dword [esp+44H] + add eax, edx + add eax, 2147483646 + lea edx, [eax+eax] + mov eax, dword [ebp+18H] + add eax, edx + mov dword [esp+64H], eax + mov eax, dword [esp+64H] + vlddqu ymm0, yword [eax] + vmovdqa yword [esp+4E0H], ymm0 + mov eax, dword [esp+40H] + sub eax, 1 + imul eax, dword [ebp+1CH] + mov edx, eax + mov eax, dword [esp+44H] + add eax, edx + add eax, 2147483646 + lea edx, [eax+eax] + mov eax, dword [ebp+18H] + add eax, edx + mov dword [esp+60H], eax + mov eax, dword [esp+60H] + vlddqu ymm0, yword [eax] + vmovdqa yword [esp+500H], ymm0 + mov eax, dword [esp+40H] + imul eax, dword [ebp+1CH] + mov edx, eax + mov eax, dword [esp+44H] + add eax, edx + add eax, 2147483646 + lea edx, [eax+eax] + mov eax, dword [ebp+18H] + add eax, edx + mov dword [esp+5CH], eax + mov eax, dword [esp+5CH] + vlddqu ymm0, yword [eax] + vmovdqa yword [esp+520H], ymm0 + mov eax, dword [esp+40H] + add eax, 1 + imul eax, dword [ebp+1CH] + mov edx, eax + mov eax, dword [esp+44H] + add eax, edx + add eax, 2147483646 + lea edx, [eax+eax] + mov eax, dword [ebp+18H] + add eax, edx + mov dword [esp+58H], eax + mov eax, dword [esp+58H] + vlddqu ymm0, yword [eax] + vmovdqa yword [esp+540H], ymm0 + mov eax, dword [esp+40H] + add eax, 2 + imul eax, dword [ebp+1CH] + mov edx, eax + mov eax, dword [esp+44H] + add eax, edx + add eax, 2147483646 + lea edx, [eax+eax] + mov eax, dword [ebp+18H] + add eax, edx + mov dword [esp+54H], eax + mov eax, dword [esp+54H] + vlddqu ymm0, yword [eax] + vmovdqa yword [esp+560H], ymm0 + mov eax, dword [esp+40H] + add eax, 3 + imul eax, dword [ebp+1CH] + mov edx, eax + mov eax, dword [esp+44H] + add eax, edx + add eax, 2147483646 + lea edx, [eax+eax] + mov eax, dword [ebp+18H] + add eax, edx + mov dword [esp+50H], eax + mov eax, dword [esp+50H] + vlddqu ymm0, yword [eax] + vmovdqa yword [esp+580H], ymm0 + mov dword [esp+48H], 0 + jmp L_006 + +L_003: mov eax, dword [esp+48H] + lea edx, [eax*4] + mov eax, dword [esp+44H] + add eax, edx + cmp dword [ebp+10H], eax + jle L_004 + mov dword [esp+4CH], 0 + vmovdqa ymm0, yword [esp+520H] + vmovdqa yword [esp+640H], ymm0 + vmovdqa ymm0, yword [esp+640H] + vmovaps oword [esp+70H], xmm0 + vmovdqa ymm0, yword [esp+540H] + vmovdqa yword [esp+620H], ymm0 + vmovdqa ymm0, yword [esp+620H] + vmovaps oword [esp+80H], xmm0 + vmovdqa xmm0, oword [esp+70H] + vmovaps oword [esp+2E0H], xmm0 + vmovdqa xmm0, oword [esp+0E0H] + vmovaps oword [esp+2F0H], xmm0 + vmovdqa xmm0, oword [esp+2F0H] + vmovdqa xmm1, oword [esp+2E0H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+80H] + vmovaps oword [esp+2C0H], xmm0 + vmovdqa xmm0, oword [esp+0E0H] + vmovaps oword [esp+2D0H], xmm0 + vmovdqa xmm0, oword [esp+2D0H] + vmovdqa xmm1, oword [esp+2C0H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+110H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+2A0H], xmm0 + vmovdqa xmm0, oword [esp+110H] + vmovaps oword [esp+2B0H], xmm0 + vmovdqa xmm1, oword [esp+2B0H] + vmovdqa xmm0, oword [esp+2A0H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+280H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+290H], xmm0 + vmovdqa xmm1, oword [esp+290H] + vmovdqa xmm0, oword [esp+280H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+260H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+270H], xmm0 + vmovdqa xmm1, oword [esp+270H] + vmovdqa xmm0, oword [esp+260H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovd eax, xmm0 + add dword [esp+4CH], eax + vmovdqa ymm0, yword [esp+500H] + vmovdqa yword [esp+600H], ymm0 + vmovdqa ymm0, yword [esp+600H] + vmovaps oword [esp+90H], xmm0 + vmovdqa ymm0, yword [esp+560H] + vmovdqa yword [esp+5E0H], ymm0 + vmovdqa ymm0, yword [esp+5E0H] + vmovaps oword [esp+0A0H], xmm0 + vmovdqa xmm0, oword [esp+90H] + vmovaps oword [esp+240H], xmm0 + vmovdqa xmm0, oword [esp+0D0H] + vmovaps oword [esp+250H], xmm0 + vmovdqa xmm0, oword [esp+250H] + vmovdqa xmm1, oword [esp+240H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+0A0H] + vmovaps oword [esp+220H], xmm0 + vmovdqa xmm0, oword [esp+0D0H] + vmovaps oword [esp+230H], xmm0 + vmovdqa xmm0, oword [esp+230H] + vmovdqa xmm1, oword [esp+220H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+110H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+200H], xmm0 + vmovdqa xmm0, oword [esp+110H] + vmovaps oword [esp+210H], xmm0 + vmovdqa xmm1, oword [esp+210H] + vmovdqa xmm0, oword [esp+200H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+1E0H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+1F0H], xmm0 + vmovdqa xmm1, oword [esp+1F0H] + vmovdqa xmm0, oword [esp+1E0H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+1C0H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+1D0H], xmm0 + vmovdqa xmm1, oword [esp+1D0H] + vmovdqa xmm0, oword [esp+1C0H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovd eax, xmm0 + add dword [esp+4CH], eax + vmovdqa ymm0, yword [esp+4E0H] + vmovdqa yword [esp+5C0H], ymm0 + vmovdqa ymm0, yword [esp+5C0H] + vmovaps oword [esp+0B0H], xmm0 + vmovdqa ymm0, yword [esp+580H] + vmovdqa yword [esp+5A0H], ymm0 + vmovdqa ymm0, yword [esp+5A0H] + vmovaps oword [esp+0C0H], xmm0 + vmovdqa xmm0, oword [esp+0B0H] + vmovaps oword [esp+1A0H], xmm0 + vmovdqa xmm0, oword [esp+0F0H] + vmovaps oword [esp+1B0H], xmm0 + vmovdqa xmm0, oword [esp+1B0H] + vmovdqa xmm1, oword [esp+1A0H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+0C0H] + vmovaps oword [esp+180H], xmm0 + vmovdqa xmm0, oword [esp+0F0H] + vmovaps oword [esp+190H], xmm0 + vmovdqa xmm0, oword [esp+190H] + vmovdqa xmm1, oword [esp+180H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+110H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+160H], xmm0 + vmovdqa xmm0, oword [esp+110H] + vmovaps oword [esp+170H], xmm0 + vmovdqa xmm1, oword [esp+170H] + vmovdqa xmm0, oword [esp+160H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+140H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+150H], xmm0 + vmovdqa xmm1, oword [esp+150H] + vmovdqa xmm0, oword [esp+140H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+120H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+130H], xmm0 + vmovdqa xmm1, oword [esp+130H] + vmovdqa xmm0, oword [esp+120H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovd eax, xmm0 + add dword [esp+4CH], eax + mov eax, dword [esp+4CH] + cdq + mov eax, edx + xor eax, dword [esp+4CH] + sub eax, edx + cdq + add dword [esp+68H], eax + adc dword [esp+6CH], edx +L_004: mov eax, dword [esp+48H] + lea edx, [eax*4] + mov eax, dword [esp+44H] + add eax, edx + add eax, 2 + cmp dword [ebp+10H], eax + jle L_005 + mov dword [esp+4CH], 0 + vmovdqa xmm0, oword [esp+70H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [esp+70H], xmm0 + vmovdqa xmm0, oword [esp+80H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [esp+80H], xmm0 + vmovdqa xmm0, oword [esp+70H] + vmovaps oword [esp+4C0H], xmm0 + vmovdqa xmm0, oword [esp+0E0H] + vmovaps oword [esp+4D0H], xmm0 + vmovdqa xmm0, oword [esp+4D0H] + vmovdqa xmm1, oword [esp+4C0H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+80H] + vmovaps oword [esp+4A0H], xmm0 + vmovdqa xmm0, oword [esp+0E0H] + vmovaps oword [esp+4B0H], xmm0 + vmovdqa xmm0, oword [esp+4B0H] + vmovdqa xmm1, oword [esp+4A0H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+110H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+480H], xmm0 + vmovdqa xmm0, oword [esp+110H] + vmovaps oword [esp+490H], xmm0 + vmovdqa xmm1, oword [esp+490H] + vmovdqa xmm0, oword [esp+480H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+460H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+470H], xmm0 + vmovdqa xmm1, oword [esp+470H] + vmovdqa xmm0, oword [esp+460H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+440H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+450H], xmm0 + vmovdqa xmm1, oword [esp+450H] + vmovdqa xmm0, oword [esp+440H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovd eax, xmm0 + add dword [esp+4CH], eax + vmovdqa xmm0, oword [esp+90H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [esp+90H], xmm0 + vmovdqa xmm0, oword [esp+0A0H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [esp+0A0H], xmm0 + vmovdqa xmm0, oword [esp+90H] + vmovaps oword [esp+420H], xmm0 + vmovdqa xmm0, oword [esp+0D0H] + vmovaps oword [esp+430H], xmm0 + vmovdqa xmm0, oword [esp+430H] + vmovdqa xmm1, oword [esp+420H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+0A0H] + vmovaps oword [esp+400H], xmm0 + vmovdqa xmm0, oword [esp+0D0H] + vmovaps oword [esp+410H], xmm0 + vmovdqa xmm0, oword [esp+410H] + vmovdqa xmm1, oword [esp+400H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+110H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+3E0H], xmm0 + vmovdqa xmm0, oword [esp+110H] + vmovaps oword [esp+3F0H], xmm0 + vmovdqa xmm1, oword [esp+3F0H] + vmovdqa xmm0, oword [esp+3E0H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+3C0H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+3D0H], xmm0 + vmovdqa xmm1, oword [esp+3D0H] + vmovdqa xmm0, oword [esp+3C0H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+3A0H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+3B0H], xmm0 + vmovdqa xmm1, oword [esp+3B0H] + vmovdqa xmm0, oword [esp+3A0H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovd eax, xmm0 + add dword [esp+4CH], eax + vmovdqa xmm0, oword [esp+0B0H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [esp+0B0H], xmm0 + vmovdqa xmm0, oword [esp+0C0H] + vpsrldq xmm0, xmm0, 4 + vmovaps oword [esp+0C0H], xmm0 + vmovdqa xmm0, oword [esp+0B0H] + vmovaps oword [esp+380H], xmm0 + vmovdqa xmm0, oword [esp+0F0H] + vmovaps oword [esp+390H], xmm0 + vmovdqa xmm0, oword [esp+390H] + vmovdqa xmm1, oword [esp+380H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+0C0H] + vmovaps oword [esp+360H], xmm0 + vmovdqa xmm0, oword [esp+0F0H] + vmovaps oword [esp+370H], xmm0 + vmovdqa xmm0, oword [esp+370H] + vmovdqa xmm1, oword [esp+360H] + vpmaddwd xmm0, xmm1, xmm0 + vmovaps oword [esp+110H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+340H], xmm0 + vmovdqa xmm0, oword [esp+110H] + vmovaps oword [esp+350H], xmm0 + vmovdqa xmm1, oword [esp+350H] + vmovdqa xmm0, oword [esp+340H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+320H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+330H], xmm0 + vmovdqa xmm1, oword [esp+330H] + vmovdqa xmm0, oword [esp+320H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+300H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovaps oword [esp+310H], xmm0 + vmovdqa xmm1, oword [esp+310H] + vmovdqa xmm0, oword [esp+300H] + vphaddd xmm0, xmm0, xmm1 + vmovaps oword [esp+100H], xmm0 + vmovdqa xmm0, oword [esp+100H] + vmovd eax, xmm0 + add dword [esp+4CH], eax + mov eax, dword [esp+4CH] + cdq + mov eax, edx + xor eax, dword [esp+4CH] + sub eax, edx + cdq + add dword [esp+68H], eax + adc dword [esp+6CH], edx + vpermq ymm0, yword [esp+4E0H], 39H + vmovdqa yword [esp+4E0H], ymm0 + vpermq ymm0, yword [esp+500H], 39H + vmovdqa yword [esp+500H], ymm0 + vpermq ymm0, yword [esp+520H], 39H + vmovdqa yword [esp+520H], ymm0 + vpermq ymm0, yword [esp+540H], 39H + vmovdqa yword [esp+540H], ymm0 + vpermq ymm0, yword [esp+560H], 39H + vmovdqa yword [esp+560H], ymm0 + vpermq ymm0, yword [esp+580H], 39H + vmovdqa yword [esp+580H], ymm0 +L_005: add dword [esp+48H], 1 +L_006: cmp dword [esp+48H], 2 + jle L_003 + add dword [esp+44H], 12 +L_007: mov eax, dword [esp+44H] + cmp eax, dword [ebp+10H] + jl L_002 + add dword [esp+40H], 2 +L_008: mov eax, dword [esp+40H] + cmp eax, dword [ebp+14H] + jl L_001 + mov eax, dword [esp+68H] + mov edx, dword [esp+6CH] + leave + ret + +diff1st_simd: + push ebp + mov ebp, esp + sub esp, 440 + call __x86.get_pc_thunk.ax + add eax, _GLOBAL_OFFSET_TABLE_-$ + mov eax, dword [ebp+10H] + mov dword [ebp-1ACH], eax + mov eax, dword [ebp+14H] + mov dword [ebp-1B0H], eax + + mov eax, dword [gs:14H] + mov dword [ebp-0CH], eax + xor eax, eax + mov dword [ebp-180H], 0 + mov dword [ebp-17CH], 0 + + mov word [ebp-1A2H], 0 + mov dword [ebp-1A0H], 0 + jmp L_012 + +L_009: mov dword [ebp-19CH], 0 + jmp L_011 + +L_010: mov eax, dword [ebp+18H] + imul eax, dword [ebp-1A0H] + mov edx, eax + mov eax, dword [ebp-19CH] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-1ACH] + add eax, edx + mov dword [ebp-184H], eax + mov eax, dword [ebp-184H] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-178H], xmm0 + mov eax, dword [ebp-1A0H] + lea edx, [eax+1H] + mov eax, dword [ebp+18H] + imul edx, eax + mov eax, dword [ebp-19CH] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-1ACH] + add eax, edx + mov dword [ebp-188H], eax + mov eax, dword [ebp-188H] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-168H], xmm0 + mov eax, dword [ebp+18H] + imul eax, dword [ebp-1A0H] + mov edx, eax + mov eax, dword [ebp-19CH] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-1B0H] + add eax, edx + mov dword [ebp-18CH], eax + mov eax, dword [ebp-18CH] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-158H], xmm0 + mov eax, dword [ebp-1A0H] + lea edx, [eax+1H] + mov eax, dword [ebp+18H] + imul edx, eax + mov eax, dword [ebp-19CH] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-1B0H] + add eax, edx + mov dword [ebp-190H], eax + mov eax, dword [ebp-190H] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-148H], xmm0 + vmovdqa xmm0, oword [ebp-178H] + vmovaps oword [ebp-38H], xmm0 + vmovdqa xmm0, oword [ebp-168H] + vmovaps oword [ebp-28H], xmm0 + vmovdqa xmm1, oword [ebp-38H] + vmovdqa xmm0, oword [ebp-28H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [ebp-138H], xmm0 + vmovdqa xmm0, oword [ebp-158H] + vmovaps oword [ebp-58H], xmm0 + vmovdqa xmm0, oword [ebp-148H] + vmovaps oword [ebp-48H], xmm0 + vmovdqa xmm1, oword [ebp-58H] + vmovdqa xmm0, oword [ebp-48H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [ebp-128H], xmm0 + vmovdqa xmm0, oword [ebp-138H] + vmovaps oword [ebp-78H], xmm0 + vmovdqa xmm0, oword [ebp-128H] + vmovaps oword [ebp-68H], xmm0 + vmovdqa xmm0, oword [ebp-78H] + vmovdqa xmm1, oword [ebp-68H] + vpsubw xmm0, xmm0, xmm1 + vmovaps oword [ebp-128H], xmm0 + vmovdqa xmm0, oword [ebp-128H] + vmovaps oword [ebp-98H], xmm0 + vmovdqa xmm0, oword [ebp-128H] + vmovaps oword [ebp-88H], xmm0 + vmovdqa xmm1, oword [ebp-88H] + vmovdqa xmm0, oword [ebp-98H] + vphaddw xmm0, xmm0, xmm1 + vmovaps oword [ebp-128H], xmm0 + vmovdqa xmm0, oword [ebp-128H] + vmovaps oword [ebp-0A8H], xmm0 + vmovdqa xmm0, oword [ebp-0A8H] + vpabsw xmm0, xmm0 + vmovaps oword [ebp-128H], xmm0 + vmovdqa xmm0, oword [ebp-128H] + vmovaps oword [ebp-0C8H], xmm0 + vmovdqa xmm0, oword [ebp-128H] + vmovaps oword [ebp-0B8H], xmm0 + vmovdqa xmm1, oword [ebp-0B8H] + vmovdqa xmm0, oword [ebp-0C8H] + vphaddsw xmm0, xmm0, xmm1 + vmovaps oword [ebp-128H], xmm0 + vmovdqa xmm0, oword [ebp-128H] + vmovaps oword [ebp-0E8H], xmm0 + vmovdqa xmm0, oword [ebp-128H] + vmovaps oword [ebp-0D8H], xmm0 + vmovdqa xmm1, oword [ebp-0D8H] + vmovdqa xmm0, oword [ebp-0E8H] + vphaddsw xmm0, xmm0, xmm1 + vmovaps oword [ebp-128H], xmm0 + vmovdqa xmm0, oword [ebp-128H] + vmovaps oword [ebp-0F8H], xmm0 + vmovdqa xmm0, oword [ebp-0F8H] + vmovd edx, xmm0 + lea eax, [ebp-1A2H] + mov word [eax], dx + movzx ecx, word [ebp-1A2H] + movzx eax, cx + mov edx, 0 + add dword [ebp-180H], eax + adc dword [ebp-17CH], edx + mov eax, dword [ebp+18H] + imul eax, dword [ebp-1A0H] + mov edx, eax + mov eax, dword [ebp-19CH] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-1B0H] + add eax, edx + mov dword [ebp-194H], eax + vmovdqa xmm0, oword [ebp-178H] + vmovaps oword [ebp-108H], xmm0 + vmovdqa xmm0, oword [ebp-108H] + mov eax, dword [ebp-194H] + vmovups oword [eax], xmm0 + nop + mov eax, dword [ebp-1A0H] + lea edx, [eax+1H] + mov eax, dword [ebp+18H] + imul edx, eax + mov eax, dword [ebp-19CH] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-1B0H] + add eax, edx + mov dword [ebp-198H], eax + vmovdqa xmm0, oword [ebp-168H] + vmovaps oword [ebp-118H], xmm0 + vmovdqa xmm0, oword [ebp-118H] + mov eax, dword [ebp-198H] + vmovups oword [eax], xmm0 + nop + add dword [ebp-19CH], 8 +L_011: mov eax, dword [ebp-19CH] + cmp eax, dword [ebp+8H] + jc L_010 + add dword [ebp-1A0H], 2 +L_012: mov eax, dword [ebp-1A0H] + cmp eax, dword [ebp+0CH] + jc L_009 + mov eax, dword [ebp-180H] + mov edx, dword [ebp-17CH] + shld edx, eax, 1 + add eax, eax + mov ecx, dword [ebp-0CH] + + xor ecx, dword [gs:14H] + jz L_013 +L_013: leave + ret + +diff2nd_simd: + push ebp + mov ebp, esp + sub esp, 616 + call __x86.get_pc_thunk.ax + add eax, _GLOBAL_OFFSET_TABLE_-$ + mov eax, dword [ebp+10H] + mov dword [ebp-25CH], eax + mov eax, dword [ebp+14H] + mov dword [ebp-260H], eax + mov eax, dword [ebp+18H] + mov dword [ebp-264H], eax + + mov eax, dword [gs:14H] + mov dword [ebp-0CH], eax + xor eax, eax + mov dword [ebp-220H], 0 + mov dword [ebp-21CH], 0 + + mov word [ebp-256H], 0 + mov dword [ebp-254H], 0 + jmp L_017 + +L_014: mov dword [ebp-250H], 0 + jmp L_016 + +L_015: mov eax, dword [ebp+1CH] + imul eax, dword [ebp-254H] + mov edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-25CH] + add eax, edx + mov dword [ebp-224H], eax + mov eax, dword [ebp-224H] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-218H], xmm0 + mov eax, dword [ebp-254H] + lea edx, [eax+1H] + mov eax, dword [ebp+1CH] + imul edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-25CH] + add eax, edx + mov dword [ebp-228H], eax + mov eax, dword [ebp-228H] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-208H], xmm0 + mov eax, dword [ebp+1CH] + imul eax, dword [ebp-254H] + mov edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-260H] + add eax, edx + mov dword [ebp-22CH], eax + mov eax, dword [ebp-22CH] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-1F8H], xmm0 + mov eax, dword [ebp-254H] + lea edx, [eax+1H] + mov eax, dword [ebp+1CH] + imul edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-260H] + add eax, edx + mov dword [ebp-230H], eax + mov eax, dword [ebp-230H] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-1E8H], xmm0 + mov eax, dword [ebp+1CH] + imul eax, dword [ebp-254H] + mov edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-264H] + add eax, edx + mov dword [ebp-234H], eax + mov eax, dword [ebp-234H] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-1D8H], xmm0 + mov eax, dword [ebp-254H] + lea edx, [eax+1H] + mov eax, dword [ebp+1CH] + imul edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-264H] + add eax, edx + mov dword [ebp-238H], eax + mov eax, dword [ebp-238H] + vlddqu xmm0, oword [eax] + vmovaps oword [ebp-1C8H], xmm0 + vmovdqa xmm0, oword [ebp-218H] + vmovaps oword [ebp-38H], xmm0 + vmovdqa xmm0, oword [ebp-208H] + vmovaps oword [ebp-28H], xmm0 + vmovdqa xmm1, oword [ebp-38H] + vmovdqa xmm0, oword [ebp-28H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [ebp-1B8H], xmm0 + vmovdqa xmm0, oword [ebp-1F8H] + vmovaps oword [ebp-58H], xmm0 + vmovdqa xmm0, oword [ebp-1E8H] + vmovaps oword [ebp-48H], xmm0 + vmovdqa xmm1, oword [ebp-58H] + vmovdqa xmm0, oword [ebp-48H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [ebp-1A8H], xmm0 + vmovdqa xmm0, oword [ebp-1D8H] + vmovaps oword [ebp-78H], xmm0 + vmovdqa xmm0, oword [ebp-1C8H] + vmovaps oword [ebp-68H], xmm0 + vmovdqa xmm1, oword [ebp-78H] + vmovdqa xmm0, oword [ebp-68H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [ebp-198H], xmm0 + vmovdqa xmm0, oword [ebp-1B8H] + vmovaps oword [ebp-98H], xmm0 + vmovdqa xmm0, oword [ebp-198H] + vmovaps oword [ebp-88H], xmm0 + vmovdqa xmm1, oword [ebp-98H] + vmovdqa xmm0, oword [ebp-88H] + vpaddw xmm0, xmm1, xmm0 + vmovaps oword [ebp-1B8H], xmm0 + vmovdqa xmm0, oword [ebp-1B8H] + vmovaps oword [ebp-0B8H], xmm0 + vmovdqa xmm0, oword [ebp-1A8H] + vmovaps oword [ebp-0A8H], xmm0 + vmovdqa xmm1, oword [ebp-0A8H] + vmovdqa xmm0, oword [ebp-0B8H] + vphaddw xmm0, xmm0, xmm1 + vmovaps oword [ebp-1B8H], xmm0 + vmovdqa xmm0, oword [ebp-1B8H] + vpshufd xmm0, xmm0, 0EEH + vmovaps oword [ebp-1A8H], xmm0 + vmovdqa xmm0, oword [ebp-1A8H] + vmovaps oword [ebp-0C8H], xmm0 + mov dword [ebp-23CH], 1 + vmovdqa xmm1, oword [ebp-0C8H] + vmovd xmm0, dword [ebp-23CH] + vpsllw xmm0, xmm1, xmm0 + vmovaps oword [ebp-1A8H], xmm0 + vmovdqa xmm0, oword [ebp-1B8H] + vmovaps oword [ebp-0E8H], xmm0 + vmovdqa xmm0, oword [ebp-1A8H] + vmovaps oword [ebp-0D8H], xmm0 + vmovdqa xmm0, oword [ebp-0E8H] + vmovdqa xmm1, oword [ebp-0D8H] + vpsubw xmm0, xmm0, xmm1 + vmovaps oword [ebp-1A8H], xmm0 + vmovdqa xmm0, oword [ebp-1A8H] + vmovaps oword [ebp-0F8H], xmm0 + vmovdqa xmm0, oword [ebp-0F8H] + vpabsw xmm0, xmm0 + vmovaps oword [ebp-1A8H], xmm0 + vmovdqa xmm0, oword [ebp-1A8H] + vmovaps oword [ebp-118H], xmm0 + vmovdqa xmm0, oword [ebp-1A8H] + vmovaps oword [ebp-108H], xmm0 + vmovdqa xmm1, oword [ebp-108H] + vmovdqa xmm0, oword [ebp-118H] + vphaddsw xmm0, xmm0, xmm1 + vmovaps oword [ebp-1A8H], xmm0 + vmovdqa xmm0, oword [ebp-1A8H] + vmovaps oword [ebp-138H], xmm0 + vmovdqa xmm0, oword [ebp-1A8H] + vmovaps oword [ebp-128H], xmm0 + vmovdqa xmm1, oword [ebp-128H] + vmovdqa xmm0, oword [ebp-138H] + vphaddsw xmm0, xmm0, xmm1 + vmovaps oword [ebp-1A8H], xmm0 + vmovdqa xmm0, oword [ebp-1A8H] + vmovaps oword [ebp-148H], xmm0 + vmovdqa xmm0, oword [ebp-148H] + vmovd edx, xmm0 + lea eax, [ebp-256H] + mov word [eax], dx + movzx ecx, word [ebp-256H] + movzx eax, cx + mov edx, 0 + add dword [ebp-220H], eax + adc dword [ebp-21CH], edx + mov eax, dword [ebp+1CH] + imul eax, dword [ebp-254H] + mov edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-264H] + add eax, edx + mov dword [ebp-240H], eax + vmovdqa xmm0, oword [ebp-1F8H] + vmovaps oword [ebp-158H], xmm0 + vmovdqa xmm0, oword [ebp-158H] + mov eax, dword [ebp-240H] + vmovups oword [eax], xmm0 + nop + mov eax, dword [ebp-254H] + lea edx, [eax+1H] + mov eax, dword [ebp+1CH] + imul edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-264H] + add eax, edx + mov dword [ebp-244H], eax + vmovdqa xmm0, oword [ebp-1E8H] + vmovaps oword [ebp-168H], xmm0 + vmovdqa xmm0, oword [ebp-168H] + mov eax, dword [ebp-244H] + vmovups oword [eax], xmm0 + nop + mov eax, dword [ebp+1CH] + imul eax, dword [ebp-254H] + mov edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-260H] + add eax, edx + mov dword [ebp-248H], eax + vmovdqa xmm0, oword [ebp-218H] + vmovaps oword [ebp-178H], xmm0 + vmovdqa xmm0, oword [ebp-178H] + mov eax, dword [ebp-248H] + vmovups oword [eax], xmm0 + nop + mov eax, dword [ebp-254H] + lea edx, [eax+1H] + mov eax, dword [ebp+1CH] + imul edx, eax + mov eax, dword [ebp-250H] + add eax, edx + lea edx, [eax+eax] + mov eax, dword [ebp-260H] + add eax, edx + mov dword [ebp-24CH], eax + vmovdqa xmm0, oword [ebp-208H] + vmovaps oword [ebp-188H], xmm0 + vmovdqa xmm0, oword [ebp-188H] + mov eax, dword [ebp-24CH] + vmovups oword [eax], xmm0 + nop + add dword [ebp-250H], 8 +L_016: mov eax, dword [ebp-250H] + cmp eax, dword [ebp+8H] + jc L_015 + add dword [ebp-254H], 2 +L_017: mov eax, dword [ebp-254H] + cmp eax, dword [ebp+0CH] + jc L_014 + mov eax, dword [ebp-220H] + mov edx, dword [ebp-21CH] + shld edx, eax, 1 + add eax, eax + mov ecx, dword [ebp-0CH] + + xor ecx, dword [gs:14H] + jz L_018 +L_018: leave + ret + + +SECTION .data + + +SECTION .bss + + +SECTION .text.__x86.get_pc_thunk.ax + +__x86.get_pc_thunk.ax: + mov eax, dword [esp] + ret + + + +SECTION .note.gnu.property align=4 + + db 04H, 00H, 00H, 00H, 0CH, 00H, 00H, 00H + db 05H, 00H, 00H, 00H, 47H, 4EH, 55H, 00H + db 02H, 00H, 00H, 0C0H, 04H, 00H, 00H, 00H + db 03H, 00H, 00H, 00H + +%endif diff --git a/libavfilter/x86/vf_xpsnr_init.c b/libavfilter/x86/vf_xpsnr_init.c new file mode 100644 index 0000000000..825fc2f995 --- /dev/null +++ b/libavfilter/x86/vf_xpsnr_init.c @@ -0,0 +1,58 @@ +/* + * Copyright (c) 2023 Christian R. Helmrich + * Copyright (c) 2023 Christian Stoffers + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/** + * @file + * SIMD initialization for calculation of extended perceptually weighted PSNR (XPSNR). + * + * Authors: Christian Helmrich and Christian Stoffers, Fraunhofer HHI, Berlin, Germany + */ + +#include "libavutil/x86/cpu.h" +#include "libavfilter/xpsnr.h" + +uint64_t ff_sse_line_16bit_sse2 (const uint8_t *buf, const uint8_t *ref, const int w); +#ifdef __AVX2__ +uint64_t highds_simd (const int x_act, const int y_act, const int w_act, const int h_act, const int16_t *o_m0, const int o); +uint64_t diff1st_simd(const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, const int o); +uint64_t diff2nd_simd(const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, int16_t *o_m2, const int o); +#endif + +void ff_xpsnr_init_x86 (PSNRDSPContext *dsp, const int bpp) +{ + if (bpp <= 15) /* XPSNR always operates with 16-bit internal precision */ + { + const int cpu_flags = av_get_cpu_flags(); + + if (EXTERNAL_SSE2 (cpu_flags)) + { + dsp->sse_line = ff_sse_line_16bit_sse2; + } + if (EXTERNAL_AVX2 (cpu_flags)) + { +#ifdef __AVX2__ + dsp->highds_func = highds_simd; + dsp->diff1st_func = diff1st_simd; + dsp->diff2nd_func = diff2nd_simd; +#endif + } + } +} diff --git a/libavfilter/xpsnr.h b/libavfilter/xpsnr.h new file mode 100644 index 0000000000..f07179e449 --- /dev/null +++ b/libavfilter/xpsnr.h @@ -0,0 +1,48 @@ +/* + * Copyright (c) 2023 Christian R. Helmrich + * Copyright (c) 2023 Christian Stoffers + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/** + * @file + * Public declaration of DSP context structure of XPSNR measurement filter for FFmpeg. + * + * Authors: Christian Helmrich and Christian Stoffers, Fraunhofer HHI, Berlin, Germany + */ + +#ifndef AVFILTER_XPSNR_H +#define AVFILTER_XPSNR_H + +#include +#include +#include "libavutil/x86/cpu.h" + +/* public XPSNR DSP structure definition */ + +typedef struct XPSNRDSPContext +{ + uint64_t (*sse_line) (const uint8_t *buf, const uint8_t *ref, const int w); + uint64_t (*highds_func) (const int x_act, const int y_act, const int w_act, const int h_act, const int16_t *o_m0, const int o); + uint64_t (*diff1st_func)(const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, const int o); + uint64_t (*diff2nd_func)(const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, int16_t *o_m2, const int o); +} PSNRDSPContext; + +void ff_xpsnr_init_x86 (PSNRDSPContext *dsp, const int bpp); + +#endif /* AVFILTER_XPSNR_H */