diff mbox series

[FFmpeg-devel] Request for adding XPSNR avfilter

Message ID b18a683b853f49ffa1f66c4b62c3de91@hhi.fraunhofer.de
State New
Headers show
Series [FFmpeg-devel] Request for adding XPSNR avfilter | expand

Checks

Context Check Description
andriy/commit_msg_x86 warning The first line of the commit message must start with a context terminated by a colon and a space, for example "lavu/opt: " or "doc: ".
yinshiyou/commit_msg_loongarch64 warning The first line of the commit message must start with a context terminated by a colon and a space, for example "lavu/opt: " or "doc: ".
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 fail Make fate failed
andriy/make_x86 success Make finished
andriy/make_fate_x86 fail Make fate failed

Commit Message

Helmrich, Christian Jan. 10, 2023, 7:39 p.m. UTC
Hi,

please find attached a patch (relative to FFmpeg master as of early January 10, 2023)
adding avfilter support for extended perceptually weighted peak signal-to-noise ratio
(XPSNR) measurements for videos, as described in the related addition to filters.texi.

The XPSNR code was originally vectorized using SIMD intrinsics, but we concluded that
FFmpeg code requires asm instead of such intrinsics, so we let gcc auto-convert these
instructions to pure assembly; see the vf_xpsnr.asm file. If the added asm code is too
lengthy, intrinsics would be possible, or something else is missing, please let us know.

Best,

Christian Helmrich and Christian Stoffers
Fraunhofer HHI

Comments

Paul B Mahol Jan. 10, 2023, 8:43 p.m. UTC | #1
On 1/10/23, Helmrich, Christian <christian.helmrich@hhi.fraunhofer.de> wrote:
> Hi,
>
> please find attached a patch (relative to FFmpeg master as of early January
> 10, 2023)
> adding avfilter support for extended perceptually weighted peak
> signal-to-noise ratio
> (XPSNR) measurements for videos, as described in the related addition to
> filters.texi.
>
> The XPSNR code was originally vectorized using SIMD intrinsics, but we
> concluded that
> FFmpeg code requires asm instead of such intrinsics, so we let gcc
> auto-convert these

So its better to use that instead of human written assembly?
Does clang generate faster code without this asm?

> instructions to pure assembly; see the vf_xpsnr.asm file. If the added asm
> code is too
> lengthy, intrinsics would be possible, or something else is missing, please
> let us know.
>

Please remove SLICE_THREADS related flag as there is no call to
execute to filter in slices.
Please remove stdbool.h header and adapt code to compile without it.

> Best,
>
> Christian Helmrich and Christian Stoffers
> Fraunhofer HHI
>
Helmrich, Christian Jan. 11, 2023, 11:42 a.m. UTC | #2
Hi,


> So its better to use that instead of human written assembly? Does clang generate faster code without this asm?


I'm not sure I fully understand your questions, but I hope the following answers it. The reason why we auto-converted our intrinsics code to asm is not a technical one, we unfortunately just don't have the knowledge or resources to manually write asm code. If I remember correctly, the SIMD optimized code runs about twice as fast as the C code, especially on UHD input.


> Please remove SLICE_THREADS related flag as there is no call to execute to filter in slices. Please remove stdbool.h header and adapt code to compile without it.


Done, please find attached a second version (v1) of the XPSNR avfilter patch.


Thanks and best,


Christian Helmrich

Fraunhofer HHI, Video Coding and Analytics Department
Paul B Mahol Jan. 11, 2023, 11:48 a.m. UTC | #3
On 1/11/23, Helmrich, Christian <christian.helmrich@hhi.fraunhofer.de> wrote:
> Hi,
>
>
>> So its better to use that instead of human written assembly? Does clang
>> generate faster code without this asm?
>
>
> I'm not sure I fully understand your questions, but I hope the following
> answers it. The reason why we auto-converted our intrinsics code to asm is
> not a technical one, we unfortunately just don't have the knowledge or
> resources to manually write asm code. If I remember correctly, the SIMD
> optimized code runs about twice as fast as the C code, especially on UHD
> input.

Compare clang compiled ffmpeg without this asm code and with it, and
tell if any difference.
I'might do it anyway later.

>
>
>> Please remove SLICE_THREADS related flag as there is no call to execute to
>> filter in slices. Please remove stdbool.h header and adapt code to compile
>> without it.
>
>
> Done, please find attached a second version (v1) of the XPSNR avfilter
> patch.
>
>
> Thanks and best,
>
>
> Christian Helmrich
>
> Fraunhofer HHI, Video Coding and Analytics Department
>
>
> ________________________________
> Von: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> im Auftrag von Paul B
> Mahol <onemda@gmail.com>
> Gesendet: Dienstag, 10. Januar 2023 21:43
> An: FFmpeg development discussions and patches
> Cc: Stoffers, Christian
> Betreff: Re: [FFmpeg-devel] [PATCH] Request for adding XPSNR avfilter
>
> On 1/10/23, Helmrich, Christian <christian.helmrich@hhi.fraunhofer.de>
> wrote:
>> Hi,
>>
>> please find attached a patch (relative to FFmpeg master as of early
>> January
>> 10, 2023)
>> adding avfilter support for extended perceptually weighted peak
>> signal-to-noise ratio
>> (XPSNR) measurements for videos, as described in the related addition to
>> filters.texi.
>>
>> The XPSNR code was originally vectorized using SIMD intrinsics, but we
>> concluded that
>> FFmpeg code requires asm instead of such intrinsics, so we let gcc
>> auto-convert these
>
> So its better to use that instead of human written assembly?
> Does clang generate faster code without this asm?
>
>> instructions to pure assembly; see the vf_xpsnr.asm file. If the added
>> asm
>> code is too
>> lengthy, intrinsics would be possible, or something else is missing,
>> please
>> let us know.
>>
>
> Please remove SLICE_THREADS related flag as there is no call to
> execute to filter in slices.
> Please remove stdbool.h header and adapt code to compile without it.
>
>> Best,
>>
>> Christian Helmrich and Christian Stoffers
>> Fraunhofer HHI
>>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
Paul B Mahol Jan. 11, 2023, 11:53 a.m. UTC | #4
On 1/11/23, Paul B Mahol <onemda@gmail.com> wrote:
> On 1/11/23, Helmrich, Christian <christian.helmrich@hhi.fraunhofer.de>
> wrote:
>> Hi,
>>
>>
>>> So its better to use that instead of human written assembly? Does clang
>>> generate faster code without this asm?
>>
>>
>> I'm not sure I fully understand your questions, but I hope the following
>> answers it. The reason why we auto-converted our intrinsics code to asm
>> is
>> not a technical one, we unfortunately just don't have the knowledge or
>> resources to manually write asm code. If I remember correctly, the SIMD
>> optimized code runs about twice as fast as the C code, especially on UHD
>> input.
>
> Compare clang compiled ffmpeg without this asm code and with it, and
> tell if any difference.
> I'might do it anyway later.

Also please fix style of code, look at other filters in codebase, for
example vf_psnr.c filter

Use "for () {\n" instead of "for () \n{\n}"

>
>>
>>
>>> Please remove SLICE_THREADS related flag as there is no call to execute
>>> to
>>> filter in slices. Please remove stdbool.h header and adapt code to
>>> compile
>>> without it.
>>
>>
>> Done, please find attached a second version (v1) of the XPSNR avfilter
>> patch.
>>
>>
>> Thanks and best,
>>
>>
>> Christian Helmrich
>>
>> Fraunhofer HHI, Video Coding and Analytics Department
>>
>>
>> ________________________________
>> Von: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> im Auftrag von Paul B
>> Mahol <onemda@gmail.com>
>> Gesendet: Dienstag, 10. Januar 2023 21:43
>> An: FFmpeg development discussions and patches
>> Cc: Stoffers, Christian
>> Betreff: Re: [FFmpeg-devel] [PATCH] Request for adding XPSNR avfilter
>>
>> On 1/10/23, Helmrich, Christian <christian.helmrich@hhi.fraunhofer.de>
>> wrote:
>>> Hi,
>>>
>>> please find attached a patch (relative to FFmpeg master as of early
>>> January
>>> 10, 2023)
>>> adding avfilter support for extended perceptually weighted peak
>>> signal-to-noise ratio
>>> (XPSNR) measurements for videos, as described in the related addition to
>>> filters.texi.
>>>
>>> The XPSNR code was originally vectorized using SIMD intrinsics, but we
>>> concluded that
>>> FFmpeg code requires asm instead of such intrinsics, so we let gcc
>>> auto-convert these
>>
>> So its better to use that instead of human written assembly?
>> Does clang generate faster code without this asm?
>>
>>> instructions to pure assembly; see the vf_xpsnr.asm file. If the added
>>> asm
>>> code is too
>>> lengthy, intrinsics would be possible, or something else is missing,
>>> please
>>> let us know.
>>>
>>
>> Please remove SLICE_THREADS related flag as there is no call to
>> execute to filter in slices.
>> Please remove stdbool.h header and adapt code to compile without it.
>>
>>> Best,
>>>
>>> Christian Helmrich and Christian Stoffers
>>> Fraunhofer HHI
>>>
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>
>
Helmrich, Christian Jan. 11, 2023, 4:53 p.m. UTC | #5
Hi,


> Also please fix style of code, ... example vf_psnr.c filter ... "for () {\n" instead of "for () \n{\n}"


Done, I aligned block encapsulation, indentation, and some other things with those in vf_psnr.c


> Compare clang compiled ffmpeg without this asm code and with it, and tell if any difference.

> I'might do it anyway later.

Strange, the asm code is now only barely (a few percent at most) faster than the C-loop code on
our side. Maybe the compilers or CPUs have improved since we last tested? Anway, we decided
to make a new patch without the asm file, but keep the function pointers in case we manage to
write better SIMD for the highds, diff1st, and diff2nd function later (for a smaller patch then).

I prepared a new avfilter_xpsnr_v2.patch. Do I need to change the email (thread) title somehow
so that a new pipeline is being triggered?


Best,


Christian Helmrich

Fraunhofer HHI, Video Coding and Analytics Department
Paul B Mahol Jan. 11, 2023, 6:33 p.m. UTC | #6
On 1/11/23, Helmrich, Christian <christian.helmrich@hhi.fraunhofer.de> wrote:
> Hi,
>
>
>> Also please fix style of code, ... example vf_psnr.c filter ... "for ()
>> {\n" instead of "for () \n{\n}"
>
>
> Done, I aligned block encapsulation, indentation, and some other things with
> those in vf_psnr.c
>
>
>> Compare clang compiled ffmpeg without this asm code and with it, and tell
>> if any difference.
>
>> I'might do it anyway later.
>
> Strange, the asm code is now only barely (a few percent at most) faster than
> the C-loop code on
> our side. Maybe the compilers or CPUs have improved since we last tested?
> Anway, we decided
> to make a new patch without the asm file, but keep the function pointers in
> case we manage to
> write better SIMD for the highds, diff1st, and diff2nd function later (for a
> smaller patch then).
>
> I prepared a new avfilter_xpsnr_v2.patch. Do I need to change the email
> (thread) title somehow
> so that a new pipeline is being triggered?

Not needed, just attach patch with proper authorship, made with git
format-patch.

>
>
> Best,
>
>
> Christian Helmrich
>
> Fraunhofer HHI, Video Coding and Analytics Department
>
>
> ________________________________
> Von: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> im Auftrag von Paul B
> Mahol <onemda@gmail.com>
> Gesendet: Mittwoch, 11. Januar 2023 12:53
> An: FFmpeg development discussions and patches
> Cc: Stoffers, Christian
> Betreff: Re: [FFmpeg-devel] [PATCH] Request for adding XPSNR avfilter
>
> On 1/11/23, Paul B Mahol <onemda@gmail.com> wrote:
>> On 1/11/23, Helmrich, Christian <christian.helmrich@hhi.fraunhofer.de>
>> wrote:
>>> Hi,
>>>
>>>
>>>> So its better to use that instead of human written assembly? Does clang
>>>> generate faster code without this asm?
>>>
>>>
>>> I'm not sure I fully understand your questions, but I hope the following
>>> answers it. The reason why we auto-converted our intrinsics code to asm
>>> is
>>> not a technical one, we unfortunately just don't have the knowledge or
>>> resources to manually write asm code. If I remember correctly, the SIMD
>>> optimized code runs about twice as fast as the C code, especially on UHD
>>> input.
>>
>> Compare clang compiled ffmpeg without this asm code and with it, and
>> tell if any difference.
>> I'might do it anyway later.
>
> Also please fix style of code, look at other filters in codebase, for
> example vf_psnr.c filter
>
> Use "for () {\n" instead of "for () \n{\n}"
>
>>
>>>
>>>
>>>> Please remove SLICE_THREADS related flag as there is no call to execute
>>>> to
>>>> filter in slices. Please remove stdbool.h header and adapt code to
>>>> compile
>>>> without it.
>>>
>>>
>>> Done, please find attached a second version (v1) of the XPSNR avfilter
>>> patch.
>>>
>>>
>>> Thanks and best,
>>>
>>>
>>> Christian Helmrich
>>>
>>> Fraunhofer HHI, Video Coding and Analytics Department
>>>
>>>
>>> ________________________________
>>> Von: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> im Auftrag von Paul
>>> B
>>> Mahol <onemda@gmail.com>
>>> Gesendet: Dienstag, 10. Januar 2023 21:43
>>> An: FFmpeg development discussions and patches
>>> Cc: Stoffers, Christian
>>> Betreff: Re: [FFmpeg-devel] [PATCH] Request for adding XPSNR avfilter
>>>
>>> On 1/10/23, Helmrich, Christian <christian.helmrich@hhi.fraunhofer.de>
>>> wrote:
>>>> Hi,
>>>>
>>>> please find attached a patch (relative to FFmpeg master as of early
>>>> January
>>>> 10, 2023)
>>>> adding avfilter support for extended perceptually weighted peak
>>>> signal-to-noise ratio
>>>> (XPSNR) measurements for videos, as described in the related addition
>>>> to
>>>> filters.texi.
>>>>
>>>> The XPSNR code was originally vectorized using SIMD intrinsics, but we
>>>> concluded that
>>>> FFmpeg code requires asm instead of such intrinsics, so we let gcc
>>>> auto-convert these
>>>
>>> So its better to use that instead of human written assembly?
>>> Does clang generate faster code without this asm?
>>>
>>>> instructions to pure assembly; see the vf_xpsnr.asm file. If the added
>>>> asm
>>>> code is too
>>>> lengthy, intrinsics would be possible, or something else is missing,
>>>> please
>>>> let us know.
>>>>
>>>
>>> Please remove SLICE_THREADS related flag as there is no call to
>>> execute to filter in slices.
>>> Please remove stdbool.h header and adapt code to compile without it.
>>>
>>>> Best,
>>>>
>>>> Christian Helmrich and Christian Stoffers
>>>> Fraunhofer HHI
>>>>
>>> _______________________________________________
>>> ffmpeg-devel mailing list
>>> ffmpeg-devel@ffmpeg.org
>>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>>
>>> To unsubscribe, visit link above, or email
>>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>>
>>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
Helmrich, Christian Jan. 11, 2023, 7:15 p.m. UTC | #7
Hi,


> just attach patch with proper authorship, made with git fformat-patch.


Done, attached.


Best,


Christian Helmrich

Fraunhofer HHI, Video Coding and Analytics Department
Helmrich, Christian Jan. 17, 2023, 1:46 p.m. UTC | #8
> just attach patch with proper authorship, made with git format-patch.


Small update, replacing our own MAX( ) define with FFmpeg's existing FFMAX( ) and

adding a v2 to the patch so that, hopefully, the fate pipeline is triggered now.


Best,


Christian Helmrich

Fraunhofer HHI, Video Coding and Analytics Department
diff mbox series

Patch

diff --git a/Changelog b/Changelog
index 179f63c7d5..35f6b6e64f 100644
--- a/Changelog
+++ b/Changelog
@@ -28,6 +28,7 @@  version <next>:
 - showcwt multimedia filter
 - corr video filter
 - adrc audio filter
+- XPSNR video filter
 
 
 version 5.1:
diff --git a/doc/filters.texi b/doc/filters.texi
index 57088ccc6c..ab725c506f 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -19141,6 +19141,7 @@  pseudocolor="'if(between(val,ymax,amax),lerp(ymin,ymax,(val-ymax)/(amax-ymax)),-
 @end example
 @end itemize
 
+@anchor{psnr}
 @section psnr
 
 Obtain the average, maximum and minimum PSNR (Peak Signal to Noise
@@ -24820,6 +24821,72 @@  minimum values, and @code{1} maximum values.
 
 This filter supports all above options as @ref{commands}, excluding option @code{inputs}.
 
+@section xpsnr
+
+Obtain the average (across all input frames) and minimum (across all color plane averages)
+eXtended Perceptually weighted peak Signal-to-Noise Ratio (XPSNR) between two input videos.
+
+The XPSNR is a low-complexity psychovisually motivated distortion measurement algorithm for
+assessing the difference between two video streams or images. This is especially useful for
+objectively quantifying the distortions caused by video and image codecs, as an alternative
+to a formal subjective test. The logarithmic XPSNR output values are in a similar range as
+those of traditional @ref{psnr} assessments but better reflect human impressions of visual
+coding quality. More details on the XPSNR measure, which essentially represents a blockwise
+weighted variant of the PSNR measure, can be found in the following freely available papers:
+
+@itemize
+@item
+C. R. Helmrich, M. Siekmann, S. Becker, S. Bosse, D. Marpe, and T. Wiegand, "XPSNR: A
+Low-Complexity Extension of the Perceptually Weighted Peak Signal-to-Noise Ratio for
+High-Resolution Video Quality Assessment," in Proc. IEEE Int. Conf. Acoustics, Speech,
+Sig. Process. (ICASSP), virt./online, May 2020. @url{www.ecodis.de/xpsnr.htm}
+
+@item
+C. R. Helmrich, S. Bosse, H. Schwarz, D. Marpe, and T. Wiegand, "A Study of the
+Extended Perceptually Weighted Peak Signal-to-Noise Ratio (XPSNR) for Video Compression
+with Different Resolutions and Bit Depths," ITU Journal: ICT Discoveries, vol. 3, no.
+1, pp. 65 - 72, May 2020. @url{http://handle.itu.int/11.1002/pub/8153d78b-en}
+@end itemize
+
+When publishing the results of XPSNR assessments obtained using, e.g., this FFmpeg filter, a
+reference to the above papers as a means of documentation is strongly encouraged. The filter
+requires two input videos. The first input is considered a (usually not distorted) reference
+source and is passed unchanged to the output, whereas the second input is a (distorted) test
+signal. Except for the bit depth, these two video inputs must have the same pixel format. In
+addition, for best performance, both compared input videos should be in YCbCr color format.
+
+The obtained overall XPSNR values mentioned above are printed through the logging system. In
+case of input with multiple color planes, we suggest reporting of the minimum XPSNR average.
+
+The following parameter, which behaves like the one for the @ref{psnr} filter, is accepted:
+
+@table @option
+@item stats_file, f
+If specified, the filter will use the named file to save the XPSNR value of each individual
+frame and color plane. When the file name equals "-", that data is sent to standard output.
+@end table
+
+This filter also supports the @ref{framesync} options.
+
+@subsection Examples
+@itemize
+@item
+XPSNR analysis of two 1080p HD videos, ref_source.yuv and test_video.yuv, both at 24 frames
+per second, with color format 4:2:0, bit depth 8, and output of a logfile named "xpsnr.log":
+@example
+ffmpeg -s 1920x1080 -framerate 24 -pix_fmt yuv420p -i ref_source.yuv -s 1920x1080 -framerate
+24 -pix_fmt yuv420p -i test_video.yuv -lavfi xpsnr="stats_file=xpsnr.log" -f null -
+@end example
+
+@item
+XPSNR analysis of two 2160p UHD videos, ref_source.yuv with bit depth 8 and test_video.yuv
+with bit depth 10, both at 60 frames per second with color format 4:2:0, no logfile output:
+@example
+ffmpeg -s 3840x2160 -framerate 60 -pix_fmt yuv420p -i ref_source.yuv -s 3840x2160 -framerate
+60 -pix_fmt yuv420p10le -i test_video.yuv -lavfi xpsnr="stats_file=-" -f null -
+@end example
+@end itemize
+
 @section xstack
 Stack video inputs into custom layout.
 
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index 5783be281d..14ba19fa4e 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -544,6 +544,7 @@  OBJS-$(CONFIG_XCORRELATE_FILTER)             += vf_convolve.o framesync.o
 OBJS-$(CONFIG_XFADE_FILTER)                  += vf_xfade.o
 OBJS-$(CONFIG_XFADE_OPENCL_FILTER)           += vf_xfade_opencl.o opencl.o opencl/xfade.o
 OBJS-$(CONFIG_XMEDIAN_FILTER)                += vf_xmedian.o framesync.o
+OBJS-$(CONFIG_XPSNR_FILTER)                  += vf_xpsnr.o framesync.o
 OBJS-$(CONFIG_XSTACK_FILTER)                 += vf_stack.o framesync.o
 OBJS-$(CONFIG_YADIF_FILTER)                  += vf_yadif.o yadif_common.o
 OBJS-$(CONFIG_YADIF_CUDA_FILTER)             += vf_yadif_cuda.o vf_yadif_cuda.ptx.o \
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index 52741b60e4..3b93a9af06 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -513,6 +513,7 @@  extern const AVFilter ff_vf_xcorrelate;
 extern const AVFilter ff_vf_xfade;
 extern const AVFilter ff_vf_xfade_opencl;
 extern const AVFilter ff_vf_xmedian;
+extern const AVFilter ff_vf_xpsnr;
 extern const AVFilter ff_vf_xstack;
 extern const AVFilter ff_vf_yadif;
 extern const AVFilter ff_vf_yadif_cuda;
diff --git a/libavfilter/version.h b/libavfilter/version.h
index 9fabc544b5..a56ba3bb6d 100644
--- a/libavfilter/version.h
+++ b/libavfilter/version.h
@@ -31,7 +31,7 @@ 
 
 #include "version_major.h"
 
-#define LIBAVFILTER_VERSION_MINOR  53
+#define LIBAVFILTER_VERSION_MINOR  54
 #define LIBAVFILTER_VERSION_MICRO 100
 
 
diff --git a/libavfilter/x86/Makefile b/libavfilter/x86/Makefile
index e87481bd7a..641b1f740f 100644
--- a/libavfilter/x86/Makefile
+++ b/libavfilter/x86/Makefile
@@ -38,6 +38,7 @@  OBJS-$(CONFIG_TRANSPOSE_FILTER)              += x86/vf_transpose_init.o
 OBJS-$(CONFIG_VOLUME_FILTER)                 += x86/af_volume_init.o
 OBJS-$(CONFIG_V360_FILTER)                   += x86/vf_v360_init.o
 OBJS-$(CONFIG_W3FDIF_FILTER)                 += x86/vf_w3fdif_init.o
+OBJS-$(CONFIG_XPSNR_FILTER)                  += x86/vf_xpsnr_init.o
 OBJS-$(CONFIG_YADIF_FILTER)                  += x86/vf_yadif_init.o
 
 X86ASM-OBJS-$(CONFIG_SCENE_SAD)              += x86/scene_sad.o
@@ -80,4 +81,5 @@  X86ASM-OBJS-$(CONFIG_TRANSPOSE_FILTER)       += x86/vf_transpose.o
 X86ASM-OBJS-$(CONFIG_VOLUME_FILTER)          += x86/af_volume.o
 X86ASM-OBJS-$(CONFIG_V360_FILTER)            += x86/vf_v360.o
 X86ASM-OBJS-$(CONFIG_W3FDIF_FILTER)          += x86/vf_w3fdif.o
+X86ASM-OBJS-$(CONFIG_XPSNR_FILTER)           += x86/vf_xpsnr.o
 X86ASM-OBJS-$(CONFIG_YADIF_FILTER)           += x86/vf_yadif.o x86/yadif-16.o x86/yadif-10.o
diff --git a/libavfilter/vf_xpsnr.c b/libavfilter/vf_xpsnr.c
new file mode 100644
index 0000000000..5b6a47aa69
--- /dev/null
+++ b/libavfilter/vf_xpsnr.c
@@ -0,0 +1,832 @@ 
+/*
+ * Copyright (c) 2023 Christian R. Helmrich
+ * Copyright (c) 2023 Christian Stoffers
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/**
+ * @file
+ * Calculate the extended perceptually weighted PSNR (XPSNR) between two input videos.
+ *
+ * Authors: Christian Helmrich and Christian Stoffers, Fraunhofer HHI, Berlin, Germany
+ */
+
+#include <stdbool.h>
+#include "libavutil/avstring.h"
+#include "libavutil/file_open.h"
+#include "libavutil/opt.h"
+#include "libavutil/pixdesc.h"
+#include "avfilter.h"
+#include "drawutils.h"
+#include "framesync.h"
+#include "internal.h"
+#include "xpsnr.h"
+
+/* XPSNR structure definition */
+
+typedef struct XPSNRContext
+{
+  /* required basic variables */
+  const AVClass   *class;
+  int             bpp; /* unpacked */
+  int             depth; /* packed */
+  char            comps[4];
+  int             num_comps;
+  uint64_t        num_frames_64;
+  unsigned        frame_rate;
+  FFFrameSync     fs;
+  int             line_sizes[4];
+  int             plane_height[4];
+  int             plane_width[4];
+  uint8_t         rgba_map[4];
+  FILE            *stats_file;
+  char            *stats_file_str;
+  /* XPSNR specific variables */
+  double          *sse_luma;
+  double          *weights;
+  AVBufferRef*    buf_org   [3];
+  AVBufferRef*    buf_org_m1[3];
+  AVBufferRef*    buf_org_m2[3];
+  AVBufferRef*    buf_rec   [3];
+  uint64_t        max_error_64;
+  double          sum_wdist [3];
+  double          sum_xpsnr [3];
+  bool            and_is_inf[3];
+  bool            is_rgb;
+  PSNRDSPContext  dsp;
+}
+XPSNRContext;
+
+/* required macro definitions */
+
+#define FLAGS     AV_OPT_FLAG_FILTERING_PARAM | AV_OPT_FLAG_VIDEO_PARAM
+#ifndef MAX
+#define MAX(a, b) (((a) > (b)) ? (a) : (b))
+#endif
+#define OFFSET(x) offsetof(XPSNRContext, x)
+#define XPSNR_GAMMA 2
+
+static const AVOption xpsnr_options[] =
+{
+  {"stats_file", "Set file where to store per-frame XPSNR information", OFFSET (stats_file_str), AV_OPT_TYPE_STRING, {.str = NULL}, 0, 0, FLAGS},
+  {"f",          "Set file where to store per-frame XPSNR information", OFFSET (stats_file_str), AV_OPT_TYPE_STRING, {.str = NULL}, 0, 0, FLAGS},
+  { NULL }
+};
+
+FRAMESYNC_DEFINE_CLASS (xpsnr, XPSNRContext, fs);
+
+/* XPSNR function definitions */
+
+static uint64_t highds (const int x_act, const int y_act, const int w_act, const int h_act, const int16_t *o_m0, const int o)
+{
+  uint64_t sa_act = 0;
+
+  for (int y = y_act; y < h_act; y += 2)
+  {
+    for (int x = x_act; x < w_act; x += 2)
+    {
+      const int f = 12 * ((int)o_m0[ y   *o + x  ] + (int)o_m0[ y   *o + x+1] + (int)o_m0[(y+1)*o + x  ] + (int)o_m0[(y+1)*o + x+1])
+                   - 3 * ((int)o_m0[(y-1)*o + x  ] + (int)o_m0[(y-1)*o + x+1] + (int)o_m0[(y+2)*o + x  ] + (int)o_m0[(y+2)*o + x+1])
+                   - 3 * ((int)o_m0[ y   *o + x-1] + (int)o_m0[ y   *o + x+2] + (int)o_m0[(y+1)*o + x-1] + (int)o_m0[(y+1)*o + x+2])
+                   - 2 * ((int)o_m0[(y-1)*o + x-1] + (int)o_m0[(y-1)*o + x+2] + (int)o_m0[(y+2)*o + x-1] + (int)o_m0[(y+2)*o + x+2])
+                       - ((int)o_m0[(y-2)*o + x-1] + (int)o_m0[(y-2)*o + x  ] + (int)o_m0[(y-2)*o + x+1] + (int)o_m0[(y-2)*o + x+2]
+                        + (int)o_m0[(y+3)*o + x-1] + (int)o_m0[(y+3)*o + x  ] + (int)o_m0[(y+3)*o + x+1] + (int)o_m0[(y+3)*o + x+2]
+                        + (int)o_m0[(y-1)*o + x-2] + (int)o_m0[ y   *o + x-2] + (int)o_m0[(y+1)*o + x-2] + (int)o_m0[(y+2)*o + x-2]
+                        + (int)o_m0[(y-1)*o + x+3] + (int)o_m0[ y   *o + x+3] + (int)o_m0[(y+1)*o + x+3] + (int)o_m0[(y+2)*o + x+3]);
+      sa_act += (uint64_t) abs(f);
+    }
+  }
+  return sa_act;
+}
+
+static uint64_t diff1st (const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, const int o)
+{
+  uint64_t ta_act = 0;
+
+  for (uint32_t y = 0; y < h_act; y += 2)
+  {
+    for (uint32_t x = 0; x < w_act; x += 2)
+    {
+      const int t = (int)o_m0[y*o + x] + (int)o_m0[y*o + x+1] + (int)o_m0[(y+1)*o + x] + (int)o_m0[(y+1)*o + x+1]
+                 - ((int)o_m1[y*o + x] + (int)o_m1[y*o + x+1] + (int)o_m1[(y+1)*o + x] + (int)o_m1[(y+1)*o + x+1]);
+      ta_act += (uint64_t) abs(t);
+      o_m1[y*o + x  ] = o_m0[y*o + x  ];  o_m1[(y+1)*o + x  ] = o_m0[(y+1)*o + x  ];
+      o_m1[y*o + x+1] = o_m0[y*o + x+1];  o_m1[(y+1)*o + x+1] = o_m0[(y+1)*o + x+1];
+    }
+  }
+  return (ta_act * XPSNR_GAMMA);
+}
+
+static uint64_t diff2nd (const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, int16_t *o_m2, const int o)
+{
+  uint64_t ta_act = 0;
+
+  for (uint32_t y = 0; y < h_act; y += 2)
+  {
+    for (uint32_t x = 0; x < w_act; x += 2)
+    {
+      const int t = (int)o_m0[y*o + x] + (int)o_m0[y*o + x+1] + (int)o_m0[(y+1)*o + x] + (int)o_m0[(y+1)*o + x+1]
+             - 2 * ((int)o_m1[y*o + x] + (int)o_m1[y*o + x+1] + (int)o_m1[(y+1)*o + x] + (int)o_m1[(y+1)*o + x+1])
+                  + (int)o_m2[y*o + x] + (int)o_m2[y*o + x+1] + (int)o_m2[(y+1)*o + x] + (int)o_m2[(y+1)*o + x+1];
+      ta_act += (uint64_t) abs(t);
+      o_m2[y*o + x  ] = o_m1[y*o + x  ];  o_m2[(y+1)*o + x  ] = o_m1[(y+1)*o + x  ];
+      o_m2[y*o + x+1] = o_m1[y*o + x+1];  o_m2[(y+1)*o + x+1] = o_m1[(y+1)*o + x+1];
+      o_m1[y*o + x  ] = o_m0[y*o + x  ];  o_m1[(y+1)*o + x  ] = o_m0[(y+1)*o + x  ];
+      o_m1[y*o + x+1] = o_m0[y*o + x+1];  o_m1[(y+1)*o + x+1] = o_m0[(y+1)*o + x+1];
+    }
+  }
+  return (ta_act * XPSNR_GAMMA);
+}
+
+static uint64_t sse_line_16bit (const uint8_t *blk_org8, const uint8_t *blk_rec8, int block_width)
+{
+  const uint16_t *blk_org = (const uint16_t*) blk_org8;
+  const uint16_t *blk_rec = (const uint16_t*) blk_rec8;
+  uint64_t sse = 0; /* sum for one pixel line */
+
+  for (int x = 0; x < block_width; x++)
+  {
+    const int64_t error = (int64_t) blk_org[x] - (int64_t) blk_rec[x];
+
+    sse += error * error;
+  }
+
+  /* sum of squared errors for the pixel line */
+  return sse;
+}
+
+static inline uint64_t calc_squared_error(XPSNRContext const *s,
+                                          const int16_t *blk_org,     const uint32_t stride_org,
+                                          const int16_t *blk_rec,     const uint32_t stride_rec,
+                                          const uint32_t block_width, const uint32_t block_height)
+{
+  uint64_t sse = 0;  /* sum of squared errors */
+
+  for (uint32_t y = 0; y < block_height; y++)
+  {
+    sse += s->dsp.sse_line ((const uint8_t*) blk_org, (const uint8_t*) blk_rec, (int) block_width);
+    blk_org += stride_org;
+    blk_rec += stride_rec;
+  }
+
+  /* return nonweighted sum of squared errors */
+  return sse;
+}
+
+static inline double calc_squared_error_and_weight (XPSNRContext const *s,
+                                                    const int16_t *pic_org,     const uint32_t stride_org,
+                                                    int16_t       *pic_org_m1,  int16_t       *pic_org_m2,
+                                                    const int16_t *pic_rec,     const uint32_t stride_rec,
+                                                    const uint32_t offset_x,    const uint32_t offset_y,
+                                                    const uint32_t block_width, const uint32_t block_height,
+                                                    const uint32_t bit_depth,   const uint32_t int_frame_rate, double *ms_act)
+{
+  const int         o = (int) stride_org;
+  const int         r = (int) stride_rec;
+  const int16_t *o_m0 = pic_org    + offset_y*o + offset_x;
+  int16_t       *o_m1 = pic_org_m1 + offset_y*o + offset_x;
+  int16_t       *o_m2 = pic_org_m2 + offset_y*o + offset_x;
+  const int16_t *r_m0 = pic_rec    + offset_y*r + offset_x;
+  const int     b_val = (s->plane_width[0] * s->plane_height[0] > 2048 * 1152 ? 2 : 1); /* threshold is a bit more than HD resolution */
+  const int     x_act = (offset_x > 0 ? 0 : b_val);
+  const int     y_act = (offset_y > 0 ? 0 : b_val);
+  const int     w_act = (offset_x + block_width  < (uint32_t) s->plane_width [0] ? (int) block_width  : (int) block_width  - b_val);
+  const int     h_act = (offset_y + block_height < (uint32_t) s->plane_height[0] ? (int) block_height : (int) block_height - b_val);
+
+  const double sse = (double) calc_squared_error (s, o_m0, stride_org,
+                                                  r_m0, stride_rec,
+                                                  block_width, block_height);
+  uint64_t sa_act = 0;  /* spatial abs. activity */
+  uint64_t ta_act = 0; /* temporal abs. activity */
+
+  if (w_act <= x_act || h_act <= y_act) /* small */
+  {
+    return sse;
+  }
+
+  if (b_val > 1) /* highpass with downsampling */
+  {
+    if (w_act > 12)
+    {
+      sa_act = s->dsp.highds_func (x_act, y_act, w_act, h_act, o_m0, o);
+    }
+    else
+    {
+      highds (x_act, y_act, w_act, h_act, o_m0, o);
+    }
+  }
+  else /* <= HD, highpass without downsampling */
+  {
+    for (int y = y_act; y < h_act; y++)
+    {
+      for (int x = x_act; x < w_act; x++)
+      {
+        const int f = 12 * (int)o_m0[y*o + x] - 2 * ((int)o_m0[y*o + x-1] + (int)o_m0[y*o + x+1] + (int)o_m0[(y-1)*o + x] + (int)o_m0[(y+1)*o + x])
+                        - ((int)o_m0[(y-1)*o + x-1] + (int)o_m0[(y-1)*o + x+1] + (int)o_m0[(y+1)*o + x-1] + (int)o_m0[(y+1)*o + x+1]);
+        sa_act += (uint64_t) abs(f);
+      }
+    }
+  }
+
+  /* calculate weights (mean squared activity) */
+  *ms_act = (double) sa_act / ((double)(w_act - x_act) * (double)(h_act - y_act));
+
+  if (b_val > 1) /* highpass with downsampling */
+  {
+    if (int_frame_rate < 32) /* 1st-order diff */
+    {
+      ta_act = s->dsp.diff1st_func (block_width, block_height, o_m0, o_m1, o);
+    }
+    else /* 2nd-order diff (diff of two diffs) */
+    {
+      ta_act = s->dsp.diff2nd_func (block_width, block_height, o_m0, o_m1, o_m2, o);
+    }
+  }
+  else /* <= HD, highpass without downsampling */
+  {
+    if (int_frame_rate < 32) /* 1st-order diff */
+    {
+      for (uint32_t y = 0; y < block_height; y++)
+      {
+        for (uint32_t x = 0; x < block_width; x++)
+        {
+          const int t = (int)o_m0[y*o + x] - (int)o_m1[y*o + x];
+
+          ta_act += XPSNR_GAMMA * (uint64_t) abs(t);
+          o_m1[y*o + x] = o_m0[y*o + x];
+        }
+      }
+    }
+    else /* 2nd-order diff (diff of two diffs) */
+    {
+      for (uint32_t y = 0; y < block_height; y++)
+      {
+        for (uint32_t x = 0; x < block_width; x++)
+        {
+          const int t = (int)o_m0[y*o + x] - 2 * (int)o_m1[y*o + x] + (int)o_m2[y*o + x];
+
+          ta_act += XPSNR_GAMMA * (uint64_t) abs(t);
+          o_m2[y*o + x] = o_m1[y*o + x];
+          o_m1[y*o + x] = o_m0[y*o + x];
+        }
+      }
+    }
+  }
+
+  /* weight += mean squared temporal activity */
+  *ms_act += (double) ta_act / ((double) block_width * (double) block_height);
+
+  /* lower limit, accounts for high-pass gain */
+  if (*ms_act < (double)(1 << (bit_depth - 6))) *ms_act = (double)(1 << (bit_depth - 6));
+
+  *ms_act *= *ms_act; /* since SSE is squared */
+
+  /* return nonweighted sum of squared errors */
+  return sse;
+}
+
+static inline double get_avg_xpsnr (const double sqrt_wsse_val,  const double sum_xpsnr_val,
+                                    const uint32_t image_width,  const uint32_t image_height,
+                                    const uint64_t max_error_64, const uint64_t num_frames_64)
+{
+  if (num_frames_64 == 0) return INFINITY;
+
+  if (sqrt_wsse_val >= (double) num_frames_64) /* sq.-mean-root dist average */
+  {
+    const double avg_dist = sqrt_wsse_val / (double) num_frames_64;
+    const uint64_t  num64 = (uint64_t) image_width * (uint64_t) image_height * max_error_64;
+
+    return 10.0 * log10 ((double) num64 / ((double) avg_dist * (double) avg_dist));
+  }
+
+  return sum_xpsnr_val / (double) num_frames_64; /* older log-domain average */
+}
+
+static int get_wsse (AVFilterContext *ctx, int16_t **org, int16_t **org_m1, int16_t **org_m2, int16_t **rec, uint64_t* const wsse64)
+{
+  XPSNRContext* const  s = ctx->priv;
+  const uint32_t       w = s->plane_width [0]; /* luma image width in pixels */
+  const uint32_t       h = s->plane_height[0];/* luma image height in pixels */
+  const double         r = (double)(w * h) / (3840.0 * 2160.0); /* UHD ratio */
+  const uint32_t       b = MAX (0, 4 * (int32_t)(32.0 * sqrt (r) + 0.5)); /* block size, integer multiple of 4 for SIMD */
+  const uint32_t   w_blk = (w + b - 1) / b; /* luma width in units of blocks */
+  const double   avg_act = sqrt (16.0 * (double)(1 << (2 * s->depth - 9)) / sqrt (MAX (0.00001, r))); /* = sqrt (a_pic) */
+  const int*  stride_org = (s->bpp == 1 ? s->plane_width : s->line_sizes);
+  uint32_t x, y, idx_blk = 0; /* the "16.0" above is due to fixed-point code */
+  double* const sse_luma = s->sse_luma;
+  double* const  weights = s->weights;
+  int c;
+
+  if ((wsse64 == NULL) || (s->depth < 6) || (s->depth > 16) || (s->num_comps <= 0) || (s->num_comps > 3) || (w == 0) || (h == 0))
+  {
+    av_log (ctx, AV_LOG_ERROR, "Error in XPSNR routine: invalid argument(s).\n");
+
+    return AVERROR (EINVAL);
+  }
+
+  if ((weights == NULL) || (b >= 4 && sse_luma == NULL))
+  {
+    av_log (ctx, AV_LOG_ERROR, "Failed to allocate temporary block memory.\n");
+
+    return AVERROR (ENOMEM);
+  }
+
+  if (b >= 4)
+  {
+    const int16_t *p_org = org[0];
+    const uint32_t s_org = stride_org[0] / s->bpp;
+    const int16_t *p_rec = rec[0];
+    const uint32_t s_rec = s->plane_width[0];
+    int16_t    *p_org_m1 = org_m1[0]; /* pixel  */
+    int16_t    *p_org_m2 = org_m2[0]; /* memory */
+    double     wsse_luma = 0.0;
+
+    for (y = 0; y < h; y += b) /* calculate block SSE and perceptual weights */
+    {
+      const uint32_t block_height = (y + b > h ? h - y : b);
+
+      for (x = 0; x < w; x += b, idx_blk++)
+      {
+        const uint32_t block_width = (x + b > w ? w - x : b);
+        double ms_act = 1.0, ms_act_prev = 0.0;
+
+        sse_luma[idx_blk] = calc_squared_error_and_weight(s, p_org, s_org,
+                                                          p_org_m1, p_org_m2,
+                                                          p_rec, s_rec,
+                                                          x, y,
+                                                          block_width, block_height,
+                                                          s->depth, s->frame_rate, &ms_act);
+        weights[idx_blk] = 1.0 / sqrt (ms_act);
+
+        if (w * h <= 640u * 480u) /* in-line "minimum-smoothing" as in paper */
+        {
+          if (x == 0) /* first column */
+          {
+            ms_act_prev = (idx_blk > 1 ? weights[idx_blk - 2] : 0);
+          }
+          else  /* after first column */
+          {
+            ms_act_prev = (x > b ? MAX (weights[idx_blk - 2], weights[idx_blk]) : weights[idx_blk]);
+          }
+          if (idx_blk > w_blk) /* after first row and first column */
+          {
+            ms_act_prev = MAX (ms_act_prev, weights[idx_blk - 1 - w_blk]); /* min (left, top) */
+          }
+          if ((idx_blk > 0) && (weights[idx_blk - 1] > ms_act_prev))
+          {
+            weights[idx_blk - 1] = ms_act_prev;
+          }
+          if ((x + b >= w) && (y + b >= h) && (idx_blk > w_blk)) /* last block in picture */
+          {
+            ms_act_prev = MAX (weights[idx_blk - 1], weights[idx_blk - w_blk]);
+            if (weights[idx_blk] > ms_act_prev)
+            {
+              weights[idx_blk] = ms_act_prev;
+            }
+          }
+        }
+      } /* for x */
+    } /* for y */
+
+    for (y = idx_blk = 0; y < h; y += b) /* calculate sum for luma (Y) XPSNR */
+    {
+      for (x = 0; x < w; x += b, idx_blk++)
+      {
+        wsse_luma += sse_luma[idx_blk] * weights[idx_blk];
+      }
+    }
+    wsse64[0] = (wsse_luma <= 0.0 ? 0 : (uint64_t)(wsse_luma * avg_act + 0.5));
+  } /* b >= 4 */
+
+  for (c = 0; c < s->num_comps; c++) /* finalize SSE data for all components */
+  {
+    const int16_t *p_org = org[c];
+    const uint32_t s_org = stride_org[c] / s->bpp;
+    const int16_t *p_rec = rec[c];
+    const uint32_t s_rec = s->plane_width[c];
+    const uint32_t w_pln = s->plane_width[c];
+    const uint32_t h_pln = s->plane_height[c];
+
+    if (b < 4) /* picture is too small for XPSNR, calculate nonweighted PSNR */
+    {
+      wsse64[c] = calc_squared_error (s, p_org, s_org,
+                                      p_rec, s_rec,
+                                      w_pln, h_pln);
+    }
+    else if (c > 0) /* b >= 4, so Y XPSNR has already been calculated above! */
+    {
+      const uint32_t bx = (b * w_pln) / w;
+      const uint32_t by = (b * h_pln) / h; /* up to chroma downsampling by 4 */
+      double wsse_chroma = 0.0;
+
+      for (y = idx_blk = 0; y < h_pln; y += by) /* calc chroma (Cb/Cr) XPSNR */
+      {
+        const uint32_t block_height = (y + by > h_pln ? h_pln - y : by);
+
+        for (x = 0; x < w_pln; x += bx, idx_blk++)
+        {
+          const uint32_t block_width = (x + bx > w_pln ? w_pln - x : bx);
+
+          wsse_chroma += (double) calc_squared_error (s, p_org + y*s_org + x, s_org,
+                                                      p_rec + y*s_rec + x, s_rec,
+                                                      block_width, block_height) * weights[idx_blk];
+        }
+      }
+      wsse64[c] = (wsse_chroma <= 0.0 ? 0 : (uint64_t)(wsse_chroma * avg_act + 0.5));
+    }
+  } /* for c */
+
+  return 0;
+}
+
+static int do_xpsnr (FFFrameSync *fs)
+{
+  AVFilterContext  *ctx = fs->parent;
+  XPSNRContext* const s = ctx->priv;
+  const uint32_t      w = s->plane_width [0];  /* luma image width in pixels */
+  const uint32_t      h = s->plane_height[0]; /* luma image height in pixels */
+  const uint32_t      b = MAX (0, 4 * (int32_t)(32.0 * sqrt ((double)(w * h) / (3840.0 * 2160.0)) + 0.5)); /* block size */
+  const uint32_t  w_blk = (w + b - 1) / b;  /* luma width in units of blocks */
+  const uint32_t  h_blk = (h + b - 1) / b; /* luma height in units of blocks */
+  AVFrame *master, *ref = NULL;
+  int16_t *porg   [3];
+  int16_t *porg_m1[3];
+  int16_t *porg_m2[3];
+  int16_t *prec   [3];
+  uint64_t wsse64 [3] = {0, 0, 0};
+  double cur_xpsnr[3] = {INFINITY, INFINITY, INFINITY};
+  int c, ret_value;
+
+  if ((ret_value = ff_framesync_dualinput_get (fs, &master, &ref)) < 0) return ret_value;
+  if (ref == NULL) return ff_filter_frame (ctx->outputs[0], master);
+
+  /* prepare XPSNR calculations: allocate temporary picture and block memory */
+  if (s->sse_luma == NULL) s->sse_luma = (double*) av_mallocz (w_blk * h_blk * sizeof (double));
+  if (s->weights  == NULL) s->weights  = (double*) av_mallocz (w_blk * h_blk * sizeof (double));
+
+  for (c = 0; c < s->num_comps; c++)  /* allocate temporal org buffer memory */
+  {
+    s->line_sizes[c] = master->linesize[c];
+
+    if (c == 0) /* luma ch. */
+    {
+      const int stride_org_bpp = (s->bpp == 1 ? s->plane_width[c] : s->line_sizes[c] / s->bpp);
+
+      if (s->buf_org_m1[c] == NULL) s->buf_org_m1[c] = av_buffer_allocz (stride_org_bpp * s->plane_height[c] * sizeof (int16_t));
+      if (s->buf_org_m2[c] == NULL) s->buf_org_m2[c] = av_buffer_allocz (stride_org_bpp * s->plane_height[c] * sizeof (int16_t));
+
+      porg_m1[c] = (int16_t*) s->buf_org_m1[c]->data;
+      porg_m2[c] = (int16_t*) s->buf_org_m2[c]->data;
+    }
+  }
+
+  if (s->bpp == 1) /* 8 bit */
+  {
+    for (c = 0; c < s->num_comps; c++) /* allocate the org/rec buffer memory */
+    {
+      const int m = s->line_sizes[c]; /* master stride */
+      const int o = s->plane_width[c]; /* XPSNR stride */
+
+      if (s->buf_org[c] == NULL) s->buf_org[c] = av_buffer_allocz (s->plane_width[c] * s->plane_height[c] * sizeof (int16_t));
+      if (s->buf_rec[c] == NULL) s->buf_rec[c] = av_buffer_allocz (s->plane_width[c] * s->plane_height[c] * sizeof (int16_t));
+
+      porg[c] = (int16_t*) s->buf_org[c]->data;
+      prec[c] = (int16_t*) s->buf_rec[c]->data;
+
+      for (int y = 0; y < s->plane_height[c]; y++)
+      {
+        for (int x = 0; x < s->plane_width[c]; x++)
+        {
+          porg[c][y*o + x] = (int16_t) master->data[c][y*m + x];
+          prec[c][y*o + x] = (int16_t)    ref->data[c][y*o + x];
+        }
+      }
+    }
+  }
+  else /* 10, 12, or 14 bit */
+  {
+    for (c = 0; c < s->num_comps; c++)
+    {
+      porg[c] = (int16_t*) master->data[c];
+      prec[c] = (int16_t*)    ref->data[c];
+    }
+  }
+
+  /* extended perceptually weighted peak signal-to-noise ratio (XPSNR) value */
+
+  if ((ret_value = get_wsse (ctx, (int16_t **)&porg, (int16_t **)&porg_m1, (int16_t **)&porg_m2, (int16_t **)&prec, wsse64)) < 0)
+  {
+    return ret_value; /* an error here implies something went wrong earlier! */
+  }
+
+  for (c = 0; c < s->num_comps; c++)
+  {
+    const double sqrt_wsse = sqrt ((double) wsse64[c]);
+
+    cur_xpsnr[c] = get_avg_xpsnr (sqrt_wsse, INFINITY,
+                                  s->plane_width[c], s->plane_height[c],
+                                  s->max_error_64, 1 /* single frame */);
+    s->sum_wdist[c] += sqrt_wsse;
+    s->sum_xpsnr[c] += cur_xpsnr[c];
+    s->and_is_inf[c] &= isinf (cur_xpsnr[c]);
+  }
+  s->num_frames_64++;
+
+  if (s->stats_file) /* print out the frame and component-wise XPSNR average */
+  {
+    fprintf (s->stats_file, "n: %4"PRId64"", s->num_frames_64);
+
+    for (c = 0; c < s->num_comps; c++)
+    {
+      fprintf (s->stats_file, "  XPSNR %c: %3.4f", s->comps[c], cur_xpsnr[c]);
+    }
+    fprintf (s->stats_file, "\n");
+  }
+
+  return ff_filter_frame (ctx->outputs[0], master);
+}
+
+static av_cold int init (AVFilterContext *ctx)
+{
+  XPSNRContext* const s = ctx->priv;
+  int c;
+
+  if (s->stats_file_str)
+  {
+    if (!strcmp (s->stats_file_str, "-")) /* no statistics file, take stdout */
+    {
+      s->stats_file = stdout;
+    }
+    else
+    {
+      s->stats_file = avpriv_fopen_utf8 (s->stats_file_str, "w");
+
+      if (s->stats_file == NULL)
+      {
+        const int err = AVERROR (errno);
+        char buf[128];
+
+        av_strerror (err, buf, sizeof (buf));
+        av_log (ctx, AV_LOG_ERROR, "Could not open statistics file %s: %s\n", s->stats_file_str, buf);
+
+        return err;
+      }
+    }
+  }
+
+  s->sse_luma = NULL;
+  s->weights  = NULL;
+
+  for (c = 0; c < 3; c++) /* initialize XPSNR value of every color component */
+  {
+    s->buf_org   [c] = NULL;
+    s->buf_org_m1[c] = NULL;
+    s->buf_org_m2[c] = NULL;
+    s->buf_rec   [c] = NULL;
+    s->sum_wdist [c] = 0.0;
+    s->sum_xpsnr [c] = 0.0;
+    s->and_is_inf[c] = true;
+  }
+
+  s->fs.on_event = do_xpsnr;
+
+  return 0;
+}
+
+static const enum AVPixelFormat pix_fmts[] =
+{
+  AV_PIX_FMT_GRAY8, AV_PIX_FMT_GRAY9, AV_PIX_FMT_GRAY10, AV_PIX_FMT_GRAY12, AV_PIX_FMT_GRAY14, AV_PIX_FMT_GRAY16,
+#define PF_NOALPHA(suf) AV_PIX_FMT_YUV420##suf,  AV_PIX_FMT_YUV422##suf,  AV_PIX_FMT_YUV444##suf
+#define PF_ALPHA(suf)   AV_PIX_FMT_YUVA420##suf, AV_PIX_FMT_YUVA422##suf, AV_PIX_FMT_YUVA444##suf
+#define PF(suf)         PF_NOALPHA(suf), PF_ALPHA(suf)
+  PF(P), PF(P9), PF(P10), PF_NOALPHA(P12), PF_NOALPHA(P14), PF(P16),
+  AV_PIX_FMT_YUV440P, AV_PIX_FMT_YUV411P, AV_PIX_FMT_YUV410P,
+  AV_PIX_FMT_YUVJ411P, AV_PIX_FMT_YUVJ420P, AV_PIX_FMT_YUVJ422P,
+  AV_PIX_FMT_YUVJ440P, AV_PIX_FMT_YUVJ444P,
+  AV_PIX_FMT_GBRP, AV_PIX_FMT_GBRP9, AV_PIX_FMT_GBRP10,
+  AV_PIX_FMT_GBRP12, AV_PIX_FMT_GBRP14, AV_PIX_FMT_GBRP16,
+  AV_PIX_FMT_GBRAP, AV_PIX_FMT_GBRAP10, AV_PIX_FMT_GBRAP12, AV_PIX_FMT_GBRAP16,
+  AV_PIX_FMT_NONE
+};
+
+static int config_input_ref (AVFilterLink *inlink)
+{
+  const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get (inlink->format);
+  AVFilterContext  *ctx = inlink->dst;
+  XPSNRContext* const s = ctx->priv;
+
+  if ((ctx->inputs[0]->w != ctx->inputs[1]->w) ||
+      (ctx->inputs[0]->h != ctx->inputs[1]->h))
+  {
+    av_log (ctx, AV_LOG_ERROR, "Width and height of the input videos must match.\n");
+
+    return AVERROR (EINVAL);
+  }
+
+  if (ctx->inputs[0]->format != ctx->inputs[1]->format)
+  {
+    av_log (ctx, AV_LOG_ERROR, "The input videos must be of the same pixel format.\n");
+
+    return AVERROR (EINVAL);
+  }
+
+  s->bpp =  (desc->comp[0].depth <= 8 ? 1 : 2);
+  s->depth = desc->comp[0].depth;
+#if 1
+  s->max_error_64 = (1 << s->depth) - 1; /* conventional limit */
+#else
+  s->max_error_64 = 255 * (1 << (s->depth - 8)); /* JVET style */
+#endif
+  s->max_error_64 *= s->max_error_64;
+
+  s->frame_rate = inlink->frame_rate.num / inlink->frame_rate.den;
+
+  s->num_comps = (desc->nb_components > 3 ? 3 : desc->nb_components);
+
+  s->is_rgb = (ff_fill_rgba_map (s->rgba_map, inlink->format) >= 0);
+  s->comps[0] = (s->is_rgb ? 'R' : 'Y');
+  s->comps[1] = (s->is_rgb ? 'G' : 'U');
+  s->comps[2] = (s->is_rgb ? 'B' : 'V');
+  s->comps[3] = 'A';
+
+  s->plane_width [1] = s->plane_width [2] = AV_CEIL_RSHIFT (inlink->w, desc->log2_chroma_w);
+  s->plane_width [0] = s->plane_width [3] = inlink->w;
+  s->plane_height[1] = s->plane_height[2] = AV_CEIL_RSHIFT (inlink->h, desc->log2_chroma_h);
+  s->plane_height[0] = s->plane_height[3] = inlink->h;
+
+  s->dsp.sse_line = sse_line_16bit;
+  s->dsp.highds_func = highds; /* initialize customized AVX2 */
+  s->dsp.diff1st_func = diff1st; /* SIMD routines from XPSNR */
+  s->dsp.diff2nd_func = diff2nd;
+#if ARCH_X86
+  ff_xpsnr_init_x86 (&s->dsp, 15); /* inheritances from PSNR */
+#endif
+
+  return 0;
+}
+
+static int config_output (AVFilterLink *outlink)
+{
+  AVFilterContext *ctx = outlink->src;
+  AVFilterLink *inlink = ctx->inputs[0];
+  XPSNRContext *s = ctx->priv;
+  int ret_value;
+
+  if ((ret_value = ff_framesync_init_dualinput (&s->fs, ctx)) < 0) return ret_value;
+
+  outlink->w = inlink->w;
+  outlink->h = inlink->h;
+  outlink->frame_rate = inlink->frame_rate;
+  outlink->sample_aspect_ratio = inlink->sample_aspect_ratio;
+  outlink->time_base = inlink->time_base;
+
+  if ((ret_value = ff_framesync_configure (&s->fs)) < 0) return ret_value;
+
+  outlink->time_base = s->fs.time_base;
+
+  if (av_cmp_q (inlink->time_base, outlink->time_base) ||
+      av_cmp_q (ctx->inputs[1]->time_base, outlink->time_base))
+  {
+    av_log (ctx, AV_LOG_WARNING, "Not matching timebases found between first input: %d/%d and second input %d/%d, results may be incorrect!\n",
+            inlink->time_base.num, inlink->time_base.den,
+            ctx->inputs[1]->time_base.num, ctx->inputs[1]->time_base.den);
+  }
+
+  return 0;
+}
+
+static int activate (AVFilterContext *ctx)
+{
+  XPSNRContext *s = ctx->priv;
+
+  return ff_framesync_activate (&s->fs);
+}
+
+static av_cold void uninit (AVFilterContext *ctx)
+{
+  XPSNRContext* const s = ctx->priv;
+  int c;
+
+  if (s->num_frames_64 > 0) /* print out overall per-component XPSNR average */
+  {
+    const double xpsnr_luma = get_avg_xpsnr(s->sum_wdist[0],   s->sum_xpsnr[0],
+                                            s->plane_width[0], s->plane_height[0],
+                                            s->max_error_64,   s->num_frames_64);
+    double xpsnr_min = xpsnr_luma;
+
+    /* luma */
+    av_log (ctx, AV_LOG_INFO, "XPSNR  %c: %3.4f", s->comps[0], xpsnr_luma);
+    if (s->stats_file)
+    {
+      fprintf (s->stats_file, "\nXPSNR average, %"PRId64" frames", s->num_frames_64);
+      fprintf (s->stats_file, "  %c: %3.4f", s->comps[0], xpsnr_luma);
+    }
+    /* chroma */
+    for (c = 1; c < s->num_comps; c++)
+    {
+      const double xpsnr_chroma = get_avg_xpsnr(s->sum_wdist[c],   s->sum_xpsnr[c],
+                                                s->plane_width[c], s->plane_height[c],
+                                                s->max_error_64,   s->num_frames_64);
+      if (xpsnr_min > xpsnr_chroma) xpsnr_min = xpsnr_chroma;
+
+      av_log (ctx, AV_LOG_INFO, "  %c: %3.4f", s->comps[c], xpsnr_chroma);
+      if (s->stats_file && s->stats_file != stdout)
+      {
+        fprintf (s->stats_file, "  %c: %3.4f", s->comps[c], xpsnr_chroma);
+      }
+    }
+    /* print out line break (with minimum XPSNR across the color components) */
+    if (s->num_comps > 1)
+    {
+      av_log (ctx, AV_LOG_INFO, "  (minimum: %3.4f)\n", xpsnr_min);
+      if (s->stats_file && s->stats_file != stdout)
+      {
+        fprintf (s->stats_file, "  (minimum: %3.4f)\n", xpsnr_min);
+      }
+    }
+    else
+    {
+      av_log (ctx, AV_LOG_INFO, "\n");
+      if (s->stats_file && s->stats_file != stdout)
+      {
+        fprintf (s->stats_file, "\n");
+      }
+    }
+  }
+
+  ff_framesync_uninit (&s->fs);   /* free temporary picture and block memory */
+
+  if (s->stats_file && s->stats_file != stdout) fclose (s->stats_file);
+
+  if (s->sse_luma) av_freep (&s->sse_luma);
+  if (s->weights ) av_freep (&s->weights );
+
+  for (c = 0; c < s->num_comps; c++) /* free addl temporal org buffer memory */
+  {
+    if (s->buf_org_m1[c]) av_freep (&s->buf_org_m1[c]);
+    if (s->buf_org_m2[c]) av_freep (&s->buf_org_m2[c]);
+  }
+  if (s->bpp == 1) /* 8 bit */
+  {
+    for (c = 0; c < s->num_comps; c++) /* free org/rec picture buffer memory */
+    {
+      if (&s->buf_org[c]) av_freep (&s->buf_org[c]);
+      if (&s->buf_rec[c]) av_freep (&s->buf_rec[c]);
+    }
+  }
+}
+
+static const AVFilterPad xpsnr_inputs[] =
+{
+  {
+    .name         = "main",
+    .type         = AVMEDIA_TYPE_VIDEO,
+  },
+  {
+    .name         = "reference",
+    .type         = AVMEDIA_TYPE_VIDEO,
+    .config_props = config_input_ref,
+  }
+};
+
+static const AVFilterPad xpsnr_outputs[] =
+{
+  {
+    .name         = "default",
+    .type         = AVMEDIA_TYPE_VIDEO,
+    .config_props = config_output,
+  }
+};
+
+const AVFilter ff_vf_xpsnr =
+{
+  .name           = "xpsnr",
+  .description    = NULL_IF_CONFIG_SMALL ("Calculate the extended perceptually weighted peak signal-to-noise ratio (XPSNR) between two video streams."),
+  .preinit        = xpsnr_framesync_preinit,
+  .init           = init,
+  .uninit         = uninit,
+  .activate       = activate,
+  .priv_size      = sizeof (XPSNRContext),
+  .priv_class     = &xpsnr_class,
+  FILTER_INPUTS (xpsnr_inputs),
+  FILTER_OUTPUTS(xpsnr_outputs),
+  FILTER_PIXFMTS_ARRAY(pix_fmts),
+  .flags          = AVFILTER_FLAG_SUPPORT_TIMELINE_INTERNAL |
+                    AVFILTER_FLAG_SLICE_THREADS             |
+                    AVFILTER_FLAG_METADATA_ONLY,
+};
diff --git a/libavfilter/x86/vf_xpsnr.asm b/libavfilter/x86/vf_xpsnr.asm
new file mode 100644
index 0000000000..bfeeff718f
--- /dev/null
+++ b/libavfilter/x86/vf_xpsnr.asm
@@ -0,0 +1,2108 @@ 
+%if ARCH_X86_64
+default rel
+global highds_simd
+global diff1st_simd
+global diff2nd_simd
+SECTION .text
+highds_simd:
+        push    rbp
+        mov     rbp, rsp
+        and     rsp, 0FFFFFFFFFFFFFFE0H
+        sub     rsp, 1576
+        mov     dword [rsp-4CH], edi
+        mov     dword [rsp-50H], esi
+        mov     dword [rsp-54H], edx
+        mov     dword [rsp-58H], ecx
+        mov     qword [rsp-60H], r8
+        mov     dword [rsp-64H], r9d
+        mov     qword [rsp], 0
+        mov     word [rsp-20H], 0
+        mov     word [rsp-1EH], 0
+        mov     word [rsp-1CH], -1
+        mov     word [rsp-1AH], -2
+        mov     word [rsp-18H], -3
+        mov     word [rsp-16H], -3
+        mov     word [rsp-14H], -2
+        mov     word [rsp-12H], -1
+        movzx   eax, word [rsp-12H]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-14H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm1, xmm0
+        movzx   eax, word [rsp-16H]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-18H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm4, xmm0
+        movzx   eax, word [rsp-1AH]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-1CH]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm2, xmm0
+        movzx   eax, word [rsp-1EH]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-20H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm3, xmm0
+        vpunpckldq xmm0, xmm1, xmm4
+        vmovdqa xmm1, xmm0
+        vpunpckldq xmm0, xmm2, xmm3
+        vpunpcklqdq xmm0, xmm1, xmm0
+        vmovaps oword [rsp+98H], xmm0
+        mov     word [rsp-30H], 0
+        mov     word [rsp-2EH], 0
+        mov     word [rsp-2CH], -1
+        mov     word [rsp-2AH], -3
+        mov     word [rsp-28H], 12
+        mov     word [rsp-26H], 12
+        mov     word [rsp-24H], -3
+        mov     word [rsp-22H], -1
+        movzx   eax, word [rsp-22H]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-24H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm1, xmm0
+        movzx   eax, word [rsp-26H]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-28H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm4, xmm0
+        movzx   eax, word [rsp-2AH]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-2CH]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm2, xmm0
+        movzx   eax, word [rsp-2EH]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-30H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm3, xmm0
+        vpunpckldq xmm0, xmm1, xmm4
+        vmovdqa xmm1, xmm0
+        vpunpckldq xmm0, xmm2, xmm3
+        vpunpcklqdq xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0A8H], xmm0
+        mov     word [rsp-40H], 0
+        mov     word [rsp-3EH], 0
+        mov     word [rsp-3CH], 0
+        mov     word [rsp-3AH], -1
+        mov     word [rsp-38H], -1
+        mov     word [rsp-36H], -1
+        mov     word [rsp-34H], -1
+        mov     word [rsp-32H], 0
+        movzx   eax, word [rsp-32H]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-34H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm1, xmm0
+        movzx   eax, word [rsp-36H]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-38H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm4, xmm0
+        movzx   eax, word [rsp-3AH]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-3CH]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm2, xmm0
+        movzx   eax, word [rsp-3EH]
+        vmovd   xmm0, eax
+        movzx   eax, word [rsp-40H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm3, xmm0
+        vpunpckldq xmm0, xmm1, xmm4
+        vmovdqa xmm1, xmm0
+        vpunpckldq xmm0, xmm2, xmm3
+        vpunpcklqdq xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0B8H], xmm0
+        mov     eax, dword [rsp-50H]
+        mov     dword [rsp-10H], eax
+        jmp     L_008
+
+L_001:  mov     eax, dword [rsp-4CH]
+        mov     dword [rsp-0CH], eax
+        jmp     L_007
+
+L_002:  mov     eax, dword [rsp-10H]
+        sub     eax, 2
+        imul    eax, dword [rsp-64H]
+        mov     edx, eax
+        mov     eax, dword [rsp-0CH]
+        add     eax, edx
+        cdqe
+        add     rax, rax
+        lea     rdx, [rax-4H]
+        mov     rax, qword [rsp-60H]
+        add     rax, rdx
+        mov     qword [rsp+30H], rax
+        mov     rax, qword [rsp+30H]
+        vlddqu  ymm0, yword [rax]
+        vmovdqa yword [rsp+4A8H], ymm0
+        mov     eax, dword [rsp-10H]
+        sub     eax, 1
+        imul    eax, dword [rsp-64H]
+        mov     edx, eax
+        mov     eax, dword [rsp-0CH]
+        add     eax, edx
+        cdqe
+        add     rax, rax
+        lea     rdx, [rax-4H]
+        mov     rax, qword [rsp-60H]
+        add     rax, rdx
+        mov     qword [rsp+28H], rax
+        mov     rax, qword [rsp+28H]
+        vlddqu  ymm0, yword [rax]
+        vmovdqa yword [rsp+4C8H], ymm0
+        mov     eax, dword [rsp-10H]
+        imul    eax, dword [rsp-64H]
+        mov     edx, eax
+        mov     eax, dword [rsp-0CH]
+        add     eax, edx
+        cdqe
+        add     rax, rax
+        lea     rdx, [rax-4H]
+        mov     rax, qword [rsp-60H]
+        add     rax, rdx
+        mov     qword [rsp+20H], rax
+        mov     rax, qword [rsp+20H]
+        vlddqu  ymm0, yword [rax]
+        vmovdqa yword [rsp+4E8H], ymm0
+        mov     eax, dword [rsp-10H]
+        add     eax, 1
+        imul    eax, dword [rsp-64H]
+        mov     edx, eax
+        mov     eax, dword [rsp-0CH]
+        add     eax, edx
+        cdqe
+        add     rax, rax
+        lea     rdx, [rax-4H]
+        mov     rax, qword [rsp-60H]
+        add     rax, rdx
+        mov     qword [rsp+18H], rax
+        mov     rax, qword [rsp+18H]
+        vlddqu  ymm0, yword [rax]
+        vmovdqa yword [rsp+508H], ymm0
+        mov     eax, dword [rsp-10H]
+        add     eax, 2
+        imul    eax, dword [rsp-64H]
+        mov     edx, eax
+        mov     eax, dword [rsp-0CH]
+        add     eax, edx
+        cdqe
+        add     rax, rax
+        lea     rdx, [rax-4H]
+        mov     rax, qword [rsp-60H]
+        add     rax, rdx
+        mov     qword [rsp+10H], rax
+        mov     rax, qword [rsp+10H]
+        vlddqu  ymm0, yword [rax]
+        vmovdqa yword [rsp+528H], ymm0
+        mov     eax, dword [rsp-10H]
+        add     eax, 3
+        imul    eax, dword [rsp-64H]
+        mov     edx, eax
+        mov     eax, dword [rsp-0CH]
+        add     eax, edx
+        cdqe
+        add     rax, rax
+        lea     rdx, [rax-4H]
+        mov     rax, qword [rsp-60H]
+        add     rax, rdx
+        mov     qword [rsp+8H], rax
+        mov     rax, qword [rsp+8H]
+        vlddqu  ymm0, yword [rax]
+        vmovdqa yword [rsp+548H], ymm0
+        mov     dword [rsp-8H], 0
+        jmp     L_006
+
+L_003:  mov     eax, dword [rsp-8H]
+        lea     edx, [rax*4]
+        mov     eax, dword [rsp-0CH]
+        add     eax, edx
+        cmp     dword [rsp-54H], eax
+        jle     L_004
+        mov     dword [rsp-4H], 0
+        vmovdqa ymm0, yword [rsp+4E8H]
+        vmovdqa yword [rsp+608H], ymm0
+        vmovdqa ymm0, yword [rsp+608H]
+        vmovaps oword [rsp+38H], xmm0
+        vmovdqa ymm0, yword [rsp+508H]
+        vmovdqa yword [rsp+5E8H], ymm0
+        vmovdqa ymm0, yword [rsp+5E8H]
+        vmovaps oword [rsp+48H], xmm0
+        vmovdqa xmm0, oword [rsp+38H]
+        vmovaps oword [rsp+2A8H], xmm0
+        vmovdqa xmm0, oword [rsp+0A8H]
+        vmovaps oword [rsp+2B8H], xmm0
+        vmovdqa xmm0, oword [rsp+2B8H]
+        vmovdqa xmm1, oword [rsp+2A8H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+48H]
+        vmovaps oword [rsp+288H], xmm0
+        vmovdqa xmm0, oword [rsp+0A8H]
+        vmovaps oword [rsp+298H], xmm0
+        vmovdqa xmm0, oword [rsp+298H]
+        vmovdqa xmm1, oword [rsp+288H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0D8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+268H], xmm0
+        vmovdqa xmm0, oword [rsp+0D8H]
+        vmovaps oword [rsp+278H], xmm0
+        vmovdqa xmm1, oword [rsp+278H]
+        vmovdqa xmm0, oword [rsp+268H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+248H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+258H], xmm0
+        vmovdqa xmm1, oword [rsp+258H]
+        vmovdqa xmm0, oword [rsp+248H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+228H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+238H], xmm0
+        vmovdqa xmm1, oword [rsp+238H]
+        vmovdqa xmm0, oword [rsp+228H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovd   eax, xmm0
+        add     dword [rsp-4H], eax
+        vmovdqa ymm0, yword [rsp+4C8H]
+        vmovdqa yword [rsp+5C8H], ymm0
+        vmovdqa ymm0, yword [rsp+5C8H]
+        vmovaps oword [rsp+58H], xmm0
+        vmovdqa ymm0, yword [rsp+528H]
+        vmovdqa yword [rsp+5A8H], ymm0
+        vmovdqa ymm0, yword [rsp+5A8H]
+        vmovaps oword [rsp+68H], xmm0
+        vmovdqa xmm0, oword [rsp+58H]
+        vmovaps oword [rsp+208H], xmm0
+        vmovdqa xmm0, oword [rsp+98H]
+        vmovaps oword [rsp+218H], xmm0
+        vmovdqa xmm0, oword [rsp+218H]
+        vmovdqa xmm1, oword [rsp+208H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+68H]
+        vmovaps oword [rsp+1E8H], xmm0
+        vmovdqa xmm0, oword [rsp+98H]
+        vmovaps oword [rsp+1F8H], xmm0
+        vmovdqa xmm0, oword [rsp+1F8H]
+        vmovdqa xmm1, oword [rsp+1E8H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0D8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+1C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0D8H]
+        vmovaps oword [rsp+1D8H], xmm0
+        vmovdqa xmm1, oword [rsp+1D8H]
+        vmovdqa xmm0, oword [rsp+1C8H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+1A8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+1B8H], xmm0
+        vmovdqa xmm1, oword [rsp+1B8H]
+        vmovdqa xmm0, oword [rsp+1A8H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+188H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+198H], xmm0
+        vmovdqa xmm1, oword [rsp+198H]
+        vmovdqa xmm0, oword [rsp+188H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovd   eax, xmm0
+        add     dword [rsp-4H], eax
+        vmovdqa ymm0, yword [rsp+4A8H]
+        vmovdqa yword [rsp+588H], ymm0
+        vmovdqa ymm0, yword [rsp+588H]
+        vmovaps oword [rsp+78H], xmm0
+        vmovdqa ymm0, yword [rsp+548H]
+        vmovdqa yword [rsp+568H], ymm0
+        vmovdqa ymm0, yword [rsp+568H]
+        vmovaps oword [rsp+88H], xmm0
+        vmovdqa xmm0, oword [rsp+78H]
+        vmovaps oword [rsp+168H], xmm0
+        vmovdqa xmm0, oword [rsp+0B8H]
+        vmovaps oword [rsp+178H], xmm0
+        vmovdqa xmm0, oword [rsp+178H]
+        vmovdqa xmm1, oword [rsp+168H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+88H]
+        vmovaps oword [rsp+148H], xmm0
+        vmovdqa xmm0, oword [rsp+0B8H]
+        vmovaps oword [rsp+158H], xmm0
+        vmovdqa xmm0, oword [rsp+158H]
+        vmovdqa xmm1, oword [rsp+148H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0D8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+128H], xmm0
+        vmovdqa xmm0, oword [rsp+0D8H]
+        vmovaps oword [rsp+138H], xmm0
+        vmovdqa xmm1, oword [rsp+138H]
+        vmovdqa xmm0, oword [rsp+128H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+108H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+118H], xmm0
+        vmovdqa xmm1, oword [rsp+118H]
+        vmovdqa xmm0, oword [rsp+108H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+0E8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+0F8H], xmm0
+        vmovdqa xmm1, oword [rsp+0F8H]
+        vmovdqa xmm0, oword [rsp+0E8H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovd   eax, xmm0
+        add     dword [rsp-4H], eax
+        mov     eax, dword [rsp-4H]
+        cdq
+        mov     eax, edx
+        xor     eax, dword [rsp-4H]
+        sub     eax, edx
+        cdqe
+        add     qword [rsp], rax
+L_004:  mov     eax, dword [rsp-8H]
+        lea     edx, [rax*4]
+        mov     eax, dword [rsp-0CH]
+        add     eax, edx
+        add     eax, 2
+        cmp     dword [rsp-54H], eax
+        jle     L_005
+        mov     dword [rsp-4H], 0
+        vmovdqa xmm0, oword [rsp+38H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [rsp+38H], xmm0
+        vmovdqa xmm0, oword [rsp+48H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [rsp+48H], xmm0
+        vmovdqa xmm0, oword [rsp+38H]
+        vmovaps oword [rsp+488H], xmm0
+        vmovdqa xmm0, oword [rsp+0A8H]
+        vmovaps oword [rsp+498H], xmm0
+        vmovdqa xmm0, oword [rsp+498H]
+        vmovdqa xmm1, oword [rsp+488H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+48H]
+        vmovaps oword [rsp+468H], xmm0
+        vmovdqa xmm0, oword [rsp+0A8H]
+        vmovaps oword [rsp+478H], xmm0
+        vmovdqa xmm0, oword [rsp+478H]
+        vmovdqa xmm1, oword [rsp+468H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0D8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+448H], xmm0
+        vmovdqa xmm0, oword [rsp+0D8H]
+        vmovaps oword [rsp+458H], xmm0
+        vmovdqa xmm1, oword [rsp+458H]
+        vmovdqa xmm0, oword [rsp+448H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+428H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+438H], xmm0
+        vmovdqa xmm1, oword [rsp+438H]
+        vmovdqa xmm0, oword [rsp+428H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+408H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+418H], xmm0
+        vmovdqa xmm1, oword [rsp+418H]
+        vmovdqa xmm0, oword [rsp+408H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovd   eax, xmm0
+        add     dword [rsp-4H], eax
+        vmovdqa xmm0, oword [rsp+58H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [rsp+58H], xmm0
+        vmovdqa xmm0, oword [rsp+68H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [rsp+68H], xmm0
+        vmovdqa xmm0, oword [rsp+58H]
+        vmovaps oword [rsp+3E8H], xmm0
+        vmovdqa xmm0, oword [rsp+98H]
+        vmovaps oword [rsp+3F8H], xmm0
+        vmovdqa xmm0, oword [rsp+3F8H]
+        vmovdqa xmm1, oword [rsp+3E8H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+68H]
+        vmovaps oword [rsp+3C8H], xmm0
+        vmovdqa xmm0, oword [rsp+98H]
+        vmovaps oword [rsp+3D8H], xmm0
+        vmovdqa xmm0, oword [rsp+3D8H]
+        vmovdqa xmm1, oword [rsp+3C8H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0D8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+3A8H], xmm0
+        vmovdqa xmm0, oword [rsp+0D8H]
+        vmovaps oword [rsp+3B8H], xmm0
+        vmovdqa xmm1, oword [rsp+3B8H]
+        vmovdqa xmm0, oword [rsp+3A8H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+388H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+398H], xmm0
+        vmovdqa xmm1, oword [rsp+398H]
+        vmovdqa xmm0, oword [rsp+388H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+368H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+378H], xmm0
+        vmovdqa xmm1, oword [rsp+378H]
+        vmovdqa xmm0, oword [rsp+368H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovd   eax, xmm0
+        add     dword [rsp-4H], eax
+        vmovdqa xmm0, oword [rsp+78H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [rsp+78H], xmm0
+        vmovdqa xmm0, oword [rsp+88H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [rsp+88H], xmm0
+        vmovdqa xmm0, oword [rsp+78H]
+        vmovaps oword [rsp+348H], xmm0
+        vmovdqa xmm0, oword [rsp+0B8H]
+        vmovaps oword [rsp+358H], xmm0
+        vmovdqa xmm0, oword [rsp+358H]
+        vmovdqa xmm1, oword [rsp+348H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+88H]
+        vmovaps oword [rsp+328H], xmm0
+        vmovdqa xmm0, oword [rsp+0B8H]
+        vmovaps oword [rsp+338H], xmm0
+        vmovdqa xmm0, oword [rsp+338H]
+        vmovdqa xmm1, oword [rsp+328H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [rsp+0D8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+308H], xmm0
+        vmovdqa xmm0, oword [rsp+0D8H]
+        vmovaps oword [rsp+318H], xmm0
+        vmovdqa xmm1, oword [rsp+318H]
+        vmovdqa xmm0, oword [rsp+308H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+2E8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+2F8H], xmm0
+        vmovdqa xmm1, oword [rsp+2F8H]
+        vmovdqa xmm0, oword [rsp+2E8H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+2C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovaps oword [rsp+2D8H], xmm0
+        vmovdqa xmm1, oword [rsp+2D8H]
+        vmovdqa xmm0, oword [rsp+2C8H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [rsp+0C8H], xmm0
+        vmovdqa xmm0, oword [rsp+0C8H]
+        vmovd   eax, xmm0
+        add     dword [rsp-4H], eax
+        mov     eax, dword [rsp-4H]
+        cdq
+        mov     eax, edx
+        xor     eax, dword [rsp-4H]
+        sub     eax, edx
+        cdqe
+        add     qword [rsp], rax
+        vpermq  ymm0, yword [rsp+4A8H], 39H
+        vmovdqa yword [rsp+4A8H], ymm0
+        vpermq  ymm0, yword [rsp+4C8H], 39H
+        vmovdqa yword [rsp+4C8H], ymm0
+        vpermq  ymm0, yword [rsp+4E8H], 39H
+        vmovdqa yword [rsp+4E8H], ymm0
+        vpermq  ymm0, yword [rsp+508H], 39H
+        vmovdqa yword [rsp+508H], ymm0
+        vpermq  ymm0, yword [rsp+528H], 39H
+        vmovdqa yword [rsp+528H], ymm0
+        vpermq  ymm0, yword [rsp+548H], 39H
+        vmovdqa yword [rsp+548H], ymm0
+L_005:  add     dword [rsp-8H], 1
+L_006:  cmp     dword [rsp-8H], 2
+        jle     L_003
+        add     dword [rsp-0CH], 12
+L_007:  mov     eax, dword [rsp-0CH]
+        cmp     eax, dword [rsp-54H]
+        jl      L_002
+        add     dword [rsp-10H], 2
+L_008:  mov     eax, dword [rsp-10H]
+        cmp     eax, dword [rsp-58H]
+        jl      L_001
+        mov     rax, qword [rsp]
+        leave
+        ret
+
+diff1st_simd:
+        push    rbp
+        mov     rbp, rsp
+        sub     rsp, 480
+        mov     dword [rbp-1C4H], edi
+        mov     dword [rbp-1C8H], esi
+        mov     qword [rbp-1D0H], rdx
+        mov     qword [rbp-1D8H], rcx
+        mov     dword [rbp-1DCH], r8d
+
+
+        mov     rax, qword [fs:abs 28H]
+        mov     qword [rbp-8H], rax
+        xor     eax, eax
+        mov     qword [rbp-1A8H], 0
+
+        mov     word [rbp-1B2H], 0
+        mov     dword [rbp-1B0H], 0
+        jmp     L_012
+
+L_009:  mov     dword [rbp-1ACH], 0
+        jmp     L_011
+
+L_010:  mov     eax, dword [rbp-1DCH]
+        imul    eax, dword [rbp-1B0H]
+        mov     edx, eax
+        mov     eax, dword [rbp-1ACH]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-1D0H]
+        add     rax, rdx
+        mov     qword [rbp-178H], rax
+        mov     rax, qword [rbp-178H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-170H], xmm0
+        mov     eax, dword [rbp-1B0H]
+        lea     edx, [rax+1H]
+        mov     eax, dword [rbp-1DCH]
+        imul    edx, eax
+        mov     eax, dword [rbp-1ACH]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-1D0H]
+        add     rax, rdx
+        mov     qword [rbp-180H], rax
+        mov     rax, qword [rbp-180H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-160H], xmm0
+        mov     eax, dword [rbp-1DCH]
+        imul    eax, dword [rbp-1B0H]
+        mov     edx, eax
+        mov     eax, dword [rbp-1ACH]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-1D8H]
+        add     rax, rdx
+        mov     qword [rbp-188H], rax
+        mov     rax, qword [rbp-188H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-150H], xmm0
+        mov     eax, dword [rbp-1B0H]
+        lea     edx, [rax+1H]
+        mov     eax, dword [rbp-1DCH]
+        imul    edx, eax
+        mov     eax, dword [rbp-1ACH]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-1D8H]
+        add     rax, rdx
+        mov     qword [rbp-190H], rax
+        mov     rax, qword [rbp-190H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-140H], xmm0
+        vmovdqa xmm0, oword [rbp-170H]
+        vmovaps oword [rbp-30H], xmm0
+        vmovdqa xmm0, oword [rbp-160H]
+        vmovaps oword [rbp-20H], xmm0
+        vmovdqa xmm1, oword [rbp-30H]
+        vmovdqa xmm0, oword [rbp-20H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [rbp-130H], xmm0
+        vmovdqa xmm0, oword [rbp-150H]
+        vmovaps oword [rbp-50H], xmm0
+        vmovdqa xmm0, oword [rbp-140H]
+        vmovaps oword [rbp-40H], xmm0
+        vmovdqa xmm1, oword [rbp-50H]
+        vmovdqa xmm0, oword [rbp-40H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [rbp-120H], xmm0
+        vmovdqa xmm0, oword [rbp-130H]
+        vmovaps oword [rbp-70H], xmm0
+        vmovdqa xmm0, oword [rbp-120H]
+        vmovaps oword [rbp-60H], xmm0
+        vmovdqa xmm0, oword [rbp-70H]
+        vmovdqa xmm1, oword [rbp-60H]
+        vpsubw  xmm0, xmm0, xmm1
+        vmovaps oword [rbp-120H], xmm0
+        vmovdqa xmm0, oword [rbp-120H]
+        vmovaps oword [rbp-90H], xmm0
+        vmovdqa xmm0, oword [rbp-120H]
+        vmovaps oword [rbp-80H], xmm0
+        vmovdqa xmm1, oword [rbp-80H]
+        vmovdqa xmm0, oword [rbp-90H]
+        vphaddw xmm0, xmm0, xmm1
+        vmovaps oword [rbp-120H], xmm0
+        vmovdqa xmm0, oword [rbp-120H]
+        vmovaps oword [rbp-0A0H], xmm0
+        vmovdqa xmm0, oword [rbp-0A0H]
+        vpabsw  xmm0, xmm0
+        vmovaps oword [rbp-120H], xmm0
+        vmovdqa xmm0, oword [rbp-120H]
+        vmovaps oword [rbp-0C0H], xmm0
+        vmovdqa xmm0, oword [rbp-120H]
+        vmovaps oword [rbp-0B0H], xmm0
+        vmovdqa xmm1, oword [rbp-0B0H]
+        vmovdqa xmm0, oword [rbp-0C0H]
+        vphaddsw xmm0, xmm0, xmm1
+        vmovaps oword [rbp-120H], xmm0
+        vmovdqa xmm0, oword [rbp-120H]
+        vmovaps oword [rbp-0E0H], xmm0
+        vmovdqa xmm0, oword [rbp-120H]
+        vmovaps oword [rbp-0D0H], xmm0
+        vmovdqa xmm1, oword [rbp-0D0H]
+        vmovdqa xmm0, oword [rbp-0E0H]
+        vphaddsw xmm0, xmm0, xmm1
+        vmovaps oword [rbp-120H], xmm0
+        vmovdqa xmm0, oword [rbp-120H]
+        vmovaps oword [rbp-0F0H], xmm0
+        vmovdqa xmm0, oword [rbp-0F0H]
+        vmovd   edx, xmm0
+        lea     rax, [rbp-1B2H]
+        mov     word [rax], dx
+        movzx   eax, word [rbp-1B2H]
+        movzx   eax, ax
+        add     qword [rbp-1A8H], rax
+        mov     eax, dword [rbp-1DCH]
+        imul    eax, dword [rbp-1B0H]
+        mov     edx, eax
+        mov     eax, dword [rbp-1ACH]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-1D8H]
+        add     rax, rdx
+        mov     qword [rbp-198H], rax
+        vmovdqa xmm0, oword [rbp-170H]
+        vmovaps oword [rbp-100H], xmm0
+        vmovdqa xmm0, oword [rbp-100H]
+        mov     rax, qword [rbp-198H]
+        vmovups oword [rax], xmm0
+        nop
+        mov     eax, dword [rbp-1B0H]
+        lea     edx, [rax+1H]
+        mov     eax, dword [rbp-1DCH]
+        imul    edx, eax
+        mov     eax, dword [rbp-1ACH]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-1D8H]
+        add     rax, rdx
+        mov     qword [rbp-1A0H], rax
+        vmovdqa xmm0, oword [rbp-160H]
+        vmovaps oword [rbp-110H], xmm0
+        vmovdqa xmm0, oword [rbp-110H]
+        mov     rax, qword [rbp-1A0H]
+        vmovups oword [rax], xmm0
+        nop
+        add     dword [rbp-1ACH], 8
+L_011:  mov     eax, dword [rbp-1ACH]
+        cmp     eax, dword [rbp-1C4H]
+        jc      L_010
+        add     dword [rbp-1B0H], 2
+L_012:  mov     eax, dword [rbp-1B0H]
+        cmp     eax, dword [rbp-1C8H]
+        jc      L_009
+        mov     rax, qword [rbp-1A8H]
+        add     rax, rax
+        mov     rcx, qword [rbp-8H]
+
+
+        xor     rcx, qword [fs:abs 28H]
+        jz      L_013
+L_013:  leave
+        ret
+
+diff2nd_simd:
+        push    rbp
+        mov     rbp, rsp
+        sub     rsp, 688
+        mov     dword [rbp-284H], edi
+        mov     dword [rbp-288H], esi
+        mov     qword [rbp-290H], rdx
+        mov     qword [rbp-298H], rcx
+        mov     qword [rbp-2A0H], r8
+        mov     dword [rbp-2A4H], r9d
+
+
+        mov     rax, qword [fs:abs 28H]
+        mov     qword [rbp-8H], rax
+        xor     eax, eax
+        mov     qword [rbp-268H], 0
+
+        mov     word [rbp-276H], 0
+        mov     dword [rbp-274H], 0
+        jmp     L_017
+
+L_014:  mov     dword [rbp-270H], 0
+        jmp     L_016
+
+L_015:  mov     eax, dword [rbp-2A4H]
+        imul    eax, dword [rbp-274H]
+        mov     edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-290H]
+        add     rax, rdx
+        mov     qword [rbp-218H], rax
+        mov     rax, qword [rbp-218H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-210H], xmm0
+        mov     eax, dword [rbp-274H]
+        lea     edx, [rax+1H]
+        mov     eax, dword [rbp-2A4H]
+        imul    edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-290H]
+        add     rax, rdx
+        mov     qword [rbp-220H], rax
+        mov     rax, qword [rbp-220H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-200H], xmm0
+        mov     eax, dword [rbp-2A4H]
+        imul    eax, dword [rbp-274H]
+        mov     edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-298H]
+        add     rax, rdx
+        mov     qword [rbp-228H], rax
+        mov     rax, qword [rbp-228H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-1F0H], xmm0
+        mov     eax, dword [rbp-274H]
+        lea     edx, [rax+1H]
+        mov     eax, dword [rbp-2A4H]
+        imul    edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-298H]
+        add     rax, rdx
+        mov     qword [rbp-230H], rax
+        mov     rax, qword [rbp-230H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-1E0H], xmm0
+        mov     eax, dword [rbp-2A4H]
+        imul    eax, dword [rbp-274H]
+        mov     edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-2A0H]
+        add     rax, rdx
+        mov     qword [rbp-238H], rax
+        mov     rax, qword [rbp-238H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-1D0H], xmm0
+        mov     eax, dword [rbp-274H]
+        lea     edx, [rax+1H]
+        mov     eax, dword [rbp-2A4H]
+        imul    edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-2A0H]
+        add     rax, rdx
+        mov     qword [rbp-240H], rax
+        mov     rax, qword [rbp-240H]
+        vlddqu  xmm0, oword [rax]
+        vmovaps oword [rbp-1C0H], xmm0
+        vmovdqa xmm0, oword [rbp-210H]
+        vmovaps oword [rbp-30H], xmm0
+        vmovdqa xmm0, oword [rbp-200H]
+        vmovaps oword [rbp-20H], xmm0
+        vmovdqa xmm1, oword [rbp-30H]
+        vmovdqa xmm0, oword [rbp-20H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [rbp-1B0H], xmm0
+        vmovdqa xmm0, oword [rbp-1F0H]
+        vmovaps oword [rbp-50H], xmm0
+        vmovdqa xmm0, oword [rbp-1E0H]
+        vmovaps oword [rbp-40H], xmm0
+        vmovdqa xmm1, oword [rbp-50H]
+        vmovdqa xmm0, oword [rbp-40H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [rbp-1A0H], xmm0
+        vmovdqa xmm0, oword [rbp-1D0H]
+        vmovaps oword [rbp-70H], xmm0
+        vmovdqa xmm0, oword [rbp-1C0H]
+        vmovaps oword [rbp-60H], xmm0
+        vmovdqa xmm1, oword [rbp-70H]
+        vmovdqa xmm0, oword [rbp-60H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [rbp-190H], xmm0
+        vmovdqa xmm0, oword [rbp-1B0H]
+        vmovaps oword [rbp-90H], xmm0
+        vmovdqa xmm0, oword [rbp-190H]
+        vmovaps oword [rbp-80H], xmm0
+        vmovdqa xmm1, oword [rbp-90H]
+        vmovdqa xmm0, oword [rbp-80H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [rbp-1B0H], xmm0
+        vmovdqa xmm0, oword [rbp-1B0H]
+        vmovaps oword [rbp-0B0H], xmm0
+        vmovdqa xmm0, oword [rbp-1A0H]
+        vmovaps oword [rbp-0A0H], xmm0
+        vmovdqa xmm1, oword [rbp-0A0H]
+        vmovdqa xmm0, oword [rbp-0B0H]
+        vphaddw xmm0, xmm0, xmm1
+        vmovaps oword [rbp-1B0H], xmm0
+        vmovdqa xmm0, oword [rbp-1B0H]
+        vpshufd xmm0, xmm0, 0EEH
+        vmovaps oword [rbp-1A0H], xmm0
+        vmovdqa xmm0, oword [rbp-1A0H]
+        vmovaps oword [rbp-0C0H], xmm0
+        mov     dword [rbp-26CH], 1
+        vmovdqa xmm1, oword [rbp-0C0H]
+        vmovd   xmm0, dword [rbp-26CH]
+        vpsllw  xmm0, xmm1, xmm0
+        vmovaps oword [rbp-1A0H], xmm0
+        vmovdqa xmm0, oword [rbp-1B0H]
+        vmovaps oword [rbp-0E0H], xmm0
+        vmovdqa xmm0, oword [rbp-1A0H]
+        vmovaps oword [rbp-0D0H], xmm0
+        vmovdqa xmm0, oword [rbp-0E0H]
+        vmovdqa xmm1, oword [rbp-0D0H]
+        vpsubw  xmm0, xmm0, xmm1
+        vmovaps oword [rbp-1A0H], xmm0
+        vmovdqa xmm0, oword [rbp-1A0H]
+        vmovaps oword [rbp-0F0H], xmm0
+        vmovdqa xmm0, oword [rbp-0F0H]
+        vpabsw  xmm0, xmm0
+        vmovaps oword [rbp-1A0H], xmm0
+        vmovdqa xmm0, oword [rbp-1A0H]
+        vmovaps oword [rbp-110H], xmm0
+        vmovdqa xmm0, oword [rbp-1A0H]
+        vmovaps oword [rbp-100H], xmm0
+        vmovdqa xmm1, oword [rbp-100H]
+        vmovdqa xmm0, oword [rbp-110H]
+        vphaddsw xmm0, xmm0, xmm1
+        vmovaps oword [rbp-1A0H], xmm0
+        vmovdqa xmm0, oword [rbp-1A0H]
+        vmovaps oword [rbp-130H], xmm0
+        vmovdqa xmm0, oword [rbp-1A0H]
+        vmovaps oword [rbp-120H], xmm0
+        vmovdqa xmm1, oword [rbp-120H]
+        vmovdqa xmm0, oword [rbp-130H]
+        vphaddsw xmm0, xmm0, xmm1
+        vmovaps oword [rbp-1A0H], xmm0
+        vmovdqa xmm0, oword [rbp-1A0H]
+        vmovaps oword [rbp-140H], xmm0
+        vmovdqa xmm0, oword [rbp-140H]
+        vmovd   edx, xmm0
+        lea     rax, [rbp-276H]
+        mov     word [rax], dx
+        movzx   eax, word [rbp-276H]
+        movzx   eax, ax
+        add     qword [rbp-268H], rax
+        mov     eax, dword [rbp-2A4H]
+        imul    eax, dword [rbp-274H]
+        mov     edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-2A0H]
+        add     rax, rdx
+        mov     qword [rbp-248H], rax
+        vmovdqa xmm0, oword [rbp-1F0H]
+        vmovaps oword [rbp-150H], xmm0
+        vmovdqa xmm0, oword [rbp-150H]
+        mov     rax, qword [rbp-248H]
+        vmovups oword [rax], xmm0
+        nop
+        mov     eax, dword [rbp-274H]
+        lea     edx, [rax+1H]
+        mov     eax, dword [rbp-2A4H]
+        imul    edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-2A0H]
+        add     rax, rdx
+        mov     qword [rbp-250H], rax
+        vmovdqa xmm0, oword [rbp-1E0H]
+        vmovaps oword [rbp-160H], xmm0
+        vmovdqa xmm0, oword [rbp-160H]
+        mov     rax, qword [rbp-250H]
+        vmovups oword [rax], xmm0
+        nop
+        mov     eax, dword [rbp-2A4H]
+        imul    eax, dword [rbp-274H]
+        mov     edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-298H]
+        add     rax, rdx
+        mov     qword [rbp-258H], rax
+        vmovdqa xmm0, oword [rbp-210H]
+        vmovaps oword [rbp-170H], xmm0
+        vmovdqa xmm0, oword [rbp-170H]
+        mov     rax, qword [rbp-258H]
+        vmovups oword [rax], xmm0
+        nop
+        mov     eax, dword [rbp-274H]
+        lea     edx, [rax+1H]
+        mov     eax, dword [rbp-2A4H]
+        imul    edx, eax
+        mov     eax, dword [rbp-270H]
+        add     eax, edx
+        mov     eax, eax
+        lea     rdx, [rax+rax]
+        mov     rax, qword [rbp-298H]
+        add     rax, rdx
+        mov     qword [rbp-260H], rax
+        vmovdqa xmm0, oword [rbp-200H]
+        vmovaps oword [rbp-180H], xmm0
+        vmovdqa xmm0, oword [rbp-180H]
+        mov     rax, qword [rbp-260H]
+        vmovups oword [rax], xmm0
+        nop
+        add     dword [rbp-270H], 8
+L_016:  mov     eax, dword [rbp-270H]
+        cmp     eax, dword [rbp-284H]
+        jc      L_015
+        add     dword [rbp-274H], 2
+L_017:  mov     eax, dword [rbp-274H]
+        cmp     eax, dword [rbp-288H]
+        jc      L_014
+        mov     rax, qword [rbp-268H]
+        add     rax, rax
+        mov     rcx, qword [rbp-8H]
+
+
+        xor     rcx, qword [fs:abs 28H]
+        jz      L_018
+L_018:  leave
+        ret
+
+
+SECTION .data
+
+
+SECTION .bss
+
+
+SECTION .note.gnu.property align=8
+
+        db 04H, 00H, 00H, 00H, 10H, 00H, 00H, 00H
+        db 05H, 00H, 00H, 00H, 47H, 4EH, 55H, 00H
+        db 02H, 00H, 00H, 0C0H, 04H, 00H, 00H, 00H
+        db 03H, 00H, 00H, 00H, 00H, 00H, 00H, 00H
+
+%else
+global __x86.get_pc_thunk.ax
+extern __stack_chk_fail_local
+extern _GLOBAL_OFFSET_TABLE_
+global highds_simd
+global diff1st_simd
+global diff2nd_simd
+
+SECTION .text
+
+highds_simd:
+        push    ebp
+        mov     ebp, esp
+        and     esp, 0FFFFFFE0H
+        sub     esp, 1632
+        call    __x86.get_pc_thunk.ax
+        add     eax, _GLOBAL_OFFSET_TABLE_-$
+        mov     dword [esp+68H], 0
+        mov     dword [esp+6CH], 0
+        mov     word [esp+30H], 0
+        mov     word [esp+32H], 0
+        mov     word [esp+34H], -1
+        mov     word [esp+36H], -2
+        mov     word [esp+38H], -3
+        mov     word [esp+3AH], -3
+        mov     word [esp+3CH], -2
+        mov     word [esp+3EH], -1
+        movzx   eax, word [esp+3EH]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+3CH]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm1, xmm0
+        movzx   eax, word [esp+3AH]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+38H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm4, xmm0
+        movzx   eax, word [esp+36H]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+34H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm2, xmm0
+        movzx   eax, word [esp+32H]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+30H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm3, xmm0
+        vpunpckldq xmm0, xmm1, xmm4
+        vmovdqa xmm1, xmm0
+        vpunpckldq xmm0, xmm2, xmm3
+        vpunpcklqdq xmm0, xmm1, xmm0
+        vmovaps oword [esp+0D0H], xmm0
+        mov     word [esp+20H], 0
+        mov     word [esp+22H], 0
+        mov     word [esp+24H], -1
+        mov     word [esp+26H], -3
+        mov     word [esp+28H], 12
+        mov     word [esp+2AH], 12
+        mov     word [esp+2CH], -3
+        mov     word [esp+2EH], -1
+        movzx   eax, word [esp+2EH]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+2CH]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm1, xmm0
+        movzx   eax, word [esp+2AH]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+28H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm4, xmm0
+        movzx   eax, word [esp+26H]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+24H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm2, xmm0
+        movzx   eax, word [esp+22H]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+20H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm3, xmm0
+        vpunpckldq xmm0, xmm1, xmm4
+        vmovdqa xmm1, xmm0
+        vpunpckldq xmm0, xmm2, xmm3
+        vpunpcklqdq xmm0, xmm1, xmm0
+        vmovaps oword [esp+0E0H], xmm0
+        mov     word [esp+10H], 0
+        mov     word [esp+12H], 0
+        mov     word [esp+14H], 0
+        mov     word [esp+16H], -1
+        mov     word [esp+18H], -1
+        mov     word [esp+1AH], -1
+        mov     word [esp+1CH], -1
+        mov     word [esp+1EH], 0
+        movzx   eax, word [esp+1EH]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+1CH]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm1, xmm0
+        movzx   eax, word [esp+1AH]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+18H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm4, xmm0
+        movzx   eax, word [esp+16H]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+14H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm2, xmm0
+        movzx   eax, word [esp+12H]
+        vmovd   xmm0, eax
+        movzx   eax, word [esp+10H]
+        vpinsrw xmm0, xmm0, eax, 1
+        vmovdqa xmm3, xmm0
+        vpunpckldq xmm0, xmm1, xmm4
+        vmovdqa xmm1, xmm0
+        vpunpckldq xmm0, xmm2, xmm3
+        vpunpcklqdq xmm0, xmm1, xmm0
+        vmovaps oword [esp+0F0H], xmm0
+        mov     eax, dword [ebp+0CH]
+        mov     dword [esp+40H], eax
+        jmp     L_008
+
+L_001:  mov     eax, dword [ebp+8H]
+        mov     dword [esp+44H], eax
+        jmp     L_007
+
+L_002:  mov     eax, dword [esp+40H]
+        sub     eax, 2
+        imul    eax, dword [ebp+1CH]
+        mov     edx, eax
+        mov     eax, dword [esp+44H]
+        add     eax, edx
+        add     eax, 2147483646
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp+18H]
+        add     eax, edx
+        mov     dword [esp+64H], eax
+        mov     eax, dword [esp+64H]
+        vlddqu  ymm0, yword [eax]
+        vmovdqa yword [esp+4E0H], ymm0
+        mov     eax, dword [esp+40H]
+        sub     eax, 1
+        imul    eax, dword [ebp+1CH]
+        mov     edx, eax
+        mov     eax, dword [esp+44H]
+        add     eax, edx
+        add     eax, 2147483646
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp+18H]
+        add     eax, edx
+        mov     dword [esp+60H], eax
+        mov     eax, dword [esp+60H]
+        vlddqu  ymm0, yword [eax]
+        vmovdqa yword [esp+500H], ymm0
+        mov     eax, dword [esp+40H]
+        imul    eax, dword [ebp+1CH]
+        mov     edx, eax
+        mov     eax, dword [esp+44H]
+        add     eax, edx
+        add     eax, 2147483646
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp+18H]
+        add     eax, edx
+        mov     dword [esp+5CH], eax
+        mov     eax, dword [esp+5CH]
+        vlddqu  ymm0, yword [eax]
+        vmovdqa yword [esp+520H], ymm0
+        mov     eax, dword [esp+40H]
+        add     eax, 1
+        imul    eax, dword [ebp+1CH]
+        mov     edx, eax
+        mov     eax, dword [esp+44H]
+        add     eax, edx
+        add     eax, 2147483646
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp+18H]
+        add     eax, edx
+        mov     dword [esp+58H], eax
+        mov     eax, dword [esp+58H]
+        vlddqu  ymm0, yword [eax]
+        vmovdqa yword [esp+540H], ymm0
+        mov     eax, dword [esp+40H]
+        add     eax, 2
+        imul    eax, dword [ebp+1CH]
+        mov     edx, eax
+        mov     eax, dword [esp+44H]
+        add     eax, edx
+        add     eax, 2147483646
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp+18H]
+        add     eax, edx
+        mov     dword [esp+54H], eax
+        mov     eax, dword [esp+54H]
+        vlddqu  ymm0, yword [eax]
+        vmovdqa yword [esp+560H], ymm0
+        mov     eax, dword [esp+40H]
+        add     eax, 3
+        imul    eax, dword [ebp+1CH]
+        mov     edx, eax
+        mov     eax, dword [esp+44H]
+        add     eax, edx
+        add     eax, 2147483646
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp+18H]
+        add     eax, edx
+        mov     dword [esp+50H], eax
+        mov     eax, dword [esp+50H]
+        vlddqu  ymm0, yword [eax]
+        vmovdqa yword [esp+580H], ymm0
+        mov     dword [esp+48H], 0
+        jmp     L_006
+
+L_003:  mov     eax, dword [esp+48H]
+        lea     edx, [eax*4]
+        mov     eax, dword [esp+44H]
+        add     eax, edx
+        cmp     dword [ebp+10H], eax
+        jle     L_004
+        mov     dword [esp+4CH], 0
+        vmovdqa ymm0, yword [esp+520H]
+        vmovdqa yword [esp+640H], ymm0
+        vmovdqa ymm0, yword [esp+640H]
+        vmovaps oword [esp+70H], xmm0
+        vmovdqa ymm0, yword [esp+540H]
+        vmovdqa yword [esp+620H], ymm0
+        vmovdqa ymm0, yword [esp+620H]
+        vmovaps oword [esp+80H], xmm0
+        vmovdqa xmm0, oword [esp+70H]
+        vmovaps oword [esp+2E0H], xmm0
+        vmovdqa xmm0, oword [esp+0E0H]
+        vmovaps oword [esp+2F0H], xmm0
+        vmovdqa xmm0, oword [esp+2F0H]
+        vmovdqa xmm1, oword [esp+2E0H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+80H]
+        vmovaps oword [esp+2C0H], xmm0
+        vmovdqa xmm0, oword [esp+0E0H]
+        vmovaps oword [esp+2D0H], xmm0
+        vmovdqa xmm0, oword [esp+2D0H]
+        vmovdqa xmm1, oword [esp+2C0H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+110H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+2A0H], xmm0
+        vmovdqa xmm0, oword [esp+110H]
+        vmovaps oword [esp+2B0H], xmm0
+        vmovdqa xmm1, oword [esp+2B0H]
+        vmovdqa xmm0, oword [esp+2A0H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+280H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+290H], xmm0
+        vmovdqa xmm1, oword [esp+290H]
+        vmovdqa xmm0, oword [esp+280H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+260H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+270H], xmm0
+        vmovdqa xmm1, oword [esp+270H]
+        vmovdqa xmm0, oword [esp+260H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovd   eax, xmm0
+        add     dword [esp+4CH], eax
+        vmovdqa ymm0, yword [esp+500H]
+        vmovdqa yword [esp+600H], ymm0
+        vmovdqa ymm0, yword [esp+600H]
+        vmovaps oword [esp+90H], xmm0
+        vmovdqa ymm0, yword [esp+560H]
+        vmovdqa yword [esp+5E0H], ymm0
+        vmovdqa ymm0, yword [esp+5E0H]
+        vmovaps oword [esp+0A0H], xmm0
+        vmovdqa xmm0, oword [esp+90H]
+        vmovaps oword [esp+240H], xmm0
+        vmovdqa xmm0, oword [esp+0D0H]
+        vmovaps oword [esp+250H], xmm0
+        vmovdqa xmm0, oword [esp+250H]
+        vmovdqa xmm1, oword [esp+240H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+0A0H]
+        vmovaps oword [esp+220H], xmm0
+        vmovdqa xmm0, oword [esp+0D0H]
+        vmovaps oword [esp+230H], xmm0
+        vmovdqa xmm0, oword [esp+230H]
+        vmovdqa xmm1, oword [esp+220H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+110H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+200H], xmm0
+        vmovdqa xmm0, oword [esp+110H]
+        vmovaps oword [esp+210H], xmm0
+        vmovdqa xmm1, oword [esp+210H]
+        vmovdqa xmm0, oword [esp+200H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+1E0H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+1F0H], xmm0
+        vmovdqa xmm1, oword [esp+1F0H]
+        vmovdqa xmm0, oword [esp+1E0H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+1C0H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+1D0H], xmm0
+        vmovdqa xmm1, oword [esp+1D0H]
+        vmovdqa xmm0, oword [esp+1C0H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovd   eax, xmm0
+        add     dword [esp+4CH], eax
+        vmovdqa ymm0, yword [esp+4E0H]
+        vmovdqa yword [esp+5C0H], ymm0
+        vmovdqa ymm0, yword [esp+5C0H]
+        vmovaps oword [esp+0B0H], xmm0
+        vmovdqa ymm0, yword [esp+580H]
+        vmovdqa yword [esp+5A0H], ymm0
+        vmovdqa ymm0, yword [esp+5A0H]
+        vmovaps oword [esp+0C0H], xmm0
+        vmovdqa xmm0, oword [esp+0B0H]
+        vmovaps oword [esp+1A0H], xmm0
+        vmovdqa xmm0, oword [esp+0F0H]
+        vmovaps oword [esp+1B0H], xmm0
+        vmovdqa xmm0, oword [esp+1B0H]
+        vmovdqa xmm1, oword [esp+1A0H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+0C0H]
+        vmovaps oword [esp+180H], xmm0
+        vmovdqa xmm0, oword [esp+0F0H]
+        vmovaps oword [esp+190H], xmm0
+        vmovdqa xmm0, oword [esp+190H]
+        vmovdqa xmm1, oword [esp+180H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+110H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+160H], xmm0
+        vmovdqa xmm0, oword [esp+110H]
+        vmovaps oword [esp+170H], xmm0
+        vmovdqa xmm1, oword [esp+170H]
+        vmovdqa xmm0, oword [esp+160H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+140H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+150H], xmm0
+        vmovdqa xmm1, oword [esp+150H]
+        vmovdqa xmm0, oword [esp+140H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+120H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+130H], xmm0
+        vmovdqa xmm1, oword [esp+130H]
+        vmovdqa xmm0, oword [esp+120H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovd   eax, xmm0
+        add     dword [esp+4CH], eax
+        mov     eax, dword [esp+4CH]
+        cdq
+        mov     eax, edx
+        xor     eax, dword [esp+4CH]
+        sub     eax, edx
+        cdq
+        add     dword [esp+68H], eax
+        adc     dword [esp+6CH], edx
+L_004:  mov     eax, dword [esp+48H]
+        lea     edx, [eax*4]
+        mov     eax, dword [esp+44H]
+        add     eax, edx
+        add     eax, 2
+        cmp     dword [ebp+10H], eax
+        jle     L_005
+        mov     dword [esp+4CH], 0
+        vmovdqa xmm0, oword [esp+70H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [esp+70H], xmm0
+        vmovdqa xmm0, oword [esp+80H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [esp+80H], xmm0
+        vmovdqa xmm0, oword [esp+70H]
+        vmovaps oword [esp+4C0H], xmm0
+        vmovdqa xmm0, oword [esp+0E0H]
+        vmovaps oword [esp+4D0H], xmm0
+        vmovdqa xmm0, oword [esp+4D0H]
+        vmovdqa xmm1, oword [esp+4C0H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+80H]
+        vmovaps oword [esp+4A0H], xmm0
+        vmovdqa xmm0, oword [esp+0E0H]
+        vmovaps oword [esp+4B0H], xmm0
+        vmovdqa xmm0, oword [esp+4B0H]
+        vmovdqa xmm1, oword [esp+4A0H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+110H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+480H], xmm0
+        vmovdqa xmm0, oword [esp+110H]
+        vmovaps oword [esp+490H], xmm0
+        vmovdqa xmm1, oword [esp+490H]
+        vmovdqa xmm0, oword [esp+480H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+460H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+470H], xmm0
+        vmovdqa xmm1, oword [esp+470H]
+        vmovdqa xmm0, oword [esp+460H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+440H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+450H], xmm0
+        vmovdqa xmm1, oword [esp+450H]
+        vmovdqa xmm0, oword [esp+440H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovd   eax, xmm0
+        add     dword [esp+4CH], eax
+        vmovdqa xmm0, oword [esp+90H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [esp+90H], xmm0
+        vmovdqa xmm0, oword [esp+0A0H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [esp+0A0H], xmm0
+        vmovdqa xmm0, oword [esp+90H]
+        vmovaps oword [esp+420H], xmm0
+        vmovdqa xmm0, oword [esp+0D0H]
+        vmovaps oword [esp+430H], xmm0
+        vmovdqa xmm0, oword [esp+430H]
+        vmovdqa xmm1, oword [esp+420H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+0A0H]
+        vmovaps oword [esp+400H], xmm0
+        vmovdqa xmm0, oword [esp+0D0H]
+        vmovaps oword [esp+410H], xmm0
+        vmovdqa xmm0, oword [esp+410H]
+        vmovdqa xmm1, oword [esp+400H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+110H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+3E0H], xmm0
+        vmovdqa xmm0, oword [esp+110H]
+        vmovaps oword [esp+3F0H], xmm0
+        vmovdqa xmm1, oword [esp+3F0H]
+        vmovdqa xmm0, oword [esp+3E0H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+3C0H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+3D0H], xmm0
+        vmovdqa xmm1, oword [esp+3D0H]
+        vmovdqa xmm0, oword [esp+3C0H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+3A0H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+3B0H], xmm0
+        vmovdqa xmm1, oword [esp+3B0H]
+        vmovdqa xmm0, oword [esp+3A0H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovd   eax, xmm0
+        add     dword [esp+4CH], eax
+        vmovdqa xmm0, oword [esp+0B0H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [esp+0B0H], xmm0
+        vmovdqa xmm0, oword [esp+0C0H]
+        vpsrldq xmm0, xmm0, 4
+        vmovaps oword [esp+0C0H], xmm0
+        vmovdqa xmm0, oword [esp+0B0H]
+        vmovaps oword [esp+380H], xmm0
+        vmovdqa xmm0, oword [esp+0F0H]
+        vmovaps oword [esp+390H], xmm0
+        vmovdqa xmm0, oword [esp+390H]
+        vmovdqa xmm1, oword [esp+380H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+0C0H]
+        vmovaps oword [esp+360H], xmm0
+        vmovdqa xmm0, oword [esp+0F0H]
+        vmovaps oword [esp+370H], xmm0
+        vmovdqa xmm0, oword [esp+370H]
+        vmovdqa xmm1, oword [esp+360H]
+        vpmaddwd xmm0, xmm1, xmm0
+        vmovaps oword [esp+110H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+340H], xmm0
+        vmovdqa xmm0, oword [esp+110H]
+        vmovaps oword [esp+350H], xmm0
+        vmovdqa xmm1, oword [esp+350H]
+        vmovdqa xmm0, oword [esp+340H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+320H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+330H], xmm0
+        vmovdqa xmm1, oword [esp+330H]
+        vmovdqa xmm0, oword [esp+320H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+300H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovaps oword [esp+310H], xmm0
+        vmovdqa xmm1, oword [esp+310H]
+        vmovdqa xmm0, oword [esp+300H]
+        vphaddd xmm0, xmm0, xmm1
+        vmovaps oword [esp+100H], xmm0
+        vmovdqa xmm0, oword [esp+100H]
+        vmovd   eax, xmm0
+        add     dword [esp+4CH], eax
+        mov     eax, dword [esp+4CH]
+        cdq
+        mov     eax, edx
+        xor     eax, dword [esp+4CH]
+        sub     eax, edx
+        cdq
+        add     dword [esp+68H], eax
+        adc     dword [esp+6CH], edx
+        vpermq  ymm0, yword [esp+4E0H], 39H
+        vmovdqa yword [esp+4E0H], ymm0
+        vpermq  ymm0, yword [esp+500H], 39H
+        vmovdqa yword [esp+500H], ymm0
+        vpermq  ymm0, yword [esp+520H], 39H
+        vmovdqa yword [esp+520H], ymm0
+        vpermq  ymm0, yword [esp+540H], 39H
+        vmovdqa yword [esp+540H], ymm0
+        vpermq  ymm0, yword [esp+560H], 39H
+        vmovdqa yword [esp+560H], ymm0
+        vpermq  ymm0, yword [esp+580H], 39H
+        vmovdqa yword [esp+580H], ymm0
+L_005:  add     dword [esp+48H], 1
+L_006:  cmp     dword [esp+48H], 2
+        jle     L_003
+        add     dword [esp+44H], 12
+L_007:  mov     eax, dword [esp+44H]
+        cmp     eax, dword [ebp+10H]
+        jl      L_002
+        add     dword [esp+40H], 2
+L_008:  mov     eax, dword [esp+40H]
+        cmp     eax, dword [ebp+14H]
+        jl      L_001
+        mov     eax, dword [esp+68H]
+        mov     edx, dword [esp+6CH]
+        leave
+        ret
+
+diff1st_simd:
+        push    ebp
+        mov     ebp, esp
+        sub     esp, 440
+        call    __x86.get_pc_thunk.ax
+        add     eax, _GLOBAL_OFFSET_TABLE_-$
+        mov     eax, dword [ebp+10H]
+        mov     dword [ebp-1ACH], eax
+        mov     eax, dword [ebp+14H]
+        mov     dword [ebp-1B0H], eax
+
+        mov     eax, dword [gs:14H]
+        mov     dword [ebp-0CH], eax
+        xor     eax, eax
+        mov     dword [ebp-180H], 0
+        mov     dword [ebp-17CH], 0
+
+        mov     word [ebp-1A2H], 0
+        mov     dword [ebp-1A0H], 0
+        jmp     L_012
+
+L_009:  mov     dword [ebp-19CH], 0
+        jmp     L_011
+
+L_010:  mov     eax, dword [ebp+18H]
+        imul    eax, dword [ebp-1A0H]
+        mov     edx, eax
+        mov     eax, dword [ebp-19CH]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-1ACH]
+        add     eax, edx
+        mov     dword [ebp-184H], eax
+        mov     eax, dword [ebp-184H]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-178H], xmm0
+        mov     eax, dword [ebp-1A0H]
+        lea     edx, [eax+1H]
+        mov     eax, dword [ebp+18H]
+        imul    edx, eax
+        mov     eax, dword [ebp-19CH]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-1ACH]
+        add     eax, edx
+        mov     dword [ebp-188H], eax
+        mov     eax, dword [ebp-188H]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-168H], xmm0
+        mov     eax, dword [ebp+18H]
+        imul    eax, dword [ebp-1A0H]
+        mov     edx, eax
+        mov     eax, dword [ebp-19CH]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-1B0H]
+        add     eax, edx
+        mov     dword [ebp-18CH], eax
+        mov     eax, dword [ebp-18CH]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-158H], xmm0
+        mov     eax, dword [ebp-1A0H]
+        lea     edx, [eax+1H]
+        mov     eax, dword [ebp+18H]
+        imul    edx, eax
+        mov     eax, dword [ebp-19CH]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-1B0H]
+        add     eax, edx
+        mov     dword [ebp-190H], eax
+        mov     eax, dword [ebp-190H]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-148H], xmm0
+        vmovdqa xmm0, oword [ebp-178H]
+        vmovaps oword [ebp-38H], xmm0
+        vmovdqa xmm0, oword [ebp-168H]
+        vmovaps oword [ebp-28H], xmm0
+        vmovdqa xmm1, oword [ebp-38H]
+        vmovdqa xmm0, oword [ebp-28H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [ebp-138H], xmm0
+        vmovdqa xmm0, oword [ebp-158H]
+        vmovaps oword [ebp-58H], xmm0
+        vmovdqa xmm0, oword [ebp-148H]
+        vmovaps oword [ebp-48H], xmm0
+        vmovdqa xmm1, oword [ebp-58H]
+        vmovdqa xmm0, oword [ebp-48H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [ebp-128H], xmm0
+        vmovdqa xmm0, oword [ebp-138H]
+        vmovaps oword [ebp-78H], xmm0
+        vmovdqa xmm0, oword [ebp-128H]
+        vmovaps oword [ebp-68H], xmm0
+        vmovdqa xmm0, oword [ebp-78H]
+        vmovdqa xmm1, oword [ebp-68H]
+        vpsubw  xmm0, xmm0, xmm1
+        vmovaps oword [ebp-128H], xmm0
+        vmovdqa xmm0, oword [ebp-128H]
+        vmovaps oword [ebp-98H], xmm0
+        vmovdqa xmm0, oword [ebp-128H]
+        vmovaps oword [ebp-88H], xmm0
+        vmovdqa xmm1, oword [ebp-88H]
+        vmovdqa xmm0, oword [ebp-98H]
+        vphaddw xmm0, xmm0, xmm1
+        vmovaps oword [ebp-128H], xmm0
+        vmovdqa xmm0, oword [ebp-128H]
+        vmovaps oword [ebp-0A8H], xmm0
+        vmovdqa xmm0, oword [ebp-0A8H]
+        vpabsw  xmm0, xmm0
+        vmovaps oword [ebp-128H], xmm0
+        vmovdqa xmm0, oword [ebp-128H]
+        vmovaps oword [ebp-0C8H], xmm0
+        vmovdqa xmm0, oword [ebp-128H]
+        vmovaps oword [ebp-0B8H], xmm0
+        vmovdqa xmm1, oword [ebp-0B8H]
+        vmovdqa xmm0, oword [ebp-0C8H]
+        vphaddsw xmm0, xmm0, xmm1
+        vmovaps oword [ebp-128H], xmm0
+        vmovdqa xmm0, oword [ebp-128H]
+        vmovaps oword [ebp-0E8H], xmm0
+        vmovdqa xmm0, oword [ebp-128H]
+        vmovaps oword [ebp-0D8H], xmm0
+        vmovdqa xmm1, oword [ebp-0D8H]
+        vmovdqa xmm0, oword [ebp-0E8H]
+        vphaddsw xmm0, xmm0, xmm1
+        vmovaps oword [ebp-128H], xmm0
+        vmovdqa xmm0, oword [ebp-128H]
+        vmovaps oword [ebp-0F8H], xmm0
+        vmovdqa xmm0, oword [ebp-0F8H]
+        vmovd   edx, xmm0
+        lea     eax, [ebp-1A2H]
+        mov     word [eax], dx
+        movzx   ecx, word [ebp-1A2H]
+        movzx   eax, cx
+        mov     edx, 0
+        add     dword [ebp-180H], eax
+        adc     dword [ebp-17CH], edx
+        mov     eax, dword [ebp+18H]
+        imul    eax, dword [ebp-1A0H]
+        mov     edx, eax
+        mov     eax, dword [ebp-19CH]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-1B0H]
+        add     eax, edx
+        mov     dword [ebp-194H], eax
+        vmovdqa xmm0, oword [ebp-178H]
+        vmovaps oword [ebp-108H], xmm0
+        vmovdqa xmm0, oword [ebp-108H]
+        mov     eax, dword [ebp-194H]
+        vmovups oword [eax], xmm0
+        nop
+        mov     eax, dword [ebp-1A0H]
+        lea     edx, [eax+1H]
+        mov     eax, dword [ebp+18H]
+        imul    edx, eax
+        mov     eax, dword [ebp-19CH]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-1B0H]
+        add     eax, edx
+        mov     dword [ebp-198H], eax
+        vmovdqa xmm0, oword [ebp-168H]
+        vmovaps oword [ebp-118H], xmm0
+        vmovdqa xmm0, oword [ebp-118H]
+        mov     eax, dword [ebp-198H]
+        vmovups oword [eax], xmm0
+        nop
+        add     dword [ebp-19CH], 8
+L_011:  mov     eax, dword [ebp-19CH]
+        cmp     eax, dword [ebp+8H]
+        jc      L_010
+        add     dword [ebp-1A0H], 2
+L_012:  mov     eax, dword [ebp-1A0H]
+        cmp     eax, dword [ebp+0CH]
+        jc      L_009
+        mov     eax, dword [ebp-180H]
+        mov     edx, dword [ebp-17CH]
+        shld    edx, eax, 1
+        add     eax, eax
+        mov     ecx, dword [ebp-0CH]
+
+        xor     ecx, dword [gs:14H]
+        jz      L_013
+L_013:  leave
+        ret
+
+diff2nd_simd:
+        push    ebp
+        mov     ebp, esp
+        sub     esp, 616
+        call    __x86.get_pc_thunk.ax
+        add     eax, _GLOBAL_OFFSET_TABLE_-$
+        mov     eax, dword [ebp+10H]
+        mov     dword [ebp-25CH], eax
+        mov     eax, dword [ebp+14H]
+        mov     dword [ebp-260H], eax
+        mov     eax, dword [ebp+18H]
+        mov     dword [ebp-264H], eax
+
+        mov     eax, dword [gs:14H]
+        mov     dword [ebp-0CH], eax
+        xor     eax, eax
+        mov     dword [ebp-220H], 0
+        mov     dword [ebp-21CH], 0
+
+        mov     word [ebp-256H], 0
+        mov     dword [ebp-254H], 0
+        jmp     L_017
+
+L_014:  mov     dword [ebp-250H], 0
+        jmp     L_016
+
+L_015:  mov     eax, dword [ebp+1CH]
+        imul    eax, dword [ebp-254H]
+        mov     edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-25CH]
+        add     eax, edx
+        mov     dword [ebp-224H], eax
+        mov     eax, dword [ebp-224H]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-218H], xmm0
+        mov     eax, dword [ebp-254H]
+        lea     edx, [eax+1H]
+        mov     eax, dword [ebp+1CH]
+        imul    edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-25CH]
+        add     eax, edx
+        mov     dword [ebp-228H], eax
+        mov     eax, dword [ebp-228H]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-208H], xmm0
+        mov     eax, dword [ebp+1CH]
+        imul    eax, dword [ebp-254H]
+        mov     edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-260H]
+        add     eax, edx
+        mov     dword [ebp-22CH], eax
+        mov     eax, dword [ebp-22CH]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-1F8H], xmm0
+        mov     eax, dword [ebp-254H]
+        lea     edx, [eax+1H]
+        mov     eax, dword [ebp+1CH]
+        imul    edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-260H]
+        add     eax, edx
+        mov     dword [ebp-230H], eax
+        mov     eax, dword [ebp-230H]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-1E8H], xmm0
+        mov     eax, dword [ebp+1CH]
+        imul    eax, dword [ebp-254H]
+        mov     edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-264H]
+        add     eax, edx
+        mov     dword [ebp-234H], eax
+        mov     eax, dword [ebp-234H]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-1D8H], xmm0
+        mov     eax, dword [ebp-254H]
+        lea     edx, [eax+1H]
+        mov     eax, dword [ebp+1CH]
+        imul    edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-264H]
+        add     eax, edx
+        mov     dword [ebp-238H], eax
+        mov     eax, dword [ebp-238H]
+        vlddqu  xmm0, oword [eax]
+        vmovaps oword [ebp-1C8H], xmm0
+        vmovdqa xmm0, oword [ebp-218H]
+        vmovaps oword [ebp-38H], xmm0
+        vmovdqa xmm0, oword [ebp-208H]
+        vmovaps oword [ebp-28H], xmm0
+        vmovdqa xmm1, oword [ebp-38H]
+        vmovdqa xmm0, oword [ebp-28H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [ebp-1B8H], xmm0
+        vmovdqa xmm0, oword [ebp-1F8H]
+        vmovaps oword [ebp-58H], xmm0
+        vmovdqa xmm0, oword [ebp-1E8H]
+        vmovaps oword [ebp-48H], xmm0
+        vmovdqa xmm1, oword [ebp-58H]
+        vmovdqa xmm0, oword [ebp-48H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [ebp-1A8H], xmm0
+        vmovdqa xmm0, oword [ebp-1D8H]
+        vmovaps oword [ebp-78H], xmm0
+        vmovdqa xmm0, oword [ebp-1C8H]
+        vmovaps oword [ebp-68H], xmm0
+        vmovdqa xmm1, oword [ebp-78H]
+        vmovdqa xmm0, oword [ebp-68H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [ebp-198H], xmm0
+        vmovdqa xmm0, oword [ebp-1B8H]
+        vmovaps oword [ebp-98H], xmm0
+        vmovdqa xmm0, oword [ebp-198H]
+        vmovaps oword [ebp-88H], xmm0
+        vmovdqa xmm1, oword [ebp-98H]
+        vmovdqa xmm0, oword [ebp-88H]
+        vpaddw  xmm0, xmm1, xmm0
+        vmovaps oword [ebp-1B8H], xmm0
+        vmovdqa xmm0, oword [ebp-1B8H]
+        vmovaps oword [ebp-0B8H], xmm0
+        vmovdqa xmm0, oword [ebp-1A8H]
+        vmovaps oword [ebp-0A8H], xmm0
+        vmovdqa xmm1, oword [ebp-0A8H]
+        vmovdqa xmm0, oword [ebp-0B8H]
+        vphaddw xmm0, xmm0, xmm1
+        vmovaps oword [ebp-1B8H], xmm0
+        vmovdqa xmm0, oword [ebp-1B8H]
+        vpshufd xmm0, xmm0, 0EEH
+        vmovaps oword [ebp-1A8H], xmm0
+        vmovdqa xmm0, oword [ebp-1A8H]
+        vmovaps oword [ebp-0C8H], xmm0
+        mov     dword [ebp-23CH], 1
+        vmovdqa xmm1, oword [ebp-0C8H]
+        vmovd   xmm0, dword [ebp-23CH]
+        vpsllw  xmm0, xmm1, xmm0
+        vmovaps oword [ebp-1A8H], xmm0
+        vmovdqa xmm0, oword [ebp-1B8H]
+        vmovaps oword [ebp-0E8H], xmm0
+        vmovdqa xmm0, oword [ebp-1A8H]
+        vmovaps oword [ebp-0D8H], xmm0
+        vmovdqa xmm0, oword [ebp-0E8H]
+        vmovdqa xmm1, oword [ebp-0D8H]
+        vpsubw  xmm0, xmm0, xmm1
+        vmovaps oword [ebp-1A8H], xmm0
+        vmovdqa xmm0, oword [ebp-1A8H]
+        vmovaps oword [ebp-0F8H], xmm0
+        vmovdqa xmm0, oword [ebp-0F8H]
+        vpabsw  xmm0, xmm0
+        vmovaps oword [ebp-1A8H], xmm0
+        vmovdqa xmm0, oword [ebp-1A8H]
+        vmovaps oword [ebp-118H], xmm0
+        vmovdqa xmm0, oword [ebp-1A8H]
+        vmovaps oword [ebp-108H], xmm0
+        vmovdqa xmm1, oword [ebp-108H]
+        vmovdqa xmm0, oword [ebp-118H]
+        vphaddsw xmm0, xmm0, xmm1
+        vmovaps oword [ebp-1A8H], xmm0
+        vmovdqa xmm0, oword [ebp-1A8H]
+        vmovaps oword [ebp-138H], xmm0
+        vmovdqa xmm0, oword [ebp-1A8H]
+        vmovaps oword [ebp-128H], xmm0
+        vmovdqa xmm1, oword [ebp-128H]
+        vmovdqa xmm0, oword [ebp-138H]
+        vphaddsw xmm0, xmm0, xmm1
+        vmovaps oword [ebp-1A8H], xmm0
+        vmovdqa xmm0, oword [ebp-1A8H]
+        vmovaps oword [ebp-148H], xmm0
+        vmovdqa xmm0, oword [ebp-148H]
+        vmovd   edx, xmm0
+        lea     eax, [ebp-256H]
+        mov     word [eax], dx
+        movzx   ecx, word [ebp-256H]
+        movzx   eax, cx
+        mov     edx, 0
+        add     dword [ebp-220H], eax
+        adc     dword [ebp-21CH], edx
+        mov     eax, dword [ebp+1CH]
+        imul    eax, dword [ebp-254H]
+        mov     edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-264H]
+        add     eax, edx
+        mov     dword [ebp-240H], eax
+        vmovdqa xmm0, oword [ebp-1F8H]
+        vmovaps oword [ebp-158H], xmm0
+        vmovdqa xmm0, oword [ebp-158H]
+        mov     eax, dword [ebp-240H]
+        vmovups oword [eax], xmm0
+        nop
+        mov     eax, dword [ebp-254H]
+        lea     edx, [eax+1H]
+        mov     eax, dword [ebp+1CH]
+        imul    edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-264H]
+        add     eax, edx
+        mov     dword [ebp-244H], eax
+        vmovdqa xmm0, oword [ebp-1E8H]
+        vmovaps oword [ebp-168H], xmm0
+        vmovdqa xmm0, oword [ebp-168H]
+        mov     eax, dword [ebp-244H]
+        vmovups oword [eax], xmm0
+        nop
+        mov     eax, dword [ebp+1CH]
+        imul    eax, dword [ebp-254H]
+        mov     edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-260H]
+        add     eax, edx
+        mov     dword [ebp-248H], eax
+        vmovdqa xmm0, oword [ebp-218H]
+        vmovaps oword [ebp-178H], xmm0
+        vmovdqa xmm0, oword [ebp-178H]
+        mov     eax, dword [ebp-248H]
+        vmovups oword [eax], xmm0
+        nop
+        mov     eax, dword [ebp-254H]
+        lea     edx, [eax+1H]
+        mov     eax, dword [ebp+1CH]
+        imul    edx, eax
+        mov     eax, dword [ebp-250H]
+        add     eax, edx
+        lea     edx, [eax+eax]
+        mov     eax, dword [ebp-260H]
+        add     eax, edx
+        mov     dword [ebp-24CH], eax
+        vmovdqa xmm0, oword [ebp-208H]
+        vmovaps oword [ebp-188H], xmm0
+        vmovdqa xmm0, oword [ebp-188H]
+        mov     eax, dword [ebp-24CH]
+        vmovups oword [eax], xmm0
+        nop
+        add     dword [ebp-250H], 8
+L_016:  mov     eax, dword [ebp-250H]
+        cmp     eax, dword [ebp+8H]
+        jc      L_015
+        add     dword [ebp-254H], 2
+L_017:  mov     eax, dword [ebp-254H]
+        cmp     eax, dword [ebp+0CH]
+        jc      L_014
+        mov     eax, dword [ebp-220H]
+        mov     edx, dword [ebp-21CH]
+        shld    edx, eax, 1
+        add     eax, eax
+        mov     ecx, dword [ebp-0CH]
+
+        xor     ecx, dword [gs:14H]
+        jz      L_018
+L_018:  leave
+        ret
+
+
+SECTION .data
+
+
+SECTION .bss
+
+
+SECTION .text.__x86.get_pc_thunk.ax
+
+__x86.get_pc_thunk.ax:
+        mov     eax, dword [esp]
+        ret
+
+
+
+SECTION .note.gnu.property align=4
+
+        db 04H, 00H, 00H, 00H, 0CH, 00H, 00H, 00H
+        db 05H, 00H, 00H, 00H, 47H, 4EH, 55H, 00H
+        db 02H, 00H, 00H, 0C0H, 04H, 00H, 00H, 00H
+        db 03H, 00H, 00H, 00H
+
+%endif
diff --git a/libavfilter/x86/vf_xpsnr_init.c b/libavfilter/x86/vf_xpsnr_init.c
new file mode 100644
index 0000000000..825fc2f995
--- /dev/null
+++ b/libavfilter/x86/vf_xpsnr_init.c
@@ -0,0 +1,58 @@ 
+/*
+ * Copyright (c) 2023 Christian R. Helmrich
+ * Copyright (c) 2023 Christian Stoffers
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/**
+ * @file
+ * SIMD initialization for calculation of extended perceptually weighted PSNR (XPSNR).
+ *
+ * Authors: Christian Helmrich and Christian Stoffers, Fraunhofer HHI, Berlin, Germany
+ */
+
+#include "libavutil/x86/cpu.h"
+#include "libavfilter/xpsnr.h"
+
+uint64_t ff_sse_line_16bit_sse2 (const uint8_t *buf, const uint8_t *ref, const int w);
+#ifdef __AVX2__
+uint64_t highds_simd (const int x_act, const int y_act, const int w_act, const int h_act, const int16_t *o_m0, const int o);
+uint64_t diff1st_simd(const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, const int o);
+uint64_t diff2nd_simd(const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, int16_t *o_m2, const int o);
+#endif
+
+void ff_xpsnr_init_x86 (PSNRDSPContext *dsp, const int bpp)
+{
+  if (bpp <= 15) /* XPSNR always operates with 16-bit internal precision */
+  {
+    const int cpu_flags = av_get_cpu_flags();
+
+    if (EXTERNAL_SSE2 (cpu_flags))
+    {
+      dsp->sse_line = ff_sse_line_16bit_sse2;
+    }
+    if (EXTERNAL_AVX2 (cpu_flags))
+    {
+#ifdef __AVX2__
+      dsp->highds_func  = highds_simd;
+      dsp->diff1st_func = diff1st_simd;
+      dsp->diff2nd_func = diff2nd_simd;
+#endif
+    }
+  }
+}
diff --git a/libavfilter/xpsnr.h b/libavfilter/xpsnr.h
new file mode 100644
index 0000000000..f07179e449
--- /dev/null
+++ b/libavfilter/xpsnr.h
@@ -0,0 +1,48 @@ 
+/*
+ * Copyright (c) 2023 Christian R. Helmrich
+ * Copyright (c) 2023 Christian Stoffers
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/**
+ * @file
+ * Public declaration of DSP context structure of XPSNR measurement filter for FFmpeg.
+ *
+ * Authors: Christian Helmrich and Christian Stoffers, Fraunhofer HHI, Berlin, Germany
+ */
+
+#ifndef AVFILTER_XPSNR_H
+#define AVFILTER_XPSNR_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include "libavutil/x86/cpu.h"
+
+/* public XPSNR DSP structure definition */
+
+typedef struct XPSNRDSPContext
+{
+  uint64_t (*sse_line) (const uint8_t *buf, const uint8_t *ref, const int w);
+  uint64_t (*highds_func) (const int x_act, const int y_act, const int w_act, const int h_act, const int16_t *o_m0, const int o);
+  uint64_t (*diff1st_func)(const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, const int o);
+  uint64_t (*diff2nd_func)(const uint32_t w_act, const uint32_t h_act, const int16_t *o_m0, int16_t *o_m1, int16_t *o_m2, const int o);
+} PSNRDSPContext;
+
+void ff_xpsnr_init_x86 (PSNRDSPContext *dsp, const int bpp);
+
+#endif /* AVFILTER_XPSNR_H */