mbox series

[FFmpeg-devel,00/16] swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats

Message ID 20240927125241.15887-1-ramiro.polla@gmail.com
Headers show
Series swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats | expand

Message

Ramiro Polla Sept. 27, 2024, 12:52 p.m. UTC
There is an issue with the constants used in YUV to YUV range conversion,
where the upper bound is not respected when converting to mpeg range.

With this patchset, the constants are calculated at runtime, depending on
the bit depth. This approach also allows us to more easily understand how
the constants are derived.

These are the speedups for the entire patchset:
x86_64:
chrRangeFromJpeg8_1920_c:     5827.4   5845.2 ( 1.00x)
chrRangeFromJpeg8_1920_sse2:  1945.6   1955.2 ( 1.00x)
chrRangeFromJpeg8_1920_avx2:   992.0    988.9 ( 1.00x)
chrRangeFromJpeg16_1920_c:    5793.2   5809.1 ( 1.00x)
chrRangeToJpeg8_1920_c:      11726.2   9462.2 ( 1.24x)
chrRangeToJpeg8_1920_sse2:    1965.5   1949.9 ( 1.01x)
chrRangeToJpeg8_1920_avx2:     984.2    988.5 ( 1.00x)
chrRangeToJpeg16_1920_c:     10610.8   9261.5 ( 1.15x)
lumRangeFromJpeg8_1920_c:     4165.7   4191.4 ( 0.99x)
lumRangeFromJpeg8_1920_sse2:  1032.0   1040.5 ( 0.99x)
lumRangeFromJpeg8_1920_avx2:   575.2    520.5 ( 1.11x)
lumRangeFromJpeg16_1920_c:    4530.0   4143.4 ( 1.09x)
lumRangeToJpeg8_1920_c:       6044.8   5720.5 ( 1.06x)
lumRangeToJpeg8_1920_sse2:    1034.2   1046.0 ( 0.99x)
lumRangeToJpeg8_1920_avx2:     513.5    540.5 ( 0.95x)
lumRangeToJpeg16_1920_c:      5343.6   5139.5 ( 1.04x)

aarch64 A55:
chrRangeFromJpeg8_1920_c:    28839.3  28834.8 ( 1.00x)
chrRangeFromJpeg8_1920_neon:  5312.2   5313.1 ( 1.00x)
chrRangeFromJpeg16_1920_c:   28843.8  28840.6 ( 1.00x)
chrRangeToJpeg8_1920_c:      44196.1  23072.5 ( 1.92x)
chrRangeToJpeg8_1920_neon:    6035.9   5550.8 ( 1.09x)
chrRangeToJpeg16_1920_c:     36526.7  23075.1 ( 1.58x)
lumRangeFromJpeg8_1920_c:    15384.3  15386.7 ( 1.00x)
lumRangeFromJpeg8_1920_neon:  3148.6   3145.8 ( 1.00x)
lumRangeFromJpeg16_1920_c:   15390.1  15383.8 ( 1.00x)
lumRangeToJpeg8_1920_c:      23066.7  19223.6 ( 1.20x)
lumRangeToJpeg8_1920_neon:    3868.8   3624.9 ( 1.07x)
lumRangeToJpeg16_1920_c:     19224.6  19225.5 ( 1.00x)

aarch64 A76:
chrRangeFromJpeg8_1920_c:     6316.2   6318.5 ( 1.00x)
chrRangeFromJpeg8_1920_neon:  2263.5   2304.2 ( 0.98x)
chrRangeFromJpeg16_1920_c:    6321.9   6323.5 ( 1.00x)
chrRangeToJpeg8_1920_c:      11389.3   9170.0 ( 1.24x)
chrRangeToJpeg8_1920_neon:    2644.2   2793.8 ( 0.95x)
chrRangeToJpeg16_1920_c:      9514.4   9195.6 ( 1.03x)
lumRangeFromJpeg8_1920_c:     4376.0   4425.5 ( 0.99x)
lumRangeFromJpeg8_1920_neon:  1110.8   1105.0 ( 1.01x)
lumRangeFromJpeg16_1920_c:    4437.9   4436.8 ( 1.00x)
lumRangeToJpeg8_1920_c:       6667.0   6017.2 ( 1.11x)
lumRangeToJpeg8_1920_neon:    1327.5   1328.0 ( 1.00x)
lumRangeToJpeg16_1920_c:      6062.5   6017.2 ( 1.01x)

NOTE: simd optimizations for x86 and aarch64 have been updated, but riscv
      and loongarch are still missing (and therefore disabled).

NOTE2: the same issue still exists in rgb2yuv conversions, which is not
       addressed in this patchset.

Changes from v1:
- Saturate the output value instead of limiting the input with amax;
- Add more comprehensive benchmarks to commit messages;
- Add comments when disabling code with "#if 0";

Ramiro Polla (16):
  swscale/range_convert: call arch-specific init functions from main
    init function
  swscale/range_convert: drop redundant conditionals from arch-specific
    init functions
  swscale/range_convert: indent after previous commit
  checkasm: use FF_ARRAY_ELEMS instead of hardcoding size of arrays
  checkasm/sw_range_convert: use YUV pixel formats instead of YUVJ
  checkasm/sw_range_convert: reduce number of input sizes tested
  checkasm/sw_range_convert: only run benchmarks on largest input width
  checkasm/sw_range_convert: test all supported bit depths
  checkasm/sw_range_convert: indent after previous couple of commits
  swscale/range_convert: saturate output instead of limiting input
  swscale/aarch64/range_convert: saturate output instead of limiting
    input
  swscale/range_convert: fix mpeg ranges in yuv range conversion for
    non-8-bit pixel formats
  swscale/x86/range_convert: update sse2 and avx2 range_convert
    functions to new API
  swscale/x86: add sse2, sse4, and avx2 {lum,chr}ConvertRange16
  swscale/aarch64/range_convert: update neon range_convert functions to
    new API
  swscale/aarch64: add neon {lum,chr}ConvertRange16

 libswscale/aarch64/range_convert_neon.S       | 152 ++++++++++----
 libswscale/aarch64/swscale.c                  |  41 +++-
 libswscale/hscale.c                           |   6 +-
 libswscale/loongarch/swscale_init_loongarch.c |  38 ++--
 libswscale/riscv/swscale.c                    |  15 +-
 libswscale/swscale.c                          | 122 ++++++++++--
 libswscale/swscale_internal.h                 |  11 +-
 libswscale/utils.c                            |  10 +-
 libswscale/x86/range_convert.asm              | 161 ++++++++++-----
 libswscale/x86/swscale.c                      |  56 ++++--
 tests/checkasm/sw_gbrp.c                      |  15 +-
 tests/checkasm/sw_range_convert.c             | 186 +++++++++++++-----
 tests/checkasm/sw_scale.c                     |  11 +-
 .../fate/filter-alphaextract_alphamerge_rgb   | 100 +++++-----
 tests/ref/fate/filter-pixdesc-gray10be        |   2 +-
 tests/ref/fate/filter-pixdesc-gray10le        |   2 +-
 tests/ref/fate/filter-pixdesc-gray12be        |   2 +-
 tests/ref/fate/filter-pixdesc-gray12le        |   2 +-
 tests/ref/fate/filter-pixdesc-gray14be        |   2 +-
 tests/ref/fate/filter-pixdesc-gray14le        |   2 +-
 tests/ref/fate/filter-pixdesc-gray16be        |   2 +-
 tests/ref/fate/filter-pixdesc-gray16le        |   2 +-
 tests/ref/fate/filter-pixdesc-gray9be         |   2 +-
 tests/ref/fate/filter-pixdesc-gray9le         |   2 +-
 tests/ref/fate/filter-pixdesc-ya16be          |   2 +-
 tests/ref/fate/filter-pixdesc-ya16le          |   2 +-
 tests/ref/fate/filter-pixdesc-yuvj411p        |   2 +-
 tests/ref/fate/filter-pixdesc-yuvj420p        |   2 +-
 tests/ref/fate/filter-pixdesc-yuvj422p        |   2 +-
 tests/ref/fate/filter-pixdesc-yuvj440p        |   2 +-
 tests/ref/fate/filter-pixdesc-yuvj444p        |   2 +-
 tests/ref/fate/filter-pixfmts-copy            |  34 ++--
 tests/ref/fate/filter-pixfmts-crop            |  34 ++--
 tests/ref/fate/filter-pixfmts-field           |  34 ++--
 tests/ref/fate/filter-pixfmts-fieldorder      |  30 +--
 tests/ref/fate/filter-pixfmts-hflip           |  34 ++--
 tests/ref/fate/filter-pixfmts-il              |  34 ++--
 tests/ref/fate/filter-pixfmts-lut             |  18 +-
 tests/ref/fate/filter-pixfmts-null            |  34 ++--
 tests/ref/fate/filter-pixfmts-pad             |  22 +--
 tests/ref/fate/filter-pixfmts-pullup          |  10 +-
 tests/ref/fate/filter-pixfmts-rotate          |   4 +-
 tests/ref/fate/filter-pixfmts-scale           |  34 ++--
 tests/ref/fate/filter-pixfmts-swapuv          |  10 +-
 .../ref/fate/filter-pixfmts-tinterlace_cvlpf  |   8 +-
 .../ref/fate/filter-pixfmts-tinterlace_merge  |   8 +-
 tests/ref/fate/filter-pixfmts-tinterlace_pad  |   8 +-
 tests/ref/fate/filter-pixfmts-tinterlace_vlpf |   8 +-
 tests/ref/fate/filter-pixfmts-transpose       |  28 +--
 tests/ref/fate/filter-pixfmts-vflip           |  34 ++--
 tests/ref/fate/fitsenc-gray                   |   2 +-
 tests/ref/fate/fitsenc-gray16be               |  10 +-
 tests/ref/fate/gifenc-gray                    | 186 +++++++++---------
 tests/ref/fate/idroq-video-encode             |   2 +-
 tests/ref/fate/jpg-icc                        |   8 +-
 tests/ref/fate/sws-yuv-colorspace             |   2 +-
 tests/ref/fate/sws-yuv-range                  |   2 +-
 tests/ref/fate/vvc-conformance-SCALING_A_1    | 128 ++++++------
 tests/ref/lavf/gray16be.fits                  |   4 +-
 tests/ref/lavf/gray16be.pam                   |   4 +-
 tests/ref/lavf/gray16be.png                   |   6 +-
 tests/ref/lavf/jpg                            |   6 +-
 tests/ref/lavf/smjpeg                         |   6 +-
 tests/ref/pixfmt/yuvj420p                     |   2 +-
 tests/ref/pixfmt/yuvj422p                     |   2 +-
 tests/ref/pixfmt/yuvj440p                     |   2 +-
 tests/ref/pixfmt/yuvj444p                     |   2 +-
 tests/ref/seek/lavf-jpg                       |   8 +-
 tests/ref/seek/vsynth_lena-mjpeg              |  40 ++--
 tests/ref/seek/vsynth_lena-roqvideo           |   2 +-
 tests/ref/vsynth/vsynth1-amv                  |   8 +-
 tests/ref/vsynth/vsynth1-mjpeg                |   6 +-
 tests/ref/vsynth/vsynth1-mjpeg-422            |   6 +-
 tests/ref/vsynth/vsynth1-mjpeg-444            |   6 +-
 tests/ref/vsynth/vsynth1-mjpeg-huffman        |   6 +-
 tests/ref/vsynth/vsynth1-mjpeg-trell          |   8 +-
 tests/ref/vsynth/vsynth1-mjpeg-trell-huffman  |   8 +-
 tests/ref/vsynth/vsynth1-roqvideo             |   8 +-
 tests/ref/vsynth/vsynth2-amv                  |   6 +-
 tests/ref/vsynth/vsynth2-mjpeg                |   6 +-
 tests/ref/vsynth/vsynth2-mjpeg-422            |   6 +-
 tests/ref/vsynth/vsynth2-mjpeg-444            |   6 +-
 tests/ref/vsynth/vsynth2-mjpeg-huffman        |   6 +-
 tests/ref/vsynth/vsynth2-mjpeg-trell          |   8 +-
 tests/ref/vsynth/vsynth2-mjpeg-trell-huffman  |   8 +-
 tests/ref/vsynth/vsynth2-roqvideo             |   8 +-
 tests/ref/vsynth/vsynth3-amv                  |   8 +-
 tests/ref/vsynth/vsynth3-mjpeg                |   8 +-
 tests/ref/vsynth/vsynth3-mjpeg-422            |   8 +-
 tests/ref/vsynth/vsynth3-mjpeg-444            |   6 +-
 tests/ref/vsynth/vsynth3-mjpeg-huffman        |   8 +-
 tests/ref/vsynth/vsynth3-mjpeg-trell          |   6 +-
 tests/ref/vsynth/vsynth3-mjpeg-trell-huffman  |   6 +-
 tests/ref/vsynth/vsynth_lena-amv              |   6 +-
 tests/ref/vsynth/vsynth_lena-mjpeg            |   8 +-
 tests/ref/vsynth/vsynth_lena-mjpeg-422        |   6 +-
 tests/ref/vsynth/vsynth_lena-mjpeg-444        |   6 +-
 tests/ref/vsynth/vsynth_lena-mjpeg-huffman    |   8 +-
 tests/ref/vsynth/vsynth_lena-mjpeg-trell      |   8 +-
 .../vsynth/vsynth_lena-mjpeg-trell-huffman    |   8 +-
 tests/ref/vsynth/vsynth_lena-roqvideo         |   8 +-
 101 files changed, 1193 insertions(+), 833 deletions(-)