Message ID | 20240725155336.37121-1-remi@remlab.net |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel] lavc/pixblockdsp: specialise aligned 16-bit get_pixels | expand |
Context | Check | Description |
---|---|---|
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
On 7/25/2024 12:53 PM, Rémi Denis-Courmont wrote: > The current code assumes that we have unaligned rows, which hurts on > platforms with slower unaligned accesses. (Also, this lets the compiler > unroll manually, which it seems to do in practice.) > --- > libavcodec/pixblockdsp.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/libavcodec/pixblockdsp.c b/libavcodec/pixblockdsp.c > index bbbeca1618..1fff244511 100644 > --- a/libavcodec/pixblockdsp.c > +++ b/libavcodec/pixblockdsp.c > @@ -26,6 +26,13 @@ > > static void get_pixels_16_c(int16_t *restrict block, const uint8_t *pixels, > ptrdiff_t stride) Is there a way to hint the compiler that block is 16 byte aligned? GCC 14 at least emits unaligned loads and stores for these. > +{ > + for (int i = 0; i < 8; i++) > + AV_COPY128(block + i * 8, pixels + i * stride); > +} > + > +static void get_pixels_unaligned_16_c(int16_t *restrict block, > + const uint8_t *pixels, ptrdiff_t stride) > { > AV_COPY128U(block + 0 * 8, pixels + 0 * stride); > AV_COPY128U(block + 1 * 8, pixels + 1 * stride); > @@ -90,7 +97,7 @@ av_cold void ff_pixblockdsp_init(PixblockDSPContext *c, AVCodecContext *avctx) > case 10: > case 12: > case 14: > - c->get_pixels_unaligned = > + c->get_pixels_unaligned = get_pixels_unaligned_16_c; > c->get_pixels = get_pixels_16_c; > break; > default:
Le torstaina 25. heinäkuuta 2024, 19.16.21 EEST James Almer a écrit : > On 7/25/2024 12:53 PM, Rémi Denis-Courmont wrote: > > The current code assumes that we have unaligned rows, which hurts on > > platforms with slower unaligned accesses. (Also, this lets the compiler > > unroll manually, which it seems to do in practice.) > > --- > > > > libavcodec/pixblockdsp.c | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/libavcodec/pixblockdsp.c b/libavcodec/pixblockdsp.c > > index bbbeca1618..1fff244511 100644 > > --- a/libavcodec/pixblockdsp.c > > +++ b/libavcodec/pixblockdsp.c > > @@ -26,6 +26,13 @@ > > > > static void get_pixels_16_c(int16_t *restrict block, const uint8_t > > *pixels, > > > > ptrdiff_t stride) > > Is there a way to hint the compiler that block is 16 byte aligned? GCC > 14 at least emits unaligned loads and stores for these. We don't have uint128_t, so the best we could do is cast to uint64_t *. Though GCC 13 emits 64-bit loads and stores on RV64 here with the given code. Is this maybe a problem with the COPY128 macro definition on x86?
On 7/25/2024 1:50 PM, Rémi Denis-Courmont wrote: > Le torstaina 25. heinäkuuta 2024, 19.16.21 EEST James Almer a écrit : >> On 7/25/2024 12:53 PM, Rémi Denis-Courmont wrote: >>> The current code assumes that we have unaligned rows, which hurts on >>> platforms with slower unaligned accesses. (Also, this lets the compiler >>> unroll manually, which it seems to do in practice.) >>> --- >>> >>> libavcodec/pixblockdsp.c | 9 ++++++++- >>> 1 file changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/libavcodec/pixblockdsp.c b/libavcodec/pixblockdsp.c >>> index bbbeca1618..1fff244511 100644 >>> --- a/libavcodec/pixblockdsp.c >>> +++ b/libavcodec/pixblockdsp.c >>> @@ -26,6 +26,13 @@ >>> >>> static void get_pixels_16_c(int16_t *restrict block, const uint8_t >>> *pixels, >>> >>> ptrdiff_t stride) >> >> Is there a way to hint the compiler that block is 16 byte aligned? GCC >> 14 at least emits unaligned loads and stores for these. > > We don't have uint128_t, so the best we could do is cast to uint64_t *. Though > GCC 13 emits 64-bit loads and stores on RV64 here with the given code. Is this > maybe a problem with the COPY128 macro definition on x86? AV_COPY128 with GCC x86 uses aligned load intrinsics, but at least GCC 14 emits movdqu instructions here for some reason.
Le torstaina 25. heinäkuuta 2024, 21.25.11 EEST James Almer a écrit : > On 7/25/2024 1:50 PM, Rémi Denis-Courmont wrote: > > Le torstaina 25. heinäkuuta 2024, 19.16.21 EEST James Almer a écrit : > >> On 7/25/2024 12:53 PM, Rémi Denis-Courmont wrote: > >>> The current code assumes that we have unaligned rows, which hurts on > >>> platforms with slower unaligned accesses. (Also, this lets the compiler > >>> unroll manually, which it seems to do in practice.) > >>> --- > >>> > >>> libavcodec/pixblockdsp.c | 9 ++++++++- > >>> 1 file changed, 8 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/libavcodec/pixblockdsp.c b/libavcodec/pixblockdsp.c > >>> index bbbeca1618..1fff244511 100644 > >>> --- a/libavcodec/pixblockdsp.c > >>> +++ b/libavcodec/pixblockdsp.c > >>> @@ -26,6 +26,13 @@ > >>> > >>> static void get_pixels_16_c(int16_t *restrict block, const uint8_t > >>> *pixels, > >>> > >>> ptrdiff_t stride) > >> > >> Is there a way to hint the compiler that block is 16 byte aligned? GCC > >> 14 at least emits unaligned loads and stores for these. > > > > We don't have uint128_t, so the best we could do is cast to uint64_t *. > > Though GCC 13 emits 64-bit loads and stores on RV64 here with the given > > code. Is this maybe a problem with the COPY128 macro definition on x86? > > AV_COPY128 with GCC x86 uses aligned load intrinsics, but at least GCC > 14 emits movdqu instructions here for some reason. Another approach would be to define a structure with size and alignment of 16 bytes, but I am none too sure that compilers will like it all that much. TBH, I am merely aiming for 64-bit aligned loads and stores here, which is a big improvement over 8-bit ones.
diff --git a/libavcodec/pixblockdsp.c b/libavcodec/pixblockdsp.c index bbbeca1618..1fff244511 100644 --- a/libavcodec/pixblockdsp.c +++ b/libavcodec/pixblockdsp.c @@ -26,6 +26,13 @@ static void get_pixels_16_c(int16_t *restrict block, const uint8_t *pixels, ptrdiff_t stride) +{ + for (int i = 0; i < 8; i++) + AV_COPY128(block + i * 8, pixels + i * stride); +} + +static void get_pixels_unaligned_16_c(int16_t *restrict block, + const uint8_t *pixels, ptrdiff_t stride) { AV_COPY128U(block + 0 * 8, pixels + 0 * stride); AV_COPY128U(block + 1 * 8, pixels + 1 * stride); @@ -90,7 +97,7 @@ av_cold void ff_pixblockdsp_init(PixblockDSPContext *c, AVCodecContext *avctx) case 10: case 12: case 14: - c->get_pixels_unaligned = + c->get_pixels_unaligned = get_pixels_unaligned_16_c; c->get_pixels = get_pixels_16_c; break; default: