[FFmpeg-devel,1/3] lavc/vp8dsp: R-V V put_bilin_h

Message ID	CAEa-L+uTQAYtgBovGdc7aW69nwCUhAWU5jPkk=g4HJtcF=Xrug@mail.gmail.com
State	New
Headers	show Delivered-To: ffmpegpatchwork2@gmail.com Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; MIME-Version: 1.0 From: flow gg <hlefthleft@gmail.com> Date: Fri, 23 Feb 2024 22:45:46 +0800 Message-ID: <CAEa-L+uTQAYtgBovGdc7aW69nwCUhAWU5jPkk=g4HJtcF=Xrug@mail.gmail.com> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Content-Type: multipart/mixed; boundary="00000000000029a10806120d9e43" Subject: [FFmpeg-devel] [PATCH 1/3] lavc/vp8dsp: R-V V put_bilin_h Precedence: list Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Series	[FFmpeg-devel,1/3] lavc/vp8dsp: R-V V put_bilin_h \| expand [FFmpeg-devel,1/3] lavc/vp8dsp: R-V V put_bilin_h [FFmpeg-devel,2/3] lavc/vp8dsp: R-V V put_bilin_v [FFmpeg-devel,3/3] lavc/vp8dsp: R-V V put_bilin_hv

flow gg Feb. 23, 2024, 2:45 p.m. UTC

Rémi Denis-Courmont Feb. 23, 2024, 5:17 p.m. UTC | #1

Hi,

+
+.macro bilin_h_load dst len
+.ifc \len,4
+        vsetivli        zero, 5, e8, mf2, ta, ma

Don't use fractional multipliers if you don't mix element widths.

+.elseif \len == 8
+        vsetivli        zero, 9, e8, m1, ta, ma
+.else
+        vsetivli        zero, 17, e8, m2, ta, ma
+.endif
+
+        vle8.v          \dst, (a2)
+        vslide1down.vx  v2, \dst, t5
+

+.ifc \len,4
+        vsetivli        zero, 4, e8, mf4, ta, ma

Same as above.

+.elseif \len == 8
+        vsetivli        zero, 8, e8, mf2, ta, ma

Also.

+.else
+        vsetivli        zero, 16, e8, m1, ta, ma
+.endif

+        vwmulu.vx       v28, \dst, t1
+        vwmaccu.vx      v28, a5, v2
+        vwaddu.wx       v24, v28, t4
+        vnsra.wi        \dst, v24, 3
+.endm
+
+.macro put_vp8_bilin_h len
+        li              t1, 8
+        li              t4, 4
+        li              t5, 1
+        sub             t1, t1, a5
+1:
+        addi            a4, a4, -1
+        bilin_h_load    v0, \len
+        vse8.v          v0, (a0)
+        add             a2, a2, a3
+        add             a0, a0, a1
+        bnez            a4, 1b
+
+        ret
+.endm
+
+func ff_put_vp8_bilin16_h_rvv, zve32x
+        put_vp8_bilin_h 16
+endfunc
+
+func ff_put_vp8_bilin8_h_rvv, zve32x
+        put_vp8_bilin_h 8
+endfunc
+
+func ff_put_vp8_bilin4_h_rvv, zve32x
+        put_vp8_bilin_h 4
+endfunc

flow gg Feb. 24, 2024, 1:07 a.m. UTC | #2

.ifc \len,4
-        vsetivli        zero, 5, e8, mf2, ta, ma
+        vsetivli        zero, 5, e8, m1, ta, ma
 .elseif \len == 8
         vsetivli        zero, 9, e8, m1, ta, ma
 .else
@@ -112,9 +112,9 @@ endfunc
         vslide1down.vx  v2, \dst, t5

 .ifc \len,4
-        vsetivli        zero, 4, e8, mf4, ta, ma
+        vsetivli        zero, 4, e8, m1, ta, ma
 .elseif \len == 8
-        vsetivli        zero, 8, e8, mf2, ta, ma
+        vsetivli        zero, 8, e8, m1, ta, ma

What are the benefits of not using fractional multipliers here? Making this
change would result in a 10%-20% slowdown.

                                              mf2/4   m1
vp8_put_bilin4_h_rvv_i32:   158.7   193.7
vp8_put_bilin4_hv_rvv_i32:  255.7   302.7
vp8_put_bilin8_h_rvv_i32:   318.7   358.7
vp8_put_bilin8_hv_rvv_i32:  528.7   569.7

Rémi Denis-Courmont <remi@remlab.net> 于2024年2月24日周六 01:18写道：

> Hi,
>
> +
> +.macro bilin_h_load dst len
> +.ifc \len,4
> +        vsetivli        zero, 5, e8, mf2, ta, ma
>
> Don't use fractional multipliers if you don't mix element widths.
>
> +.elseif \len == 8
> +        vsetivli        zero, 9, e8, m1, ta, ma
> +.else
> +        vsetivli        zero, 17, e8, m2, ta, ma
> +.endif
> +
> +        vle8.v          \dst, (a2)
> +        vslide1down.vx  v2, \dst, t5
> +
>
> +.ifc \len,4
> +        vsetivli        zero, 4, e8, mf4, ta, ma
>
> Same as above.
>
> +.elseif \len == 8
> +        vsetivli        zero, 8, e8, mf2, ta, ma
>
> Also.
>
> +.else
> +        vsetivli        zero, 16, e8, m1, ta, ma
> +.endif
>
> +        vwmulu.vx       v28, \dst, t1
> +        vwmaccu.vx      v28, a5, v2
> +        vwaddu.wx       v24, v28, t4
> +        vnsra.wi        \dst, v24, 3
> +.endm
> +
> +.macro put_vp8_bilin_h len
> +        li              t1, 8
> +        li              t4, 4
> +        li              t5, 1
> +        sub             t1, t1, a5
> +1:
> +        addi            a4, a4, -1
> +        bilin_h_load    v0, \len
> +        vse8.v          v0, (a0)
> +        add             a2, a2, a3
> +        add             a0, a0, a1
> +        bnez            a4, 1b
> +
> +        ret
> +.endm
> +
> +func ff_put_vp8_bilin16_h_rvv, zve32x
> +        put_vp8_bilin_h 16
> +endfunc
> +
> +func ff_put_vp8_bilin8_h_rvv, zve32x
> +        put_vp8_bilin_h 8
> +endfunc
> +
> +func ff_put_vp8_bilin4_h_rvv, zve32x
> +        put_vp8_bilin_h 4
> +endfunc
>
> --
> レミ・デニ-クールモン
> http://www.remlab.net/
>
>
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>

Rémi Denis-Courmont Feb. 24, 2024, 7:38 a.m. UTC | #3

Hi,

Le 24 février 2024 03:07:36 GMT+02:00, flow gg <hlefthleft@gmail.com> a écrit :
> .ifc \len,4
>-        vsetivli        zero, 5, e8, mf2, ta, ma
>+        vsetivli        zero, 5, e8, m1, ta, ma
> .elseif \len == 8
>         vsetivli        zero, 9, e8, m1, ta, ma
> .else
>@@ -112,9 +112,9 @@ endfunc
>         vslide1down.vx  v2, \dst, t5
>
> .ifc \len,4
>-        vsetivli        zero, 4, e8, mf4, ta, ma
>+        vsetivli        zero, 4, e8, m1, ta, ma
> .elseif \len == 8
>-        vsetivli        zero, 8, e8, mf2, ta, ma
>+        vsetivli        zero, 8, e8, m1, ta, ma
>
>What are the benefits of not using fractional multipliers here?

Insofar as E8/MF4 is guaranteed to work for Zve32x, there are no benefits per se.

However fractional multipliers were added to the specification to enable addressing invididual vectors whilst the effective multiplier is larger than one. This can only happen with mixed widths. Fractions were not intended to make vector shorter - there is the vector length for that already.

That's why "E64/MF2" doesn't work, even though it's the same vector bit size as "E8/MF2".

> Making this
>change would result in a 10%-20% slowdown.

That's kind of odd. This may be caused by the slides, but it's strange to go out of the way for hardware to optimise a case that's not even intended.

>                                              mf2/4   m1
>vp8_put_bilin4_h_rvv_i32:   158.7   193.7
>vp8_put_bilin4_hv_rvv_i32:  255.7   302.7
>vp8_put_bilin8_h_rvv_i32:   318.7   358.7
>vp8_put_bilin8_hv_rvv_i32:  528.7   569.7
>
>Rémi Denis-Courmont <remi@remlab.net> 于2024年2月24日周六 01:18写道：
>
>> Hi,
>>
>> +
>> +.macro bilin_h_load dst len
>> +.ifc \len,4
>> +        vsetivli        zero, 5, e8, mf2, ta, ma
>>
>> Don't use fractional multipliers if you don't mix element widths.
>>
>> +.elseif \len == 8
>> +        vsetivli        zero, 9, e8, m1, ta, ma
>> +.else
>> +        vsetivli        zero, 17, e8, m2, ta, ma
>> +.endif
>> +
>> +        vle8.v          \dst, (a2)
>> +        vslide1down.vx  v2, \dst, t5
>> +
>>
>> +.ifc \len,4
>> +        vsetivli        zero, 4, e8, mf4, ta, ma
>>
>> Same as above.
>>
>> +.elseif \len == 8
>> +        vsetivli        zero, 8, e8, mf2, ta, ma
>>
>> Also.
>>
>> +.else
>> +        vsetivli        zero, 16, e8, m1, ta, ma
>> +.endif
>>
>> +        vwmulu.vx       v28, \dst, t1
>> +        vwmaccu.vx      v28, a5, v2
>> +        vwaddu.wx       v24, v28, t4
>> +        vnsra.wi        \dst, v24, 3
>> +.endm
>> +
>> +.macro put_vp8_bilin_h len
>> +        li              t1, 8
>> +        li              t4, 4
>> +        li              t5, 1
>> +        sub             t1, t1, a5
>> +1:
>> +        addi            a4, a4, -1
>> +        bilin_h_load    v0, \len
>> +        vse8.v          v0, (a0)
>> +        add             a2, a2, a3
>> +        add             a0, a0, a1
>> +        bnez            a4, 1b
>> +
>> +        ret
>> +.endm
>> +
>> +func ff_put_vp8_bilin16_h_rvv, zve32x
>> +        put_vp8_bilin_h 16
>> +endfunc
>> +
>> +func ff_put_vp8_bilin8_h_rvv, zve32x
>> +        put_vp8_bilin_h 8
>> +endfunc
>> +
>> +func ff_put_vp8_bilin4_h_rvv, zve32x
>> +        put_vp8_bilin_h 4
>> +endfunc
>>
>> --
>> レミ・デニ-クールモン
>> http://www.remlab.net/
>>
>>
>>
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>
>_______________________________________________
>ffmpeg-devel mailing list
>ffmpeg-devel@ffmpeg.org
>https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
>To unsubscribe, visit link above, or email
>ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

flow gg Feb. 24, 2024, 8:31 a.m. UTC | #4

Okay, Thanks for clarifying.

I have used many fractional multipliers, mostly not for correctness, but
often for performance improvements (though I don't know why),
and there are no obvious downsides, How about leaving this code?

Rémi Denis-Courmont <remi@remlab.net> 于2024年2月24日周六 15:39写道：

> Hi,
>
> Le 24 février 2024 03:07:36 GMT+02:00, flow gg <hlefthleft@gmail.com> a
> écrit :
> > .ifc \len,4
> >-        vsetivli        zero, 5, e8, mf2, ta, ma
> >+        vsetivli        zero, 5, e8, m1, ta, ma
> > .elseif \len == 8
> >         vsetivli        zero, 9, e8, m1, ta, ma
> > .else
> >@@ -112,9 +112,9 @@ endfunc
> >         vslide1down.vx  v2, \dst, t5
> >
> > .ifc \len,4
> >-        vsetivli        zero, 4, e8, mf4, ta, ma
> >+        vsetivli        zero, 4, e8, m1, ta, ma
> > .elseif \len == 8
> >-        vsetivli        zero, 8, e8, mf2, ta, ma
> >+        vsetivli        zero, 8, e8, m1, ta, ma
> >
> >What are the benefits of not using fractional multipliers here?
>
> Insofar as E8/MF4 is guaranteed to work for Zve32x, there are no benefits
> per se.
>
> However fractional multipliers were added to the specification to enable
> addressing invididual vectors whilst the effective multiplier is larger
> than one. This can only happen with mixed widths. Fractions were not
> intended to make vector shorter - there is the vector length for that
> already.
>
> That's why "E64/MF2" doesn't work, even though it's the same vector bit
> size as "E8/MF2".
>
> > Making this
> >change would result in a 10%-20% slowdown.
>
> That's kind of odd. This may be caused by the slides, but it's strange to
> go out of the way for hardware to optimise a case that's not even intended.
>
> >                                              mf2/4   m1
> >vp8_put_bilin4_h_rvv_i32:   158.7   193.7
> >vp8_put_bilin4_hv_rvv_i32:  255.7   302.7
> >vp8_put_bilin8_h_rvv_i32:   318.7   358.7
> >vp8_put_bilin8_hv_rvv_i32:  528.7   569.7
> >
> >Rémi Denis-Courmont <remi@remlab.net> 于2024年2月24日周六 01:18写道：
> >
> >> Hi,
> >>
> >> +
> >> +.macro bilin_h_load dst len
> >> +.ifc \len,4
> >> +        vsetivli        zero, 5, e8, mf2, ta, ma
> >>
> >> Don't use fractional multipliers if you don't mix element widths.
> >>
> >> +.elseif \len == 8
> >> +        vsetivli        zero, 9, e8, m1, ta, ma
> >> +.else
> >> +        vsetivli        zero, 17, e8, m2, ta, ma
> >> +.endif
> >> +
> >> +        vle8.v          \dst, (a2)
> >> +        vslide1down.vx  v2, \dst, t5
> >> +
> >>
> >> +.ifc \len,4
> >> +        vsetivli        zero, 4, e8, mf4, ta, ma
> >>
> >> Same as above.
> >>
> >> +.elseif \len == 8
> >> +        vsetivli        zero, 8, e8, mf2, ta, ma
> >>
> >> Also.
> >>
> >> +.else
> >> +        vsetivli        zero, 16, e8, m1, ta, ma
> >> +.endif
> >>
> >> +        vwmulu.vx       v28, \dst, t1
> >> +        vwmaccu.vx      v28, a5, v2
> >> +        vwaddu.wx       v24, v28, t4
> >> +        vnsra.wi        \dst, v24, 3
> >> +.endm
> >> +
> >> +.macro put_vp8_bilin_h len
> >> +        li              t1, 8
> >> +        li              t4, 4
> >> +        li              t5, 1
> >> +        sub             t1, t1, a5
> >> +1:
> >> +        addi            a4, a4, -1
> >> +        bilin_h_load    v0, \len
> >> +        vse8.v          v0, (a0)
> >> +        add             a2, a2, a3
> >> +        add             a0, a0, a1
> >> +        bnez            a4, 1b
> >> +
> >> +        ret
> >> +.endm
> >> +
> >> +func ff_put_vp8_bilin16_h_rvv, zve32x
> >> +        put_vp8_bilin_h 16
> >> +endfunc
> >> +
> >> +func ff_put_vp8_bilin8_h_rvv, zve32x
> >> +        put_vp8_bilin_h 8
> >> +endfunc
> >> +
> >> +func ff_put_vp8_bilin4_h_rvv, zve32x
> >> +        put_vp8_bilin_h 4
> >> +endfunc
> >>
> >> --
> >> レミ・デニ-クールモン
> >> http://www.remlab.net/
> >>
> >>
> >>
> >> _______________________________________________
> >> ffmpeg-devel mailing list
> >> ffmpeg-devel@ffmpeg.org
> >> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >>
> >> To unsubscribe, visit link above, or email
> >> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> >>
> >_______________________________________________
> >ffmpeg-devel mailing list
> >ffmpeg-devel@ffmpeg.org
> >https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >
> >To unsubscribe, visit link above, or email
> >ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>

Rémi Denis-Courmont Feb. 28, 2024, 8:25 p.m. UTC | #5

Le lauantaina 24. helmikuuta 2024, 10.31.36 EET flow gg a écrit :
> Okay, Thanks for clarifying.
> 
> I have used many fractional multipliers, mostly not for correctness, but
> often for performance improvements (though I don't know why),
> and there are no obvious downsides, How about leaving this code?

In this case, it does not affect the baseline requirements. It will be 
problematic if performance ends up worse on other future designs, but we can 
cross that bridge if and then.

Rémi Denis-Courmont March 3, 2024, 2:39 p.m. UTC | #6

Le perjantaina 23. helmikuuta 2024, 16.45.46 EET flow gg a écrit :
> 

Looks like this needs rebasing, or otherwise does not apply.

flow gg March 3, 2024, 3:03 p.m. UTC | #7

Sorry since I did not send the emails all at once, so cannot apply all 4
patches together with git am *.patch. Instead, it needs to first apply the
patch with 'git am '[PATCH] lavc/vp8dsp: R-V V put_vp8_pixels'', and then
apply the patches 1-3 in the series with 'git am *.patch'.

Rémi Denis-Courmont <remi@remlab.net> 于2024年3月3日周日 22:39写道：

> Le perjantaina 23. helmikuuta 2024, 16.45.46 EET flow gg a écrit :
> >
>
> Looks like this needs rebasing, or otherwise does not apply.
>
> --
> Rémi Denis-Courmont
> http://www.remlab.net/
>
>
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>

flow gg March 17, 2024, 4:42 p.m. UTC | #8

ping

flow gg <hlefthleft@gmail.com> 于2024年3月3日周日 23:03写道：

> Sorry since I did not send the emails all at once, so cannot apply all 4
> patches together with git am *.patch. Instead, it needs to first apply the
> patch with 'git am '[PATCH] lavc/vp8dsp: R-V V put_vp8_pixels'', and then
> apply the patches 1-3 in the series with 'git am *.patch'.
>
> Rémi Denis-Courmont <remi@remlab.net> 于2024年3月3日周日 22:39写道：
>
>> Le perjantaina 23. helmikuuta 2024, 16.45.46 EET flow gg a écrit :
>> >
>>
>> Looks like this needs rebasing, or otherwise does not apply.
>>
>> --
>> Rémi Denis-Courmont
>> http://www.remlab.net/
>>
>>
>>
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>
>

[FFmpeg-devel,1/3] lavc/vp8dsp: R-V V put_bilin_h

Commit Message

Comments

Patch