diff mbox series

[FFmpeg-devel,4/7] checkasm: use pointers for start/stop functions

Message ID 20230714182835.66326-4-remi@remlab.net
State New
Headers show
Series checkasm RISC-V Linux perf enablement | expand

Checks

Context Check Description
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 success Make fate finished
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

Rémi Denis-Courmont July 14, 2023, 6:28 p.m. UTC
This makes all calls to the bench start and stop functions via
function pointers. While the primary goal is to support run-time
selection of the performance measurement back-end in later commits,
this has the side benefit of containing platform dependencies in to
checkasm.c and out of checkasm.h.
---
 tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++-----
 tests/checkasm/checkasm.h | 31 ++++---------------------------
 2 files changed, 32 insertions(+), 32 deletions(-)

Comments

Lynne July 15, 2023, 8:05 a.m. UTC | #1
Jul 14, 2023, 20:29 by remi@remlab.net:

> This makes all calls to the bench start and stop functions via
> function pointers. While the primary goal is to support run-time
> selection of the performance measurement back-end in later commits,
> this has the side benefit of containing platform dependencies in to
> checkasm.c and out of checkasm.h.
> ---
>  tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++-----
>  tests/checkasm/checkasm.h | 31 ++++---------------------------
>  2 files changed, 32 insertions(+), 32 deletions(-)
>

Not sure I agree with this commit, the overhead can be detectable,
and we have a lot of small functions with runtime a few times that
of a null function call.
Can you store the function pointers out of the loop to reduce
the derefs needed?
Rémi Denis-Courmont July 15, 2023, 8:25 a.m. UTC | #2
Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit :
> Jul 14, 2023, 20:29 by remi@remlab.net:
> > This makes all calls to the bench start and stop functions via
> > function pointers. While the primary goal is to support run-time
> > selection of the performance measurement back-end in later commits,
> > this has the side benefit of containing platform dependencies in to
> > checkasm.c and out of checkasm.h.
> > ---
> > 
> >  tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++-----
> >  tests/checkasm/checkasm.h | 31 ++++---------------------------
> >  2 files changed, 32 insertions(+), 32 deletions(-)
> 
> Not sure I agree with this commit, the overhead can be detectable,
> and we have a lot of small functions with runtime a few times that
> of a null function call.

I don't think the function call is ever null. The pointers are left NULL only 
if none of the backend initialise. But then, checkasm will bail out and exit 
before we try to benchmark anything anyway.

As for the real functions, they always do *something*. None of them "just 
return 0".

> Can you store the function pointers out of the loop to reduce
> the derefs needed?

Taking just the two loads is out of the loop should be feasible but it seems a 
rather vain. You will still have the overhead of the indirect function call, 
the function, and most importantly in the case of Linux perf and MacOS kperf, 
the system calls.

The only way to avoid the indirect function calls are to use IFUNC (tricky and 
not portable), or to make horrible macros to spawn one bench loop for each 
backend.

In the end, I think we should rather aim for as constant time as possible, 
rather than as fast as possible, so that the nop loop can estimate the 
benchmarking overhead as well as possible. In this respect, I think it is 
actually marginally better *not* to cache the function pointers in local 
variables, which could end up spilled on the stack, or not, depending on local 
compiler optimisations for any given test case.
Lynne July 15, 2023, 5:43 p.m. UTC | #3
Jul 15, 2023, 10:26 by remi@remlab.net:

> Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit :
>
>> Jul 14, 2023, 20:29 by remi@remlab.net:
>> > This makes all calls to the bench start and stop functions via
>> > function pointers. While the primary goal is to support run-time
>> > selection of the performance measurement back-end in later commits,
>> > this has the side benefit of containing platform dependencies in to
>> > checkasm.c and out of checkasm.h.
>> > ---
>> > 
>> >  tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++-----
>> >  tests/checkasm/checkasm.h | 31 ++++---------------------------
>> >  2 files changed, 32 insertions(+), 32 deletions(-)
>>
>> Not sure I agree with this commit, the overhead can be detectable,
>> and we have a lot of small functions with runtime a few times that
>> of a null function call.
>>
>
> I don't think the function call is ever null. The pointers are left NULL only 
> if none of the backend initialise. But then, checkasm will bail out and exit 
> before we try to benchmark anything anyway.
>
> As for the real functions, they always do *something*. None of them "just 
> return 0".
>

I meant a no-op function call to measure the overhead of function
calls themselves, complete with all the ABI stuff.



>> Can you store the function pointers out of the loop to reduce
>> the derefs needed?
>>
>
> Taking just the two loads is out of the loop should be feasible but it seems a 
> rather vain. You will still have the overhead of the indirect function call, 
> the function, and most importantly in the case of Linux perf and MacOS kperf, 
> the system calls.
>
> The only way to avoid the indirect function calls are to use IFUNC (tricky and 
> not portable), or to make horrible macros to spawn one bench loop for each 
> backend.
>
> In the end, I think we should rather aim for as constant time as possible, 
> rather than as fast as possible, so that the nop loop can estimate the 
> benchmarking overhead as well as possible. In this respect, I think it is 
> actually marginally better *not* to cache the function pointers in local 
> variables, which could end up spilled on the stack, or not, depending on local 
> compiler optimisations for any given test case.
>

I disagree, uninlining the timer fetches adds another source of
inconsistency. It may be messy, but I think accuracy here is more
important than cleanliness, especially as it's a development tool.
Rémi Denis-Courmont July 15, 2023, 8:13 p.m. UTC | #4
Le lauantaina 15. heinäkuuta 2023, 20.43.26 EEST Lynne a écrit :
> Jul 15, 2023, 10:26 by remi@remlab.net:
> > Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit :
> >> Jul 14, 2023, 20:29 by remi@remlab.net:
> >> > This makes all calls to the bench start and stop functions via
> >> > function pointers. While the primary goal is to support run-time
> >> > selection of the performance measurement back-end in later commits,
> >> > this has the side benefit of containing platform dependencies in to
> >> > checkasm.c and out of checkasm.h.
> >> > ---
> >> > 
> >> >  tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++-----
> >> >  tests/checkasm/checkasm.h | 31 ++++---------------------------
> >> >  2 files changed, 32 insertions(+), 32 deletions(-)
> >> 
> >> Not sure I agree with this commit, the overhead can be detectable,
> >> and we have a lot of small functions with runtime a few times that
> >> of a null function call.
> > 
> > I don't think the function call is ever null. The pointers are left NULL
> > only if none of the backend initialise. But then, checkasm will bail out
> > and exit before we try to benchmark anything anyway.
> > 
> > As for the real functions, they always do *something*. None of them "just
> > return 0".
> 
> I meant a no-op function call to measure the overhead of function
> calls themselves, complete with all the ABI stuff.

I

> 
> >> Can you store the function pointers out of the loop to reduce
> >> the derefs needed?
> > 
> > Taking just the two loads is out of the loop should be feasible but it
> > seems a rather vain. You will still have the overhead of the indirect
> > function call, the function, and most importantly in the case of Linux
> > perf and MacOS kperf, the system calls.
> > 
> > The only way to avoid the indirect function calls are to use IFUNC (tricky
> > and not portable), or to make horrible macros to spawn one bench loop for
> > each backend.
> > 
> > In the end, I think we should rather aim for as constant time as possible,
> > rather than as fast as possible, so that the nop loop can estimate the
> > benchmarking overhead as well as possible. In this respect, I think it is
> > actually marginally better *not* to cache the function pointers in local
> > variables, which could end up spilled on the stack, or not, depending on
> > local compiler optimisations for any given test case.
> 
> I disagree, uninlining the timer fetches adds another source of
> inconsistency.

Err, outlining the timer makes sure that it's always the exact same code 
that's run, and not differently optimised inlinings, at least if LTO is absent. 
(And even with LTO, it vastly reduces the compiler's ability to optimise and 
vary the compilation.) Again, given how the calculations are made at the 
moment, the stability of the overhead is important, so that we can *compare* 
measurements. The absolute value of the overhead, not so much.

But I still argue that that is, either way, completely negligible compared to 
the *existing* overhead. Each loop is making 4 system calls, and each of those 
system call requires a direct call (to PLT) and an indirect branch (from GOT). 
If you have a problem with the two additional function calls, then you can't 
be using Linux perf in the first place.
Lynne July 16, 2023, 8:32 p.m. UTC | #5
Jul 15, 2023, 22:13 by remi@remlab.net:

> Le lauantaina 15. heinäkuuta 2023, 20.43.26 EEST Lynne a écrit :
>
>> Jul 15, 2023, 10:26 by remi@remlab.net:
>> > Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit :
>> >> Jul 14, 2023, 20:29 by remi@remlab.net:
>> >> > This makes all calls to the bench start and stop functions via
>> >> > function pointers. While the primary goal is to support run-time
>> >> > selection of the performance measurement back-end in later commits,
>> >> > this has the side benefit of containing platform dependencies in to
>> >> > checkasm.c and out of checkasm.h.
>> >> > ---
>> >> > 
>> >> >  tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++-----
>> >> >  tests/checkasm/checkasm.h | 31 ++++---------------------------
>> >> >  2 files changed, 32 insertions(+), 32 deletions(-)
>> >> 
>> >> Not sure I agree with this commit, the overhead can be detectable,
>> >> and we have a lot of small functions with runtime a few times that
>> >> of a null function call.
>> > 
>> > I don't think the function call is ever null. The pointers are left NULL
>> > only if none of the backend initialise. But then, checkasm will bail out
>> > and exit before we try to benchmark anything anyway.
>> > 
>> > As for the real functions, they always do *something*. None of them "just
>> > return 0".
>>
>> I meant a no-op function call to measure the overhead of function
>> calls themselves, complete with all the ABI stuff.
>>
>
> I
>
>>
>> >> Can you store the function pointers out of the loop to reduce
>> >> the derefs needed?
>> > 
>> > Taking just the two loads is out of the loop should be feasible but it
>> > seems a rather vain. You will still have the overhead of the indirect
>> > function call, the function, and most importantly in the case of Linux
>> > perf and MacOS kperf, the system calls.
>> > 
>> > The only way to avoid the indirect function calls are to use IFUNC (tricky
>> > and not portable), or to make horrible macros to spawn one bench loop for
>> > each backend.
>> > 
>> > In the end, I think we should rather aim for as constant time as possible,
>> > rather than as fast as possible, so that the nop loop can estimate the
>> > benchmarking overhead as well as possible. In this respect, I think it is
>> > actually marginally better *not* to cache the function pointers in local
>> > variables, which could end up spilled on the stack, or not, depending on
>> > local compiler optimisations for any given test case.
>>
>> I disagree, uninlining the timer fetches adds another source of
>> inconsistency.
>>
>
> Err, outlining the timer makes sure that it's always the exact same code 
> that's run, and not differently optimised inlinings, at least if LTO is absent. 
> (And even with LTO, it vastly reduces the compiler's ability to optimise and 
> vary the compilation.) Again, given how the calculations are made at the 
> moment, the stability of the overhead is important, so that we can *compare* 
> measurements. The absolute value of the overhead, not so much.
>

Introducing additional overhead in the form of a dereference is a point
where instability can creep in. Can you guarantee that a context will
always remain in L1D cache, as opposed to just reading the raw CPU timing
directly where that's supported.


> But I still argue that that is, either way, completely negligible compared to 
> the *existing* overhead. Each loop is making 4 system calls, and each of those 
> system call requires a direct call (to PLT) and an indirect branch (from GOT). 
> If you have a problem with the two additional function calls, then you can't 
> be using Linux perf in the first place.
>

You don't want to ever use linux perf in the first place, it's second class.
I don't think it's worth changing the direct inlining we had before. You're not
interested in whether or not the same exact code is ran between platforms,
just that the code that's measuring timing is as efficient and low overhead
as possible.
Rémi Denis-Courmont July 17, 2023, 5:18 a.m. UTC | #6
Le sunnuntaina 16. heinäkuuta 2023, 23.32.21 EEST Lynne a écrit :
> Introducing additional overhead in the form of a dereference is a point
> where instability can creep in. Can you guarantee that a context will
> always remain in L1D cache,

L1D is not involved here. In version 2, the pointers are cached locally.

> as opposed to just reading the raw CPU timing
> directly where that's supported.

Of course not. Raw CPU timing is subject to noise from interrupts (and 
whatever those interrupts trigger). And that's not just theoretical. I've 
experienced it and it sucks. Raw CPU timing is much noisier than Linux perf.

And because it has also been proven vastly insecure, it's been disabled on Arm 
for a long time, and is being disabled on RISC-V too now.

> > But I still argue that that is, either way, completely negligible compared
> > to the *existing* overhead. Each loop is making 4 system calls, and each
> > of those system call requires a direct call (to PLT) and an indirect
> > branch (from GOT). If you have a problem with the two additional function
> > calls, then you can't be using Linux perf in the first place.
> 
> You don't want to ever use linux perf in the first place, it's second class.

No it isn't. The interface is more involved than just reading a CSR; and sure 
I'd prefer the simple interface that RDCYCLE is all other things being equal. 
But other things are not equal. Linux perf is in fact *more* accurate by 
virtue of not *wrongly* counting other things. And it does not threaten the 
security of the entire system, so it will work inside a rented VM or an 
unprivileged process.

> I don't think it's worth changing the direct inlining we had before. You're
> not interested in whether or not the same exact code is ran between
> platforms,

Err, I am definitely interested in doing exactly that. I don't want to have to 
reconfigure and recompile the entire FFmpeg just to switch between Linux perf 
and raw cycle counter. A contrario, I *do* want to compare performance between 
vendors once the hardware is available.

> just that the code that's measuring timing is as efficient and
> low overhead as possible.

Of course not. Low overhead is irrelevant here. The measurement overhead is 
know and is subtracted. What we need is stable/reproducible overhead, and 
accurate measurements.

And that's assuming the stuff works at all. You can argue that we should use 
Arm PMU and RISC-V RDCYCLE, and that Linux perf sucks, all you want. PMU 
access will just throw a SIGILL and end the checkasm process with zero 
measurements. The rest of the industry wants to use system calls for informed 
reasons. I don't think you, or even the whole FFmpeg project, can win that 
argument against OS and CPU vendors.
Lynne July 17, 2023, 5:48 p.m. UTC | #7
Jul 17, 2023, 07:18 by remi@remlab.net:

> Le sunnuntaina 16. heinäkuuta 2023, 23.32.21 EEST Lynne a écrit :
>
>> Introducing additional overhead in the form of a dereference is a point
>> where instability can creep in. Can you guarantee that a context will
>> always remain in L1D cache,
>>
>
> L1D is not involved here. In version 2, the pointers are cached locally.
>
>> as opposed to just reading the raw CPU timing
>> directly where that's supported.
>>
>
> Of course not. Raw CPU timing is subject to noise from interrupts (and 
> whatever those interrupts trigger). And that's not just theoretical. I've 
> experienced it and it sucks. Raw CPU timing is much noisier than Linux perf.
>
> And because it has also been proven vastly insecure, it's been disabled on Arm 
> for a long time, and is being disabled on RISC-V too now.
>
>> > But I still argue that that is, either way, completely negligible compared
>> > to the *existing* overhead. Each loop is making 4 system calls, and each
>> > of those system call requires a direct call (to PLT) and an indirect
>> > branch (from GOT). If you have a problem with the two additional function
>> > calls, then you can't be using Linux perf in the first place.
>>
>> You don't want to ever use linux perf in the first place, it's second class.
>>
>
> No it isn't. The interface is more involved than just reading a CSR; and sure 
> I'd prefer the simple interface that RDCYCLE is all other things being equal. 
> But other things are not equal. Linux perf is in fact *more* accurate by 
> virtue of not *wrongly* counting other things. And it does not threaten the 
> security of the entire system, so it will work inside a rented VM or an 
> unprivileged process.
>

Threaten? This is a development tool first and foremost.
If anyone doesn't want to use rdcycle, they can use linux perf, it still works,
with or without the patch.


>> I don't think it's worth changing the direct inlining we had before. You're
>> not interested in whether or not the same exact code is ran between
>> platforms,
>>
>
> Err, I am definitely interested in doing exactly that. I don't want to have to 
> reconfigure and recompile the entire FFmpeg just to switch between Linux perf 
> and raw cycle counter. A contrario, I *do* want to compare performance between 
> vendors once the hardware is available.
>

That's a weak reason to compromise the accuracy of a development tool.


>> just that the code that's measuring timing is as efficient and
>> low overhead as possible.
>>
>
> Of course not. Low overhead is irrelevant here. The measurement overhead is 
> know and is subtracted. What we need is stable/reproducible overhead, and 
> accurate measurements.
>

Which is what TSC or the equivalent gets you. It's noisy, but that's because
it's better and higher accuracy than having to roundtrip through the kernel.


> And that's assuming the stuff works at all. You can argue that we should use 
> Arm PMU and RISC-V RDCYCLE, and that Linux perf sucks, all you want. PMU 
> access will just throw a SIGILL and end the checkasm process with zero 
> measurements. The rest of the industry wants to use system calls for informed 
> reasons. I don't think you, or even the whole FFmpeg project, can win that 
> argument against OS and CPU vendors.
>

Either way, I don't agree with this patch, not accepting it.
Rémi Denis-Courmont July 17, 2023, 6:09 p.m. UTC | #8
Le maanantaina 17. heinäkuuta 2023, 20.48.40 EEST Lynne a écrit :
> >> > But I still argue that that is, either way, completely negligible
> >> > compared
> >> > to the *existing* overhead. Each loop is making 4 system calls, and
> >> > each
> >> > of those system call requires a direct call (to PLT) and an indirect
> >> > branch (from GOT). If you have a problem with the two additional
> >> > function
> >> > calls, then you can't be using Linux perf in the first place.
> >> 
> >> You don't want to ever use linux perf in the first place, it's second
> >> class.> 
> > No it isn't. The interface is more involved than just reading a CSR; and
> > sure I'd prefer the simple interface that RDCYCLE is all other things
> > being equal. But other things are not equal. Linux perf is in fact *more*
> > accurate by virtue of not *wrongly* counting other things. And it does
> > not threaten the security of the entire system, so it will work inside a
> > rented VM or an unprivileged process.
> 
> Threaten?

User-space access to the cycle counter has been deemed a security threat due 
to the Cycle Drift attack, and is disabled as of OpenSBI 1.3.

If FFmpeg does not support Linux perf, FFmpeg will get _no_ performance 
benchmarks on Linux.

> This is a development tool first and foremost.

A development tool is not a justification for leaving a security whole in the 
system. I don't make the rules, and you don't either. OpenSBI and Linux make 
them.

> If anyone doesn't want to use rdcycle, they can use linux perf, it still
> works, with or without the patch.

It does not.

> >> I don't think it's worth changing the direct inlining we had before.
> >> You're
> >> not interested in whether or not the same exact code is ran between
> >> platforms,
> > 
> > Err, I am definitely interested in doing exactly that. I don't want to
> > have to reconfigure and recompile the entire FFmpeg just to switch
> > between Linux perf and raw cycle counter. A contrario, I *do* want to
> > compare performance between vendors once the hardware is available.
> 
> That's a weak reason to compromise the accuracy of a development tool.

This is not compromising any accuracy of any development tool. Again, in a 
scenario where both RDCYCLE and Linux perf work, Linux perf is more accurate 
for reasons that I already outlined in previous messages. And on systems with 
newer OpenSBI and kernel, RDCYCLE does not work at all.

> >> just that the code that's measuring timing is as efficient and
> >> low overhead as possible.
> > 
> > Of course not. Low overhead is irrelevant here. The measurement overhead
> > is know and is subtracted. What we need is stable/reproducible overhead,
> > and accurate measurements.
> 
> Which is what TSC or the equivalent gets you. It's noisy, but that's because
> it's better and higher accuracy than having to roundtrip through the
> kernel.

We _could_ use RDTIME. That's not blocked.

And while that should be "low overhead", but measuring time is also obviously 
way _less_ accurate than measuring cycles (RDCYCLE), which is in turn _less_ 
accurate than measuring cycles only in user mode and only of the current 
process (Linux perf).

> Either way, I don't agree with this patch, not accepting it.

The only vaguely valid reason you've given is that this should cache the 
function pointers locally, which version 2 does.

All you've done is make it abundantly clear that you don't like Linux perf and 
don't care about the rationale for this patch because it doesn't suit your 
personal preferences, especially on IRC:

20:49 <@Lynne> "security"
20:49 <@Lynne> it's a fucking timer
20:49 <@Lynne> "insecure"
20:51 <@Lynne> computers are very good at knowing how much time has passed, to 
               the point of the speed of light being an issue
20:52 <@Lynne> getting rid of this because some script kiddie could 
potentially 
               figure out a bit by calling this a million times while the CPU 
               is busy is paranoid
20:58 <@Lynne> it's a cache issue, so you fix the cache lifetime, not nerf a 
               timer

And it's not a cache issue. Cycle Drift is not a cache issue, it's an 
instruction timing issue.

Since you're giving zero valid reasons, I'm invoking the TC.
Lynne July 18, 2023, 9:32 p.m. UTC | #9
Jul 17, 2023, 20:09 by remi@remlab.net:

> Le maanantaina 17. heinäkuuta 2023, 20.48.40 EEST Lynne a écrit :
>
>> >> > But I still argue that that is, either way, completely negligible
>> >> > compared
>> >> > to the *existing* overhead. Each loop is making 4 system calls, and
>> >> > each
>> >> > of those system call requires a direct call (to PLT) and an indirect
>> >> > branch (from GOT). If you have a problem with the two additional
>> >> > function
>> >> > calls, then you can't be using Linux perf in the first place.
>> >>
>> >> You don't want to ever use linux perf in the first place, it's second
>> >> class.>
>> > No it isn't. The interface is more involved than just reading a CSR; and
>> > sure I'd prefer the simple interface that RDCYCLE is all other things
>> > being equal. But other things are not equal. Linux perf is in fact *more*
>> > accurate by virtue of not *wrongly* counting other things. And it does
>> > not threaten the security of the entire system, so it will work inside a
>> > rented VM or an unprivileged process.
>>
>> Threaten?
>>
>
> User-space access to the cycle counter has been deemed a security threat due
> to the Cycle Drift attack, and is disabled as of OpenSBI 1.3.
>
> If FFmpeg does not support Linux perf, FFmpeg will get _no_ performance
> benchmarks on Linux.
>

It does support linux perf. I've used it many times.


>> This is a development tool first and foremost.
>>
>
> A development tool is not a justification for leaving a security whole in the
> system. I don't make the rules, and you don't either. OpenSBI and Linux make
> them.
>

You're not required to leave them on always. It's a development tool.
You don't disable frequency scaling on all the time on other platforms,
just when doing benchmarks, which is what the tool is meant for.

For ARM, we routinely use an external kernel module to enable
the performance timers just so we avoid linux perf.
Specifically on arm, the linux perf noise is very high, which is where
my concerns are.


>> If anyone doesn't want to use rdcycle, they can use linux perf, it still
>> works, with or without the patch.
>>
>
> It does not.
>

Why not? It works on arm. The same exact code is used on risc-v
and on any other platform too.


>> >> I don't think it's worth changing the direct inlining we had before.
>> >> You're
>> >> not interested in whether or not the same exact code is ran between
>> >> platforms,
>> >
>> > Err, I am definitely interested in doing exactly that. I don't want to
>> > have to reconfigure and recompile the entire FFmpeg just to switch
>> > between Linux perf and raw cycle counter. A contrario, I *do* want to
>> > compare performance between vendors once the hardware is available.
>>
>> That's a weak reason to compromise the accuracy of a development tool.
>>
>
> This is not compromising any accuracy of any development tool. Again, in a
> scenario where both RDCYCLE and Linux perf work, Linux perf is more accurate
> for reasons that I already outlined in previous messages. And on systems with
> newer OpenSBI and kernel, RDCYCLE does not work at all.
>
>> >> just that the code that's measuring timing is as efficient and
>> >> low overhead as possible.
>> >
>> > Of course not. Low overhead is irrelevant here. The measurement overhead
>> > is know and is subtracted. What we need is stable/reproducible overhead,
>> > and accurate measurements.
>>
>> Which is what TSC or the equivalent gets you. It's noisy, but that's because
>> it's better and higher accuracy than having to roundtrip through the
>> kernel.
>>
>
> We _could_ use RDTIME. That's not blocked.
>
> And while that should be "low overhead", but measuring time is also obviously
> way _less_ accurate than measuring cycles (RDCYCLE), which is in turn _less_
> accurate than measuring cycles only in user mode and only of the current
> process (Linux perf).
>
>> Either way, I don't agree with this patch, not accepting it.
>>
>
> The only vaguely valid reason you've given is that this should cache the
> function pointers locally, which version 2 does.
>

I disagree with it doing an indirection at all.
It's overhead and a potential source of inaccuracy.
It's up to you to prove it wouldn't affect major platforms.


> Since you're giving zero valid reasons, I'm invoking the TC.
>

What, you expect them to materialize an opinion out of thin air?
You have to call for a vote. And bring up why it hasn't been changed
in 3 years, even though in the duties listed, it's only supposed
to last for a year before elections are called again. Which possibly
involves doing elections again.

You could've asked the actual maintainer, and one who knows the
most about the code, Gramner. But you're focusing too hard
on arguing about irrelevant issues, even bringing up security,
rather than using reasoning or clearly describing the need for the patch.

> All you've done is make it abundantly clear that you don't like Linux perf and 
> don't care about the rationale for this patch because it doesn't suit your 
> personal preferences, especially on IRC:
By the way, if you're going to start doing personal attacks, right down
to bringing up random quotes from the IRC, unrelated to you, I would
ask the community committee to step in. Those, on the other hand,
are well-known figures.
Rémi Denis-Courmont July 19, 2023, 3:58 p.m. UTC | #10
Le keskiviikkona 19. heinäkuuta 2023, 0.32.26 EEST Lynne a écrit :
> > User-space access to the cycle counter has been deemed a security threat
> > due to the Cycle Drift attack, and is disabled as of OpenSBI 1.3.
> > 
> > If FFmpeg does not support Linux perf, FFmpeg will get _no_ performance
> > benchmarks on Linux.
> 
> It does support linux perf. I've used it many times.

Yes of course. This MR modifies the Linux perf code, so I can see that it's 
there. My point is that *if* Linux perf is *not* enabled (in ./configure), 
there will be no benchmarks anymore on newer Linux RISC-V systems. But if it 
is enabled it, there will be no benchmarks on existing systems. Hence the need 
for the run-time indirection, or some very major PITA for development and 
supporting new developers (and we've seen already 3 different people sending 
RISC-V patches besides me in the past few months).

> >> This is a development tool first and foremost.
> > 
> > A development tool is not a justification for leaving a security whole in
> > the system. I don't make the rules, and you don't either. OpenSBI and
> > Linux make them.
> 
> You're not required to leave them on always. It's a development tool.
> You don't disable frequency scaling on all the time on other platforms,
> just when doing benchmarks, which is what the tool is meant for.

There is no need to disable frequency scaling if you use the most _suited_ 
counter for microbencmarks, that is to say the cycle counter. And then, 
disabling frequency scaling is not always feasible.

> For ARM, we routinely use an external kernel module to enable
> the performance timers just so we avoid linux perf.

I don't think an external ARM-specific kernel module is an excuse for breaking 
checkasm for _normal_ out-of-the-box kernel setups. Just because checkasm is a 
development tool doesn't mean that it only runs on custom development systems.

> Specifically on arm, the linux perf noise is very high, which is where
> my concerns are.

I fail to see how this MR makes this any worse. You can still disable Linux 
perf at build time as you could before. Quite the contrary in fact: if the 
noise is so high then there are no ways that a single indirect function call 
would make any meaningful difference for the worse.

> >> If anyone doesn't want to use rdcycle, they can use linux perf, it still
> >> works, with or without the patch.
> > 
> > It does not.
> 
> Why not? It works on arm. The same exact code is used on risc-v
> and on any other platform too.

Currently it's disabled by default on RISC-V (really anything but Arm) in 
FFmpeg's configure, and hardly anybody has the necessary OpenSBI and Linux 
updates for Linux perf to work on RISC-V.

We can't simply assume that Linux perf is supported. We're a few years too 
early for that. And by then, Android will have blocked Linux perf, so we will 
have the same problem anyway, only for different reasons.

> >> Either way, I don't agree with this patch, not accepting it.
> > 
> > The only vaguely valid reason you've given is that this should cache the
> > function pointers locally, which version 2 does.
> 
> I disagree with it doing an indirection at all.

I can see that, and you're entitled to your opinion.
But that's neither technical reasoning, nor actionable review feedback.

> It's overhead and a potential source of inaccuracy.

If Linux perf is enabled, it's blatantly obvious that whatever additional 
overhead this adds is negligible. Since you insist, on top of my head, Linux 
perf already now will do quite a few indirect calls:
- from checkasm's PLT to libc's ioctl(),
- from libc's syscall to kernel exception vector,
- from exception vector to ioctl syscall handler,
- from ioctl handler to PMU framework,
- from PMU framework to PMU backend driver.
And that's grossly over-simplified and glossing over all the other calls and 
stuff that happen in-between.

As for the case that Linux perf is disabled at build time, then I can make a 
different version that calls AV_READ_TIME directly without indirection if 
that's what bothers Gramner and/or you. Though you've noted yourself that that 
path is also very noisy as it is so not like adding one indirect call will 
matter here :shrug:

> It's up to you to prove it wouldn't affect major platforms.

I think that I have supplied plenty of technical arguments that there 
shouldn't be a problem with this MR. And in the unlikely event that there 
would be an unexpected problem, I'm sure somebody would revert it; it's not an 
irreversible situation. Not to mention that, in the mean time, Martin did 
indeed verify that the overhead is small and stable (on Arm).

> > Since you're giving zero valid reasons, I'm invoking the TC.
> 
> What, you expect them to materialize an opinion out of thin air?

I think that they already have plenty to materialise an opinion from, between 
the MR, this thread, Martin's benchmarks, and their experience. Otherwise, 
they always have the possibility to ask for clarifications (it's even in the 
community rulebook).

> You could've asked the actual maintainer, and one who knows the
> most about the code, Gramner.

Henrik was and still is welcome to review and comment. I am not in the habit 
of asking permission to send code for review.

> But you're focusing too hard on arguing about irrelevant issues, even
> bringing up security, rather than using reasoning or clearly describing the
> need for the patch.

Security is the reason why RDCYCLE will stop working, and that in turn is the 
motivation for this set. I believe it's well described in the commit message 
(notably patch 6/7), and this is the first time in this thread that you state 
otherwise. You don't like the security restriction and you don't have to. But 
that does not make it "irrelevant".

And if you want to argue against the change and/or its security motivation, 
you need to make your case on Linux RISC-V, Linux pref and Linux kernel 
mailing lists, not here.

> By the way, if you're going to start doing personal attacks,

Quoting somebody is not attacking them. I didn't even slice or edit the quote 
or otherwise attempt to distort it.

> right down to bringing up random quotes from the IRC,
> unrelated to you,

It looked very much like it was related to the *MR*. Maybe it was just a 
"random" coincidence, and I am sorry that you apparently took so badly to 
being quoted.

> I would ask the community committee to step in.
> Those, on the other hand, are well-known figures.
Martin Storsjö July 24, 2023, 9:26 p.m. UTC | #11
On Mon, 17 Jul 2023, Rémi Denis-Courmont wrote:

> Since you're giving zero valid reasons, I'm invoking the TC.

Just for public record and visibility:

The TC has discussed the matter at hand. Overall, the TC considered that 
the approach of using an indirect call seemed tolerable given the benefits 
and could be accepted if necessary. But avoiding the indirect call unless 
there actually are multiple timing providers would be even better, 
retaining the same low overhead as before.

As the updated patchset seems to implement that, we consider the case 
closed - and actual TC involvement wasn't needed after all here.

// Martin Storsjö, on behalf of the FFmpeg Technical Committee
Nicolas George July 24, 2023, 9:33 p.m. UTC | #12
Martin Storsjo (12023-07-25):
> The TC has discussed the matter at hand.

The TC (and CC) has been elected for a year in July of 2020. That means
your mandate is two years expired. This decision is therefore not
binding.

Regards,
Lynne July 24, 2023, 10:19 p.m. UTC | #13
Jul 24, 2023, 23:27 by martin@martin.st:

> On Mon, 17 Jul 2023, Rémi Denis-Courmont wrote:
>
>> Since you're giving zero valid reasons, I'm invoking the TC.
>>
>
> Just for public record and visibility:
>
> The TC has discussed the matter at hand. Overall, the TC considered that the approach of using an indirect call seemed tolerable given the benefits and could be accepted if necessary. But avoiding the indirect call unless there actually are multiple timing providers would be even better, retaining the same low overhead as before.
>
> As the updated patchset seems to implement that, we consider the case closed - and actual TC involvement wasn't needed after all here.
>
> // Martin Storsjö, on behalf of the FFmpeg Technical Committee
>

I do think TC involvement here was unnecessary - I do not object to this
version of the patch, so LGTM.
I do agree with the decision overall, but this is nothing that couldn't
be solved with a 3 minute discussion.

I think, however, the process has become rather opaque in this case.
Usually, there has to be a clear writeup of the issue, with all context
removed, that all parties have to agree on is presentable to the TC
for guidelines. In the past, whenever developers have thrown in random
comments for a TC discussion, this has been followed, and the TC
has not responded, but what makes this case so special, when this
was also the case?

Plus, even now, I have zero idea who is on the TC - it has been three years
since it was made, and although it's not meant to last for so long, I doubt
there's a need for a vote as the final results would probably remain the
same. But I would like to know who is still actively involved, and whether
all of the people on the TC discussed this, as it's necessary to make a decision.

If this was an informative guideline, rather than an actual decision where
all members voted and agreed, I think this should be stated.

If the TC were to actively get involved and give decisions without following
proper procedures, it would do more than what the TC was meant to do -
resolve issues.
Kieran Kunhya July 24, 2023, 10:57 p.m. UTC | #14
>
> I think, however, the process has become rather opaque in this case.
> Usually, there has to be a clear writeup of the issue, with all context
> removed, that all parties have to agree on is presentable to the TC
> for guidelines.
>

I don't see why such a writeup is relevant, the mailing list discussions
and the patch are enough.
The TC isn't a debating society.

Kieran
Martin Storsjö July 25, 2023, 6:44 a.m. UTC | #15
On Tue, 25 Jul 2023, Lynne wrote:

> I think, however, the process has become rather opaque in this case.
> Usually, there has to be a clear writeup of the issue, with all context
> removed, that all parties have to agree on is presentable to the TC
> for guidelines. In the past, whenever developers have thrown in random
> comments for a TC discussion, this has been followed, and the TC
> has not responded, but what makes this case so special, when this
> was also the case?

This case was admittedly very opaque. I've seen numerous cases threatening 
to escalate disputes to the TC. The difference here was that an actual 
direct mail was sent to the TC requesting to take a stance on the matter 
to unblock the patch. It wasn't a case of the TC deciding on its own to 
get involved.

Now admittedly, to follow correct procedures, the TC should have announced 
on the ML that we are discussing this issue and trying to make a decision. 
Unfortunately I didn't notice that part in the description of procedures 
until the discussion was done (and the patch review on the ML had 
progressed with a new patchset that made good progress anyway), but we 
wanted to make it publicly known that we had been invoked and actually had 
had a discussion on the matter and made a decision, as was requested.

// Martin
diff mbox series

Patch

diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 69a709175f..4b6164301a 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -57,6 +57,14 @@ 
 #if HAVE_UNISTD_H
 #include <unistd.h>
 #endif
+#if CONFIG_LINUX_PERF
+#include <sys/ioctl.h>
+#include <asm/unistd.h>
+#include <linux/perf_event.h>
+#endif
+#if CONFIG_MACOS_KPERF
+#include "libavutil/macos_kperf.h"
+#endif
 
 #if !HAVE_ISATTY
 #define isatty(fd) 1
@@ -508,6 +516,9 @@  static int cmp_nop(const void *a, const void *b)
     return *(const uint16_t*)a - *(const uint16_t*)b;
 }
 
+static uint64_t (*checkasm_bench_start)(void);
+static uint64_t (*checkasm_bench_stop)(void);
+
 /* Measure the overhead of the timing code (in decicycles) */
 static int measure_nop_time(void)
 {
@@ -651,7 +662,7 @@  static void print_cpu_name(void)
 }
 
 #if CONFIG_LINUX_PERF
-uint64_t checkasm_bench_linux_perf_start(void)
+static uint64_t checkasm_bench_linux_perf_start(void)
 {
     int fd = state.sysfd;
 
@@ -660,7 +671,7 @@  uint64_t checkasm_bench_linux_perf_start(void)
     return 0;
 }
 
-uint64_t checkasm_bench_linux_perf_stop(void)
+static uint64_t checkasm_bench_linux_perf_stop(void)
 {
     uint64_t t;
     int fd = state.sysfd;
@@ -687,24 +698,34 @@  static int bench_init_linux(void)
         perror("perf_event_open");
         return -1;
     }
+    checkasm_bench_start = checkasm_bench_linux_perf_start;
+    checkasm_bench_stop = checkasm_bench_linux_perf_stop;
     return 0;
 }
 #elif CONFIG_MACOS_KPERF
 static int bench_init_kperf(void)
 {
     ff_kperf_init();
+    checkasm_bench_start = checkasm_bench_stop = ff_kperf_cycles;
     return 0;
 }
-#else
+#elif defined (AV_READ_TIME)
+static uint64_t ff_read_time(void)
+{
+    return AV_READ_TIME();
+}
+
 static int bench_init_ffmpeg(void)
 {
-#ifdef AV_READ_TIME
     printf("benchmarking with native FFmpeg timers\n");
+    checkasm_bench_start = checkasm_bench_stop = ff_read_time;
     return 0;
+}
 #else
+static int bench_init_ffmpeg(void)
+{
     fprintf(stderr, "checkasm: --bench is not supported on your system\n");
     return -1;
-#endif
 }
 #endif
 
@@ -871,6 +892,8 @@  CheckasmPerf *checkasm_get_perf_context(void)
     CheckasmPerf *perf = &state.current_func_ver->perf;
     memset(perf, 0, sizeof(*perf));
     perf->sysfd = state.sysfd;
+    perf->start = checkasm_bench_start;
+    perf->stop = checkasm_bench_start;
     return perf;
 }
 
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index 8a62b98f3e..4962815345 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -26,15 +26,6 @@ 
 #include <stdint.h>
 #include "config.h"
 
-#if CONFIG_LINUX_PERF
-#include <unistd.h> // read(3)
-#include <sys/ioctl.h>
-#include <asm/unistd.h>
-#include <linux/perf_event.h>
-#elif CONFIG_MACOS_KPERF
-#include "libavutil/macos_kperf.h"
-#endif
-
 #include "libavutil/avstring.h"
 #include "libavutil/cpu.h"
 #include "libavutil/internal.h"
@@ -236,24 +227,10 @@  typedef struct CheckasmPerf {
     int sysfd;
     uint64_t cycles;
     int iterations;
+    uint64_t (*start)(void);
+    uint64_t (*stop)(void);
 } CheckasmPerf;
 
-#if CONFIG_LINUX_PERF
-uint64_t checkasm_bench_linux_perf_start(void);
-uint64_t checkasm_bench_linux_perf_stop(void);
-#define checkasm_bench_start() checkasm_bench_linux_perf_start()
-#define checkasm_bench_stop()  checkasm_bench_linux_perf_stop()
-#elif CONFIG_MACOS_KPERF
-#define checkasm_bench_start() ff_kperf_cycles()
-#define checkasm_bench_stop()  ff_kperf_cycles()
-#elif defined (AV_READ_TIME)
-#define checkasm_bench_start() AV_READ_TIME()
-#define checkasm_bench_stop()  AV_READ_TIME()
-#else
-#define checkasm_bench_start() UINT64_C(0)
-#define checkasm_bench_stop()  UINT64_C(0)
-#endif
-
 /* Benchmark the function */
 #define bench_new(...)\
     do {\
@@ -265,12 +242,12 @@  uint64_t checkasm_bench_linux_perf_stop(void);
             int ti, tcount = 0;\
             uint64_t t = 0; \
             for (ti = 0; ti < BENCH_RUNS; ti++) {\
-                t = checkasm_bench_start(); \
+                t = perf->start(); \
                 tfunc(__VA_ARGS__);\
                 tfunc(__VA_ARGS__);\
                 tfunc(__VA_ARGS__);\
                 tfunc(__VA_ARGS__);\
-                t = checkasm_bench_stop() - t;\
+                t = perf->stop() - t;\
                 if (t*tcount <= tsum*4 && ti > 0) {\
                     tsum += t;\
                     tcount++;\