Message ID | 20230714182835.66326-4-remi@remlab.net |
---|---|
State | New |
Headers | show |
Series | checkasm RISC-V Linux perf enablement | expand |
Context | Check | Description |
---|---|---|
yinshiyou/make_loongarch64 | success | Make finished |
yinshiyou/make_fate_loongarch64 | success | Make fate finished |
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
Jul 14, 2023, 20:29 by remi@remlab.net: > This makes all calls to the bench start and stop functions via > function pointers. While the primary goal is to support run-time > selection of the performance measurement back-end in later commits, > this has the side benefit of containing platform dependencies in to > checkasm.c and out of checkasm.h. > --- > tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++----- > tests/checkasm/checkasm.h | 31 ++++--------------------------- > 2 files changed, 32 insertions(+), 32 deletions(-) > Not sure I agree with this commit, the overhead can be detectable, and we have a lot of small functions with runtime a few times that of a null function call. Can you store the function pointers out of the loop to reduce the derefs needed?
Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit : > Jul 14, 2023, 20:29 by remi@remlab.net: > > This makes all calls to the bench start and stop functions via > > function pointers. While the primary goal is to support run-time > > selection of the performance measurement back-end in later commits, > > this has the side benefit of containing platform dependencies in to > > checkasm.c and out of checkasm.h. > > --- > > > > tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++----- > > tests/checkasm/checkasm.h | 31 ++++--------------------------- > > 2 files changed, 32 insertions(+), 32 deletions(-) > > Not sure I agree with this commit, the overhead can be detectable, > and we have a lot of small functions with runtime a few times that > of a null function call. I don't think the function call is ever null. The pointers are left NULL only if none of the backend initialise. But then, checkasm will bail out and exit before we try to benchmark anything anyway. As for the real functions, they always do *something*. None of them "just return 0". > Can you store the function pointers out of the loop to reduce > the derefs needed? Taking just the two loads is out of the loop should be feasible but it seems a rather vain. You will still have the overhead of the indirect function call, the function, and most importantly in the case of Linux perf and MacOS kperf, the system calls. The only way to avoid the indirect function calls are to use IFUNC (tricky and not portable), or to make horrible macros to spawn one bench loop for each backend. In the end, I think we should rather aim for as constant time as possible, rather than as fast as possible, so that the nop loop can estimate the benchmarking overhead as well as possible. In this respect, I think it is actually marginally better *not* to cache the function pointers in local variables, which could end up spilled on the stack, or not, depending on local compiler optimisations for any given test case.
Jul 15, 2023, 10:26 by remi@remlab.net: > Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit : > >> Jul 14, 2023, 20:29 by remi@remlab.net: >> > This makes all calls to the bench start and stop functions via >> > function pointers. While the primary goal is to support run-time >> > selection of the performance measurement back-end in later commits, >> > this has the side benefit of containing platform dependencies in to >> > checkasm.c and out of checkasm.h. >> > --- >> > >> > tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++----- >> > tests/checkasm/checkasm.h | 31 ++++--------------------------- >> > 2 files changed, 32 insertions(+), 32 deletions(-) >> >> Not sure I agree with this commit, the overhead can be detectable, >> and we have a lot of small functions with runtime a few times that >> of a null function call. >> > > I don't think the function call is ever null. The pointers are left NULL only > if none of the backend initialise. But then, checkasm will bail out and exit > before we try to benchmark anything anyway. > > As for the real functions, they always do *something*. None of them "just > return 0". > I meant a no-op function call to measure the overhead of function calls themselves, complete with all the ABI stuff. >> Can you store the function pointers out of the loop to reduce >> the derefs needed? >> > > Taking just the two loads is out of the loop should be feasible but it seems a > rather vain. You will still have the overhead of the indirect function call, > the function, and most importantly in the case of Linux perf and MacOS kperf, > the system calls. > > The only way to avoid the indirect function calls are to use IFUNC (tricky and > not portable), or to make horrible macros to spawn one bench loop for each > backend. > > In the end, I think we should rather aim for as constant time as possible, > rather than as fast as possible, so that the nop loop can estimate the > benchmarking overhead as well as possible. In this respect, I think it is > actually marginally better *not* to cache the function pointers in local > variables, which could end up spilled on the stack, or not, depending on local > compiler optimisations for any given test case. > I disagree, uninlining the timer fetches adds another source of inconsistency. It may be messy, but I think accuracy here is more important than cleanliness, especially as it's a development tool.
Le lauantaina 15. heinäkuuta 2023, 20.43.26 EEST Lynne a écrit : > Jul 15, 2023, 10:26 by remi@remlab.net: > > Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit : > >> Jul 14, 2023, 20:29 by remi@remlab.net: > >> > This makes all calls to the bench start and stop functions via > >> > function pointers. While the primary goal is to support run-time > >> > selection of the performance measurement back-end in later commits, > >> > this has the side benefit of containing platform dependencies in to > >> > checkasm.c and out of checkasm.h. > >> > --- > >> > > >> > tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++----- > >> > tests/checkasm/checkasm.h | 31 ++++--------------------------- > >> > 2 files changed, 32 insertions(+), 32 deletions(-) > >> > >> Not sure I agree with this commit, the overhead can be detectable, > >> and we have a lot of small functions with runtime a few times that > >> of a null function call. > > > > I don't think the function call is ever null. The pointers are left NULL > > only if none of the backend initialise. But then, checkasm will bail out > > and exit before we try to benchmark anything anyway. > > > > As for the real functions, they always do *something*. None of them "just > > return 0". > > I meant a no-op function call to measure the overhead of function > calls themselves, complete with all the ABI stuff. I > > >> Can you store the function pointers out of the loop to reduce > >> the derefs needed? > > > > Taking just the two loads is out of the loop should be feasible but it > > seems a rather vain. You will still have the overhead of the indirect > > function call, the function, and most importantly in the case of Linux > > perf and MacOS kperf, the system calls. > > > > The only way to avoid the indirect function calls are to use IFUNC (tricky > > and not portable), or to make horrible macros to spawn one bench loop for > > each backend. > > > > In the end, I think we should rather aim for as constant time as possible, > > rather than as fast as possible, so that the nop loop can estimate the > > benchmarking overhead as well as possible. In this respect, I think it is > > actually marginally better *not* to cache the function pointers in local > > variables, which could end up spilled on the stack, or not, depending on > > local compiler optimisations for any given test case. > > I disagree, uninlining the timer fetches adds another source of > inconsistency. Err, outlining the timer makes sure that it's always the exact same code that's run, and not differently optimised inlinings, at least if LTO is absent. (And even with LTO, it vastly reduces the compiler's ability to optimise and vary the compilation.) Again, given how the calculations are made at the moment, the stability of the overhead is important, so that we can *compare* measurements. The absolute value of the overhead, not so much. But I still argue that that is, either way, completely negligible compared to the *existing* overhead. Each loop is making 4 system calls, and each of those system call requires a direct call (to PLT) and an indirect branch (from GOT). If you have a problem with the two additional function calls, then you can't be using Linux perf in the first place.
Jul 15, 2023, 22:13 by remi@remlab.net: > Le lauantaina 15. heinäkuuta 2023, 20.43.26 EEST Lynne a écrit : > >> Jul 15, 2023, 10:26 by remi@remlab.net: >> > Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit : >> >> Jul 14, 2023, 20:29 by remi@remlab.net: >> >> > This makes all calls to the bench start and stop functions via >> >> > function pointers. While the primary goal is to support run-time >> >> > selection of the performance measurement back-end in later commits, >> >> > this has the side benefit of containing platform dependencies in to >> >> > checkasm.c and out of checkasm.h. >> >> > --- >> >> > >> >> > tests/checkasm/checkasm.c | 33 ++++++++++++++++++++++++++++----- >> >> > tests/checkasm/checkasm.h | 31 ++++--------------------------- >> >> > 2 files changed, 32 insertions(+), 32 deletions(-) >> >> >> >> Not sure I agree with this commit, the overhead can be detectable, >> >> and we have a lot of small functions with runtime a few times that >> >> of a null function call. >> > >> > I don't think the function call is ever null. The pointers are left NULL >> > only if none of the backend initialise. But then, checkasm will bail out >> > and exit before we try to benchmark anything anyway. >> > >> > As for the real functions, they always do *something*. None of them "just >> > return 0". >> >> I meant a no-op function call to measure the overhead of function >> calls themselves, complete with all the ABI stuff. >> > > I > >> >> >> Can you store the function pointers out of the loop to reduce >> >> the derefs needed? >> > >> > Taking just the two loads is out of the loop should be feasible but it >> > seems a rather vain. You will still have the overhead of the indirect >> > function call, the function, and most importantly in the case of Linux >> > perf and MacOS kperf, the system calls. >> > >> > The only way to avoid the indirect function calls are to use IFUNC (tricky >> > and not portable), or to make horrible macros to spawn one bench loop for >> > each backend. >> > >> > In the end, I think we should rather aim for as constant time as possible, >> > rather than as fast as possible, so that the nop loop can estimate the >> > benchmarking overhead as well as possible. In this respect, I think it is >> > actually marginally better *not* to cache the function pointers in local >> > variables, which could end up spilled on the stack, or not, depending on >> > local compiler optimisations for any given test case. >> >> I disagree, uninlining the timer fetches adds another source of >> inconsistency. >> > > Err, outlining the timer makes sure that it's always the exact same code > that's run, and not differently optimised inlinings, at least if LTO is absent. > (And even with LTO, it vastly reduces the compiler's ability to optimise and > vary the compilation.) Again, given how the calculations are made at the > moment, the stability of the overhead is important, so that we can *compare* > measurements. The absolute value of the overhead, not so much. > Introducing additional overhead in the form of a dereference is a point where instability can creep in. Can you guarantee that a context will always remain in L1D cache, as opposed to just reading the raw CPU timing directly where that's supported. > But I still argue that that is, either way, completely negligible compared to > the *existing* overhead. Each loop is making 4 system calls, and each of those > system call requires a direct call (to PLT) and an indirect branch (from GOT). > If you have a problem with the two additional function calls, then you can't > be using Linux perf in the first place. > You don't want to ever use linux perf in the first place, it's second class. I don't think it's worth changing the direct inlining we had before. You're not interested in whether or not the same exact code is ran between platforms, just that the code that's measuring timing is as efficient and low overhead as possible.
Le sunnuntaina 16. heinäkuuta 2023, 23.32.21 EEST Lynne a écrit : > Introducing additional overhead in the form of a dereference is a point > where instability can creep in. Can you guarantee that a context will > always remain in L1D cache, L1D is not involved here. In version 2, the pointers are cached locally. > as opposed to just reading the raw CPU timing > directly where that's supported. Of course not. Raw CPU timing is subject to noise from interrupts (and whatever those interrupts trigger). And that's not just theoretical. I've experienced it and it sucks. Raw CPU timing is much noisier than Linux perf. And because it has also been proven vastly insecure, it's been disabled on Arm for a long time, and is being disabled on RISC-V too now. > > But I still argue that that is, either way, completely negligible compared > > to the *existing* overhead. Each loop is making 4 system calls, and each > > of those system call requires a direct call (to PLT) and an indirect > > branch (from GOT). If you have a problem with the two additional function > > calls, then you can't be using Linux perf in the first place. > > You don't want to ever use linux perf in the first place, it's second class. No it isn't. The interface is more involved than just reading a CSR; and sure I'd prefer the simple interface that RDCYCLE is all other things being equal. But other things are not equal. Linux perf is in fact *more* accurate by virtue of not *wrongly* counting other things. And it does not threaten the security of the entire system, so it will work inside a rented VM or an unprivileged process. > I don't think it's worth changing the direct inlining we had before. You're > not interested in whether or not the same exact code is ran between > platforms, Err, I am definitely interested in doing exactly that. I don't want to have to reconfigure and recompile the entire FFmpeg just to switch between Linux perf and raw cycle counter. A contrario, I *do* want to compare performance between vendors once the hardware is available. > just that the code that's measuring timing is as efficient and > low overhead as possible. Of course not. Low overhead is irrelevant here. The measurement overhead is know and is subtracted. What we need is stable/reproducible overhead, and accurate measurements. And that's assuming the stuff works at all. You can argue that we should use Arm PMU and RISC-V RDCYCLE, and that Linux perf sucks, all you want. PMU access will just throw a SIGILL and end the checkasm process with zero measurements. The rest of the industry wants to use system calls for informed reasons. I don't think you, or even the whole FFmpeg project, can win that argument against OS and CPU vendors.
Jul 17, 2023, 07:18 by remi@remlab.net: > Le sunnuntaina 16. heinäkuuta 2023, 23.32.21 EEST Lynne a écrit : > >> Introducing additional overhead in the form of a dereference is a point >> where instability can creep in. Can you guarantee that a context will >> always remain in L1D cache, >> > > L1D is not involved here. In version 2, the pointers are cached locally. > >> as opposed to just reading the raw CPU timing >> directly where that's supported. >> > > Of course not. Raw CPU timing is subject to noise from interrupts (and > whatever those interrupts trigger). And that's not just theoretical. I've > experienced it and it sucks. Raw CPU timing is much noisier than Linux perf. > > And because it has also been proven vastly insecure, it's been disabled on Arm > for a long time, and is being disabled on RISC-V too now. > >> > But I still argue that that is, either way, completely negligible compared >> > to the *existing* overhead. Each loop is making 4 system calls, and each >> > of those system call requires a direct call (to PLT) and an indirect >> > branch (from GOT). If you have a problem with the two additional function >> > calls, then you can't be using Linux perf in the first place. >> >> You don't want to ever use linux perf in the first place, it's second class. >> > > No it isn't. The interface is more involved than just reading a CSR; and sure > I'd prefer the simple interface that RDCYCLE is all other things being equal. > But other things are not equal. Linux perf is in fact *more* accurate by > virtue of not *wrongly* counting other things. And it does not threaten the > security of the entire system, so it will work inside a rented VM or an > unprivileged process. > Threaten? This is a development tool first and foremost. If anyone doesn't want to use rdcycle, they can use linux perf, it still works, with or without the patch. >> I don't think it's worth changing the direct inlining we had before. You're >> not interested in whether or not the same exact code is ran between >> platforms, >> > > Err, I am definitely interested in doing exactly that. I don't want to have to > reconfigure and recompile the entire FFmpeg just to switch between Linux perf > and raw cycle counter. A contrario, I *do* want to compare performance between > vendors once the hardware is available. > That's a weak reason to compromise the accuracy of a development tool. >> just that the code that's measuring timing is as efficient and >> low overhead as possible. >> > > Of course not. Low overhead is irrelevant here. The measurement overhead is > know and is subtracted. What we need is stable/reproducible overhead, and > accurate measurements. > Which is what TSC or the equivalent gets you. It's noisy, but that's because it's better and higher accuracy than having to roundtrip through the kernel. > And that's assuming the stuff works at all. You can argue that we should use > Arm PMU and RISC-V RDCYCLE, and that Linux perf sucks, all you want. PMU > access will just throw a SIGILL and end the checkasm process with zero > measurements. The rest of the industry wants to use system calls for informed > reasons. I don't think you, or even the whole FFmpeg project, can win that > argument against OS and CPU vendors. > Either way, I don't agree with this patch, not accepting it.
Le maanantaina 17. heinäkuuta 2023, 20.48.40 EEST Lynne a écrit : > >> > But I still argue that that is, either way, completely negligible > >> > compared > >> > to the *existing* overhead. Each loop is making 4 system calls, and > >> > each > >> > of those system call requires a direct call (to PLT) and an indirect > >> > branch (from GOT). If you have a problem with the two additional > >> > function > >> > calls, then you can't be using Linux perf in the first place. > >> > >> You don't want to ever use linux perf in the first place, it's second > >> class.> > > No it isn't. The interface is more involved than just reading a CSR; and > > sure I'd prefer the simple interface that RDCYCLE is all other things > > being equal. But other things are not equal. Linux perf is in fact *more* > > accurate by virtue of not *wrongly* counting other things. And it does > > not threaten the security of the entire system, so it will work inside a > > rented VM or an unprivileged process. > > Threaten? User-space access to the cycle counter has been deemed a security threat due to the Cycle Drift attack, and is disabled as of OpenSBI 1.3. If FFmpeg does not support Linux perf, FFmpeg will get _no_ performance benchmarks on Linux. > This is a development tool first and foremost. A development tool is not a justification for leaving a security whole in the system. I don't make the rules, and you don't either. OpenSBI and Linux make them. > If anyone doesn't want to use rdcycle, they can use linux perf, it still > works, with or without the patch. It does not. > >> I don't think it's worth changing the direct inlining we had before. > >> You're > >> not interested in whether or not the same exact code is ran between > >> platforms, > > > > Err, I am definitely interested in doing exactly that. I don't want to > > have to reconfigure and recompile the entire FFmpeg just to switch > > between Linux perf and raw cycle counter. A contrario, I *do* want to > > compare performance between vendors once the hardware is available. > > That's a weak reason to compromise the accuracy of a development tool. This is not compromising any accuracy of any development tool. Again, in a scenario where both RDCYCLE and Linux perf work, Linux perf is more accurate for reasons that I already outlined in previous messages. And on systems with newer OpenSBI and kernel, RDCYCLE does not work at all. > >> just that the code that's measuring timing is as efficient and > >> low overhead as possible. > > > > Of course not. Low overhead is irrelevant here. The measurement overhead > > is know and is subtracted. What we need is stable/reproducible overhead, > > and accurate measurements. > > Which is what TSC or the equivalent gets you. It's noisy, but that's because > it's better and higher accuracy than having to roundtrip through the > kernel. We _could_ use RDTIME. That's not blocked. And while that should be "low overhead", but measuring time is also obviously way _less_ accurate than measuring cycles (RDCYCLE), which is in turn _less_ accurate than measuring cycles only in user mode and only of the current process (Linux perf). > Either way, I don't agree with this patch, not accepting it. The only vaguely valid reason you've given is that this should cache the function pointers locally, which version 2 does. All you've done is make it abundantly clear that you don't like Linux perf and don't care about the rationale for this patch because it doesn't suit your personal preferences, especially on IRC: 20:49 <@Lynne> "security" 20:49 <@Lynne> it's a fucking timer 20:49 <@Lynne> "insecure" 20:51 <@Lynne> computers are very good at knowing how much time has passed, to the point of the speed of light being an issue 20:52 <@Lynne> getting rid of this because some script kiddie could potentially figure out a bit by calling this a million times while the CPU is busy is paranoid 20:58 <@Lynne> it's a cache issue, so you fix the cache lifetime, not nerf a timer And it's not a cache issue. Cycle Drift is not a cache issue, it's an instruction timing issue. Since you're giving zero valid reasons, I'm invoking the TC.
Jul 17, 2023, 20:09 by remi@remlab.net: > Le maanantaina 17. heinäkuuta 2023, 20.48.40 EEST Lynne a écrit : > >> >> > But I still argue that that is, either way, completely negligible >> >> > compared >> >> > to the *existing* overhead. Each loop is making 4 system calls, and >> >> > each >> >> > of those system call requires a direct call (to PLT) and an indirect >> >> > branch (from GOT). If you have a problem with the two additional >> >> > function >> >> > calls, then you can't be using Linux perf in the first place. >> >> >> >> You don't want to ever use linux perf in the first place, it's second >> >> class.> >> > No it isn't. The interface is more involved than just reading a CSR; and >> > sure I'd prefer the simple interface that RDCYCLE is all other things >> > being equal. But other things are not equal. Linux perf is in fact *more* >> > accurate by virtue of not *wrongly* counting other things. And it does >> > not threaten the security of the entire system, so it will work inside a >> > rented VM or an unprivileged process. >> >> Threaten? >> > > User-space access to the cycle counter has been deemed a security threat due > to the Cycle Drift attack, and is disabled as of OpenSBI 1.3. > > If FFmpeg does not support Linux perf, FFmpeg will get _no_ performance > benchmarks on Linux. > It does support linux perf. I've used it many times. >> This is a development tool first and foremost. >> > > A development tool is not a justification for leaving a security whole in the > system. I don't make the rules, and you don't either. OpenSBI and Linux make > them. > You're not required to leave them on always. It's a development tool. You don't disable frequency scaling on all the time on other platforms, just when doing benchmarks, which is what the tool is meant for. For ARM, we routinely use an external kernel module to enable the performance timers just so we avoid linux perf. Specifically on arm, the linux perf noise is very high, which is where my concerns are. >> If anyone doesn't want to use rdcycle, they can use linux perf, it still >> works, with or without the patch. >> > > It does not. > Why not? It works on arm. The same exact code is used on risc-v and on any other platform too. >> >> I don't think it's worth changing the direct inlining we had before. >> >> You're >> >> not interested in whether or not the same exact code is ran between >> >> platforms, >> > >> > Err, I am definitely interested in doing exactly that. I don't want to >> > have to reconfigure and recompile the entire FFmpeg just to switch >> > between Linux perf and raw cycle counter. A contrario, I *do* want to >> > compare performance between vendors once the hardware is available. >> >> That's a weak reason to compromise the accuracy of a development tool. >> > > This is not compromising any accuracy of any development tool. Again, in a > scenario where both RDCYCLE and Linux perf work, Linux perf is more accurate > for reasons that I already outlined in previous messages. And on systems with > newer OpenSBI and kernel, RDCYCLE does not work at all. > >> >> just that the code that's measuring timing is as efficient and >> >> low overhead as possible. >> > >> > Of course not. Low overhead is irrelevant here. The measurement overhead >> > is know and is subtracted. What we need is stable/reproducible overhead, >> > and accurate measurements. >> >> Which is what TSC or the equivalent gets you. It's noisy, but that's because >> it's better and higher accuracy than having to roundtrip through the >> kernel. >> > > We _could_ use RDTIME. That's not blocked. > > And while that should be "low overhead", but measuring time is also obviously > way _less_ accurate than measuring cycles (RDCYCLE), which is in turn _less_ > accurate than measuring cycles only in user mode and only of the current > process (Linux perf). > >> Either way, I don't agree with this patch, not accepting it. >> > > The only vaguely valid reason you've given is that this should cache the > function pointers locally, which version 2 does. > I disagree with it doing an indirection at all. It's overhead and a potential source of inaccuracy. It's up to you to prove it wouldn't affect major platforms. > Since you're giving zero valid reasons, I'm invoking the TC. > What, you expect them to materialize an opinion out of thin air? You have to call for a vote. And bring up why it hasn't been changed in 3 years, even though in the duties listed, it's only supposed to last for a year before elections are called again. Which possibly involves doing elections again. You could've asked the actual maintainer, and one who knows the most about the code, Gramner. But you're focusing too hard on arguing about irrelevant issues, even bringing up security, rather than using reasoning or clearly describing the need for the patch. > All you've done is make it abundantly clear that you don't like Linux perf and > don't care about the rationale for this patch because it doesn't suit your > personal preferences, especially on IRC: By the way, if you're going to start doing personal attacks, right down to bringing up random quotes from the IRC, unrelated to you, I would ask the community committee to step in. Those, on the other hand, are well-known figures.
Le keskiviikkona 19. heinäkuuta 2023, 0.32.26 EEST Lynne a écrit : > > User-space access to the cycle counter has been deemed a security threat > > due to the Cycle Drift attack, and is disabled as of OpenSBI 1.3. > > > > If FFmpeg does not support Linux perf, FFmpeg will get _no_ performance > > benchmarks on Linux. > > It does support linux perf. I've used it many times. Yes of course. This MR modifies the Linux perf code, so I can see that it's there. My point is that *if* Linux perf is *not* enabled (in ./configure), there will be no benchmarks anymore on newer Linux RISC-V systems. But if it is enabled it, there will be no benchmarks on existing systems. Hence the need for the run-time indirection, or some very major PITA for development and supporting new developers (and we've seen already 3 different people sending RISC-V patches besides me in the past few months). > >> This is a development tool first and foremost. > > > > A development tool is not a justification for leaving a security whole in > > the system. I don't make the rules, and you don't either. OpenSBI and > > Linux make them. > > You're not required to leave them on always. It's a development tool. > You don't disable frequency scaling on all the time on other platforms, > just when doing benchmarks, which is what the tool is meant for. There is no need to disable frequency scaling if you use the most _suited_ counter for microbencmarks, that is to say the cycle counter. And then, disabling frequency scaling is not always feasible. > For ARM, we routinely use an external kernel module to enable > the performance timers just so we avoid linux perf. I don't think an external ARM-specific kernel module is an excuse for breaking checkasm for _normal_ out-of-the-box kernel setups. Just because checkasm is a development tool doesn't mean that it only runs on custom development systems. > Specifically on arm, the linux perf noise is very high, which is where > my concerns are. I fail to see how this MR makes this any worse. You can still disable Linux perf at build time as you could before. Quite the contrary in fact: if the noise is so high then there are no ways that a single indirect function call would make any meaningful difference for the worse. > >> If anyone doesn't want to use rdcycle, they can use linux perf, it still > >> works, with or without the patch. > > > > It does not. > > Why not? It works on arm. The same exact code is used on risc-v > and on any other platform too. Currently it's disabled by default on RISC-V (really anything but Arm) in FFmpeg's configure, and hardly anybody has the necessary OpenSBI and Linux updates for Linux perf to work on RISC-V. We can't simply assume that Linux perf is supported. We're a few years too early for that. And by then, Android will have blocked Linux perf, so we will have the same problem anyway, only for different reasons. > >> Either way, I don't agree with this patch, not accepting it. > > > > The only vaguely valid reason you've given is that this should cache the > > function pointers locally, which version 2 does. > > I disagree with it doing an indirection at all. I can see that, and you're entitled to your opinion. But that's neither technical reasoning, nor actionable review feedback. > It's overhead and a potential source of inaccuracy. If Linux perf is enabled, it's blatantly obvious that whatever additional overhead this adds is negligible. Since you insist, on top of my head, Linux perf already now will do quite a few indirect calls: - from checkasm's PLT to libc's ioctl(), - from libc's syscall to kernel exception vector, - from exception vector to ioctl syscall handler, - from ioctl handler to PMU framework, - from PMU framework to PMU backend driver. And that's grossly over-simplified and glossing over all the other calls and stuff that happen in-between. As for the case that Linux perf is disabled at build time, then I can make a different version that calls AV_READ_TIME directly without indirection if that's what bothers Gramner and/or you. Though you've noted yourself that that path is also very noisy as it is so not like adding one indirect call will matter here :shrug: > It's up to you to prove it wouldn't affect major platforms. I think that I have supplied plenty of technical arguments that there shouldn't be a problem with this MR. And in the unlikely event that there would be an unexpected problem, I'm sure somebody would revert it; it's not an irreversible situation. Not to mention that, in the mean time, Martin did indeed verify that the overhead is small and stable (on Arm). > > Since you're giving zero valid reasons, I'm invoking the TC. > > What, you expect them to materialize an opinion out of thin air? I think that they already have plenty to materialise an opinion from, between the MR, this thread, Martin's benchmarks, and their experience. Otherwise, they always have the possibility to ask for clarifications (it's even in the community rulebook). > You could've asked the actual maintainer, and one who knows the > most about the code, Gramner. Henrik was and still is welcome to review and comment. I am not in the habit of asking permission to send code for review. > But you're focusing too hard on arguing about irrelevant issues, even > bringing up security, rather than using reasoning or clearly describing the > need for the patch. Security is the reason why RDCYCLE will stop working, and that in turn is the motivation for this set. I believe it's well described in the commit message (notably patch 6/7), and this is the first time in this thread that you state otherwise. You don't like the security restriction and you don't have to. But that does not make it "irrelevant". And if you want to argue against the change and/or its security motivation, you need to make your case on Linux RISC-V, Linux pref and Linux kernel mailing lists, not here. > By the way, if you're going to start doing personal attacks, Quoting somebody is not attacking them. I didn't even slice or edit the quote or otherwise attempt to distort it. > right down to bringing up random quotes from the IRC, > unrelated to you, It looked very much like it was related to the *MR*. Maybe it was just a "random" coincidence, and I am sorry that you apparently took so badly to being quoted. > I would ask the community committee to step in. > Those, on the other hand, are well-known figures.
On Mon, 17 Jul 2023, Rémi Denis-Courmont wrote:
> Since you're giving zero valid reasons, I'm invoking the TC.
Just for public record and visibility:
The TC has discussed the matter at hand. Overall, the TC considered that
the approach of using an indirect call seemed tolerable given the benefits
and could be accepted if necessary. But avoiding the indirect call unless
there actually are multiple timing providers would be even better,
retaining the same low overhead as before.
As the updated patchset seems to implement that, we consider the case
closed - and actual TC involvement wasn't needed after all here.
// Martin Storsjö, on behalf of the FFmpeg Technical Committee
Martin Storsjo (12023-07-25):
> The TC has discussed the matter at hand.
The TC (and CC) has been elected for a year in July of 2020. That means
your mandate is two years expired. This decision is therefore not
binding.
Regards,
Jul 24, 2023, 23:27 by martin@martin.st: > On Mon, 17 Jul 2023, Rémi Denis-Courmont wrote: > >> Since you're giving zero valid reasons, I'm invoking the TC. >> > > Just for public record and visibility: > > The TC has discussed the matter at hand. Overall, the TC considered that the approach of using an indirect call seemed tolerable given the benefits and could be accepted if necessary. But avoiding the indirect call unless there actually are multiple timing providers would be even better, retaining the same low overhead as before. > > As the updated patchset seems to implement that, we consider the case closed - and actual TC involvement wasn't needed after all here. > > // Martin Storsjö, on behalf of the FFmpeg Technical Committee > I do think TC involvement here was unnecessary - I do not object to this version of the patch, so LGTM. I do agree with the decision overall, but this is nothing that couldn't be solved with a 3 minute discussion. I think, however, the process has become rather opaque in this case. Usually, there has to be a clear writeup of the issue, with all context removed, that all parties have to agree on is presentable to the TC for guidelines. In the past, whenever developers have thrown in random comments for a TC discussion, this has been followed, and the TC has not responded, but what makes this case so special, when this was also the case? Plus, even now, I have zero idea who is on the TC - it has been three years since it was made, and although it's not meant to last for so long, I doubt there's a need for a vote as the final results would probably remain the same. But I would like to know who is still actively involved, and whether all of the people on the TC discussed this, as it's necessary to make a decision. If this was an informative guideline, rather than an actual decision where all members voted and agreed, I think this should be stated. If the TC were to actively get involved and give decisions without following proper procedures, it would do more than what the TC was meant to do - resolve issues.
> > I think, however, the process has become rather opaque in this case. > Usually, there has to be a clear writeup of the issue, with all context > removed, that all parties have to agree on is presentable to the TC > for guidelines. > I don't see why such a writeup is relevant, the mailing list discussions and the patch are enough. The TC isn't a debating society. Kieran
On Tue, 25 Jul 2023, Lynne wrote: > I think, however, the process has become rather opaque in this case. > Usually, there has to be a clear writeup of the issue, with all context > removed, that all parties have to agree on is presentable to the TC > for guidelines. In the past, whenever developers have thrown in random > comments for a TC discussion, this has been followed, and the TC > has not responded, but what makes this case so special, when this > was also the case? This case was admittedly very opaque. I've seen numerous cases threatening to escalate disputes to the TC. The difference here was that an actual direct mail was sent to the TC requesting to take a stance on the matter to unblock the patch. It wasn't a case of the TC deciding on its own to get involved. Now admittedly, to follow correct procedures, the TC should have announced on the ML that we are discussing this issue and trying to make a decision. Unfortunately I didn't notice that part in the description of procedures until the discussion was done (and the patch review on the ML had progressed with a new patchset that made good progress anyway), but we wanted to make it publicly known that we had been invoked and actually had had a discussion on the matter and made a decision, as was requested. // Martin
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c index 69a709175f..4b6164301a 100644 --- a/tests/checkasm/checkasm.c +++ b/tests/checkasm/checkasm.c @@ -57,6 +57,14 @@ #if HAVE_UNISTD_H #include <unistd.h> #endif +#if CONFIG_LINUX_PERF +#include <sys/ioctl.h> +#include <asm/unistd.h> +#include <linux/perf_event.h> +#endif +#if CONFIG_MACOS_KPERF +#include "libavutil/macos_kperf.h" +#endif #if !HAVE_ISATTY #define isatty(fd) 1 @@ -508,6 +516,9 @@ static int cmp_nop(const void *a, const void *b) return *(const uint16_t*)a - *(const uint16_t*)b; } +static uint64_t (*checkasm_bench_start)(void); +static uint64_t (*checkasm_bench_stop)(void); + /* Measure the overhead of the timing code (in decicycles) */ static int measure_nop_time(void) { @@ -651,7 +662,7 @@ static void print_cpu_name(void) } #if CONFIG_LINUX_PERF -uint64_t checkasm_bench_linux_perf_start(void) +static uint64_t checkasm_bench_linux_perf_start(void) { int fd = state.sysfd; @@ -660,7 +671,7 @@ uint64_t checkasm_bench_linux_perf_start(void) return 0; } -uint64_t checkasm_bench_linux_perf_stop(void) +static uint64_t checkasm_bench_linux_perf_stop(void) { uint64_t t; int fd = state.sysfd; @@ -687,24 +698,34 @@ static int bench_init_linux(void) perror("perf_event_open"); return -1; } + checkasm_bench_start = checkasm_bench_linux_perf_start; + checkasm_bench_stop = checkasm_bench_linux_perf_stop; return 0; } #elif CONFIG_MACOS_KPERF static int bench_init_kperf(void) { ff_kperf_init(); + checkasm_bench_start = checkasm_bench_stop = ff_kperf_cycles; return 0; } -#else +#elif defined (AV_READ_TIME) +static uint64_t ff_read_time(void) +{ + return AV_READ_TIME(); +} + static int bench_init_ffmpeg(void) { -#ifdef AV_READ_TIME printf("benchmarking with native FFmpeg timers\n"); + checkasm_bench_start = checkasm_bench_stop = ff_read_time; return 0; +} #else +static int bench_init_ffmpeg(void) +{ fprintf(stderr, "checkasm: --bench is not supported on your system\n"); return -1; -#endif } #endif @@ -871,6 +892,8 @@ CheckasmPerf *checkasm_get_perf_context(void) CheckasmPerf *perf = &state.current_func_ver->perf; memset(perf, 0, sizeof(*perf)); perf->sysfd = state.sysfd; + perf->start = checkasm_bench_start; + perf->stop = checkasm_bench_start; return perf; } diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h index 8a62b98f3e..4962815345 100644 --- a/tests/checkasm/checkasm.h +++ b/tests/checkasm/checkasm.h @@ -26,15 +26,6 @@ #include <stdint.h> #include "config.h" -#if CONFIG_LINUX_PERF -#include <unistd.h> // read(3) -#include <sys/ioctl.h> -#include <asm/unistd.h> -#include <linux/perf_event.h> -#elif CONFIG_MACOS_KPERF -#include "libavutil/macos_kperf.h" -#endif - #include "libavutil/avstring.h" #include "libavutil/cpu.h" #include "libavutil/internal.h" @@ -236,24 +227,10 @@ typedef struct CheckasmPerf { int sysfd; uint64_t cycles; int iterations; + uint64_t (*start)(void); + uint64_t (*stop)(void); } CheckasmPerf; -#if CONFIG_LINUX_PERF -uint64_t checkasm_bench_linux_perf_start(void); -uint64_t checkasm_bench_linux_perf_stop(void); -#define checkasm_bench_start() checkasm_bench_linux_perf_start() -#define checkasm_bench_stop() checkasm_bench_linux_perf_stop() -#elif CONFIG_MACOS_KPERF -#define checkasm_bench_start() ff_kperf_cycles() -#define checkasm_bench_stop() ff_kperf_cycles() -#elif defined (AV_READ_TIME) -#define checkasm_bench_start() AV_READ_TIME() -#define checkasm_bench_stop() AV_READ_TIME() -#else -#define checkasm_bench_start() UINT64_C(0) -#define checkasm_bench_stop() UINT64_C(0) -#endif - /* Benchmark the function */ #define bench_new(...)\ do {\ @@ -265,12 +242,12 @@ uint64_t checkasm_bench_linux_perf_stop(void); int ti, tcount = 0;\ uint64_t t = 0; \ for (ti = 0; ti < BENCH_RUNS; ti++) {\ - t = checkasm_bench_start(); \ + t = perf->start(); \ tfunc(__VA_ARGS__);\ tfunc(__VA_ARGS__);\ tfunc(__VA_ARGS__);\ tfunc(__VA_ARGS__);\ - t = checkasm_bench_stop() - t;\ + t = perf->stop() - t;\ if (t*tcount <= tsum*4 && ti > 0) {\ tsum += t;\ tcount++;\