Submitted by Andreas Rheinhardt on June 9, 2019, 11 a.m.

Message ID | 20190609110053.4012-6-andreas.rheinhardt@gmail.com |
---|---|

State | New |

Headers | show |

Andreas, On Sun, 09. Jun 13:00, Andreas Rheinhardt wrote: > Up until now, the bitmasks used to initially find out when one needs > to take a closer look and search for startcodes were rather primitive: > If a block (of four or eight bytes, depending on the system) contained a > zero, it was treated as a target for closer inspection. > This can be improved: Using the same technique of bitmasking as before, > one can test whether an arbitrary continuous range of bits consists of > zeros alone (and one can or the results of tests for non-overlapping > ranges of bits). A simple improvement would consist in simply testing > for whether every even byte does not vanish. But even better masks are > possible, in particular for big endian systems: > If all of the bits called 'a' or all of the bits called 'b' in the > uint32_t aaaa aaaa aaaa aaax bbbb bbbb bbbb bbby vanish, then this is > a candidate for closer inspection. If not, then it follows that > the three least serious bytes can't be the position of the 0x01 byte of > a 0x00 0x00 0x01 startcode. And if the most significant byte is the > position of the 0x01 of a startcode and if the same test is performed on > each block of four bytes in sequence, then all the b bits (as well as > the y bit) of the previous uint32_t would vanish and it would have been > considered a candidate for closer inspection. In other words: All > startcodes can be found with this test. The amount of candidates is > thereby reduced by a factor of about 256 (from about 4/2^8 to about > 2/2^15). For eight bytes at a time, the patterns are simply applied to > the high and low 32 bits at the same time. > Unfortunately, not so much is possible for little-endian systems because > continuous blocks of bits in the input byte array won't be continuous > blocks in the integer read. But one can nevertheless even improve upon > the "check every even/odd byte" test using the following mask: > aaaa aaaa aaaa bbbb bbbb bbbb cccc cccc > If any of these three blocks of bits vanishes, the block is a candidate > for further inspection. In the byte array, this mask corresponds to the > following situation: > cccc cccc bbbb bbbb aaaa bbbb aaaa aaaa > If a startcode's 0x01 is at the second or third position, the c block > vanishes; if it is at the fourth position, the b block vanishes and if > it is at the first position, the a block of the previous four-byte block > vanishes. (This derivation just made use of the fact that there must be > two vanishing bytes in sequence; I couldn't find a way to utilize that > the third byte also has nearly no bits set.) So all startcodes can be > found with this mask. Doubling this mask yields a mask for eight bytes > at a time. I haven't found a better mask for eight bytes than this one. > Using this, one can detect reduce the amount of false positives by about > 72% compared to the earlier check whether one of the bytes vanishes. > > Being more effective at initial filtering makes the whole process more > efficient, of course. Here are some benchmarks for a 30.2 Mb/s transport > stream A (10 iterations of 8192 runs each) and another 7.4 Mb/s transport > stream B (10 iterations of 131072 runs each) on an x64; except for > "vanilla" all versions tested didn't return false positives. > > | vanilla | aligned | aligned | > | (unaligned) | bad masks | good masks | > --|-------------|-----------|------------| > A | 411578 | 323355 | 233494 | > B | 55476 | 44147 | 31793 | > > "vanilla" is the version before this patchset, i.e. before the switch to > aligned reads. "aligned bad masks" is the version before this commit. > "aligned good masks" is the current version. > > Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com> > --- > libavcodec/startcode.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/libavcodec/startcode.c b/libavcodec/startcode.c > index b29c1490c0..fa506a37d1 100644 > --- a/libavcodec/startcode.c > +++ b/libavcodec/startcode.c > @@ -95,10 +95,18 @@ int ff_startcode_find_candidate_c(const uint8_t *buf, int size) > > #if HAVE_FAST_64BIT > INITIALIZATION(8); > - MAIN_LOOP(64, 0x0101010101010101ULL, 0x8080808080808080ULL); > +#if HAVE_BIGENDIAN > + MAIN_LOOP(64, 0x0002000200020002ULL, 0x8000800080008000ULL); > +#else > + MAIN_LOOP(64, 0x0010010100100101ULL, 0x8008008080080080ULL); > +#endif > #else > INITIALIZATION(4); > - MAIN_LOOP(32, 0x01010101U, 0x80808080U); > +#if HAVE_BIGENDIAN > + MAIN_LOOP(32, 0x00020002U, 0x80008000U); > +#else > + MAIN_LOOP(32, 0x00100101U, 0x80080080U); > +#endif > #endif > > /* For the end, buf is the earliest possible position > -- > 2.21.0 Below are the results of this patchset on Armv7 using OdroidXU4. OdroidXU4 has 8 cores - 4xARM Cortex A15 and 4xARM Cortex A7. Only the A15 cores were used for this test. Test sequences: Sequence C - 2560x1440, 20.4Mb/s, 8105 frames Source: https://www.youtube.com/watch?v=1La4QzGeaaQ Sequence D - 1920x1080, 4.8Mb/s, 18808 frames Source: https://www.youtube.com/watch\?v\=LXb3EKWsInQ Averaged results: Sequence C - 10 iterations with 16384 runs Sequence D - 10 iterations with ~32768 runs | before | first patch | full patchset | | patchset | only - 1/5 | 5/5 | | (decicyles) | (decicycles)| (decicyles) | --|-------------|-------------|---------------| C | 910181 | 864827 | 861376 | D | 74798 | 70529 | 69574 | I also have a Jetson Nano that uses Armv8. Let me know if you need these simulations on 64bit Arm. Andriy

diff --git a/libavcodec/startcode.c b/libavcodec/startcode.c index b29c1490c0..fa506a37d1 100644 --- a/libavcodec/startcode.c +++ b/libavcodec/startcode.c @@ -95,10 +95,18 @@ int ff_startcode_find_candidate_c(const uint8_t *buf, int size) #if HAVE_FAST_64BIT INITIALIZATION(8); - MAIN_LOOP(64, 0x0101010101010101ULL, 0x8080808080808080ULL); +#if HAVE_BIGENDIAN + MAIN_LOOP(64, 0x0002000200020002ULL, 0x8000800080008000ULL); +#else + MAIN_LOOP(64, 0x0010010100100101ULL, 0x8008008080080080ULL); +#endif #else INITIALIZATION(4); - MAIN_LOOP(32, 0x01010101U, 0x80808080U); +#if HAVE_BIGENDIAN + MAIN_LOOP(32, 0x00020002U, 0x80008000U); +#else + MAIN_LOOP(32, 0x00100101U, 0x80080080U); +#endif #endif /* For the end, buf is the earliest possible position

`Up until now, the bitmasks used to initially find out when one needs to take a closer look and search for startcodes were rather primitive: If a block (of four or eight bytes, depending on the system) contained a zero, it was treated as a target for closer inspection. This can be improved: Using the same technique of bitmasking as before, one can test whether an arbitrary continuous range of bits consists of zeros alone (and one can or the results of tests for non-overlapping ranges of bits). A simple improvement would consist in simply testing for whether every even byte does not vanish. But even better masks are possible, in particular for big endian systems: If all of the bits called 'a' or all of the bits called 'b' in the uint32_t aaaa aaaa aaaa aaax bbbb bbbb bbbb bbby vanish, then this is a candidate for closer inspection. If not, then it follows that the three least serious bytes can't be the position of the 0x01 byte of a 0x00 0x00 0x01 startcode. And if the most significant byte is the position of the 0x01 of a startcode and if the same test is performed on each block of four bytes in sequence, then all the b bits (as well as the y bit) of the previous uint32_t would vanish and it would have been considered a candidate for closer inspection. In other words: All startcodes can be found with this test. The amount of candidates is thereby reduced by a factor of about 256 (from about 4/2^8 to about 2/2^15). For eight bytes at a time, the patterns are simply applied to the high and low 32 bits at the same time. Unfortunately, not so much is possible for little-endian systems because continuous blocks of bits in the input byte array won't be continuous blocks in the integer read. But one can nevertheless even improve upon the "check every even/odd byte" test using the following mask: aaaa aaaa aaaa bbbb bbbb bbbb cccc cccc If any of these three blocks of bits vanishes, the block is a candidate for further inspection. In the byte array, this mask corresponds to the following situation: cccc cccc bbbb bbbb aaaa bbbb aaaa aaaa If a startcode's 0x01 is at the second or third position, the c block vanishes; if it is at the fourth position, the b block vanishes and if it is at the first position, the a block of the previous four-byte block vanishes. (This derivation just made use of the fact that there must be two vanishing bytes in sequence; I couldn't find a way to utilize that the third byte also has nearly no bits set.) So all startcodes can be found with this mask. Doubling this mask yields a mask for eight bytes at a time. I haven't found a better mask for eight bytes than this one. Using this, one can detect reduce the amount of false positives by about 72% compared to the earlier check whether one of the bytes vanishes. Being more effective at initial filtering makes the whole process more efficient, of course. Here are some benchmarks for a 30.2 Mb/s transport stream A (10 iterations of 8192 runs each) and another 7.4 Mb/s transport stream B (10 iterations of 131072 runs each) on an x64; except for "vanilla" all versions tested didn't return false positives. | vanilla | aligned | aligned | | (unaligned) | bad masks | good masks | --|-------------|-----------|------------| A | 411578 | 323355 | 233494 | B | 55476 | 44147 | 31793 | "vanilla" is the version before this patchset, i.e. before the switch to aligned reads. "aligned bad masks" is the version before this commit. "aligned good masks" is the current version. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com> --- libavcodec/startcode.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)`