diff mbox

[FFmpeg-devel,5/5] startcode: Filter out non-startcodes earlier

Message ID 20190609110053.4012-6-andreas.rheinhardt@gmail.com
State New
Headers show

Commit Message

Andreas Rheinhardt June 9, 2019, 11 a.m. UTC
Up until now, the bitmasks used to initially find out when one needs
to take a closer look and search for startcodes were rather primitive:
If a block (of four or eight bytes, depending on the system) contained a
zero, it was treated as a target for closer inspection.
This can be improved: Using the same technique of bitmasking as before,
one can test whether an arbitrary continuous range of bits consists of
zeros alone (and one can or the results of tests for non-overlapping
ranges of bits). A simple improvement would consist in simply testing
for whether every even byte does not vanish. But even better masks are
possible, in particular for big endian systems:
If all of the bits called 'a' or all of the bits called 'b' in the
uint32_t aaaa aaaa aaaa aaax bbbb bbbb bbbb bbby vanish, then this is
a candidate for closer inspection. If not, then it follows that
the three least serious bytes can't be the position of the 0x01 byte of
a 0x00 0x00 0x01 startcode. And if the most significant byte is the
position of the 0x01 of a startcode and if the same test is performed on
each block of four bytes in sequence, then all the b bits (as well as
the y bit) of the previous uint32_t would vanish and it would have been
considered a candidate for closer inspection. In other words: All
startcodes can be found with this test. The amount of candidates is
thereby reduced by a factor of about 256 (from about 4/2^8 to about
2/2^15). For eight bytes at a time, the patterns are simply applied to
the high and low 32 bits at the same time.
Unfortunately, not so much is possible for little-endian systems because
continuous blocks of bits in the input byte array won't be continuous
blocks in the integer read. But one can nevertheless even improve upon
the "check every even/odd byte" test using the following mask:
aaaa aaaa aaaa bbbb bbbb bbbb cccc cccc
If any of these three blocks of bits vanishes, the block is a candidate
for further inspection. In the byte array, this mask corresponds to the
following situation:
cccc cccc bbbb bbbb aaaa bbbb aaaa aaaa
If a startcode's 0x01 is at the second or third position, the c block
vanishes; if it is at the fourth position, the b block vanishes and if
it is at the first position, the a block of the previous four-byte block
vanishes. (This derivation just made use of the fact that there must be
two vanishing bytes in sequence; I couldn't find a way to utilize that
the third byte also has nearly no bits set.) So all startcodes can be
found with this mask. Doubling this mask yields a mask for eight bytes
at a time. I haven't found a better mask for eight bytes than this one.
Using this, one can detect reduce the amount of false positives by about
72% compared to the earlier check whether one of the bytes vanishes.

Being more effective at initial filtering makes the whole process more
efficient, of course. Here are some benchmarks for a 30.2 Mb/s transport
stream A (10 iterations of 8192 runs each) and another 7.4 Mb/s transport
stream B (10 iterations of 131072 runs each) on an x64; except for
"vanilla" all versions tested didn't return false positives.

  |   vanilla   |  aligned  |  aligned   |
  | (unaligned) | bad masks | good masks |
--|-------------|-----------|------------|
A |    411578   |   323355  |    233494  |
B |     55476   |    44147  |     31793  |

"vanilla" is the version before this patchset, i.e. before the switch to
aligned reads. "aligned bad masks" is the version before this commit.
"aligned good masks" is the current version.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
---
 libavcodec/startcode.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Andriy Gelman June 30, 2019, 11:53 p.m. UTC | #1
Andreas, 

On Sun, 09. Jun 13:00, Andreas Rheinhardt wrote:
> Up until now, the bitmasks used to initially find out when one needs
> to take a closer look and search for startcodes were rather primitive:
> If a block (of four or eight bytes, depending on the system) contained a
> zero, it was treated as a target for closer inspection.
> This can be improved: Using the same technique of bitmasking as before,
> one can test whether an arbitrary continuous range of bits consists of
> zeros alone (and one can or the results of tests for non-overlapping
> ranges of bits). A simple improvement would consist in simply testing
> for whether every even byte does not vanish. But even better masks are
> possible, in particular for big endian systems:
> If all of the bits called 'a' or all of the bits called 'b' in the
> uint32_t aaaa aaaa aaaa aaax bbbb bbbb bbbb bbby vanish, then this is
> a candidate for closer inspection. If not, then it follows that
> the three least serious bytes can't be the position of the 0x01 byte of
> a 0x00 0x00 0x01 startcode. And if the most significant byte is the
> position of the 0x01 of a startcode and if the same test is performed on
> each block of four bytes in sequence, then all the b bits (as well as
> the y bit) of the previous uint32_t would vanish and it would have been
> considered a candidate for closer inspection. In other words: All
> startcodes can be found with this test. The amount of candidates is
> thereby reduced by a factor of about 256 (from about 4/2^8 to about
> 2/2^15). For eight bytes at a time, the patterns are simply applied to
> the high and low 32 bits at the same time.
> Unfortunately, not so much is possible for little-endian systems because
> continuous blocks of bits in the input byte array won't be continuous
> blocks in the integer read. But one can nevertheless even improve upon
> the "check every even/odd byte" test using the following mask:
> aaaa aaaa aaaa bbbb bbbb bbbb cccc cccc
> If any of these three blocks of bits vanishes, the block is a candidate
> for further inspection. In the byte array, this mask corresponds to the
> following situation:
> cccc cccc bbbb bbbb aaaa bbbb aaaa aaaa
> If a startcode's 0x01 is at the second or third position, the c block
> vanishes; if it is at the fourth position, the b block vanishes and if
> it is at the first position, the a block of the previous four-byte block
> vanishes. (This derivation just made use of the fact that there must be
> two vanishing bytes in sequence; I couldn't find a way to utilize that
> the third byte also has nearly no bits set.) So all startcodes can be
> found with this mask. Doubling this mask yields a mask for eight bytes
> at a time. I haven't found a better mask for eight bytes than this one.
> Using this, one can detect reduce the amount of false positives by about
> 72% compared to the earlier check whether one of the bytes vanishes.
> 
> Being more effective at initial filtering makes the whole process more
> efficient, of course. Here are some benchmarks for a 30.2 Mb/s transport
> stream A (10 iterations of 8192 runs each) and another 7.4 Mb/s transport
> stream B (10 iterations of 131072 runs each) on an x64; except for
> "vanilla" all versions tested didn't return false positives.
> 
>   |   vanilla   |  aligned  |  aligned   |
>   | (unaligned) | bad masks | good masks |
> --|-------------|-----------|------------|
> A |    411578   |   323355  |    233494  |
> B |     55476   |    44147  |     31793  |
> 
> "vanilla" is the version before this patchset, i.e. before the switch to
> aligned reads. "aligned bad masks" is the version before this commit.
> "aligned good masks" is the current version.
> 
> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
> ---
>  libavcodec/startcode.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/libavcodec/startcode.c b/libavcodec/startcode.c
> index b29c1490c0..fa506a37d1 100644
> --- a/libavcodec/startcode.c
> +++ b/libavcodec/startcode.c
> @@ -95,10 +95,18 @@ int ff_startcode_find_candidate_c(const uint8_t *buf, int size)
>  
>  #if HAVE_FAST_64BIT
>      INITIALIZATION(8);
> -    MAIN_LOOP(64, 0x0101010101010101ULL, 0x8080808080808080ULL);
> +#if HAVE_BIGENDIAN
> +    MAIN_LOOP(64, 0x0002000200020002ULL, 0x8000800080008000ULL);
> +#else
> +    MAIN_LOOP(64, 0x0010010100100101ULL, 0x8008008080080080ULL);
> +#endif
>  #else
>      INITIALIZATION(4);
> -    MAIN_LOOP(32, 0x01010101U, 0x80808080U);
> +#if HAVE_BIGENDIAN
> +    MAIN_LOOP(32, 0x00020002U, 0x80008000U);
> +#else
> +    MAIN_LOOP(32, 0x00100101U, 0x80080080U);
> +#endif
>  #endif
>  
>      /* For the end, buf is the earliest possible position
> -- 
> 2.21.0

Below are the results of this patchset on Armv7 using OdroidXU4. 
OdroidXU4 has 8 cores - 4xARM Cortex A15 and 4xARM Cortex A7. Only the A15 cores were used for this test. 

Test sequences:
Sequence C - 2560x1440, 20.4Mb/s, 8105 frames      
Source: https://www.youtube.com/watch?v=1La4QzGeaaQ

Sequence D - 1920x1080, 4.8Mb/s, 18808 frames  
Source: https://www.youtube.com/watch\?v\=LXb3EKWsInQ

Averaged results:
Sequence C - 10 iterations with 16384  runs
Sequence D - 10 iterations with ~32768 runs

  |   before    | first patch | full patchset |
  |  patchset   |  only - 1/5 |     5/5       | 
  | (decicyles) | (decicycles)|  (decicyles)  |
--|-------------|-------------|---------------|
C |   910181    |   864827    |    861376     |
D |    74798    |    70529    |     69574     |   

I also have a Jetson Nano that uses Armv8. Let me know if you need these simulations on 64bit Arm.

Andriy
diff mbox

Patch

diff --git a/libavcodec/startcode.c b/libavcodec/startcode.c
index b29c1490c0..fa506a37d1 100644
--- a/libavcodec/startcode.c
+++ b/libavcodec/startcode.c
@@ -95,10 +95,18 @@  int ff_startcode_find_candidate_c(const uint8_t *buf, int size)
 
 #if HAVE_FAST_64BIT
     INITIALIZATION(8);
-    MAIN_LOOP(64, 0x0101010101010101ULL, 0x8080808080808080ULL);
+#if HAVE_BIGENDIAN
+    MAIN_LOOP(64, 0x0002000200020002ULL, 0x8000800080008000ULL);
+#else
+    MAIN_LOOP(64, 0x0010010100100101ULL, 0x8008008080080080ULL);
+#endif
 #else
     INITIALIZATION(4);
-    MAIN_LOOP(32, 0x01010101U, 0x80808080U);
+#if HAVE_BIGENDIAN
+    MAIN_LOOP(32, 0x00020002U, 0x80008000U);
+#else
+    MAIN_LOOP(32, 0x00100101U, 0x80080080U);
+#endif
 #endif
 
     /* For the end, buf is the earliest possible position