diff mbox series

[FFmpeg-devel,33/57] avcodec/mpv_reconstruct_mb_template: Don't unnecessarily copy data

Message ID GV1P250MB0737B8590B04701B3570871B8FC02@GV1P250MB0737.EURP250.PROD.OUTLOOK.COM
State New
Headers show
Series [FFmpeg-devel,01/57] avcodec/vc1: Combine identical checks | expand

Commit Message

Andreas Rheinhardt June 12, 2024, 1:48 p.m. UTC
There is no reason to use a temporary buffer as destination
for the new macroblock before copying it into its proper place.

(History: 3994623df2efd2749631c3492184dd8d4ffa9d1b changed
the precursor of ff_mpv_reconstruct_mb() to always decode
to the first row of macroblocks for B pictures when
a draw_horiz_band callback is set (they are exported to the caller
via said callback and each row overwrites the previously decoded
row; this was probably intended as a cache-optimization).
Later b68ab2609c67d07b6f12ed65125d76bf9a054479 changed this
to the current form in which a scratchpad buffer is used when
decoding B-pictures without draw_horiz_band callback, followed
by copying the block from the scratchpad buffer to the actual
destination. I do not know what the aim of this was. When thinking
of it as a cache optimization, it makes sense to not use it
when the aforementioned draw_horiz_band optimization is in effect,
because then the destination row can be presumed to be hot
already. But then it makes no sense to restrict this optimization
to B-frames.)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
---
 libavcodec/mpegpicture.h                 |  1 -
 libavcodec/mpv_reconstruct_mb_template.c | 23 ++---------------------
 2 files changed, 2 insertions(+), 22 deletions(-)

Comments

Michael Niedermayer June 13, 2024, 7:36 a.m. UTC | #1
On Wed, Jun 12, 2024 at 03:48:29PM +0200, Andreas Rheinhardt wrote:
> There is no reason to use a temporary buffer as destination
> for the new macroblock before copying it into its proper place.
> 
> (History: 3994623df2efd2749631c3492184dd8d4ffa9d1b changed
> the precursor of ff_mpv_reconstruct_mb() to always decode
> to the first row of macroblocks for B pictures when
> a draw_horiz_band callback is set (they are exported to the caller
> via said callback and each row overwrites the previously decoded
> row; this was probably intended as a cache-optimization).
> Later b68ab2609c67d07b6f12ed65125d76bf9a054479 changed this
> to the current form in which a scratchpad buffer is used when
> decoding B-pictures without draw_horiz_band callback, followed
> by copying the block from the scratchpad buffer to the actual
> destination. I do not know what the aim of this was. When thinking
> of it as a cache optimization, it makes sense to not use it
> when the aforementioned draw_horiz_band optimization is in effect,
> because then the destination row can be presumed to be hot
> already. But then it makes no sense to restrict this optimization
> to B-frames.)

IIRC
The B frames where directly placed in GPU memory, which is slow to read
from, so building up MC + IDCT (which could read depending on caches)
was avoided by a a scratchpad but its really long ago so i might remember
this only partly

thx

[...]
Andreas Rheinhardt June 13, 2024, 8:50 a.m. UTC | #2
Michael Niedermayer:
> On Wed, Jun 12, 2024 at 03:48:29PM +0200, Andreas Rheinhardt wrote:
>> There is no reason to use a temporary buffer as destination
>> for the new macroblock before copying it into its proper place.
>>
>> (History: 3994623df2efd2749631c3492184dd8d4ffa9d1b changed
>> the precursor of ff_mpv_reconstruct_mb() to always decode
>> to the first row of macroblocks for B pictures when
>> a draw_horiz_band callback is set (they are exported to the caller
>> via said callback and each row overwrites the previously decoded
>> row; this was probably intended as a cache-optimization).
>> Later b68ab2609c67d07b6f12ed65125d76bf9a054479 changed this
>> to the current form in which a scratchpad buffer is used when
>> decoding B-pictures without draw_horiz_band callback, followed
>> by copying the block from the scratchpad buffer to the actual
>> destination. I do not know what the aim of this was. When thinking
>> of it as a cache optimization, it makes sense to not use it
>> when the aforementioned draw_horiz_band optimization is in effect,
>> because then the destination row can be presumed to be hot
>> already. But then it makes no sense to restrict this optimization
>> to B-frames.)
> 
> IIRC
> The B frames where directly placed in GPU memory, which is slow to read
> from, so building up MC + IDCT (which could read depending on caches)
> was avoided by a a scratchpad but its really long ago so i might remember
> this only partly
> 

The code here is not executed for hardware acceleration at all. For the
ordinary case, this copy incurs overhead; furthermore it increases
complexity (and for these reasons no other codec has similar code). So
I'd like to remove it (with an amended commit message that says that
this was done due to concerns about reading from GPU memory). Do you
object to this?

- Andreas
Michael Niedermayer June 13, 2024, 9:19 p.m. UTC | #3
On Thu, Jun 13, 2024 at 10:50:36AM +0200, Andreas Rheinhardt wrote:
> Michael Niedermayer:
> > On Wed, Jun 12, 2024 at 03:48:29PM +0200, Andreas Rheinhardt wrote:
> >> There is no reason to use a temporary buffer as destination
> >> for the new macroblock before copying it into its proper place.
> >>
> >> (History: 3994623df2efd2749631c3492184dd8d4ffa9d1b changed
> >> the precursor of ff_mpv_reconstruct_mb() to always decode
> >> to the first row of macroblocks for B pictures when
> >> a draw_horiz_band callback is set (they are exported to the caller
> >> via said callback and each row overwrites the previously decoded
> >> row; this was probably intended as a cache-optimization).
> >> Later b68ab2609c67d07b6f12ed65125d76bf9a054479 changed this
> >> to the current form in which a scratchpad buffer is used when
> >> decoding B-pictures without draw_horiz_band callback, followed
> >> by copying the block from the scratchpad buffer to the actual
> >> destination. I do not know what the aim of this was. When thinking
> >> of it as a cache optimization, it makes sense to not use it
> >> when the aforementioned draw_horiz_band optimization is in effect,
> >> because then the destination row can be presumed to be hot
> >> already. But then it makes no sense to restrict this optimization
> >> to B-frames.)
> > 
> > IIRC
> > The B frames where directly placed in GPU memory, which is slow to read
> > from, so building up MC + IDCT (which could read depending on caches)
> > was avoided by a a scratchpad but its really long ago so i might remember
> > this only partly
> > 
> 
> The code here is not executed for hardware acceleration at all. For the
> ordinary case, this copy incurs overhead; furthermore it increases
> complexity (and for these reasons no other codec has similar code). So
> I'd like to remove it (with an amended commit message that says that
> this was done due to concerns about reading from GPU memory). Do you
> object to this?

i dont object, i was trying to explain why this was there. This predates
modern hw acceleration. The memory in AVFrame.data[] for a normal
pixel format can be mapped to GPU memory. FFmpeg itself probably did not
set it up this way but some applications did do this with get_buffer()
callbacks VERY long ago IIRC to avoid having to copy the image later
when the application belived the image would not be read from. (like a B frame
in MPEG-2 times)

thx

[...]
diff mbox series

Patch

diff --git a/libavcodec/mpegpicture.h b/libavcodec/mpegpicture.h
index 86504fe8ca..d3f39bbae6 100644
--- a/libavcodec/mpegpicture.h
+++ b/libavcodec/mpegpicture.h
@@ -35,7 +35,6 @@  typedef struct ScratchpadContext {
     uint8_t *obmc_scratchpad;
     union {
         uint8_t *scratchpad_buf;  ///< the other *_scratchpad point into this buffer
-        uint8_t *b_scratchpad;    ///< scratchpad used for writing into write only buffers
         uint8_t *rd_scratchpad;   ///< scratchpad for rate distortion mb decision
     };
     int      linesize;            ///< linesize that the buffers in this context have been allocated for
diff --git a/libavcodec/mpv_reconstruct_mb_template.c b/libavcodec/mpv_reconstruct_mb_template.c
index 6ad353ddfd..e39b3d0e73 100644
--- a/libavcodec/mpv_reconstruct_mb_template.c
+++ b/libavcodec/mpv_reconstruct_mb_template.c
@@ -80,11 +80,10 @@  void mpv_reconstruct_mb_internal(MpegEncContext *s, int16_t block[12][64],
           s->avctx->mb_decision != FF_MB_DECISION_RD))  // FIXME precalc
 #endif /* IS_ENCODER */
     {
-        uint8_t *dest_y, *dest_cb, *dest_cr;
+        uint8_t *dest_y = s->dest[0], *dest_cb = s->dest[1], *dest_cr = s->dest[2];
         int dct_linesize, dct_offset;
         const int linesize   = s->cur_pic.linesize[0]; //not s->linesize as this would be wrong for field pics
         const int uvlinesize = s->cur_pic.linesize[1];
-        const int readable   = IS_ENCODER || lowres_flag || s->pict_type != AV_PICTURE_TYPE_B;
         const int block_size = lowres_flag ? 8 >> s->avctx->lowres : 8;
 
         /* avoid copy if macroblock skipped in last frame too */
@@ -106,16 +105,6 @@  void mpv_reconstruct_mb_internal(MpegEncContext *s, int16_t block[12][64],
         dct_linesize = linesize << s->interlaced_dct;
         dct_offset   = s->interlaced_dct ? linesize : linesize * block_size;
 
-        if (readable) {
-            dest_y  = s->dest[0];
-            dest_cb = s->dest[1];
-            dest_cr = s->dest[2];
-        } else {
-            dest_y  = s->sc.b_scratchpad;
-            dest_cb = s->sc.b_scratchpad + 16 * linesize;
-            dest_cr = s->sc.b_scratchpad + 32 * linesize;
-        }
-
         if (!s->mb_intra) {
             /* motion handling */
             /* decoding or more than one mb_type (MC was already done otherwise) */
@@ -169,7 +158,7 @@  void mpv_reconstruct_mb_internal(MpegEncContext *s, int16_t block[12][64],
                 if(  (s->avctx->skip_idct >= AVDISCARD_NONREF && s->pict_type == AV_PICTURE_TYPE_B)
                    ||(s->avctx->skip_idct >= AVDISCARD_NONKEY && s->pict_type != AV_PICTURE_TYPE_I)
                    || s->avctx->skip_idct >= AVDISCARD_ALL)
-                    goto skip_idct;
+                    return;
             }
 
             /* add dct residue */
@@ -288,14 +277,6 @@  void mpv_reconstruct_mb_internal(MpegEncContext *s, int16_t block[12][64],
                     }
                 } //gray
             }
-        }
-skip_idct:
-        if (!readable) {
-            s->hdsp.put_pixels_tab[0][0](s->dest[0], dest_y, linesize, 16);
-            if (!CONFIG_GRAY || !(s->avctx->flags & AV_CODEC_FLAG_GRAY)) {
-                s->hdsp.put_pixels_tab[s->chroma_x_shift][0](s->dest[1], dest_cb, uvlinesize, 16 >> s->chroma_y_shift);
-                s->hdsp.put_pixels_tab[s->chroma_x_shift][0](s->dest[2], dest_cr, uvlinesize, 16 >> s->chroma_y_shift);
-            }
 #endif /* !IS_ENCODER */
         }
     }