diff mbox series

[FFmpeg-devel,v2] fbdetile cpu based framebuffer layout detiling v02

Message ID 20200627195739.9056-1-hanishkvc@gmail.com
State New
Headers show
Series [FFmpeg-devel,v2] fbdetile cpu based framebuffer layout detiling v02
Related show

Checks

Context Check Description
andriy/default pending
andriy/make_warn warning New warnings during build
andriy/make success Make finished
andriy/make_fate success Make fate finished

Commit Message

C Hanish Menon June 27, 2020, 7:57 p.m. UTC
v02-20200627IST2331

Unrolled Intel Legacy Tile-Y detiling logic.

Also a consolidated patch file, instead of the previous development
flow based multiple patch files.

v01-20200627IST1308

Implemented Intel Legacy Tile-X and Tile-Y detiling logic

NOTES:

This video filter allows framebuffers which are tiled to be detiled
using logic running on the cpu, into a linear layout.

Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling.
THis should help one to work with frames captured (say using kmsgrab)
on laptops having Intel GPU.

Tile-X conversion logic has been explicitly cross checked, with Tile-X
based frames. However Tile-Y conv logic hasnt been tested with Tile-Y
based frames, but it should potentially do the job, based on my current
understanding of the Tile-Y layout format.

TODO1: At a later time have to generate Tile-Y based frames, and then
cross check the corresponding logic explicitly.

TODO2: May be use OpenGL or Vulcan buffer helper routines to do the
layout conversion. But some online discussions from sometime back seem
to indicate that this path is not fully bug free currently.
---
 Changelog                 |   1 +
 doc/filters.texi          |  62 ++++++++
 libavfilter/Makefile      |   1 +
 libavfilter/allfilters.c  |   1 +
 libavfilter/vf_fbdetile.c | 309 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 374 insertions(+)
 create mode 100644 libavfilter/vf_fbdetile.c

Comments

Paul B Mahol June 27, 2020, 8 p.m. UTC | #1
What is this?

Missing documentation.
NAK

On 6/27/20, hanishkvc <hanishkvc@gmail.com> wrote:
> v02-20200627IST2331
>
> Unrolled Intel Legacy Tile-Y detiling logic.
>
> Also a consolidated patch file, instead of the previous development
> flow based multiple patch files.
>
> v01-20200627IST1308
>
> Implemented Intel Legacy Tile-X and Tile-Y detiling logic
>
> NOTES:
>
> This video filter allows framebuffers which are tiled to be detiled
> using logic running on the cpu, into a linear layout.
>
> Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling.
> THis should help one to work with frames captured (say using kmsgrab)
> on laptops having Intel GPU.
>
> Tile-X conversion logic has been explicitly cross checked, with Tile-X
> based frames. However Tile-Y conv logic hasnt been tested with Tile-Y
> based frames, but it should potentially do the job, based on my current
> understanding of the Tile-Y layout format.
>
> TODO1: At a later time have to generate Tile-Y based frames, and then
> cross check the corresponding logic explicitly.
>
> TODO2: May be use OpenGL or Vulcan buffer helper routines to do the
> layout conversion. But some online discussions from sometime back seem
> to indicate that this path is not fully bug free currently.
> ---
>  Changelog                 |   1 +
>  doc/filters.texi          |  62 ++++++++
>  libavfilter/Makefile      |   1 +
>  libavfilter/allfilters.c  |   1 +
>  libavfilter/vf_fbdetile.c | 309 ++++++++++++++++++++++++++++++++++++++
>  5 files changed, 374 insertions(+)
>  create mode 100644 libavfilter/vf_fbdetile.c
>
> diff --git a/Changelog b/Changelog
> index a60e7d2eb8..0e03491f6a 100644
> --- a/Changelog
> +++ b/Changelog
> @@ -2,6 +2,7 @@ Entries are sorted chronologically from oldest to youngest
> within each release,
>  releases are sorted from youngest to oldest.
>
>  version <next>:
> +- fbdetile cpu based framebuffer layout detiling video filter
>  - AudioToolbox output device
>  - MacCaption demuxer
>
> diff --git a/doc/filters.texi b/doc/filters.texi
> index 3c2dd2eb90..73ba21af89 100644
> --- a/doc/filters.texi
> +++ b/doc/filters.texi
> @@ -12210,6 +12210,68 @@ It accepts the following optional parameters:
>  The number of the CUDA device to use
>  @end table
>
> +@anchor{fbdetile}
> +@section fbdetile
> +
> +Detiles the Framebuffer tile layout into a linear layout using CPU.
> +
> +It currently supports conversion from Intel legacy tile-x and tile-y
> layouts
> +into a linear layout. This is useful if one is using kmsgrab and hwdownload
> +to capture a screen which is using one of these non-linear layouts.
> +
> +Currently it expects the data to be a 32bit RGB based pixel format. However
> +the logic doesnt do any pixel format conversion or so. Later will be
> enabling
> +16bit RGB data also, as the logic is transparent to it at one level.
> +
> +One could either insert this into the filter chain while capturing itself,
> +or else, if it is slowing things down or so, then one could instead insert
> +it into the filter chain during playback or transcoding or so.
> +
> +It supports the following optional parameters
> +
> +@table @option
> +@item type
> +Specify which detiling conversion to apply. The supported values are
> +@table @var
> +@item 0
> +intel tile-x to linear conversion (the default)
> +@item 1
> +intel tile-y to linear conversion.
> +@end table
> +@end table
> +
> +If one wants to convert during capture itself, one could do
> +@example
> +ffmpeg -f kmsgrab -i - -vf "hwdownload, fbdetile" OUTPUT
> +@end example
> +
> +However if one wants to convert after the tiled data has been already
> captured
> +@example
> +ffmpeg -i INPUT -vf "fbdetile" OUTPUT
> +@end example
> +@example
> +ffplay -i INPUT -vf "fbdetile"
> +@end example
> +
> +NOTE: While transcoding a test 1080p h264 stream, with 276 frames, with two
> +runs of each situation, the performance was has given below. However this
> +was for the older | initial version of the logic, as well as it was run on
> +the default linux chromebook->vm->container, so the perf values need not be
> +proper. But in a relative sense the overhead would be similar.
> +@example
> +rm out.mp4; time ./ffmpeg -i input.mp4 out.mp4
> +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=0 out.mp4
> +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=1 out.mp4
> +@end example
> +@table @option
> +@item with no fbdetile filter
> +it took ~7.28 secs,
> +@item with fbdetile=0 filter
> +it took ~8.69 secs,
> +@item with fbdetile=1 filter
> +it took ~9.20 secs.
> +@end table
> +
>  @section hqx
>
>  Apply a high-quality magnification filter designed for pixel art. This
> filter
> diff --git a/libavfilter/Makefile b/libavfilter/Makefile
> index 5123540653..bdb0c379ae 100644
> --- a/libavfilter/Makefile
> +++ b/libavfilter/Makefile
> @@ -280,6 +280,7 @@ OBJS-$(CONFIG_HWDOWNLOAD_FILTER)             +=
> vf_hwdownload.o
>  OBJS-$(CONFIG_HWMAP_FILTER)                  += vf_hwmap.o
>  OBJS-$(CONFIG_HWUPLOAD_CUDA_FILTER)          += vf_hwupload_cuda.o
>  OBJS-$(CONFIG_HWUPLOAD_FILTER)               += vf_hwupload.o
> +OBJS-$(CONFIG_FBDETILE_FILTER)               += vf_fbdetile.o
>  OBJS-$(CONFIG_HYSTERESIS_FILTER)             += vf_hysteresis.o framesync.o
>  OBJS-$(CONFIG_IDET_FILTER)                   += vf_idet.o
>  OBJS-$(CONFIG_IL_FILTER)                     += vf_il.o
> diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
> index 1183e40267..f8dceb2a88 100644
> --- a/libavfilter/allfilters.c
> +++ b/libavfilter/allfilters.c
> @@ -265,6 +265,7 @@ extern AVFilter ff_vf_hwdownload;
>  extern AVFilter ff_vf_hwmap;
>  extern AVFilter ff_vf_hwupload;
>  extern AVFilter ff_vf_hwupload_cuda;
> +extern AVFilter ff_vf_fbdetile;
>  extern AVFilter ff_vf_hysteresis;
>  extern AVFilter ff_vf_idet;
>  extern AVFilter ff_vf_il;
> diff --git a/libavfilter/vf_fbdetile.c b/libavfilter/vf_fbdetile.c
> new file mode 100644
> index 0000000000..8b20c96d2c
> --- /dev/null
> +++ b/libavfilter/vf_fbdetile.c
> @@ -0,0 +1,309 @@
> +/*
> + * Copyright (c) 2020 HanishKVC
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
> USA
> + */
> +
> +/**
> + * @file
> + * Detile the Frame buffer's tile layout using the cpu
> + * Currently it supports the legacy Intel Tile X layout detiling.
> + *
> + */
> +
> +/*
> + * ToThink|Check: Optimisations
> + *
> + * Does gcc setting used by ffmpeg allows memcpy | stringops inlining,
> + * loop unrolling, better native matching instructions, additional
> + * optimisations, ...
> + *
> + * Does gcc map to optimal memcpy logic, based on the situation it is
> + * used in.
> + *
> + * If not, may be look at vector_size or intrinsics or appropriate arch
> + * and cpu specific inline asm or ...
> + *
> + */
> +
> +#include "libavutil/avassert.h"
> +#include "libavutil/imgutils.h"
> +#include "libavutil/opt.h"
> +#include "avfilter.h"
> +#include "formats.h"
> +#include "internal.h"
> +#include "video.h"
> +
> +enum FilterMode {
> +    TYPE_INTELX,
> +    TYPE_INTELY,
> +    NB_TYPE
> +};
> +
> +typedef struct FBDetileContext {
> +    const AVClass *class;
> +    int width, height;
> +    int type;
> +} FBDetileContext;
> +
> +#define OFFSET(x) offsetof(FBDetileContext, x)
> +#define FLAGS AV_OPT_FLAG_FILTERING_PARAM|AV_OPT_FLAG_VIDEO_PARAM
> +static const AVOption fbdetile_options[] = {
> +    { "type", "set framebuffer format_modifier type", OFFSET(type),
> AV_OPT_TYPE_INT, {.i64=TYPE_INTELX}, 0, NB_TYPE-1, FLAGS, "type" },
> +        { "intelx", "Intel Tile-X layout", 0, AV_OPT_TYPE_CONST,
> {.i64=TYPE_INTELX}, INT_MIN, INT_MAX, FLAGS, "type" },
> +        { "intely", "Intel Tile-Y layout", 0, AV_OPT_TYPE_CONST,
> {.i64=TYPE_INTELY}, INT_MIN, INT_MAX, FLAGS, "type" },
> +    { NULL }
> +};
> +
> +AVFILTER_DEFINE_CLASS(fbdetile);
> +
> +static av_cold int init(AVFilterContext *ctx)
> +{
> +    FBDetileContext *fbdetile = ctx->priv;
> +
> +    if (fbdetile->type == TYPE_INTELX) {
> +        fprintf(stderr,"INFO:fbdetile:init: Intel tile-x to linear\n");
> +    } else if (fbdetile->type == TYPE_INTELY) {
> +        fprintf(stderr,"INFO:fbdetile:init: Intel tile-y to linear\n");
> +    } else {
> +        fprintf(stderr,"DBUG:fbdetile:init: Unknown Tile format specified,
> shouldnt reach here\n");
> +    }
> +    fbdetile->width = 1920;
> +    fbdetile->height = 1080;
> +    return 0;
> +}
> +
> +static int query_formats(AVFilterContext *ctx)
> +{
> +    // Currently only RGB based 32bit formats are specified
> +    // TODO: Technically the logic is transparent to 16bit RGB formats also
> +    static const enum AVPixelFormat pix_fmts[] = {AV_PIX_FMT_RGB0,
> AV_PIX_FMT_0RGB, AV_PIX_FMT_BGR0, AV_PIX_FMT_0BGR,
> +                                                  AV_PIX_FMT_RGBA,
> AV_PIX_FMT_ARGB, AV_PIX_FMT_BGRA, AV_PIX_FMT_ABGR,
> +                                                  AV_PIX_FMT_NONE};
> +    AVFilterFormats *fmts_list;
> +
> +    fmts_list = ff_make_format_list(pix_fmts);
> +    if (!fmts_list)
> +        return AVERROR(ENOMEM);
> +    return ff_set_common_formats(ctx, fmts_list);
> +}
> +
> +static int config_props(AVFilterLink *inlink)
> +{
> +    AVFilterContext *ctx = inlink->dst;
> +    FBDetileContext *fbdetile = ctx->priv;
> +
> +    fbdetile->width = inlink->w;
> +    fbdetile->height = inlink->h;
> +    fprintf(stderr,"DBUG:fbdetile:config_props: %d x %d\n",
> fbdetile->width, fbdetile->height);
> +
> +    return 0;
> +}
> +
> +static void detile_intelx(AVFilterContext *ctx, int w, int h,
> +                                uint8_t *dst, int dstLineSize,
> +                          const uint8_t *src, int srcLineSize)
> +{
> +    // Offsets and LineSize are in bytes
> +    int tileW = 128; // For a 32Bit / Pixel framebuffer, 512/4
> +    int tileH = 8;
> +
> +    if (w*4 != srcLineSize) {
> +        fprintf(stderr,"DBUG:fbdetile:intelx: w%dxh%d, dL%d, sL%d\n", w, h,
> dstLineSize, srcLineSize);
> +        fprintf(stderr,"ERRR:fbdetile:intelx: dont support LineSize | Pitch
> going beyond width\n");
> +    }
> +    int sO = 0;
> +    int dX = 0;
> +    int dY = 0;
> +    int nTRows = (w*h)/tileW;
> +    int cTR = 0;
> +    while (cTR < nTRows) {
> +        int dO = dY*dstLineSize + dX*4;
> +#ifdef DEBUG_FBTILE
> +        fprintf(stderr,"DBUG:fbdetile:intelx: dX%d dY%d, sO%d, dO%d\n", dX,
> dY, sO, dO);
> +#endif
> +        memcpy(dst+dO+0*dstLineSize, src+sO+0*512, 512);
> +        memcpy(dst+dO+1*dstLineSize, src+sO+1*512, 512);
> +        memcpy(dst+dO+2*dstLineSize, src+sO+2*512, 512);
> +        memcpy(dst+dO+3*dstLineSize, src+sO+3*512, 512);
> +        memcpy(dst+dO+4*dstLineSize, src+sO+4*512, 512);
> +        memcpy(dst+dO+5*dstLineSize, src+sO+5*512, 512);
> +        memcpy(dst+dO+6*dstLineSize, src+sO+6*512, 512);
> +        memcpy(dst+dO+7*dstLineSize, src+sO+7*512, 512);
> +        dX += tileW;
> +        if (dX >= w) {
> +            dX = 0;
> +            dY += 8;
> +        }
> +        sO = sO + 8*512;
> +        cTR += 8;
> +    }
> +}
> +
> +/*
> + * Intel Legacy Tile-Y layout conversion support
> + *
> + * currently done in a simple dumb way. Two low hanging optimisations
> + * that could be readily applied are
> + *
> + * a) unrolling the inner for loop
> + *    --- Given small size memcpy, should help, DONE
> + *
> + * b) using simd based 128bit loading and storing along with prefetch
> + *    hinting.
> + *
> + *    TOTHINK|CHECK: Does memcpy already does this and more if situation
> + *    is right?!
> + *
> + *    As code (or even intrinsics) would be specific to each architecture,
> + *    avoiding for now. Later have to check if vector_size attribute and
> + *    corresponding implementation by gcc can handle different
> architectures
> + *    properly, such that it wont become worse than memcpy provided for
> that
> + *    architecture.
> + *
> + * Or maybe I could even merge the two intel detiling logics into one, as
> + * the semantic and flow is almost same for both logics.
> + *
> + */
> +static void detile_intely(AVFilterContext *ctx, int w, int h,
> +                                uint8_t *dst, int dstLineSize,
> +                          const uint8_t *src, int srcLineSize)
> +{
> +    // Offsets and LineSize are in bytes
> +    int tileW = 4; // For a 32Bit / Pixel framebuffer, 16/4
> +    int tileH = 32;
> +
> +    if (w*4 != srcLineSize) {
> +        fprintf(stderr,"DBUG:fbdetile:intely: w%dxh%d, dL%d, sL%d\n", w, h,
> dstLineSize, srcLineSize);
> +        fprintf(stderr,"ERRR:fbdetile:intely: dont support LineSize | Pitch
> going beyond width\n");
> +    }
> +    int sO = 0;
> +    int dX = 0;
> +    int dY = 0;
> +    int nTRows = (w*h)/tileW;
> +    int cTR = 0;
> +    while (cTR < nTRows) {
> +        int dO = dY*dstLineSize + dX*4;
> +#ifdef DEBUG_FBTILE
> +        fprintf(stderr,"DBUG:fbdetile:intely: dX%d dY%d, sO%d, dO%d\n", dX,
> dY, sO, dO);
> +#endif
> +
> +        memcpy(dst+dO+0*dstLineSize, src+sO+0*16, 16);
> +        memcpy(dst+dO+1*dstLineSize, src+sO+1*16, 16);
> +        memcpy(dst+dO+2*dstLineSize, src+sO+2*16, 16);
> +        memcpy(dst+dO+3*dstLineSize, src+sO+3*16, 16);
> +        memcpy(dst+dO+4*dstLineSize, src+sO+4*16, 16);
> +        memcpy(dst+dO+5*dstLineSize, src+sO+5*16, 16);
> +        memcpy(dst+dO+6*dstLineSize, src+sO+6*16, 16);
> +        memcpy(dst+dO+7*dstLineSize, src+sO+7*16, 16);
> +        memcpy(dst+dO+8*dstLineSize, src+sO+8*16, 16);
> +        memcpy(dst+dO+9*dstLineSize, src+sO+9*16, 16);
> +        memcpy(dst+dO+10*dstLineSize, src+sO+10*16, 16);
> +        memcpy(dst+dO+11*dstLineSize, src+sO+11*16, 16);
> +        memcpy(dst+dO+12*dstLineSize, src+sO+12*16, 16);
> +        memcpy(dst+dO+13*dstLineSize, src+sO+13*16, 16);
> +        memcpy(dst+dO+14*dstLineSize, src+sO+14*16, 16);
> +        memcpy(dst+dO+15*dstLineSize, src+sO+15*16, 16);
> +        memcpy(dst+dO+16*dstLineSize, src+sO+16*16, 16);
> +        memcpy(dst+dO+17*dstLineSize, src+sO+17*16, 16);
> +        memcpy(dst+dO+18*dstLineSize, src+sO+18*16, 16);
> +        memcpy(dst+dO+19*dstLineSize, src+sO+19*16, 16);
> +        memcpy(dst+dO+20*dstLineSize, src+sO+20*16, 16);
> +        memcpy(dst+dO+21*dstLineSize, src+sO+21*16, 16);
> +        memcpy(dst+dO+22*dstLineSize, src+sO+22*16, 16);
> +        memcpy(dst+dO+23*dstLineSize, src+sO+23*16, 16);
> +        memcpy(dst+dO+24*dstLineSize, src+sO+24*16, 16);
> +        memcpy(dst+dO+25*dstLineSize, src+sO+25*16, 16);
> +        memcpy(dst+dO+26*dstLineSize, src+sO+26*16, 16);
> +        memcpy(dst+dO+27*dstLineSize, src+sO+27*16, 16);
> +        memcpy(dst+dO+28*dstLineSize, src+sO+28*16, 16);
> +        memcpy(dst+dO+29*dstLineSize, src+sO+29*16, 16);
> +        memcpy(dst+dO+30*dstLineSize, src+sO+30*16, 16);
> +        memcpy(dst+dO+31*dstLineSize, src+sO+31*16, 16);
> +
> +        dX += tileW;
> +        if (dX >= w) {
> +            dX = 0;
> +            dY += 32;
> +        }
> +        sO = sO + 32*16;
> +        cTR += 32;
> +    }
> +}
> +
> +static int filter_frame(AVFilterLink *inlink, AVFrame *in)
> +{
> +    AVFilterContext *ctx = inlink->dst;
> +    FBDetileContext *fbdetile = ctx->priv;
> +    AVFilterLink *outlink = ctx->outputs[0];
> +    AVFrame *out;
> +
> +    out = ff_get_video_buffer(outlink, outlink->w, outlink->h);
> +    if (!out) {
> +        av_frame_free(&in);
> +        return AVERROR(ENOMEM);
> +    }
> +    av_frame_copy_props(out, in);
> +
> +    if (fbdetile->type == TYPE_INTELX) {
> +        detile_intelx(ctx, fbdetile->width, fbdetile->height,
> +                      out->data[0], out->linesize[0],
> +                      in->data[0], in->linesize[0]);
> +    } else if (fbdetile->type == TYPE_INTELY) {
> +        detile_intely(ctx, fbdetile->width, fbdetile->height,
> +                      out->data[0], out->linesize[0],
> +                      in->data[0], in->linesize[0]);
> +    }
> +
> +    av_frame_free(&in);
> +    return ff_filter_frame(outlink, out);
> +}
> +
> +static av_cold void uninit(AVFilterContext *ctx)
> +{
> +
> +}
> +
> +static const AVFilterPad fbdetile_inputs[] = {
> +    {
> +        .name         = "default",
> +        .type         = AVMEDIA_TYPE_VIDEO,
> +        .config_props = config_props,
> +        .filter_frame = filter_frame,
> +    },
> +    { NULL }
> +};
> +
> +static const AVFilterPad fbdetile_outputs[] = {
> +    {
> +        .name = "default",
> +        .type = AVMEDIA_TYPE_VIDEO,
> +    },
> +    { NULL }
> +};
> +
> +AVFilter ff_vf_fbdetile = {
> +    .name          = "fbdetile",
> +    .description   = NULL_IF_CONFIG_SMALL("Detile Framebuffer using CPU"),
> +    .priv_size     = sizeof(FBDetileContext),
> +    .init          = init,
> +    .uninit        = uninit,
> +    .query_formats = query_formats,
> +    .inputs        = fbdetile_inputs,
> +    .outputs       = fbdetile_outputs,
> +    .priv_class    = &fbdetile_class,
> +};
> --
> 2.20.1
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
C Hanish Menon June 27, 2020, 8:12 p.m. UTC | #2
Hi,

It is a new video filter which I created to do detailing of the Intel
Tile-X and Tile-Y framebuffer layouts into linear layout using a logic
which runs on the cpu. It can be used if one uses kmsgrab and hwdownload to
capture screen on a Intel GPU based system, so that one can get proper
screen capture.

Without this kmsgrab will generate a unusable/scrambled capture, because
the contents will be tiled. I had this issue few days back when trying to
capture screen with wayland, so created this.

In the patch submitted, I have added the doc/filters.texi, which mentions
the same.



On Sun, Jun 28, 2020 at 1:30 AM Paul B Mahol <onemda@gmail.com> wrote:

> What is this?
>
> Missing documentation.
> NAK
>
> On 6/27/20, hanishkvc <hanishkvc@gmail.com> wrote:
> > v02-20200627IST2331
> >
> > Unrolled Intel Legacy Tile-Y detiling logic.
> >
> > Also a consolidated patch file, instead of the previous development
> > flow based multiple patch files.
> >
> > v01-20200627IST1308
> >
> > Implemented Intel Legacy Tile-X and Tile-Y detiling logic
> >
> > NOTES:
> >
> > This video filter allows framebuffers which are tiled to be detiled
> > using logic running on the cpu, into a linear layout.
> >
> > Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling.
> > THis should help one to work with frames captured (say using kmsgrab)
> > on laptops having Intel GPU.
> >
> > Tile-X conversion logic has been explicitly cross checked, with Tile-X
> > based frames. However Tile-Y conv logic hasnt been tested with Tile-Y
> > based frames, but it should potentially do the job, based on my current
> > understanding of the Tile-Y layout format.
> >
> > TODO1: At a later time have to generate Tile-Y based frames, and then
> > cross check the corresponding logic explicitly.
> >
> > TODO2: May be use OpenGL or Vulcan buffer helper routines to do the
> > layout conversion. But some online discussions from sometime back seem
> > to indicate that this path is not fully bug free currently.
> > ---
> >  Changelog                 |   1 +
> >  doc/filters.texi          |  62 ++++++++
> >  libavfilter/Makefile      |   1 +
> >  libavfilter/allfilters.c  |   1 +
> >  libavfilter/vf_fbdetile.c | 309 ++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 374 insertions(+)
> >  create mode 100644 libavfilter/vf_fbdetile.c
> >
> > diff --git a/Changelog b/Changelog
> > index a60e7d2eb8..0e03491f6a 100644
> > --- a/Changelog
> > +++ b/Changelog
> > @@ -2,6 +2,7 @@ Entries are sorted chronologically from oldest to
> youngest
> > within each release,
> >  releases are sorted from youngest to oldest.
> >
> >  version <next>:
> > +- fbdetile cpu based framebuffer layout detiling video filter
> >  - AudioToolbox output device
> >  - MacCaption demuxer
> >
> > diff --git a/doc/filters.texi b/doc/filters.texi
> > index 3c2dd2eb90..73ba21af89 100644
> > --- a/doc/filters.texi
> > +++ b/doc/filters.texi
> > @@ -12210,6 +12210,68 @@ It accepts the following optional parameters:
> >  The number of the CUDA device to use
> >  @end table
> >
> > +@anchor{fbdetile}
> > +@section fbdetile
> > +
> > +Detiles the Framebuffer tile layout into a linear layout using CPU.
> > +
> > +It currently supports conversion from Intel legacy tile-x and tile-y
> > layouts
> > +into a linear layout. This is useful if one is using kmsgrab and
> hwdownload
> > +to capture a screen which is using one of these non-linear layouts.
> > +
> > +Currently it expects the data to be a 32bit RGB based pixel format.
> However
> > +the logic doesnt do any pixel format conversion or so. Later will be
> > enabling
> > +16bit RGB data also, as the logic is transparent to it at one level.
> > +
> > +One could either insert this into the filter chain while capturing
> itself,
> > +or else, if it is slowing things down or so, then one could instead
> insert
> > +it into the filter chain during playback or transcoding or so.
> > +
> > +It supports the following optional parameters
> > +
> > +@table @option
> > +@item type
> > +Specify which detiling conversion to apply. The supported values are
> > +@table @var
> > +@item 0
> > +intel tile-x to linear conversion (the default)
> > +@item 1
> > +intel tile-y to linear conversion.
> > +@end table
> > +@end table
> > +
> > +If one wants to convert during capture itself, one could do
> > +@example
> > +ffmpeg -f kmsgrab -i - -vf "hwdownload, fbdetile" OUTPUT
> > +@end example
> > +
> > +However if one wants to convert after the tiled data has been already
> > captured
> > +@example
> > +ffmpeg -i INPUT -vf "fbdetile" OUTPUT
> > +@end example
> > +@example
> > +ffplay -i INPUT -vf "fbdetile"
> > +@end example
> > +
> > +NOTE: While transcoding a test 1080p h264 stream, with 276 frames, with
> two
> > +runs of each situation, the performance was has given below. However
> this
> > +was for the older | initial version of the logic, as well as it was run
> on
> > +the default linux chromebook->vm->container, so the perf values need
> not be
> > +proper. But in a relative sense the overhead would be similar.
> > +@example
> > +rm out.mp4; time ./ffmpeg -i input.mp4 out.mp4
> > +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=0 out.mp4
> > +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=1 out.mp4
> > +@end example
> > +@table @option
> > +@item with no fbdetile filter
> > +it took ~7.28 secs,
> > +@item with fbdetile=0 filter
> > +it took ~8.69 secs,
> > +@item with fbdetile=1 filter
> > +it took ~9.20 secs.
> > +@end table
> > +
> >  @section hqx
> >
> >  Apply a high-quality magnification filter designed for pixel art. This
> > filter
> > diff --git a/libavfilter/Makefile b/libavfilter/Makefile
> > index 5123540653..bdb0c379ae 100644
> > --- a/libavfilter/Makefile
> > +++ b/libavfilter/Makefile
> > @@ -280,6 +280,7 @@ OBJS-$(CONFIG_HWDOWNLOAD_FILTER)             +=
> > vf_hwdownload.o
> >  OBJS-$(CONFIG_HWMAP_FILTER)                  += vf_hwmap.o
> >  OBJS-$(CONFIG_HWUPLOAD_CUDA_FILTER)          += vf_hwupload_cuda.o
> >  OBJS-$(CONFIG_HWUPLOAD_FILTER)               += vf_hwupload.o
> > +OBJS-$(CONFIG_FBDETILE_FILTER)               += vf_fbdetile.o
> >  OBJS-$(CONFIG_HYSTERESIS_FILTER)             += vf_hysteresis.o
> framesync.o
> >  OBJS-$(CONFIG_IDET_FILTER)                   += vf_idet.o
> >  OBJS-$(CONFIG_IL_FILTER)                     += vf_il.o
> > diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
> > index 1183e40267..f8dceb2a88 100644
> > --- a/libavfilter/allfilters.c
> > +++ b/libavfilter/allfilters.c
> > @@ -265,6 +265,7 @@ extern AVFilter ff_vf_hwdownload;
> >  extern AVFilter ff_vf_hwmap;
> >  extern AVFilter ff_vf_hwupload;
> >  extern AVFilter ff_vf_hwupload_cuda;
> > +extern AVFilter ff_vf_fbdetile;
> >  extern AVFilter ff_vf_hysteresis;
> >  extern AVFilter ff_vf_idet;
> >  extern AVFilter ff_vf_il;
> > diff --git a/libavfilter/vf_fbdetile.c b/libavfilter/vf_fbdetile.c
> > new file mode 100644
> > index 0000000000..8b20c96d2c
> > --- /dev/null
> > +++ b/libavfilter/vf_fbdetile.c
> > @@ -0,0 +1,309 @@
> > +/*
> > + * Copyright (c) 2020 HanishKVC
> > + *
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> 02110-1301
> > USA
> > + */
> > +
> > +/**
> > + * @file
> > + * Detile the Frame buffer's tile layout using the cpu
> > + * Currently it supports the legacy Intel Tile X layout detiling.
> > + *
> > + */
> > +
> > +/*
> > + * ToThink|Check: Optimisations
> > + *
> > + * Does gcc setting used by ffmpeg allows memcpy | stringops inlining,
> > + * loop unrolling, better native matching instructions, additional
> > + * optimisations, ...
> > + *
> > + * Does gcc map to optimal memcpy logic, based on the situation it is
> > + * used in.
> > + *
> > + * If not, may be look at vector_size or intrinsics or appropriate arch
> > + * and cpu specific inline asm or ...
> > + *
> > + */
> > +
> > +#include "libavutil/avassert.h"
> > +#include "libavutil/imgutils.h"
> > +#include "libavutil/opt.h"
> > +#include "avfilter.h"
> > +#include "formats.h"
> > +#include "internal.h"
> > +#include "video.h"
> > +
> > +enum FilterMode {
> > +    TYPE_INTELX,
> > +    TYPE_INTELY,
> > +    NB_TYPE
> > +};
> > +
> > +typedef struct FBDetileContext {
> > +    const AVClass *class;
> > +    int width, height;
> > +    int type;
> > +} FBDetileContext;
> > +
> > +#define OFFSET(x) offsetof(FBDetileContext, x)
> > +#define FLAGS AV_OPT_FLAG_FILTERING_PARAM|AV_OPT_FLAG_VIDEO_PARAM
> > +static const AVOption fbdetile_options[] = {
> > +    { "type", "set framebuffer format_modifier type", OFFSET(type),
> > AV_OPT_TYPE_INT, {.i64=TYPE_INTELX}, 0, NB_TYPE-1, FLAGS, "type" },
> > +        { "intelx", "Intel Tile-X layout", 0, AV_OPT_TYPE_CONST,
> > {.i64=TYPE_INTELX}, INT_MIN, INT_MAX, FLAGS, "type" },
> > +        { "intely", "Intel Tile-Y layout", 0, AV_OPT_TYPE_CONST,
> > {.i64=TYPE_INTELY}, INT_MIN, INT_MAX, FLAGS, "type" },
> > +    { NULL }
> > +};
> > +
> > +AVFILTER_DEFINE_CLASS(fbdetile);
> > +
> > +static av_cold int init(AVFilterContext *ctx)
> > +{
> > +    FBDetileContext *fbdetile = ctx->priv;
> > +
> > +    if (fbdetile->type == TYPE_INTELX) {
> > +        fprintf(stderr,"INFO:fbdetile:init: Intel tile-x to linear\n");
> > +    } else if (fbdetile->type == TYPE_INTELY) {
> > +        fprintf(stderr,"INFO:fbdetile:init: Intel tile-y to linear\n");
> > +    } else {
> > +        fprintf(stderr,"DBUG:fbdetile:init: Unknown Tile format
> specified,
> > shouldnt reach here\n");
> > +    }
> > +    fbdetile->width = 1920;
> > +    fbdetile->height = 1080;
> > +    return 0;
> > +}
> > +
> > +static int query_formats(AVFilterContext *ctx)
> > +{
> > +    // Currently only RGB based 32bit formats are specified
> > +    // TODO: Technically the logic is transparent to 16bit RGB formats
> also
> > +    static const enum AVPixelFormat pix_fmts[] = {AV_PIX_FMT_RGB0,
> > AV_PIX_FMT_0RGB, AV_PIX_FMT_BGR0, AV_PIX_FMT_0BGR,
> > +                                                  AV_PIX_FMT_RGBA,
> > AV_PIX_FMT_ARGB, AV_PIX_FMT_BGRA, AV_PIX_FMT_ABGR,
> > +                                                  AV_PIX_FMT_NONE};
> > +    AVFilterFormats *fmts_list;
> > +
> > +    fmts_list = ff_make_format_list(pix_fmts);
> > +    if (!fmts_list)
> > +        return AVERROR(ENOMEM);
> > +    return ff_set_common_formats(ctx, fmts_list);
> > +}
> > +
> > +static int config_props(AVFilterLink *inlink)
> > +{
> > +    AVFilterContext *ctx = inlink->dst;
> > +    FBDetileContext *fbdetile = ctx->priv;
> > +
> > +    fbdetile->width = inlink->w;
> > +    fbdetile->height = inlink->h;
> > +    fprintf(stderr,"DBUG:fbdetile:config_props: %d x %d\n",
> > fbdetile->width, fbdetile->height);
> > +
> > +    return 0;
> > +}
> > +
> > +static void detile_intelx(AVFilterContext *ctx, int w, int h,
> > +                                uint8_t *dst, int dstLineSize,
> > +                          const uint8_t *src, int srcLineSize)
> > +{
> > +    // Offsets and LineSize are in bytes
> > +    int tileW = 128; // For a 32Bit / Pixel framebuffer, 512/4
> > +    int tileH = 8;
> > +
> > +    if (w*4 != srcLineSize) {
> > +        fprintf(stderr,"DBUG:fbdetile:intelx: w%dxh%d, dL%d, sL%d\n",
> w, h,
> > dstLineSize, srcLineSize);
> > +        fprintf(stderr,"ERRR:fbdetile:intelx: dont support LineSize |
> Pitch
> > going beyond width\n");
> > +    }
> > +    int sO = 0;
> > +    int dX = 0;
> > +    int dY = 0;
> > +    int nTRows = (w*h)/tileW;
> > +    int cTR = 0;
> > +    while (cTR < nTRows) {
> > +        int dO = dY*dstLineSize + dX*4;
> > +#ifdef DEBUG_FBTILE
> > +        fprintf(stderr,"DBUG:fbdetile:intelx: dX%d dY%d, sO%d, dO%d\n",
> dX,
> > dY, sO, dO);
> > +#endif
> > +        memcpy(dst+dO+0*dstLineSize, src+sO+0*512, 512);
> > +        memcpy(dst+dO+1*dstLineSize, src+sO+1*512, 512);
> > +        memcpy(dst+dO+2*dstLineSize, src+sO+2*512, 512);
> > +        memcpy(dst+dO+3*dstLineSize, src+sO+3*512, 512);
> > +        memcpy(dst+dO+4*dstLineSize, src+sO+4*512, 512);
> > +        memcpy(dst+dO+5*dstLineSize, src+sO+5*512, 512);
> > +        memcpy(dst+dO+6*dstLineSize, src+sO+6*512, 512);
> > +        memcpy(dst+dO+7*dstLineSize, src+sO+7*512, 512);
> > +        dX += tileW;
> > +        if (dX >= w) {
> > +            dX = 0;
> > +            dY += 8;
> > +        }
> > +        sO = sO + 8*512;
> > +        cTR += 8;
> > +    }
> > +}
> > +
> > +/*
> > + * Intel Legacy Tile-Y layout conversion support
> > + *
> > + * currently done in a simple dumb way. Two low hanging optimisations
> > + * that could be readily applied are
> > + *
> > + * a) unrolling the inner for loop
> > + *    --- Given small size memcpy, should help, DONE
> > + *
> > + * b) using simd based 128bit loading and storing along with prefetch
> > + *    hinting.
> > + *
> > + *    TOTHINK|CHECK: Does memcpy already does this and more if situation
> > + *    is right?!
> > + *
> > + *    As code (or even intrinsics) would be specific to each
> architecture,
> > + *    avoiding for now. Later have to check if vector_size attribute and
> > + *    corresponding implementation by gcc can handle different
> > architectures
> > + *    properly, such that it wont become worse than memcpy provided for
> > that
> > + *    architecture.
> > + *
> > + * Or maybe I could even merge the two intel detiling logics into one,
> as
> > + * the semantic and flow is almost same for both logics.
> > + *
> > + */
> > +static void detile_intely(AVFilterContext *ctx, int w, int h,
> > +                                uint8_t *dst, int dstLineSize,
> > +                          const uint8_t *src, int srcLineSize)
> > +{
> > +    // Offsets and LineSize are in bytes
> > +    int tileW = 4; // For a 32Bit / Pixel framebuffer, 16/4
> > +    int tileH = 32;
> > +
> > +    if (w*4 != srcLineSize) {
> > +        fprintf(stderr,"DBUG:fbdetile:intely: w%dxh%d, dL%d, sL%d\n",
> w, h,
> > dstLineSize, srcLineSize);
> > +        fprintf(stderr,"ERRR:fbdetile:intely: dont support LineSize |
> Pitch
> > going beyond width\n");
> > +    }
> > +    int sO = 0;
> > +    int dX = 0;
> > +    int dY = 0;
> > +    int nTRows = (w*h)/tileW;
> > +    int cTR = 0;
> > +    while (cTR < nTRows) {
> > +        int dO = dY*dstLineSize + dX*4;
> > +#ifdef DEBUG_FBTILE
> > +        fprintf(stderr,"DBUG:fbdetile:intely: dX%d dY%d, sO%d, dO%d\n",
> dX,
> > dY, sO, dO);
> > +#endif
> > +
> > +        memcpy(dst+dO+0*dstLineSize, src+sO+0*16, 16);
> > +        memcpy(dst+dO+1*dstLineSize, src+sO+1*16, 16);
> > +        memcpy(dst+dO+2*dstLineSize, src+sO+2*16, 16);
> > +        memcpy(dst+dO+3*dstLineSize, src+sO+3*16, 16);
> > +        memcpy(dst+dO+4*dstLineSize, src+sO+4*16, 16);
> > +        memcpy(dst+dO+5*dstLineSize, src+sO+5*16, 16);
> > +        memcpy(dst+dO+6*dstLineSize, src+sO+6*16, 16);
> > +        memcpy(dst+dO+7*dstLineSize, src+sO+7*16, 16);
> > +        memcpy(dst+dO+8*dstLineSize, src+sO+8*16, 16);
> > +        memcpy(dst+dO+9*dstLineSize, src+sO+9*16, 16);
> > +        memcpy(dst+dO+10*dstLineSize, src+sO+10*16, 16);
> > +        memcpy(dst+dO+11*dstLineSize, src+sO+11*16, 16);
> > +        memcpy(dst+dO+12*dstLineSize, src+sO+12*16, 16);
> > +        memcpy(dst+dO+13*dstLineSize, src+sO+13*16, 16);
> > +        memcpy(dst+dO+14*dstLineSize, src+sO+14*16, 16);
> > +        memcpy(dst+dO+15*dstLineSize, src+sO+15*16, 16);
> > +        memcpy(dst+dO+16*dstLineSize, src+sO+16*16, 16);
> > +        memcpy(dst+dO+17*dstLineSize, src+sO+17*16, 16);
> > +        memcpy(dst+dO+18*dstLineSize, src+sO+18*16, 16);
> > +        memcpy(dst+dO+19*dstLineSize, src+sO+19*16, 16);
> > +        memcpy(dst+dO+20*dstLineSize, src+sO+20*16, 16);
> > +        memcpy(dst+dO+21*dstLineSize, src+sO+21*16, 16);
> > +        memcpy(dst+dO+22*dstLineSize, src+sO+22*16, 16);
> > +        memcpy(dst+dO+23*dstLineSize, src+sO+23*16, 16);
> > +        memcpy(dst+dO+24*dstLineSize, src+sO+24*16, 16);
> > +        memcpy(dst+dO+25*dstLineSize, src+sO+25*16, 16);
> > +        memcpy(dst+dO+26*dstLineSize, src+sO+26*16, 16);
> > +        memcpy(dst+dO+27*dstLineSize, src+sO+27*16, 16);
> > +        memcpy(dst+dO+28*dstLineSize, src+sO+28*16, 16);
> > +        memcpy(dst+dO+29*dstLineSize, src+sO+29*16, 16);
> > +        memcpy(dst+dO+30*dstLineSize, src+sO+30*16, 16);
> > +        memcpy(dst+dO+31*dstLineSize, src+sO+31*16, 16);
> > +
> > +        dX += tileW;
> > +        if (dX >= w) {
> > +            dX = 0;
> > +            dY += 32;
> > +        }
> > +        sO = sO + 32*16;
> > +        cTR += 32;
> > +    }
> > +}
> > +
> > +static int filter_frame(AVFilterLink *inlink, AVFrame *in)
> > +{
> > +    AVFilterContext *ctx = inlink->dst;
> > +    FBDetileContext *fbdetile = ctx->priv;
> > +    AVFilterLink *outlink = ctx->outputs[0];
> > +    AVFrame *out;
> > +
> > +    out = ff_get_video_buffer(outlink, outlink->w, outlink->h);
> > +    if (!out) {
> > +        av_frame_free(&in);
> > +        return AVERROR(ENOMEM);
> > +    }
> > +    av_frame_copy_props(out, in);
> > +
> > +    if (fbdetile->type == TYPE_INTELX) {
> > +        detile_intelx(ctx, fbdetile->width, fbdetile->height,
> > +                      out->data[0], out->linesize[0],
> > +                      in->data[0], in->linesize[0]);
> > +    } else if (fbdetile->type == TYPE_INTELY) {
> > +        detile_intely(ctx, fbdetile->width, fbdetile->height,
> > +                      out->data[0], out->linesize[0],
> > +                      in->data[0], in->linesize[0]);
> > +    }
> > +
> > +    av_frame_free(&in);
> > +    return ff_filter_frame(outlink, out);
> > +}
> > +
> > +static av_cold void uninit(AVFilterContext *ctx)
> > +{
> > +
> > +}
> > +
> > +static const AVFilterPad fbdetile_inputs[] = {
> > +    {
> > +        .name         = "default",
> > +        .type         = AVMEDIA_TYPE_VIDEO,
> > +        .config_props = config_props,
> > +        .filter_frame = filter_frame,
> > +    },
> > +    { NULL }
> > +};
> > +
> > +static const AVFilterPad fbdetile_outputs[] = {
> > +    {
> > +        .name = "default",
> > +        .type = AVMEDIA_TYPE_VIDEO,
> > +    },
> > +    { NULL }
> > +};
> > +
> > +AVFilter ff_vf_fbdetile = {
> > +    .name          = "fbdetile",
> > +    .description   = NULL_IF_CONFIG_SMALL("Detile Framebuffer using
> CPU"),
> > +    .priv_size     = sizeof(FBDetileContext),
> > +    .init          = init,
> > +    .uninit        = uninit,
> > +    .query_formats = query_formats,
> > +    .inputs        = fbdetile_inputs,
> > +    .outputs       = fbdetile_outputs,
> > +    .priv_class    = &fbdetile_class,
> > +};
> > --
> > 2.20.1
> >
> > _______________________________________________
> > ffmpeg-devel mailing list
> > ffmpeg-devel@ffmpeg.org
> > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >
> > To unsubscribe, visit link above, or email
> > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
Paul B Mahol June 28, 2020, 9:25 a.m. UTC | #3
On 6/27/20, C Hanish Menon <hanishkvc@gmail.com> wrote:
> Hi,
>
> It is a new video filter which I created to do detailing of the Intel
> Tile-X and Tile-Y framebuffer layouts into linear layout using a logic
> which runs on the cpu. It can be used if one uses kmsgrab and hwdownload to
> capture screen on a Intel GPU based system, so that one can get proper
> screen capture.
>
> Without this kmsgrab will generate a unusable/scrambled capture, because
> the contents will be tiled. I had this issue few days back when trying to
> capture screen with wayland, so created this.
>
> In the patch submitted, I have added the doc/filters.texi, which mentions
> the same.

Filter is marginally useful, it is done in CPU, completely
invalidating any possible gain using hw path.

>
>
>
> On Sun, Jun 28, 2020 at 1:30 AM Paul B Mahol <onemda@gmail.com> wrote:
>
>> What is this?
>>
>> Missing documentation.
>> NAK
>>
>> On 6/27/20, hanishkvc <hanishkvc@gmail.com> wrote:
>> > v02-20200627IST2331
>> >
>> > Unrolled Intel Legacy Tile-Y detiling logic.
>> >
>> > Also a consolidated patch file, instead of the previous development
>> > flow based multiple patch files.
>> >
>> > v01-20200627IST1308
>> >
>> > Implemented Intel Legacy Tile-X and Tile-Y detiling logic
>> >
>> > NOTES:
>> >
>> > This video filter allows framebuffers which are tiled to be detiled
>> > using logic running on the cpu, into a linear layout.
>> >
>> > Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling.
>> > THis should help one to work with frames captured (say using kmsgrab)
>> > on laptops having Intel GPU.
>> >
>> > Tile-X conversion logic has been explicitly cross checked, with Tile-X
>> > based frames. However Tile-Y conv logic hasnt been tested with Tile-Y
>> > based frames, but it should potentially do the job, based on my current
>> > understanding of the Tile-Y layout format.
>> >
>> > TODO1: At a later time have to generate Tile-Y based frames, and then
>> > cross check the corresponding logic explicitly.
>> >
>> > TODO2: May be use OpenGL or Vulcan buffer helper routines to do the
>> > layout conversion. But some online discussions from sometime back seem
>> > to indicate that this path is not fully bug free currently.
>> > ---
>> >  Changelog                 |   1 +
>> >  doc/filters.texi          |  62 ++++++++
>> >  libavfilter/Makefile      |   1 +
>> >  libavfilter/allfilters.c  |   1 +
>> >  libavfilter/vf_fbdetile.c | 309 ++++++++++++++++++++++++++++++++++++++
>> >  5 files changed, 374 insertions(+)
>> >  create mode 100644 libavfilter/vf_fbdetile.c
>> >
>> > diff --git a/Changelog b/Changelog
>> > index a60e7d2eb8..0e03491f6a 100644
>> > --- a/Changelog
>> > +++ b/Changelog
>> > @@ -2,6 +2,7 @@ Entries are sorted chronologically from oldest to
>> youngest
>> > within each release,
>> >  releases are sorted from youngest to oldest.
>> >
>> >  version <next>:
>> > +- fbdetile cpu based framebuffer layout detiling video filter
>> >  - AudioToolbox output device
>> >  - MacCaption demuxer
>> >
>> > diff --git a/doc/filters.texi b/doc/filters.texi
>> > index 3c2dd2eb90..73ba21af89 100644
>> > --- a/doc/filters.texi
>> > +++ b/doc/filters.texi
>> > @@ -12210,6 +12210,68 @@ It accepts the following optional parameters:
>> >  The number of the CUDA device to use
>> >  @end table
>> >
>> > +@anchor{fbdetile}
>> > +@section fbdetile
>> > +
>> > +Detiles the Framebuffer tile layout into a linear layout using CPU.
>> > +
>> > +It currently supports conversion from Intel legacy tile-x and tile-y
>> > layouts
>> > +into a linear layout. This is useful if one is using kmsgrab and
>> hwdownload
>> > +to capture a screen which is using one of these non-linear layouts.
>> > +
>> > +Currently it expects the data to be a 32bit RGB based pixel format.
>> However
>> > +the logic doesnt do any pixel format conversion or so. Later will be
>> > enabling
>> > +16bit RGB data also, as the logic is transparent to it at one level.
>> > +
>> > +One could either insert this into the filter chain while capturing
>> itself,
>> > +or else, if it is slowing things down or so, then one could instead
>> insert
>> > +it into the filter chain during playback or transcoding or so.
>> > +
>> > +It supports the following optional parameters
>> > +
>> > +@table @option
>> > +@item type
>> > +Specify which detiling conversion to apply. The supported values are
>> > +@table @var
>> > +@item 0
>> > +intel tile-x to linear conversion (the default)
>> > +@item 1
>> > +intel tile-y to linear conversion.
>> > +@end table
>> > +@end table
>> > +
>> > +If one wants to convert during capture itself, one could do
>> > +@example
>> > +ffmpeg -f kmsgrab -i - -vf "hwdownload, fbdetile" OUTPUT
>> > +@end example
>> > +
>> > +However if one wants to convert after the tiled data has been already
>> > captured
>> > +@example
>> > +ffmpeg -i INPUT -vf "fbdetile" OUTPUT
>> > +@end example
>> > +@example
>> > +ffplay -i INPUT -vf "fbdetile"
>> > +@end example
>> > +
>> > +NOTE: While transcoding a test 1080p h264 stream, with 276 frames,
>> > with
>> two
>> > +runs of each situation, the performance was has given below. However
>> this
>> > +was for the older | initial version of the logic, as well as it was
>> > run
>> on
>> > +the default linux chromebook->vm->container, so the perf values need
>> not be
>> > +proper. But in a relative sense the overhead would be similar.
>> > +@example
>> > +rm out.mp4; time ./ffmpeg -i input.mp4 out.mp4
>> > +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=0 out.mp4
>> > +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=1 out.mp4
>> > +@end example
>> > +@table @option
>> > +@item with no fbdetile filter
>> > +it took ~7.28 secs,
>> > +@item with fbdetile=0 filter
>> > +it took ~8.69 secs,
>> > +@item with fbdetile=1 filter
>> > +it took ~9.20 secs.
>> > +@end table
>> > +
>> >  @section hqx
>> >
>> >  Apply a high-quality magnification filter designed for pixel art. This
>> > filter
>> > diff --git a/libavfilter/Makefile b/libavfilter/Makefile
>> > index 5123540653..bdb0c379ae 100644
>> > --- a/libavfilter/Makefile
>> > +++ b/libavfilter/Makefile
>> > @@ -280,6 +280,7 @@ OBJS-$(CONFIG_HWDOWNLOAD_FILTER)             +=
>> > vf_hwdownload.o
>> >  OBJS-$(CONFIG_HWMAP_FILTER)                  += vf_hwmap.o
>> >  OBJS-$(CONFIG_HWUPLOAD_CUDA_FILTER)          += vf_hwupload_cuda.o
>> >  OBJS-$(CONFIG_HWUPLOAD_FILTER)               += vf_hwupload.o
>> > +OBJS-$(CONFIG_FBDETILE_FILTER)               += vf_fbdetile.o
>> >  OBJS-$(CONFIG_HYSTERESIS_FILTER)             += vf_hysteresis.o
>> framesync.o
>> >  OBJS-$(CONFIG_IDET_FILTER)                   += vf_idet.o
>> >  OBJS-$(CONFIG_IL_FILTER)                     += vf_il.o
>> > diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
>> > index 1183e40267..f8dceb2a88 100644
>> > --- a/libavfilter/allfilters.c
>> > +++ b/libavfilter/allfilters.c
>> > @@ -265,6 +265,7 @@ extern AVFilter ff_vf_hwdownload;
>> >  extern AVFilter ff_vf_hwmap;
>> >  extern AVFilter ff_vf_hwupload;
>> >  extern AVFilter ff_vf_hwupload_cuda;
>> > +extern AVFilter ff_vf_fbdetile;
>> >  extern AVFilter ff_vf_hysteresis;
>> >  extern AVFilter ff_vf_idet;
>> >  extern AVFilter ff_vf_il;
>> > diff --git a/libavfilter/vf_fbdetile.c b/libavfilter/vf_fbdetile.c
>> > new file mode 100644
>> > index 0000000000..8b20c96d2c
>> > --- /dev/null
>> > +++ b/libavfilter/vf_fbdetile.c
>> > @@ -0,0 +1,309 @@
>> > +/*
>> > + * Copyright (c) 2020 HanishKVC
>> > + *
>> > + * This file is part of FFmpeg.
>> > + *
>> > + * FFmpeg is free software; you can redistribute it and/or
>> > + * modify it under the terms of the GNU Lesser General Public
>> > + * License as published by the Free Software Foundation; either
>> > + * version 2.1 of the License, or (at your option) any later version.
>> > + *
>> > + * FFmpeg is distributed in the hope that it will be useful,
>> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> > + * Lesser General Public License for more details.
>> > + *
>> > + * You should have received a copy of the GNU Lesser General Public
>> > + * License along with FFmpeg; if not, write to the Free Software
>> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
>> 02110-1301
>> > USA
>> > + */
>> > +
>> > +/**
>> > + * @file
>> > + * Detile the Frame buffer's tile layout using the cpu
>> > + * Currently it supports the legacy Intel Tile X layout detiling.
>> > + *
>> > + */
>> > +
>> > +/*
>> > + * ToThink|Check: Optimisations
>> > + *
>> > + * Does gcc setting used by ffmpeg allows memcpy | stringops inlining,
>> > + * loop unrolling, better native matching instructions, additional
>> > + * optimisations, ...
>> > + *
>> > + * Does gcc map to optimal memcpy logic, based on the situation it is
>> > + * used in.
>> > + *
>> > + * If not, may be look at vector_size or intrinsics or appropriate
>> > arch
>> > + * and cpu specific inline asm or ...
>> > + *
>> > + */
>> > +
>> > +#include "libavutil/avassert.h"
>> > +#include "libavutil/imgutils.h"
>> > +#include "libavutil/opt.h"
>> > +#include "avfilter.h"
>> > +#include "formats.h"
>> > +#include "internal.h"
>> > +#include "video.h"
>> > +
>> > +enum FilterMode {
>> > +    TYPE_INTELX,
>> > +    TYPE_INTELY,
>> > +    NB_TYPE
>> > +};
>> > +
>> > +typedef struct FBDetileContext {
>> > +    const AVClass *class;
>> > +    int width, height;
>> > +    int type;
>> > +} FBDetileContext;
>> > +
>> > +#define OFFSET(x) offsetof(FBDetileContext, x)
>> > +#define FLAGS AV_OPT_FLAG_FILTERING_PARAM|AV_OPT_FLAG_VIDEO_PARAM
>> > +static const AVOption fbdetile_options[] = {
>> > +    { "type", "set framebuffer format_modifier type", OFFSET(type),
>> > AV_OPT_TYPE_INT, {.i64=TYPE_INTELX}, 0, NB_TYPE-1, FLAGS, "type" },
>> > +        { "intelx", "Intel Tile-X layout", 0, AV_OPT_TYPE_CONST,
>> > {.i64=TYPE_INTELX}, INT_MIN, INT_MAX, FLAGS, "type" },
>> > +        { "intely", "Intel Tile-Y layout", 0, AV_OPT_TYPE_CONST,
>> > {.i64=TYPE_INTELY}, INT_MIN, INT_MAX, FLAGS, "type" },
>> > +    { NULL }
>> > +};
>> > +
>> > +AVFILTER_DEFINE_CLASS(fbdetile);
>> > +
>> > +static av_cold int init(AVFilterContext *ctx)
>> > +{
>> > +    FBDetileContext *fbdetile = ctx->priv;
>> > +
>> > +    if (fbdetile->type == TYPE_INTELX) {
>> > +        fprintf(stderr,"INFO:fbdetile:init: Intel tile-x to
>> > linear\n");
>> > +    } else if (fbdetile->type == TYPE_INTELY) {
>> > +        fprintf(stderr,"INFO:fbdetile:init: Intel tile-y to
>> > linear\n");
>> > +    } else {
>> > +        fprintf(stderr,"DBUG:fbdetile:init: Unknown Tile format
>> specified,
>> > shouldnt reach here\n");
>> > +    }
>> > +    fbdetile->width = 1920;
>> > +    fbdetile->height = 1080;
>> > +    return 0;
>> > +}
>> > +
>> > +static int query_formats(AVFilterContext *ctx)
>> > +{
>> > +    // Currently only RGB based 32bit formats are specified
>> > +    // TODO: Technically the logic is transparent to 16bit RGB formats
>> also
>> > +    static const enum AVPixelFormat pix_fmts[] = {AV_PIX_FMT_RGB0,
>> > AV_PIX_FMT_0RGB, AV_PIX_FMT_BGR0, AV_PIX_FMT_0BGR,
>> > +                                                  AV_PIX_FMT_RGBA,
>> > AV_PIX_FMT_ARGB, AV_PIX_FMT_BGRA, AV_PIX_FMT_ABGR,
>> > +                                                  AV_PIX_FMT_NONE};
>> > +    AVFilterFormats *fmts_list;
>> > +
>> > +    fmts_list = ff_make_format_list(pix_fmts);
>> > +    if (!fmts_list)
>> > +        return AVERROR(ENOMEM);
>> > +    return ff_set_common_formats(ctx, fmts_list);
>> > +}
>> > +
>> > +static int config_props(AVFilterLink *inlink)
>> > +{
>> > +    AVFilterContext *ctx = inlink->dst;
>> > +    FBDetileContext *fbdetile = ctx->priv;
>> > +
>> > +    fbdetile->width = inlink->w;
>> > +    fbdetile->height = inlink->h;
>> > +    fprintf(stderr,"DBUG:fbdetile:config_props: %d x %d\n",
>> > fbdetile->width, fbdetile->height);
>> > +
>> > +    return 0;
>> > +}
>> > +
>> > +static void detile_intelx(AVFilterContext *ctx, int w, int h,
>> > +                                uint8_t *dst, int dstLineSize,
>> > +                          const uint8_t *src, int srcLineSize)
>> > +{
>> > +    // Offsets and LineSize are in bytes
>> > +    int tileW = 128; // For a 32Bit / Pixel framebuffer, 512/4
>> > +    int tileH = 8;
>> > +
>> > +    if (w*4 != srcLineSize) {
>> > +        fprintf(stderr,"DBUG:fbdetile:intelx: w%dxh%d, dL%d, sL%d\n",
>> w, h,
>> > dstLineSize, srcLineSize);
>> > +        fprintf(stderr,"ERRR:fbdetile:intelx: dont support LineSize |
>> Pitch
>> > going beyond width\n");
>> > +    }
>> > +    int sO = 0;
>> > +    int dX = 0;
>> > +    int dY = 0;
>> > +    int nTRows = (w*h)/tileW;
>> > +    int cTR = 0;
>> > +    while (cTR < nTRows) {
>> > +        int dO = dY*dstLineSize + dX*4;
>> > +#ifdef DEBUG_FBTILE
>> > +        fprintf(stderr,"DBUG:fbdetile:intelx: dX%d dY%d, sO%d,
>> > dO%d\n",
>> dX,
>> > dY, sO, dO);
>> > +#endif
>> > +        memcpy(dst+dO+0*dstLineSize, src+sO+0*512, 512);
>> > +        memcpy(dst+dO+1*dstLineSize, src+sO+1*512, 512);
>> > +        memcpy(dst+dO+2*dstLineSize, src+sO+2*512, 512);
>> > +        memcpy(dst+dO+3*dstLineSize, src+sO+3*512, 512);
>> > +        memcpy(dst+dO+4*dstLineSize, src+sO+4*512, 512);
>> > +        memcpy(dst+dO+5*dstLineSize, src+sO+5*512, 512);
>> > +        memcpy(dst+dO+6*dstLineSize, src+sO+6*512, 512);
>> > +        memcpy(dst+dO+7*dstLineSize, src+sO+7*512, 512);
>> > +        dX += tileW;
>> > +        if (dX >= w) {
>> > +            dX = 0;
>> > +            dY += 8;
>> > +        }
>> > +        sO = sO + 8*512;
>> > +        cTR += 8;
>> > +    }
>> > +}
>> > +
>> > +/*
>> > + * Intel Legacy Tile-Y layout conversion support
>> > + *
>> > + * currently done in a simple dumb way. Two low hanging optimisations
>> > + * that could be readily applied are
>> > + *
>> > + * a) unrolling the inner for loop
>> > + *    --- Given small size memcpy, should help, DONE
>> > + *
>> > + * b) using simd based 128bit loading and storing along with prefetch
>> > + *    hinting.
>> > + *
>> > + *    TOTHINK|CHECK: Does memcpy already does this and more if
>> > situation
>> > + *    is right?!
>> > + *
>> > + *    As code (or even intrinsics) would be specific to each
>> architecture,
>> > + *    avoiding for now. Later have to check if vector_size attribute
>> > and
>> > + *    corresponding implementation by gcc can handle different
>> > architectures
>> > + *    properly, such that it wont become worse than memcpy provided
>> > for
>> > that
>> > + *    architecture.
>> > + *
>> > + * Or maybe I could even merge the two intel detiling logics into one,
>> as
>> > + * the semantic and flow is almost same for both logics.
>> > + *
>> > + */
>> > +static void detile_intely(AVFilterContext *ctx, int w, int h,
>> > +                                uint8_t *dst, int dstLineSize,
>> > +                          const uint8_t *src, int srcLineSize)
>> > +{
>> > +    // Offsets and LineSize are in bytes
>> > +    int tileW = 4; // For a 32Bit / Pixel framebuffer, 16/4
>> > +    int tileH = 32;
>> > +
>> > +    if (w*4 != srcLineSize) {
>> > +        fprintf(stderr,"DBUG:fbdetile:intely: w%dxh%d, dL%d, sL%d\n",
>> w, h,
>> > dstLineSize, srcLineSize);
>> > +        fprintf(stderr,"ERRR:fbdetile:intely: dont support LineSize |
>> Pitch
>> > going beyond width\n");
>> > +    }
>> > +    int sO = 0;
>> > +    int dX = 0;
>> > +    int dY = 0;
>> > +    int nTRows = (w*h)/tileW;
>> > +    int cTR = 0;
>> > +    while (cTR < nTRows) {
>> > +        int dO = dY*dstLineSize + dX*4;
>> > +#ifdef DEBUG_FBTILE
>> > +        fprintf(stderr,"DBUG:fbdetile:intely: dX%d dY%d, sO%d,
>> > dO%d\n",
>> dX,
>> > dY, sO, dO);
>> > +#endif
>> > +
>> > +        memcpy(dst+dO+0*dstLineSize, src+sO+0*16, 16);
>> > +        memcpy(dst+dO+1*dstLineSize, src+sO+1*16, 16);
>> > +        memcpy(dst+dO+2*dstLineSize, src+sO+2*16, 16);
>> > +        memcpy(dst+dO+3*dstLineSize, src+sO+3*16, 16);
>> > +        memcpy(dst+dO+4*dstLineSize, src+sO+4*16, 16);
>> > +        memcpy(dst+dO+5*dstLineSize, src+sO+5*16, 16);
>> > +        memcpy(dst+dO+6*dstLineSize, src+sO+6*16, 16);
>> > +        memcpy(dst+dO+7*dstLineSize, src+sO+7*16, 16);
>> > +        memcpy(dst+dO+8*dstLineSize, src+sO+8*16, 16);
>> > +        memcpy(dst+dO+9*dstLineSize, src+sO+9*16, 16);
>> > +        memcpy(dst+dO+10*dstLineSize, src+sO+10*16, 16);
>> > +        memcpy(dst+dO+11*dstLineSize, src+sO+11*16, 16);
>> > +        memcpy(dst+dO+12*dstLineSize, src+sO+12*16, 16);
>> > +        memcpy(dst+dO+13*dstLineSize, src+sO+13*16, 16);
>> > +        memcpy(dst+dO+14*dstLineSize, src+sO+14*16, 16);
>> > +        memcpy(dst+dO+15*dstLineSize, src+sO+15*16, 16);
>> > +        memcpy(dst+dO+16*dstLineSize, src+sO+16*16, 16);
>> > +        memcpy(dst+dO+17*dstLineSize, src+sO+17*16, 16);
>> > +        memcpy(dst+dO+18*dstLineSize, src+sO+18*16, 16);
>> > +        memcpy(dst+dO+19*dstLineSize, src+sO+19*16, 16);
>> > +        memcpy(dst+dO+20*dstLineSize, src+sO+20*16, 16);
>> > +        memcpy(dst+dO+21*dstLineSize, src+sO+21*16, 16);
>> > +        memcpy(dst+dO+22*dstLineSize, src+sO+22*16, 16);
>> > +        memcpy(dst+dO+23*dstLineSize, src+sO+23*16, 16);
>> > +        memcpy(dst+dO+24*dstLineSize, src+sO+24*16, 16);
>> > +        memcpy(dst+dO+25*dstLineSize, src+sO+25*16, 16);
>> > +        memcpy(dst+dO+26*dstLineSize, src+sO+26*16, 16);
>> > +        memcpy(dst+dO+27*dstLineSize, src+sO+27*16, 16);
>> > +        memcpy(dst+dO+28*dstLineSize, src+sO+28*16, 16);
>> > +        memcpy(dst+dO+29*dstLineSize, src+sO+29*16, 16);
>> > +        memcpy(dst+dO+30*dstLineSize, src+sO+30*16, 16);
>> > +        memcpy(dst+dO+31*dstLineSize, src+sO+31*16, 16);
>> > +
>> > +        dX += tileW;
>> > +        if (dX >= w) {
>> > +            dX = 0;
>> > +            dY += 32;
>> > +        }
>> > +        sO = sO + 32*16;
>> > +        cTR += 32;
>> > +    }
>> > +}
>> > +
>> > +static int filter_frame(AVFilterLink *inlink, AVFrame *in)
>> > +{
>> > +    AVFilterContext *ctx = inlink->dst;
>> > +    FBDetileContext *fbdetile = ctx->priv;
>> > +    AVFilterLink *outlink = ctx->outputs[0];
>> > +    AVFrame *out;
>> > +
>> > +    out = ff_get_video_buffer(outlink, outlink->w, outlink->h);
>> > +    if (!out) {
>> > +        av_frame_free(&in);
>> > +        return AVERROR(ENOMEM);
>> > +    }
>> > +    av_frame_copy_props(out, in);
>> > +
>> > +    if (fbdetile->type == TYPE_INTELX) {
>> > +        detile_intelx(ctx, fbdetile->width, fbdetile->height,
>> > +                      out->data[0], out->linesize[0],
>> > +                      in->data[0], in->linesize[0]);
>> > +    } else if (fbdetile->type == TYPE_INTELY) {
>> > +        detile_intely(ctx, fbdetile->width, fbdetile->height,
>> > +                      out->data[0], out->linesize[0],
>> > +                      in->data[0], in->linesize[0]);
>> > +    }
>> > +
>> > +    av_frame_free(&in);
>> > +    return ff_filter_frame(outlink, out);
>> > +}
>> > +
>> > +static av_cold void uninit(AVFilterContext *ctx)
>> > +{
>> > +
>> > +}
>> > +
>> > +static const AVFilterPad fbdetile_inputs[] = {
>> > +    {
>> > +        .name         = "default",
>> > +        .type         = AVMEDIA_TYPE_VIDEO,
>> > +        .config_props = config_props,
>> > +        .filter_frame = filter_frame,
>> > +    },
>> > +    { NULL }
>> > +};
>> > +
>> > +static const AVFilterPad fbdetile_outputs[] = {
>> > +    {
>> > +        .name = "default",
>> > +        .type = AVMEDIA_TYPE_VIDEO,
>> > +    },
>> > +    { NULL }
>> > +};
>> > +
>> > +AVFilter ff_vf_fbdetile = {
>> > +    .name          = "fbdetile",
>> > +    .description   = NULL_IF_CONFIG_SMALL("Detile Framebuffer using
>> CPU"),
>> > +    .priv_size     = sizeof(FBDetileContext),
>> > +    .init          = init,
>> > +    .uninit        = uninit,
>> > +    .query_formats = query_formats,
>> > +    .inputs        = fbdetile_inputs,
>> > +    .outputs       = fbdetile_outputs,
>> > +    .priv_class    = &fbdetile_class,
>> > +};
>> > --
>> > 2.20.1
>> >
>> > _______________________________________________
>> > ffmpeg-devel mailing list
>> > ffmpeg-devel@ffmpeg.org
>> > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>> >
>> > To unsubscribe, visit link above, or email
>> > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>
>
>
> --
> Keep ;-)
> HanishKVC
>
C Hanish Menon June 28, 2020, 4:10 p.m. UTC | #4
Hi Paul,

At one level I agree that compared to using gpu for doing the detiling,
this will be slower. However, what I have seen from some discussions
online, the extensions in vulkan and opengl with the format modifier aware
export weren't fully there yet. Also in case of intel, it potentially
requires using fenced view provided by the gpu to get the linear view, but
inturn this fence is a limited resource in it.

Parallely, in my case, few days back I for one wanted to capture my wayland
session at good fps and full screen, but was limited by the screencapture
dbus service provided by gnome.shell, which limited me to framerate of less
than 10-14 fps and limited compression. And also currently ffmpeg with
kmsgrab doesnt allow me to capture such a tile layout based screen, because
its outputs will be the tiles and not the linear proper screen.

This is what lead to me implementing this. This allows me to capture full
screen (1080p) at 24-30 fps, with out overloading the cpu. Also even if
someone is capturing lets say 4K or larger screens, in that case they can
still capture the screen without using this filter and later inturn run the
captured tiled video throu this filter and it will convert it into proper
video. So I feel it will be very useful for many usecases, which is
currently not possible with ffmpeg and its current filters (unless I have
missed some setup of ffmpeg, which is also possible, would be interested in
knowing such a path, if any).

Keep ;-)
HanishKVC




On Sun, Jun 28, 2020 at 2:55 PM Paul B Mahol <onemda@gmail.com> wrote:

> On 6/27/20, C Hanish Menon <hanishkvc@gmail.com> wrote:
> > Hi,
> >
> > It is a new video filter which I created to do detailing of the Intel
> > Tile-X and Tile-Y framebuffer layouts into linear layout using a logic
> > which runs on the cpu. It can be used if one uses kmsgrab and hwdownload
> to
> > capture screen on a Intel GPU based system, so that one can get proper
> > screen capture.
> >
> > Without this kmsgrab will generate a unusable/scrambled capture, because
> > the contents will be tiled. I had this issue few days back when trying to
> > capture screen with wayland, so created this.
> >
> > In the patch submitted, I have added the doc/filters.texi, which mentions
> > the same.
>
> Filter is marginally useful, it is done in CPU, completely
> invalidating any possible gain using hw path.
>
> >
> >
> >
> > On Sun, Jun 28, 2020 at 1:30 AM Paul B Mahol <onemda@gmail.com> wrote:
> >
> >> What is this?
> >>
> >> Missing documentation.
> >> NAK
> >>
> >> On 6/27/20, hanishkvc <hanishkvc@gmail.com> wrote:
> >> > v02-20200627IST2331
> >> >
> >> > Unrolled Intel Legacy Tile-Y detiling logic.
> >> >
> >> > Also a consolidated patch file, instead of the previous development
> >> > flow based multiple patch files.
> >> >
> >> > v01-20200627IST1308
> >> >
> >> > Implemented Intel Legacy Tile-X and Tile-Y detiling logic
> >> >
> >> > NOTES:
> >> >
> >> > This video filter allows framebuffers which are tiled to be detiled
> >> > using logic running on the cpu, into a linear layout.
> >> >
> >> > Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling.
> >> > THis should help one to work with frames captured (say using kmsgrab)
> >> > on laptops having Intel GPU.
> >> >
> >> > Tile-X conversion logic has been explicitly cross checked, with Tile-X
> >> > based frames. However Tile-Y conv logic hasnt been tested with Tile-Y
> >> > based frames, but it should potentially do the job, based on my
> current
> >> > understanding of the Tile-Y layout format.
> >> >
> >> > TODO1: At a later time have to generate Tile-Y based frames, and then
> >> > cross check the corresponding logic explicitly.
> >> >
> >> > TODO2: May be use OpenGL or Vulcan buffer helper routines to do the
> >> > layout conversion. But some online discussions from sometime back seem
> >> > to indicate that this path is not fully bug free currently.
> >> > ---
> >> >  Changelog                 |   1 +
> >> >  doc/filters.texi          |  62 ++++++++
> >> >  libavfilter/Makefile      |   1 +
> >> >  libavfilter/allfilters.c  |   1 +
> >> >  libavfilter/vf_fbdetile.c | 309
> ++++++++++++++++++++++++++++++++++++++
> >> >  5 files changed, 374 insertions(+)
> >> >  create mode 100644 libavfilter/vf_fbdetile.c
> >> >
> >> > diff --git a/Changelog b/Changelog
> >> > index a60e7d2eb8..0e03491f6a 100644
> >> > --- a/Changelog
> >> > +++ b/Changelog
> >> > @@ -2,6 +2,7 @@ Entries are sorted chronologically from oldest to
> >> youngest
> >> > within each release,
> >> >  releases are sorted from youngest to oldest.
> >> >
> >> >  version <next>:
> >> > +- fbdetile cpu based framebuffer layout detiling video filter
> >> >  - AudioToolbox output device
> >> >  - MacCaption demuxer
> >> >
> >> > diff --git a/doc/filters.texi b/doc/filters.texi
> >> > index 3c2dd2eb90..73ba21af89 100644
> >> > --- a/doc/filters.texi
> >> > +++ b/doc/filters.texi
> >> > @@ -12210,6 +12210,68 @@ It accepts the following optional parameters:
> >> >  The number of the CUDA device to use
> >> >  @end table
> >> >
> >> > +@anchor{fbdetile}
> >> > +@section fbdetile
> >> > +
> >> > +Detiles the Framebuffer tile layout into a linear layout using CPU.
> >> > +
> >> > +It currently supports conversion from Intel legacy tile-x and tile-y
> >> > layouts
> >> > +into a linear layout. This is useful if one is using kmsgrab and
> >> hwdownload
> >> > +to capture a screen which is using one of these non-linear layouts.
> >> > +
> >> > +Currently it expects the data to be a 32bit RGB based pixel format.
> >> However
> >> > +the logic doesnt do any pixel format conversion or so. Later will be
> >> > enabling
> >> > +16bit RGB data also, as the logic is transparent to it at one level.
> >> > +
> >> > +One could either insert this into the filter chain while capturing
> >> itself,
> >> > +or else, if it is slowing things down or so, then one could instead
> >> insert
> >> > +it into the filter chain during playback or transcoding or so.
> >> > +
> >> > +It supports the following optional parameters
> >> > +
> >> > +@table @option
> >> > +@item type
> >> > +Specify which detiling conversion to apply. The supported values are
> >> > +@table @var
> >> > +@item 0
> >> > +intel tile-x to linear conversion (the default)
> >> > +@item 1
> >> > +intel tile-y to linear conversion.
> >> > +@end table
> >> > +@end table
> >> > +
> >> > +If one wants to convert during capture itself, one could do
> >> > +@example
> >> > +ffmpeg -f kmsgrab -i - -vf "hwdownload, fbdetile" OUTPUT
> >> > +@end example
> >> > +
> >> > +However if one wants to convert after the tiled data has been already
> >> > captured
> >> > +@example
> >> > +ffmpeg -i INPUT -vf "fbdetile" OUTPUT
> >> > +@end example
> >> > +@example
> >> > +ffplay -i INPUT -vf "fbdetile"
> >> > +@end example
> >> > +
> >> > +NOTE: While transcoding a test 1080p h264 stream, with 276 frames,
> >> > with
> >> two
> >> > +runs of each situation, the performance was has given below. However
> >> this
> >> > +was for the older | initial version of the logic, as well as it was
> >> > run
> >> on
> >> > +the default linux chromebook->vm->container, so the perf values need
> >> not be
> >> > +proper. But in a relative sense the overhead would be similar.
> >> > +@example
> >> > +rm out.mp4; time ./ffmpeg -i input.mp4 out.mp4
> >> > +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=0 out.mp4
> >> > +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=1 out.mp4
> >> > +@end example
> >> > +@table @option
> >> > +@item with no fbdetile filter
> >> > +it took ~7.28 secs,
> >> > +@item with fbdetile=0 filter
> >> > +it took ~8.69 secs,
> >> > +@item with fbdetile=1 filter
> >> > +it took ~9.20 secs.
> >> > +@end table
> >> > +
> >> >  @section hqx
> >> >
> >> >  Apply a high-quality magnification filter designed for pixel art.
> This
> >> > filter
> >> > diff --git a/libavfilter/Makefile b/libavfilter/Makefile
> >> > index 5123540653..bdb0c379ae 100644
> >> > --- a/libavfilter/Makefile
> >> > +++ b/libavfilter/Makefile
> >> > @@ -280,6 +280,7 @@ OBJS-$(CONFIG_HWDOWNLOAD_FILTER)             +=
> >> > vf_hwdownload.o
> >> >  OBJS-$(CONFIG_HWMAP_FILTER)                  += vf_hwmap.o
> >> >  OBJS-$(CONFIG_HWUPLOAD_CUDA_FILTER)          += vf_hwupload_cuda.o
> >> >  OBJS-$(CONFIG_HWUPLOAD_FILTER)               += vf_hwupload.o
> >> > +OBJS-$(CONFIG_FBDETILE_FILTER)               += vf_fbdetile.o
> >> >  OBJS-$(CONFIG_HYSTERESIS_FILTER)             += vf_hysteresis.o
> >> framesync.o
> >> >  OBJS-$(CONFIG_IDET_FILTER)                   += vf_idet.o
> >> >  OBJS-$(CONFIG_IL_FILTER)                     += vf_il.o
> >> > diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
> >> > index 1183e40267..f8dceb2a88 100644
> >> > --- a/libavfilter/allfilters.c
> >> > +++ b/libavfilter/allfilters.c
> >> > @@ -265,6 +265,7 @@ extern AVFilter ff_vf_hwdownload;
> >> >  extern AVFilter ff_vf_hwmap;
> >> >  extern AVFilter ff_vf_hwupload;
> >> >  extern AVFilter ff_vf_hwupload_cuda;
> >> > +extern AVFilter ff_vf_fbdetile;
> >> >  extern AVFilter ff_vf_hysteresis;
> >> >  extern AVFilter ff_vf_idet;
> >> >  extern AVFilter ff_vf_il;
> >> > diff --git a/libavfilter/vf_fbdetile.c b/libavfilter/vf_fbdetile.c
> >> > new file mode 100644
> >> > index 0000000000..8b20c96d2c
> >> > --- /dev/null
> >> > +++ b/libavfilter/vf_fbdetile.c
> >> > @@ -0,0 +1,309 @@
> >> > +/*
> >> > + * Copyright (c) 2020 HanishKVC
> >> > + *
> >> > + * This file is part of FFmpeg.
> >> > + *
> >> > + * FFmpeg is free software; you can redistribute it and/or
> >> > + * modify it under the terms of the GNU Lesser General Public
> >> > + * License as published by the Free Software Foundation; either
> >> > + * version 2.1 of the License, or (at your option) any later version.
> >> > + *
> >> > + * FFmpeg is distributed in the hope that it will be useful,
> >> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> > + * Lesser General Public License for more details.
> >> > + *
> >> > + * You should have received a copy of the GNU Lesser General Public
> >> > + * License along with FFmpeg; if not, write to the Free Software
> >> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> >> 02110-1301
> >> > USA
> >> > + */
> >> > +
> >> > +/**
> >> > + * @file
> >> > + * Detile the Frame buffer's tile layout using the cpu
> >> > + * Currently it supports the legacy Intel Tile X layout detiling.
> >> > + *
> >> > + */
> >> > +
> >> > +/*
> >> > + * ToThink|Check: Optimisations
> >> > + *
> >> > + * Does gcc setting used by ffmpeg allows memcpy | stringops
> inlining,
> >> > + * loop unrolling, better native matching instructions, additional
> >> > + * optimisations, ...
> >> > + *
> >> > + * Does gcc map to optimal memcpy logic, based on the situation it is
> >> > + * used in.
> >> > + *
> >> > + * If not, may be look at vector_size or intrinsics or appropriate
> >> > arch
> >> > + * and cpu specific inline asm or ...
> >> > + *
> >> > + */
> >> > +
> >> > +#include "libavutil/avassert.h"
> >> > +#include "libavutil/imgutils.h"
> >> > +#include "libavutil/opt.h"
> >> > +#include "avfilter.h"
> >> > +#include "formats.h"
> >> > +#include "internal.h"
> >> > +#include "video.h"
> >> > +
> >> > +enum FilterMode {
> >> > +    TYPE_INTELX,
> >> > +    TYPE_INTELY,
> >> > +    NB_TYPE
> >> > +};
> >> > +
> >> > +typedef struct FBDetileContext {
> >> > +    const AVClass *class;
> >> > +    int width, height;
> >> > +    int type;
> >> > +} FBDetileContext;
> >> > +
> >> > +#define OFFSET(x) offsetof(FBDetileContext, x)
> >> > +#define FLAGS AV_OPT_FLAG_FILTERING_PARAM|AV_OPT_FLAG_VIDEO_PARAM
> >> > +static const AVOption fbdetile_options[] = {
> >> > +    { "type", "set framebuffer format_modifier type", OFFSET(type),
> >> > AV_OPT_TYPE_INT, {.i64=TYPE_INTELX}, 0, NB_TYPE-1, FLAGS, "type" },
> >> > +        { "intelx", "Intel Tile-X layout", 0, AV_OPT_TYPE_CONST,
> >> > {.i64=TYPE_INTELX}, INT_MIN, INT_MAX, FLAGS, "type" },
> >> > +        { "intely", "Intel Tile-Y layout", 0, AV_OPT_TYPE_CONST,
> >> > {.i64=TYPE_INTELY}, INT_MIN, INT_MAX, FLAGS, "type" },
> >> > +    { NULL }
> >> > +};
> >> > +
> >> > +AVFILTER_DEFINE_CLASS(fbdetile);
> >> > +
> >> > +static av_cold int init(AVFilterContext *ctx)
> >> > +{
> >> > +    FBDetileContext *fbdetile = ctx->priv;
> >> > +
> >> > +    if (fbdetile->type == TYPE_INTELX) {
> >> > +        fprintf(stderr,"INFO:fbdetile:init: Intel tile-x to
> >> > linear\n");
> >> > +    } else if (fbdetile->type == TYPE_INTELY) {
> >> > +        fprintf(stderr,"INFO:fbdetile:init: Intel tile-y to
> >> > linear\n");
> >> > +    } else {
> >> > +        fprintf(stderr,"DBUG:fbdetile:init: Unknown Tile format
> >> specified,
> >> > shouldnt reach here\n");
> >> > +    }
> >> > +    fbdetile->width = 1920;
> >> > +    fbdetile->height = 1080;
> >> > +    return 0;
> >> > +}
> >> > +
> >> > +static int query_formats(AVFilterContext *ctx)
> >> > +{
> >> > +    // Currently only RGB based 32bit formats are specified
> >> > +    // TODO: Technically the logic is transparent to 16bit RGB
> formats
> >> also
> >> > +    static const enum AVPixelFormat pix_fmts[] = {AV_PIX_FMT_RGB0,
> >> > AV_PIX_FMT_0RGB, AV_PIX_FMT_BGR0, AV_PIX_FMT_0BGR,
> >> > +                                                  AV_PIX_FMT_RGBA,
> >> > AV_PIX_FMT_ARGB, AV_PIX_FMT_BGRA, AV_PIX_FMT_ABGR,
> >> > +                                                  AV_PIX_FMT_NONE};
> >> > +    AVFilterFormats *fmts_list;
> >> > +
> >> > +    fmts_list = ff_make_format_list(pix_fmts);
> >> > +    if (!fmts_list)
> >> > +        return AVERROR(ENOMEM);
> >> > +    return ff_set_common_formats(ctx, fmts_list);
> >> > +}
> >> > +
> >> > +static int config_props(AVFilterLink *inlink)
> >> > +{
> >> > +    AVFilterContext *ctx = inlink->dst;
> >> > +    FBDetileContext *fbdetile = ctx->priv;
> >> > +
> >> > +    fbdetile->width = inlink->w;
> >> > +    fbdetile->height = inlink->h;
> >> > +    fprintf(stderr,"DBUG:fbdetile:config_props: %d x %d\n",
> >> > fbdetile->width, fbdetile->height);
> >> > +
> >> > +    return 0;
> >> > +}
> >> > +
> >> > +static void detile_intelx(AVFilterContext *ctx, int w, int h,
> >> > +                                uint8_t *dst, int dstLineSize,
> >> > +                          const uint8_t *src, int srcLineSize)
> >> > +{
> >> > +    // Offsets and LineSize are in bytes
> >> > +    int tileW = 128; // For a 32Bit / Pixel framebuffer, 512/4
> >> > +    int tileH = 8;
> >> > +
> >> > +    if (w*4 != srcLineSize) {
> >> > +        fprintf(stderr,"DBUG:fbdetile:intelx: w%dxh%d, dL%d, sL%d\n",
> >> w, h,
> >> > dstLineSize, srcLineSize);
> >> > +        fprintf(stderr,"ERRR:fbdetile:intelx: dont support LineSize |
> >> Pitch
> >> > going beyond width\n");
> >> > +    }
> >> > +    int sO = 0;
> >> > +    int dX = 0;
> >> > +    int dY = 0;
> >> > +    int nTRows = (w*h)/tileW;
> >> > +    int cTR = 0;
> >> > +    while (cTR < nTRows) {
> >> > +        int dO = dY*dstLineSize + dX*4;
> >> > +#ifdef DEBUG_FBTILE
> >> > +        fprintf(stderr,"DBUG:fbdetile:intelx: dX%d dY%d, sO%d,
> >> > dO%d\n",
> >> dX,
> >> > dY, sO, dO);
> >> > +#endif
> >> > +        memcpy(dst+dO+0*dstLineSize, src+sO+0*512, 512);
> >> > +        memcpy(dst+dO+1*dstLineSize, src+sO+1*512, 512);
> >> > +        memcpy(dst+dO+2*dstLineSize, src+sO+2*512, 512);
> >> > +        memcpy(dst+dO+3*dstLineSize, src+sO+3*512, 512);
> >> > +        memcpy(dst+dO+4*dstLineSize, src+sO+4*512, 512);
> >> > +        memcpy(dst+dO+5*dstLineSize, src+sO+5*512, 512);
> >> > +        memcpy(dst+dO+6*dstLineSize, src+sO+6*512, 512);
> >> > +        memcpy(dst+dO+7*dstLineSize, src+sO+7*512, 512);
> >> > +        dX += tileW;
> >> > +        if (dX >= w) {
> >> > +            dX = 0;
> >> > +            dY += 8;
> >> > +        }
> >> > +        sO = sO + 8*512;
> >> > +        cTR += 8;
> >> > +    }
> >> > +}
> >> > +
> >> > +/*
> >> > + * Intel Legacy Tile-Y layout conversion support
> >> > + *
> >> > + * currently done in a simple dumb way. Two low hanging optimisations
> >> > + * that could be readily applied are
> >> > + *
> >> > + * a) unrolling the inner for loop
> >> > + *    --- Given small size memcpy, should help, DONE
> >> > + *
> >> > + * b) using simd based 128bit loading and storing along with prefetch
> >> > + *    hinting.
> >> > + *
> >> > + *    TOTHINK|CHECK: Does memcpy already does this and more if
> >> > situation
> >> > + *    is right?!
> >> > + *
> >> > + *    As code (or even intrinsics) would be specific to each
> >> architecture,
> >> > + *    avoiding for now. Later have to check if vector_size attribute
> >> > and
> >> > + *    corresponding implementation by gcc can handle different
> >> > architectures
> >> > + *    properly, such that it wont become worse than memcpy provided
> >> > for
> >> > that
> >> > + *    architecture.
> >> > + *
> >> > + * Or maybe I could even merge the two intel detiling logics into
> one,
> >> as
> >> > + * the semantic and flow is almost same for both logics.
> >> > + *
> >> > + */
> >> > +static void detile_intely(AVFilterContext *ctx, int w, int h,
> >> > +                                uint8_t *dst, int dstLineSize,
> >> > +                          const uint8_t *src, int srcLineSize)
> >> > +{
> >> > +    // Offsets and LineSize are in bytes
> >> > +    int tileW = 4; // For a 32Bit / Pixel framebuffer, 16/4
> >> > +    int tileH = 32;
> >> > +
> >> > +    if (w*4 != srcLineSize) {
> >> > +        fprintf(stderr,"DBUG:fbdetile:intely: w%dxh%d, dL%d, sL%d\n",
> >> w, h,
> >> > dstLineSize, srcLineSize);
> >> > +        fprintf(stderr,"ERRR:fbdetile:intely: dont support LineSize |
> >> Pitch
> >> > going beyond width\n");
> >> > +    }
> >> > +    int sO = 0;
> >> > +    int dX = 0;
> >> > +    int dY = 0;
> >> > +    int nTRows = (w*h)/tileW;
> >> > +    int cTR = 0;
> >> > +    while (cTR < nTRows) {
> >> > +        int dO = dY*dstLineSize + dX*4;
> >> > +#ifdef DEBUG_FBTILE
> >> > +        fprintf(stderr,"DBUG:fbdetile:intely: dX%d dY%d, sO%d,
> >> > dO%d\n",
> >> dX,
> >> > dY, sO, dO);
> >> > +#endif
> >> > +
> >> > +        memcpy(dst+dO+0*dstLineSize, src+sO+0*16, 16);
> >> > +        memcpy(dst+dO+1*dstLineSize, src+sO+1*16, 16);
> >> > +        memcpy(dst+dO+2*dstLineSize, src+sO+2*16, 16);
> >> > +        memcpy(dst+dO+3*dstLineSize, src+sO+3*16, 16);
> >> > +        memcpy(dst+dO+4*dstLineSize, src+sO+4*16, 16);
> >> > +        memcpy(dst+dO+5*dstLineSize, src+sO+5*16, 16);
> >> > +        memcpy(dst+dO+6*dstLineSize, src+sO+6*16, 16);
> >> > +        memcpy(dst+dO+7*dstLineSize, src+sO+7*16, 16);
> >> > +        memcpy(dst+dO+8*dstLineSize, src+sO+8*16, 16);
> >> > +        memcpy(dst+dO+9*dstLineSize, src+sO+9*16, 16);
> >> > +        memcpy(dst+dO+10*dstLineSize, src+sO+10*16, 16);
> >> > +        memcpy(dst+dO+11*dstLineSize, src+sO+11*16, 16);
> >> > +        memcpy(dst+dO+12*dstLineSize, src+sO+12*16, 16);
> >> > +        memcpy(dst+dO+13*dstLineSize, src+sO+13*16, 16);
> >> > +        memcpy(dst+dO+14*dstLineSize, src+sO+14*16, 16);
> >> > +        memcpy(dst+dO+15*dstLineSize, src+sO+15*16, 16);
> >> > +        memcpy(dst+dO+16*dstLineSize, src+sO+16*16, 16);
> >> > +        memcpy(dst+dO+17*dstLineSize, src+sO+17*16, 16);
> >> > +        memcpy(dst+dO+18*dstLineSize, src+sO+18*16, 16);
> >> > +        memcpy(dst+dO+19*dstLineSize, src+sO+19*16, 16);
> >> > +        memcpy(dst+dO+20*dstLineSize, src+sO+20*16, 16);
> >> > +        memcpy(dst+dO+21*dstLineSize, src+sO+21*16, 16);
> >> > +        memcpy(dst+dO+22*dstLineSize, src+sO+22*16, 16);
> >> > +        memcpy(dst+dO+23*dstLineSize, src+sO+23*16, 16);
> >> > +        memcpy(dst+dO+24*dstLineSize, src+sO+24*16, 16);
> >> > +        memcpy(dst+dO+25*dstLineSize, src+sO+25*16, 16);
> >> > +        memcpy(dst+dO+26*dstLineSize, src+sO+26*16, 16);
> >> > +        memcpy(dst+dO+27*dstLineSize, src+sO+27*16, 16);
> >> > +        memcpy(dst+dO+28*dstLineSize, src+sO+28*16, 16);
> >> > +        memcpy(dst+dO+29*dstLineSize, src+sO+29*16, 16);
> >> > +        memcpy(dst+dO+30*dstLineSize, src+sO+30*16, 16);
> >> > +        memcpy(dst+dO+31*dstLineSize, src+sO+31*16, 16);
> >> > +
> >> > +        dX += tileW;
> >> > +        if (dX >= w) {
> >> > +            dX = 0;
> >> > +            dY += 32;
> >> > +        }
> >> > +        sO = sO + 32*16;
> >> > +        cTR += 32;
> >> > +    }
> >> > +}
> >> > +
> >> > +static int filter_frame(AVFilterLink *inlink, AVFrame *in)
> >> > +{
> >> > +    AVFilterContext *ctx = inlink->dst;
> >> > +    FBDetileContext *fbdetile = ctx->priv;
> >> > +    AVFilterLink *outlink = ctx->outputs[0];
> >> > +    AVFrame *out;
> >> > +
> >> > +    out = ff_get_video_buffer(outlink, outlink->w, outlink->h);
> >> > +    if (!out) {
> >> > +        av_frame_free(&in);
> >> > +        return AVERROR(ENOMEM);
> >> > +    }
> >> > +    av_frame_copy_props(out, in);
> >> > +
> >> > +    if (fbdetile->type == TYPE_INTELX) {
> >> > +        detile_intelx(ctx, fbdetile->width, fbdetile->height,
> >> > +                      out->data[0], out->linesize[0],
> >> > +                      in->data[0], in->linesize[0]);
> >> > +    } else if (fbdetile->type == TYPE_INTELY) {
> >> > +        detile_intely(ctx, fbdetile->width, fbdetile->height,
> >> > +                      out->data[0], out->linesize[0],
> >> > +                      in->data[0], in->linesize[0]);
> >> > +    }
> >> > +
> >> > +    av_frame_free(&in);
> >> > +    return ff_filter_frame(outlink, out);
> >> > +}
> >> > +
> >> > +static av_cold void uninit(AVFilterContext *ctx)
> >> > +{
> >> > +
> >> > +}
> >> > +
> >> > +static const AVFilterPad fbdetile_inputs[] = {
> >> > +    {
> >> > +        .name         = "default",
> >> > +        .type         = AVMEDIA_TYPE_VIDEO,
> >> > +        .config_props = config_props,
> >> > +        .filter_frame = filter_frame,
> >> > +    },
> >> > +    { NULL }
> >> > +};
> >> > +
> >> > +static const AVFilterPad fbdetile_outputs[] = {
> >> > +    {
> >> > +        .name = "default",
> >> > +        .type = AVMEDIA_TYPE_VIDEO,
> >> > +    },
> >> > +    { NULL }
> >> > +};
> >> > +
> >> > +AVFilter ff_vf_fbdetile = {
> >> > +    .name          = "fbdetile",
> >> > +    .description   = NULL_IF_CONFIG_SMALL("Detile Framebuffer using
> >> CPU"),
> >> > +    .priv_size     = sizeof(FBDetileContext),
> >> > +    .init          = init,
> >> > +    .uninit        = uninit,
> >> > +    .query_formats = query_formats,
> >> > +    .inputs        = fbdetile_inputs,
> >> > +    .outputs       = fbdetile_outputs,
> >> > +    .priv_class    = &fbdetile_class,
> >> > +};
> >> > --
> >> > 2.20.1
> >> >
> >> > _______________________________________________
> >> > ffmpeg-devel mailing list
> >> > ffmpeg-devel@ffmpeg.org
> >> > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >> >
> >> > To unsubscribe, visit link above, or email
> >> > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> >>
> >
> >
> > --
> > Keep ;-)
> > HanishKVC
> >
>
Mark Thompson June 28, 2020, 8:50 p.m. UTC | #5
On 27/06/2020 20:57, hanishkvc wrote:
> v02-20200627IST2331
> 
> Unrolled Intel Legacy Tile-Y detiling logic.
> 
> Also a consolidated patch file, instead of the previous development
> flow based multiple patch files.
> 
> v01-20200627IST1308
> 
> Implemented Intel Legacy Tile-X and Tile-Y detiling logic
> 
> NOTES:
> 
> This video filter allows framebuffers which are tiled to be detiled
> using logic running on the cpu, into a linear layout.
> 
> Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling.
> THis should help one to work with frames captured (say using kmsgrab)
> on laptops having Intel GPU.
> 
> Tile-X conversion logic has been explicitly cross checked, with Tile-X
> based frames. However Tile-Y conv logic hasnt been tested with Tile-Y
> based frames, but it should potentially do the job, based on my current
> understanding of the Tile-Y layout format.
> 
> TODO1: At a later time have to generate Tile-Y based frames, and then
> cross check the corresponding logic explicitly.
> 
> TODO2: May be use OpenGL or Vulcan buffer helper routines to do the
> layout conversion. But some online discussions from sometime back seem
> to indicate that this path is not fully bug free currently.
> ---
>   Changelog                 |   1 +
>   doc/filters.texi          |  62 ++++++++
>   libavfilter/Makefile      |   1 +
>   libavfilter/allfilters.c  |   1 +
>   libavfilter/vf_fbdetile.c | 309 ++++++++++++++++++++++++++++++++++++++
>   5 files changed, 374 insertions(+)
>   create mode 100644 libavfilter/vf_fbdetile.c

For your kmsgrab use-case I think you are doing this in the wrong place.  There is already a copy during the download step (the hwdownload filter before this), and that does know what the tiling mode 
is such that it could detile transparently without a need for an extra filter doing another copy.  See drm_transfer_data_from() in libavutil/hwcontext_drm.c, which currently just does the linear copy 
you observe regardless of the format modifier on the input buffer.

Unrelated to the previous point, does the dependence of the actual layout of the X and Y tiled formats on the exact model of GPU in use cause any problems here?  If the layout is actually the same on 
everything people might use nowadays then it's probably fine; if that isn't true then maybe it needs some extra check.

- Mark
C Hanish Menon June 28, 2020, 9:40 p.m. UTC | #6
Hi Mark,

**** hwdownload vs separate filter

True, for kmsgrab use-case one could potentially do this transform as part
of the drm_transfer_data logic (which currently mmaps and does a linear
copy, if even I remember correctly).  But like what I had mentioned in my
previous email, as this is done on the cpu side, if one wants to capture
very large framebuffers (say 4K or 8K at high fps), it could impact the
performance to some extent, so in such a situation decoupling the capture
from detiling, allows one to capture the screen at a very high resolution
without worrying about detiling and then handle detile in a offline /
separate pass manner.

NOTE1: Also as a side note, I dont think the existing logic is currently
fetching the format modifier of the actual frame buffer, I think it gets
set to NONE type by default and remains like that, unless user passes the
format_modifier argument, but I could be wrong in this understanding of
mine, as I have only gone through the code flow quickly once and also as I
am in alien territory in some sense at one level.

**** Tile layouts

As it mainly supports Intel tile layouts for now, and as older Intel GPUs
didnt support Tile-Y format for scan out purpose, I think currently most
set the framebuffer layout to Tile-X for display purpose. So in that sense
the default type of Tile-X which is used by the filter should be fine for
most cases. However if one wants, one can change the tile conversion format
to Tile-Y by passing a argument to the filter. Also as I wasnt very sure
the format-modifer is being picked up by default, so also used the most
likely case as the default and inturn provided the option to change the
layout conversion to use if required.

NOTE2: The Tile-X being the default is my understanding based on a quick
glance through the Intel GPU documents and potentially some things which I
might have seen online.

NOTE3: I am not much clued in into this domain in general, nor tracking it,
but more as I had a issue with some capturing which I wanted to do, I went
through the ffmpeg kmsgrab + hwup/down and hwcontext code path a bit, some
documents and headers quickly and then based on a rough logical
understanding I wanted to implement a quick and flexible solution to solve
my problem as well as potentially help others who might have a similar
issue. And that is how this filter got done.

Also I am planning to add a additional generic detile logic later, where
the user can configure the tile format as a list of direction changes and
few other constraints and then the same logic can handle either TileX or
TileY or TileYs or TileYf or ... This will be slower (based on some initial
tests the generic logic seems to be around 50% slower compared to current
specific targeted conversion logics which I have implemented), but should
allow one to try and detile any (or rather more correctly - many) kind of
tile layouts, as the case may be. Again the idea is to use this generic
path has a offline / second pass.


On Mon, Jun 29, 2020 at 2:28 AM Mark Thompson <sw@jkqxz.net> wrote:

> On 27/06/2020 20:57, hanishkvc wrote:
> > v02-20200627IST2331
> >
> > Unrolled Intel Legacy Tile-Y detiling logic.
> >
> > Also a consolidated patch file, instead of the previous development
> > flow based multiple patch files.
> >
> > v01-20200627IST1308
> >
> > Implemented Intel Legacy Tile-X and Tile-Y detiling logic
> >
> > NOTES:
> >
> > This video filter allows framebuffers which are tiled to be detiled
> > using logic running on the cpu, into a linear layout.
> >
> > Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling.
> > THis should help one to work with frames captured (say using kmsgrab)
> > on laptops having Intel GPU.
> >
> > Tile-X conversion logic has been explicitly cross checked, with Tile-X
> > based frames. However Tile-Y conv logic hasnt been tested with Tile-Y
> > based frames, but it should potentially do the job, based on my current
> > understanding of the Tile-Y layout format.
> >
> > TODO1: At a later time have to generate Tile-Y based frames, and then
> > cross check the corresponding logic explicitly.
> >
> > TODO2: May be use OpenGL or Vulcan buffer helper routines to do the
> > layout conversion. But some online discussions from sometime back seem
> > to indicate that this path is not fully bug free currently.
> > ---
> >   Changelog                 |   1 +
> >   doc/filters.texi          |  62 ++++++++
> >   libavfilter/Makefile      |   1 +
> >   libavfilter/allfilters.c  |   1 +
> >   libavfilter/vf_fbdetile.c | 309 ++++++++++++++++++++++++++++++++++++++
> >   5 files changed, 374 insertions(+)
> >   create mode 100644 libavfilter/vf_fbdetile.c
>
> For your kmsgrab use-case I think you are doing this in the wrong place.
> There is already a copy during the download step (the hwdownload filter
> before this), and that does know what the tiling mode
> is such that it could detile transparently without a need for an extra
> filter doing another copy.  See drm_transfer_data_from() in
> libavutil/hwcontext_drm.c, which currently just does the linear copy
> you observe regardless of the format modifier on the input buffer.
>
> Unrelated to the previous point, does the dependence of the actual layout
> of the X and Y tiled formats on the exact model of GPU in use cause any
> problems here?  If the layout is actually the same on
> everything people might use nowadays then it's probably fine; if that
> isn't true then maybe it needs some extra check.
>
> - Mark
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
C Hanish Menon June 28, 2020, 10:17 p.m. UTC | #7
Hi Mark,

A small additional clarification to my last email, where I have responded
to your queries/thoughts.

The additional flexible generic logic which I am experimenting currently,
allows the more complex Tile-Yf to be detiled with around 50% overhead
compared to the targetted Tile-X or Tile-Y implementation. WHile the
flexible generic logic handles Tile-X using only additional 3% overhead
compared to the targetted Tile-X implementation. So in that sense the
generic logic which I am currently experimenting seems to do good at one
level. So for TileX, TileY one uses the targeted logic, while for the more
intricate tiled layouts use the flexible | configurable generic detile
logic.


On Mon, Jun 29, 2020 at 3:10 AM C Hanish Menon <hanishkvc@gmail.com> wrote:

> Hi Mark,
>
> **** hwdownload vs separate filter
>
> True, for kmsgrab use-case one could potentially do this transform as part
> of the drm_transfer_data logic (which currently mmaps and does a linear
> copy, if even I remember correctly).  But like what I had mentioned in my
> previous email, as this is done on the cpu side, if one wants to capture
> very large framebuffers (say 4K or 8K at high fps), it could impact the
> performance to some extent, so in such a situation decoupling the capture
> from detiling, allows one to capture the screen at a very high resolution
> without worrying about detiling and then handle detile in a offline /
> separate pass manner.
>
> NOTE1: Also as a side note, I dont think the existing logic is currently
> fetching the format modifier of the actual frame buffer, I think it gets
> set to NONE type by default and remains like that, unless user passes the
> format_modifier argument, but I could be wrong in this understanding of
> mine, as I have only gone through the code flow quickly once and also as I
> am in alien territory in some sense at one level.
>
> **** Tile layouts
>
> As it mainly supports Intel tile layouts for now, and as older Intel GPUs
> didnt support Tile-Y format for scan out purpose, I think currently most
> set the framebuffer layout to Tile-X for display purpose. So in that sense
> the default type of Tile-X which is used by the filter should be fine for
> most cases. However if one wants, one can change the tile conversion format
> to Tile-Y by passing a argument to the filter. Also as I wasnt very sure
> the format-modifer is being picked up by default, so also used the most
> likely case as the default and inturn provided the option to change the
> layout conversion to use if required.
>
> NOTE2: The Tile-X being the default is my understanding based on a quick
> glance through the Intel GPU documents and potentially some things which I
> might have seen online.
>
> NOTE3: I am not much clued in into this domain in general, nor tracking
> it, but more as I had a issue with some capturing which I wanted to do, I
> went through the ffmpeg kmsgrab + hwup/down and hwcontext code path a bit,
> some documents and headers quickly and then based on a rough logical
> understanding I wanted to implement a quick and flexible solution to solve
> my problem as well as potentially help others who might have a similar
> issue. And that is how this filter got done.
>
> Also I am planning to add a additional generic detile logic later, where
> the user can configure the tile format as a list of direction changes and
> few other constraints and then the same logic can handle either TileX or
> TileY or TileYs or TileYf or ... This will be slower (based on some initial
> tests the generic logic seems to be around 50% slower compared to current
> specific targeted conversion logics which I have implemented), but should
> allow one to try and detile any (or rather more correctly - many) kind of
> tile layouts, as the case may be. Again the idea is to use this generic
> path has a offline / second pass.
>
>
> On Mon, Jun 29, 2020 at 2:28 AM Mark Thompson <sw@jkqxz.net> wrote:
>
>> On 27/06/2020 20:57, hanishkvc wrote:
>> > v02-20200627IST2331
>> >
>> > Unrolled Intel Legacy Tile-Y detiling logic.
>> >
>> > Also a consolidated patch file, instead of the previous development
>> > flow based multiple patch files.
>> >
>> > v01-20200627IST1308
>> >
>> > Implemented Intel Legacy Tile-X and Tile-Y detiling logic
>> >
>> > NOTES:
>> >
>> > This video filter allows framebuffers which are tiled to be detiled
>> > using logic running on the cpu, into a linear layout.
>> >
>> > Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling.
>> > THis should help one to work with frames captured (say using kmsgrab)
>> > on laptops having Intel GPU.
>> >
>> > Tile-X conversion logic has been explicitly cross checked, with Tile-X
>> > based frames. However Tile-Y conv logic hasnt been tested with Tile-Y
>> > based frames, but it should potentially do the job, based on my current
>> > understanding of the Tile-Y layout format.
>> >
>> > TODO1: At a later time have to generate Tile-Y based frames, and then
>> > cross check the corresponding logic explicitly.
>> >
>> > TODO2: May be use OpenGL or Vulcan buffer helper routines to do the
>> > layout conversion. But some online discussions from sometime back seem
>> > to indicate that this path is not fully bug free currently.
>> > ---
>> >   Changelog                 |   1 +
>> >   doc/filters.texi          |  62 ++++++++
>> >   libavfilter/Makefile      |   1 +
>> >   libavfilter/allfilters.c  |   1 +
>> >   libavfilter/vf_fbdetile.c | 309 ++++++++++++++++++++++++++++++++++++++
>> >   5 files changed, 374 insertions(+)
>> >   create mode 100644 libavfilter/vf_fbdetile.c
>>
>> For your kmsgrab use-case I think you are doing this in the wrong place.
>> There is already a copy during the download step (the hwdownload filter
>> before this), and that does know what the tiling mode
>> is such that it could detile transparently without a need for an extra
>> filter doing another copy.  See drm_transfer_data_from() in
>> libavutil/hwcontext_drm.c, which currently just does the linear copy
>> you observe regardless of the format modifier on the input buffer.
>>
>> Unrelated to the previous point, does the dependence of the actual layout
>> of the X and Y tiled formats on the exact model of GPU in use cause any
>> problems here?  If the layout is actually the same on
>> everything people might use nowadays then it's probably fine; if that
>> isn't true then maybe it needs some extra check.
>>
>> - Mark
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
>
>
> --
> Keep ;-)
> HanishKVC
>
C Hanish Menon June 29, 2020, 5:40 p.m. UTC | #8
Hi Lynne,

Thanks for your thoughts. My thoughts I have embedded below

On Mon, Jun 29, 2020 at 6:32 PM Lynne <dev@lynne.ee> wrote:

> Jun 28, 2020, 22:40 by hanishkvc@gmail.com:
>
> > Hi Mark,
> >
> > **** hwdownload vs separate filter
> >
> > True, for kmsgrab use-case one could potentially do this transform as
> part
> > of the drm_transfer_data logic (which currently mmaps and does a linear
> > copy, if even I remember correctly).  But like what I had mentioned in my
> > previous email, as this is done on the cpu side, if one wants to capture
> > very large framebuffers (say 4K or 8K at high fps), it could impact the
> > performance to some extent, so in such a situation decoupling the capture
> > from detiling, allows one to capture the screen at a very high resolution
> > without worrying about detiling and then handle detile in a offline /
> > separate pass manner.
> >
>
> I too think the filter must be done during hwdownload. That's the only
> place
> where it fits, since the tiling is known, and also the intent to access
> the buffer
> is known.
> We should not be outputting raw, tiled data in the first place and if speed
> really is necessary the detiling can be SIMD'd to speed it up
> significantly.
>
>
Do note that if one uses kmsgrab currently it will be outputting raw tiled
data only, if the underlying framebuffer is tiled. So it is not a new
behaviour, but what exists currently.

And as I had mentioned before, if we embed this logic into hwdownload, then
it can be used only when capturing using hardware context, while by keeping
the logic as a separate filter, the end user has the flexibility to use it
as they may find suitable for their situation. Also the overhead added by
the separate filter to the path is minimal.

Isn't the whole idea of unix kiss principle and piping as well as filter
chaining in the first place to have each logically independent | self
contained functionality as its own and give the user the freedom to mix and
match things the way they want for their end use. And this definitely
follows that.


>
> > NOTE1: Also as a side note, I dont think the existing logic is currently
> > fetching the format modifier of the actual frame buffer, I think it gets
> > set to NONE type by default and remains like that, unless user passes the
> > format_modifier argument, but I could be wrong in this understanding of
> > mine, as I have only gone through the code flow quickly once and also as
> I
> > am in alien territory in some sense at one level.
> >
>
> kmsgrab might not, but other APIs certainly do.
> Also, we don't top post on this mailing list.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
diff mbox series

Patch

diff --git a/Changelog b/Changelog
index a60e7d2eb8..0e03491f6a 100644
--- a/Changelog
+++ b/Changelog
@@ -2,6 +2,7 @@  Entries are sorted chronologically from oldest to youngest within each release,
 releases are sorted from youngest to oldest.
 
 version <next>:
+- fbdetile cpu based framebuffer layout detiling video filter
 - AudioToolbox output device
 - MacCaption demuxer
 
diff --git a/doc/filters.texi b/doc/filters.texi
index 3c2dd2eb90..73ba21af89 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -12210,6 +12210,68 @@  It accepts the following optional parameters:
 The number of the CUDA device to use
 @end table
 
+@anchor{fbdetile}
+@section fbdetile
+
+Detiles the Framebuffer tile layout into a linear layout using CPU.
+
+It currently supports conversion from Intel legacy tile-x and tile-y layouts
+into a linear layout. This is useful if one is using kmsgrab and hwdownload
+to capture a screen which is using one of these non-linear layouts.
+
+Currently it expects the data to be a 32bit RGB based pixel format. However
+the logic doesnt do any pixel format conversion or so. Later will be enabling
+16bit RGB data also, as the logic is transparent to it at one level.
+
+One could either insert this into the filter chain while capturing itself,
+or else, if it is slowing things down or so, then one could instead insert
+it into the filter chain during playback or transcoding or so.
+
+It supports the following optional parameters
+
+@table @option
+@item type
+Specify which detiling conversion to apply. The supported values are
+@table @var
+@item 0
+intel tile-x to linear conversion (the default)
+@item 1
+intel tile-y to linear conversion.
+@end table
+@end table
+
+If one wants to convert during capture itself, one could do
+@example
+ffmpeg -f kmsgrab -i - -vf "hwdownload, fbdetile" OUTPUT
+@end example
+
+However if one wants to convert after the tiled data has been already captured
+@example
+ffmpeg -i INPUT -vf "fbdetile" OUTPUT
+@end example
+@example
+ffplay -i INPUT -vf "fbdetile"
+@end example
+
+NOTE: While transcoding a test 1080p h264 stream, with 276 frames, with two
+runs of each situation, the performance was has given below. However this
+was for the older | initial version of the logic, as well as it was run on
+the default linux chromebook->vm->container, so the perf values need not be
+proper. But in a relative sense the overhead would be similar.
+@example
+rm out.mp4; time ./ffmpeg -i input.mp4 out.mp4
+rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=0 out.mp4
+rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=1 out.mp4
+@end example
+@table @option
+@item with no fbdetile filter
+it took ~7.28 secs,
+@item with fbdetile=0 filter
+it took ~8.69 secs,
+@item with fbdetile=1 filter
+it took ~9.20 secs.
+@end table
+
 @section hqx
 
 Apply a high-quality magnification filter designed for pixel art. This filter
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index 5123540653..bdb0c379ae 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -280,6 +280,7 @@  OBJS-$(CONFIG_HWDOWNLOAD_FILTER)             += vf_hwdownload.o
 OBJS-$(CONFIG_HWMAP_FILTER)                  += vf_hwmap.o
 OBJS-$(CONFIG_HWUPLOAD_CUDA_FILTER)          += vf_hwupload_cuda.o
 OBJS-$(CONFIG_HWUPLOAD_FILTER)               += vf_hwupload.o
+OBJS-$(CONFIG_FBDETILE_FILTER)               += vf_fbdetile.o
 OBJS-$(CONFIG_HYSTERESIS_FILTER)             += vf_hysteresis.o framesync.o
 OBJS-$(CONFIG_IDET_FILTER)                   += vf_idet.o
 OBJS-$(CONFIG_IL_FILTER)                     += vf_il.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index 1183e40267..f8dceb2a88 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -265,6 +265,7 @@  extern AVFilter ff_vf_hwdownload;
 extern AVFilter ff_vf_hwmap;
 extern AVFilter ff_vf_hwupload;
 extern AVFilter ff_vf_hwupload_cuda;
+extern AVFilter ff_vf_fbdetile;
 extern AVFilter ff_vf_hysteresis;
 extern AVFilter ff_vf_idet;
 extern AVFilter ff_vf_il;
diff --git a/libavfilter/vf_fbdetile.c b/libavfilter/vf_fbdetile.c
new file mode 100644
index 0000000000..8b20c96d2c
--- /dev/null
+++ b/libavfilter/vf_fbdetile.c
@@ -0,0 +1,309 @@ 
+/*
+ * Copyright (c) 2020 HanishKVC
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/**
+ * @file
+ * Detile the Frame buffer's tile layout using the cpu
+ * Currently it supports the legacy Intel Tile X layout detiling.
+ *
+ */
+
+/*
+ * ToThink|Check: Optimisations
+ *
+ * Does gcc setting used by ffmpeg allows memcpy | stringops inlining,
+ * loop unrolling, better native matching instructions, additional
+ * optimisations, ...
+ *
+ * Does gcc map to optimal memcpy logic, based on the situation it is
+ * used in.
+ *
+ * If not, may be look at vector_size or intrinsics or appropriate arch
+ * and cpu specific inline asm or ...
+ *
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/imgutils.h"
+#include "libavutil/opt.h"
+#include "avfilter.h"
+#include "formats.h"
+#include "internal.h"
+#include "video.h"
+
+enum FilterMode {
+    TYPE_INTELX,
+    TYPE_INTELY,
+    NB_TYPE
+};
+
+typedef struct FBDetileContext {
+    const AVClass *class;
+    int width, height;
+    int type;
+} FBDetileContext;
+
+#define OFFSET(x) offsetof(FBDetileContext, x)
+#define FLAGS AV_OPT_FLAG_FILTERING_PARAM|AV_OPT_FLAG_VIDEO_PARAM
+static const AVOption fbdetile_options[] = {
+    { "type", "set framebuffer format_modifier type", OFFSET(type), AV_OPT_TYPE_INT, {.i64=TYPE_INTELX}, 0, NB_TYPE-1, FLAGS, "type" },
+        { "intelx", "Intel Tile-X layout", 0, AV_OPT_TYPE_CONST, {.i64=TYPE_INTELX}, INT_MIN, INT_MAX, FLAGS, "type" },
+        { "intely", "Intel Tile-Y layout", 0, AV_OPT_TYPE_CONST, {.i64=TYPE_INTELY}, INT_MIN, INT_MAX, FLAGS, "type" },
+    { NULL }
+};
+
+AVFILTER_DEFINE_CLASS(fbdetile);
+
+static av_cold int init(AVFilterContext *ctx)
+{
+    FBDetileContext *fbdetile = ctx->priv;
+
+    if (fbdetile->type == TYPE_INTELX) {
+        fprintf(stderr,"INFO:fbdetile:init: Intel tile-x to linear\n");
+    } else if (fbdetile->type == TYPE_INTELY) {
+        fprintf(stderr,"INFO:fbdetile:init: Intel tile-y to linear\n");
+    } else {
+        fprintf(stderr,"DBUG:fbdetile:init: Unknown Tile format specified, shouldnt reach here\n");
+    }
+    fbdetile->width = 1920;
+    fbdetile->height = 1080;
+    return 0;
+}
+
+static int query_formats(AVFilterContext *ctx)
+{
+    // Currently only RGB based 32bit formats are specified
+    // TODO: Technically the logic is transparent to 16bit RGB formats also
+    static const enum AVPixelFormat pix_fmts[] = {AV_PIX_FMT_RGB0, AV_PIX_FMT_0RGB, AV_PIX_FMT_BGR0, AV_PIX_FMT_0BGR,
+                                                  AV_PIX_FMT_RGBA, AV_PIX_FMT_ARGB, AV_PIX_FMT_BGRA, AV_PIX_FMT_ABGR,
+                                                  AV_PIX_FMT_NONE};
+    AVFilterFormats *fmts_list;
+
+    fmts_list = ff_make_format_list(pix_fmts);
+    if (!fmts_list)
+        return AVERROR(ENOMEM);
+    return ff_set_common_formats(ctx, fmts_list);
+}
+
+static int config_props(AVFilterLink *inlink)
+{
+    AVFilterContext *ctx = inlink->dst;
+    FBDetileContext *fbdetile = ctx->priv;
+
+    fbdetile->width = inlink->w;
+    fbdetile->height = inlink->h;
+    fprintf(stderr,"DBUG:fbdetile:config_props: %d x %d\n", fbdetile->width, fbdetile->height);
+
+    return 0;
+}
+
+static void detile_intelx(AVFilterContext *ctx, int w, int h,
+                                uint8_t *dst, int dstLineSize,
+                          const uint8_t *src, int srcLineSize)
+{
+    // Offsets and LineSize are in bytes
+    int tileW = 128; // For a 32Bit / Pixel framebuffer, 512/4
+    int tileH = 8;
+
+    if (w*4 != srcLineSize) {
+        fprintf(stderr,"DBUG:fbdetile:intelx: w%dxh%d, dL%d, sL%d\n", w, h, dstLineSize, srcLineSize);
+        fprintf(stderr,"ERRR:fbdetile:intelx: dont support LineSize | Pitch going beyond width\n");
+    }
+    int sO = 0;
+    int dX = 0;
+    int dY = 0;
+    int nTRows = (w*h)/tileW;
+    int cTR = 0;
+    while (cTR < nTRows) {
+        int dO = dY*dstLineSize + dX*4;
+#ifdef DEBUG_FBTILE
+        fprintf(stderr,"DBUG:fbdetile:intelx: dX%d dY%d, sO%d, dO%d\n", dX, dY, sO, dO);
+#endif
+        memcpy(dst+dO+0*dstLineSize, src+sO+0*512, 512);
+        memcpy(dst+dO+1*dstLineSize, src+sO+1*512, 512);
+        memcpy(dst+dO+2*dstLineSize, src+sO+2*512, 512);
+        memcpy(dst+dO+3*dstLineSize, src+sO+3*512, 512);
+        memcpy(dst+dO+4*dstLineSize, src+sO+4*512, 512);
+        memcpy(dst+dO+5*dstLineSize, src+sO+5*512, 512);
+        memcpy(dst+dO+6*dstLineSize, src+sO+6*512, 512);
+        memcpy(dst+dO+7*dstLineSize, src+sO+7*512, 512);
+        dX += tileW;
+        if (dX >= w) {
+            dX = 0;
+            dY += 8;
+        }
+        sO = sO + 8*512;
+        cTR += 8;
+    }
+}
+
+/*
+ * Intel Legacy Tile-Y layout conversion support
+ *
+ * currently done in a simple dumb way. Two low hanging optimisations
+ * that could be readily applied are
+ *
+ * a) unrolling the inner for loop
+ *    --- Given small size memcpy, should help, DONE
+ *
+ * b) using simd based 128bit loading and storing along with prefetch
+ *    hinting.
+ *
+ *    TOTHINK|CHECK: Does memcpy already does this and more if situation
+ *    is right?!
+ *
+ *    As code (or even intrinsics) would be specific to each architecture,
+ *    avoiding for now. Later have to check if vector_size attribute and
+ *    corresponding implementation by gcc can handle different architectures
+ *    properly, such that it wont become worse than memcpy provided for that
+ *    architecture.
+ *
+ * Or maybe I could even merge the two intel detiling logics into one, as
+ * the semantic and flow is almost same for both logics.
+ *
+ */
+static void detile_intely(AVFilterContext *ctx, int w, int h,
+                                uint8_t *dst, int dstLineSize,
+                          const uint8_t *src, int srcLineSize)
+{
+    // Offsets and LineSize are in bytes
+    int tileW = 4; // For a 32Bit / Pixel framebuffer, 16/4
+    int tileH = 32;
+
+    if (w*4 != srcLineSize) {
+        fprintf(stderr,"DBUG:fbdetile:intely: w%dxh%d, dL%d, sL%d\n", w, h, dstLineSize, srcLineSize);
+        fprintf(stderr,"ERRR:fbdetile:intely: dont support LineSize | Pitch going beyond width\n");
+    }
+    int sO = 0;
+    int dX = 0;
+    int dY = 0;
+    int nTRows = (w*h)/tileW;
+    int cTR = 0;
+    while (cTR < nTRows) {
+        int dO = dY*dstLineSize + dX*4;
+#ifdef DEBUG_FBTILE
+        fprintf(stderr,"DBUG:fbdetile:intely: dX%d dY%d, sO%d, dO%d\n", dX, dY, sO, dO);
+#endif
+
+        memcpy(dst+dO+0*dstLineSize, src+sO+0*16, 16);
+        memcpy(dst+dO+1*dstLineSize, src+sO+1*16, 16);
+        memcpy(dst+dO+2*dstLineSize, src+sO+2*16, 16);
+        memcpy(dst+dO+3*dstLineSize, src+sO+3*16, 16);
+        memcpy(dst+dO+4*dstLineSize, src+sO+4*16, 16);
+        memcpy(dst+dO+5*dstLineSize, src+sO+5*16, 16);
+        memcpy(dst+dO+6*dstLineSize, src+sO+6*16, 16);
+        memcpy(dst+dO+7*dstLineSize, src+sO+7*16, 16);
+        memcpy(dst+dO+8*dstLineSize, src+sO+8*16, 16);
+        memcpy(dst+dO+9*dstLineSize, src+sO+9*16, 16);
+        memcpy(dst+dO+10*dstLineSize, src+sO+10*16, 16);
+        memcpy(dst+dO+11*dstLineSize, src+sO+11*16, 16);
+        memcpy(dst+dO+12*dstLineSize, src+sO+12*16, 16);
+        memcpy(dst+dO+13*dstLineSize, src+sO+13*16, 16);
+        memcpy(dst+dO+14*dstLineSize, src+sO+14*16, 16);
+        memcpy(dst+dO+15*dstLineSize, src+sO+15*16, 16);
+        memcpy(dst+dO+16*dstLineSize, src+sO+16*16, 16);
+        memcpy(dst+dO+17*dstLineSize, src+sO+17*16, 16);
+        memcpy(dst+dO+18*dstLineSize, src+sO+18*16, 16);
+        memcpy(dst+dO+19*dstLineSize, src+sO+19*16, 16);
+        memcpy(dst+dO+20*dstLineSize, src+sO+20*16, 16);
+        memcpy(dst+dO+21*dstLineSize, src+sO+21*16, 16);
+        memcpy(dst+dO+22*dstLineSize, src+sO+22*16, 16);
+        memcpy(dst+dO+23*dstLineSize, src+sO+23*16, 16);
+        memcpy(dst+dO+24*dstLineSize, src+sO+24*16, 16);
+        memcpy(dst+dO+25*dstLineSize, src+sO+25*16, 16);
+        memcpy(dst+dO+26*dstLineSize, src+sO+26*16, 16);
+        memcpy(dst+dO+27*dstLineSize, src+sO+27*16, 16);
+        memcpy(dst+dO+28*dstLineSize, src+sO+28*16, 16);
+        memcpy(dst+dO+29*dstLineSize, src+sO+29*16, 16);
+        memcpy(dst+dO+30*dstLineSize, src+sO+30*16, 16);
+        memcpy(dst+dO+31*dstLineSize, src+sO+31*16, 16);
+
+        dX += tileW;
+        if (dX >= w) {
+            dX = 0;
+            dY += 32;
+        }
+        sO = sO + 32*16;
+        cTR += 32;
+    }
+}
+
+static int filter_frame(AVFilterLink *inlink, AVFrame *in)
+{
+    AVFilterContext *ctx = inlink->dst;
+    FBDetileContext *fbdetile = ctx->priv;
+    AVFilterLink *outlink = ctx->outputs[0];
+    AVFrame *out;
+
+    out = ff_get_video_buffer(outlink, outlink->w, outlink->h);
+    if (!out) {
+        av_frame_free(&in);
+        return AVERROR(ENOMEM);
+    }
+    av_frame_copy_props(out, in);
+
+    if (fbdetile->type == TYPE_INTELX) {
+        detile_intelx(ctx, fbdetile->width, fbdetile->height,
+                      out->data[0], out->linesize[0],
+                      in->data[0], in->linesize[0]);
+    } else if (fbdetile->type == TYPE_INTELY) {
+        detile_intely(ctx, fbdetile->width, fbdetile->height,
+                      out->data[0], out->linesize[0],
+                      in->data[0], in->linesize[0]);
+    }
+
+    av_frame_free(&in);
+    return ff_filter_frame(outlink, out);
+}
+
+static av_cold void uninit(AVFilterContext *ctx)
+{
+
+}
+
+static const AVFilterPad fbdetile_inputs[] = {
+    {
+        .name         = "default",
+        .type         = AVMEDIA_TYPE_VIDEO,
+        .config_props = config_props,
+        .filter_frame = filter_frame,
+    },
+    { NULL }
+};
+
+static const AVFilterPad fbdetile_outputs[] = {
+    {
+        .name = "default",
+        .type = AVMEDIA_TYPE_VIDEO,
+    },
+    { NULL }
+};
+
+AVFilter ff_vf_fbdetile = {
+    .name          = "fbdetile",
+    .description   = NULL_IF_CONFIG_SMALL("Detile Framebuffer using CPU"),
+    .priv_size     = sizeof(FBDetileContext),
+    .init          = init,
+    .uninit        = uninit,
+    .query_formats = query_formats,
+    .inputs        = fbdetile_inputs,
+    .outputs       = fbdetile_outputs,
+    .priv_class    = &fbdetile_class,
+};