diff mbox series

[FFmpeg-devel,v04] fbdetile cpu based detiling of framebuffers v04

Message ID 20200701164348.41647-1-hanishkvc@gmail.com
State New
Headers show
Series [FFmpeg-devel,v04] fbdetile cpu based detiling of framebuffers v04 | expand

Checks

Context Check Description
andriy/default pending
andriy/make_warn warning New warnings during build
andriy/make success Make finished
andriy/make_fate success Make fate finished

Commit Message

hanishkvc July 1, 2020, 4:43 p.m. UTC
v04-20200701IST2132, fbdetile

Optimised Generic Detile logic to detile multiple tiles in parallel,
logically speaking. This reduces cross check overheads to some extent
and thus improves speed a bit. While in a hardware or multicore setup,
it can be used to parallelise the detiling operations in a true sense.

Add a additional level of subtiling wrt Tile-Yf, which I had missed out
in the dirChangesList based tiling configuration, which is used by my
generic detiling logic. THe overhead due to this additional subtiling
is compensated by the speed gains due to parallel detiling.

NOTE: This is a consolidated patch, it contains previous versions also

v03-20200629IST2208 fbdetile

Added a generic detiling logic, which can be easily configured to
detile many different tiling schemes.

The same is inturn used to detile Intel Tile-Yf layout.

v02-20200627IST2331

Unrolled Intel Legacy Tile-Y detiling logic.

Also a consolidated patch file, instead of the previous development
flow based multiple patch files.

v01-20200627IST1308

Implemented Intel Legacy Tile-X and Tile-Y detiling logic

NOTES:

This video filter allows framebuffers which are tiled to be detiled
using logic running on the cpu, into a linear layout.

Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling,
as well as the newer Intel Tile-Yf layouts.

THis should help one to work with frames captured (say using kmsgrab)
on laptops having Intel GPU. This can be done live while capturing
itself, or it can be applied later as a seperate pass.

Tile-X conversion logic has been explicitly cross checked, with Tile-X
based frames. However Tile-Y and Tile-Yf conv logics havent been tested
with Tile-Y | Tile-Yf based frames, but it should potentially get the
job done, based on my current understanding of these layout formats. A
minimal test has been done by seeing how a multicolor linear framebuffer
gets converted into a patterned layout, depending on the detile walk.
---
 Changelog                 |   1 +
 doc/filters.texi          |  74 +++++
 libavfilter/Makefile      |   1 +
 libavfilter/allfilters.c  |   1 +
 libavfilter/vf_fbdetile.c | 568 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 645 insertions(+)
 create mode 100644 libavfilter/vf_fbdetile.c

Comments

Michael Niedermayer July 2, 2020, 12:57 p.m. UTC | #1
On Wed, Jul 01, 2020 at 10:13:48PM +0530, hanishkvc wrote:
> v04-20200701IST2132, fbdetile
> 
> Optimised Generic Detile logic to detile multiple tiles in parallel,
> logically speaking. This reduces cross check overheads to some extent
> and thus improves speed a bit. While in a hardware or multicore setup,
> it can be used to parallelise the detiling operations in a true sense.
> 
> Add a additional level of subtiling wrt Tile-Yf, which I had missed out
> in the dirChangesList based tiling configuration, which is used by my
> generic detiling logic. THe overhead due to this additional subtiling
> is compensated by the speed gains due to parallel detiling.
> 
> NOTE: This is a consolidated patch, it contains previous versions also
> 
> v03-20200629IST2208 fbdetile
> 
> Added a generic detiling logic, which can be easily configured to
> detile many different tiling schemes.
> 
> The same is inturn used to detile Intel Tile-Yf layout.
> 
> v02-20200627IST2331
> 
> Unrolled Intel Legacy Tile-Y detiling logic.
> 
> Also a consolidated patch file, instead of the previous development
> flow based multiple patch files.
> 
> v01-20200627IST1308
> 
> Implemented Intel Legacy Tile-X and Tile-Y detiling logic
> 
> NOTES:
> 
> This video filter allows framebuffers which are tiled to be detiled
> using logic running on the cpu, into a linear layout.
> 
> Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling,
> as well as the newer Intel Tile-Yf layouts.
> 
> THis should help one to work with frames captured (say using kmsgrab)
> on laptops having Intel GPU. This can be done live while capturing
> itself, or it can be applied later as a seperate pass.
> 
> Tile-X conversion logic has been explicitly cross checked, with Tile-X
> based frames. However Tile-Y and Tile-Yf conv logics havent been tested
> with Tile-Y | Tile-Yf based frames, but it should potentially get the
> job done, based on my current understanding of these layout formats. A
> minimal test has been done by seeing how a multicolor linear framebuffer
> gets converted into a patterned layout, depending on the detile walk.
> ---
>  Changelog                 |   1 +
>  doc/filters.texi          |  74 +++++
>  libavfilter/Makefile      |   1 +
>  libavfilter/allfilters.c  |   1 +
>  libavfilter/vf_fbdetile.c | 568 ++++++++++++++++++++++++++++++++++++++
>  5 files changed, 645 insertions(+)
>  create mode 100644 libavfilter/vf_fbdetile.c

This breaks build on non x86

src/libavfilter/vf_fbdetile.c:81:10: fatal error: x86intrin.h: No such file or directory
 #include <x86intrin.h>
          ^~~~~~~~~~~~~

[...]
hanishkvc July 2, 2020, 6 p.m. UTC | #2
Hi Michael,

Thanks for the input, I had forgotten to disable/undef DEBUG_PERF, which I
was using to get a rough idea of the time taken by the logic using the
processor performance counter on x86. The logic doesnt use x86intrinsic for
any other purpose, so disabling DEBUG_PERF will fix it.

I can release an updated patch with the DEBUG_PERF undef'd by default, to
avoid this issue.


On Thu, Jul 2, 2020 at 6:28 PM Michael Niedermayer <michael@niedermayer.cc>
wrote:

> On Wed, Jul 01, 2020 at 10:13:48PM +0530, hanishkvc wrote:
> > v04-20200701IST2132, fbdetile
> >
> > Optimised Generic Detile logic to detile multiple tiles in parallel,
> > logically speaking. This reduces cross check overheads to some extent
> > and thus improves speed a bit. While in a hardware or multicore setup,
> > it can be used to parallelise the detiling operations in a true sense.
> >
> > Add a additional level of subtiling wrt Tile-Yf, which I had missed out
> > in the dirChangesList based tiling configuration, which is used by my
> > generic detiling logic. THe overhead due to this additional subtiling
> > is compensated by the speed gains due to parallel detiling.
> >
> > NOTE: This is a consolidated patch, it contains previous versions also
> >
> > v03-20200629IST2208 fbdetile
> >
> > Added a generic detiling logic, which can be easily configured to
> > detile many different tiling schemes.
> >
> > The same is inturn used to detile Intel Tile-Yf layout.
> >
> > v02-20200627IST2331
> >
> > Unrolled Intel Legacy Tile-Y detiling logic.
> >
> > Also a consolidated patch file, instead of the previous development
> > flow based multiple patch files.
> >
> > v01-20200627IST1308
> >
> > Implemented Intel Legacy Tile-X and Tile-Y detiling logic
> >
> > NOTES:
> >
> > This video filter allows framebuffers which are tiled to be detiled
> > using logic running on the cpu, into a linear layout.
> >
> > Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling,
> > as well as the newer Intel Tile-Yf layouts.
> >
> > THis should help one to work with frames captured (say using kmsgrab)
> > on laptops having Intel GPU. This can be done live while capturing
> > itself, or it can be applied later as a seperate pass.
> >
> > Tile-X conversion logic has been explicitly cross checked, with Tile-X
> > based frames. However Tile-Y and Tile-Yf conv logics havent been tested
> > with Tile-Y | Tile-Yf based frames, but it should potentially get the
> > job done, based on my current understanding of these layout formats. A
> > minimal test has been done by seeing how a multicolor linear framebuffer
> > gets converted into a patterned layout, depending on the detile walk.
> > ---
> >  Changelog                 |   1 +
> >  doc/filters.texi          |  74 +++++
> >  libavfilter/Makefile      |   1 +
> >  libavfilter/allfilters.c  |   1 +
> >  libavfilter/vf_fbdetile.c | 568 ++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 645 insertions(+)
> >  create mode 100644 libavfilter/vf_fbdetile.c
>
> This breaks build on non x86
>
> src/libavfilter/vf_fbdetile.c:81:10: fatal error: x86intrin.h: No such
> file or directory
>  #include <x86intrin.h>
>           ^~~~~~~~~~~~~
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> In a rich man's house there is no place to spit but his face.
> -- Diogenes of Sinope
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
diff mbox series

Patch

diff --git a/Changelog b/Changelog
index a60e7d2eb8..0e03491f6a 100644
--- a/Changelog
+++ b/Changelog
@@ -2,6 +2,7 @@  Entries are sorted chronologically from oldest to youngest within each release,
 releases are sorted from youngest to oldest.
 
 version <next>:
+- fbdetile cpu based framebuffer layout detiling video filter
 - AudioToolbox output device
 - MacCaption demuxer
 
diff --git a/doc/filters.texi b/doc/filters.texi
index 67892e0afb..f7bcae1685 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -12210,6 +12210,80 @@  It accepts the following optional parameters:
 The number of the CUDA device to use
 @end table
 
+@anchor{fbdetile}
+@section fbdetile
+
+Detiles the Framebuffer tile layout into a linear layout using CPU.
+
+It currently supports conversion from Intel legacy tile-x and tile-y as well
+as the newer Intel tile-yf layouts into a linear layout. This is useful if
+one is using kmsgrab and hwdownload to capture a screen which is using one
+of these non-linear layouts.
+
+NOTE: It also provides a generic detiling logic, which can be easily configured
+to detile many different tiling schemes if required, in future. The same is
+used for detiling the intel tile-yf layout. Also sample configuration to handle
+intel tile-x and tile-y using generic detile logic is also shown for reference,
+in the code.
+
+Currently it expects the data to be a 32bit RGB based pixel format. However
+the logic doesnt do any pixel format conversion or so. Later will be enabling
+16bit RGB data also, as the logic is transparent to it at one level.
+
+One could either insert this into the filter chain while capturing itself,
+or else, if it is slowing things down or so, then one could instead insert
+it into the filter chain during playback or transcoding or so.
+
+It supports the following optional parameters
+
+@table @option
+@item type
+Specify which detiling conversion to apply. The supported values are
+@table @var
+@item 0
+intel tile-x to linear conversion (the default)
+@item 1
+intel tile-y to linear conversion.
+@item 2
+intel tile-yf to linear conversion.
+@end table
+@end table
+
+If one wants to convert during capture itself, one could do
+@example
+ffmpeg -f kmsgrab -i - -vf "hwdownload,format=bgr0,fbdetile" OUTPUT
+@end example
+
+However if one wants to convert after the tiled data has been already captured
+@example
+ffmpeg -i INPUT -vf "fbdetile" OUTPUT
+@end example
+@example
+ffplay -i INPUT -vf "fbdetile"
+@end example
+
+NOTE: While transcoding a test 1080p h264 stream, with 276 frames, below was
+the average times taken by the different detile logics.
+@example
+rm out.mp4; time ./ffmpeg -i input.mp4 out.mp4
+rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=0 out.mp4
+rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=1 out.mp4
+rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=2 out.mp4
+@end example
+@table @option
+@item with no fbdetile filter
+it took ~7.28 secs, i5-8th Gen
+it took ~10.1 secs, i7-7th Gen
+@item with fbdetile=0 filter, Intel Tile-X
+it took ~8.69 secs, i5-8th Gen
+it took ~13.3 secs, i7-7th Gen
+@item with fbdetile=1 filter, Intel Tile-Y
+it took ~9.20 secs. i5-8th Gen
+it took ~13.5 secs. i7-7th Gen
+@item with fbdetile=2 filter, Intel Tile-Yf
+it took ~13.8 secs. i7-7th Gen
+@end table
+
 @section hqx
 
 Apply a high-quality magnification filter designed for pixel art. This filter
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index 5123540653..bdb0c379ae 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -280,6 +280,7 @@  OBJS-$(CONFIG_HWDOWNLOAD_FILTER)             += vf_hwdownload.o
 OBJS-$(CONFIG_HWMAP_FILTER)                  += vf_hwmap.o
 OBJS-$(CONFIG_HWUPLOAD_CUDA_FILTER)          += vf_hwupload_cuda.o
 OBJS-$(CONFIG_HWUPLOAD_FILTER)               += vf_hwupload.o
+OBJS-$(CONFIG_FBDETILE_FILTER)               += vf_fbdetile.o
 OBJS-$(CONFIG_HYSTERESIS_FILTER)             += vf_hysteresis.o framesync.o
 OBJS-$(CONFIG_IDET_FILTER)                   += vf_idet.o
 OBJS-$(CONFIG_IL_FILTER)                     += vf_il.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index 1183e40267..f8dceb2a88 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -265,6 +265,7 @@  extern AVFilter ff_vf_hwdownload;
 extern AVFilter ff_vf_hwmap;
 extern AVFilter ff_vf_hwupload;
 extern AVFilter ff_vf_hwupload_cuda;
+extern AVFilter ff_vf_fbdetile;
 extern AVFilter ff_vf_hysteresis;
 extern AVFilter ff_vf_idet;
 extern AVFilter ff_vf_il;
diff --git a/libavfilter/vf_fbdetile.c b/libavfilter/vf_fbdetile.c
new file mode 100644
index 0000000000..f9e13ced18
--- /dev/null
+++ b/libavfilter/vf_fbdetile.c
@@ -0,0 +1,568 @@ 
+/*
+ * Copyright (c) 2020 HanishKVC
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/**
+ * @file
+ * Detile the Frame buffer's tile layout using the cpu
+ * Currently it supports detiling of following layouts
+ *     legacy Intel Tile-X
+ *     legacy Intel Tile-Y
+ *     newer Intel Tile-Yf
+ * More tiling layouts can be easily supported by adding configuration data
+ * for the generic detile logic, wrt the required tiling schemes.
+ *
+ */
+
+/*
+ * ToThink|Check: Optimisations
+ *
+ * Does gcc setting used by ffmpeg allows memcpy | stringops inlining,
+ * loop unrolling, better native matching instructions, additional
+ * optimisations, ...
+ *
+ * Does gcc map to optimal memcpy logic, based on the situation it is
+ * used in i.e like
+ *     based on size of transfer, alignment, architecture, etc
+ *     a suitable combination of inlining and or rep movsb and or
+ *     simd load/store and or unrolling and or ...
+ *
+ * If not, may be look at vector_size or intrinsics or appropriate arch
+ * and cpu specific inline asm or ...
+ *
+ */
+
+/*
+ * Performance check results on i7-7500u
+ * TileYf, TileGX, TileGY using detile_generic_opti
+ *     This mainly impacts TileYf, due to its deeper subtiling
+ *     Without opti, its TSCCnt rises to aroun 11.XYM
+ * Run Type      : Type   : Seconds Max, Min : TSCCnt Min, Max
+ * Non filter run:        :  10.11s, 09.96s  :
+ * fbdetile=0 run: TileX  :  13.45s, 13.20s  :  05.95M, 06.10M
+ * fbdetile=1 run: TileY  :  13.50s, 13.39s  :  06.22M, 06.39M
+ * fbdetile=2 run: TileYf :  13.75s, 13.63s  :  09.82M, 09.90M
+ * fbdetile=3 run: TileGX :  13.70s, 13.32s  :  06.15M, 06.24M
+ * fbdetile=4 run: TileGY :  14.12s, 13.57s  :  08.75M, 09.10M
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/imgutils.h"
+#include "libavutil/opt.h"
+#include "avfilter.h"
+#include "formats.h"
+#include "internal.h"
+#include "video.h"
+
+// Use Optimised detile_generic or the Simpler but more fine grained one
+#define DETILE_GENERIC_OPTI 1
+// Enable printing of the tile walk
+#undef DEBUG_FBTILE
+// Print time taken by detile using performance counter
+#define DEBUG_PERF 1
+
+#ifdef DEBUG_PERF
+#include <x86intrin.h>
+uint64_t perfTime = 0;
+int perfCnt = 0;
+#endif
+
+enum FilterMode {
+    TYPE_INTELX,
+    TYPE_INTELY,
+    TYPE_INTELYF,
+    TYPE_INTELGX,
+    TYPE_INTELGY,
+    NB_TYPE
+};
+
+typedef struct FBDetileContext {
+    const AVClass *class;
+    int width, height;
+    int type;
+} FBDetileContext;
+
+#define OFFSET(x) offsetof(FBDetileContext, x)
+#define FLAGS AV_OPT_FLAG_FILTERING_PARAM|AV_OPT_FLAG_VIDEO_PARAM
+static const AVOption fbdetile_options[] = {
+    { "type", "set framebuffer format_modifier type", OFFSET(type), AV_OPT_TYPE_INT, {.i64=TYPE_INTELX}, 0, NB_TYPE-1, FLAGS, "type" },
+        { "intelx", "Intel Tile-X layout", 0, AV_OPT_TYPE_CONST, {.i64=TYPE_INTELX}, INT_MIN, INT_MAX, FLAGS, "type" },
+        { "intely", "Intel Tile-Y layout", 0, AV_OPT_TYPE_CONST, {.i64=TYPE_INTELY}, INT_MIN, INT_MAX, FLAGS, "type" },
+        { "intelyf", "Intel Tile-Yf layout", 0, AV_OPT_TYPE_CONST, {.i64=TYPE_INTELYF}, INT_MIN, INT_MAX, FLAGS, "type" },
+        { "intelgx", "Intel Tile-X layout, GenericDetile", 0, AV_OPT_TYPE_CONST, {.i64=TYPE_INTELGX}, INT_MIN, INT_MAX, FLAGS, "type" },
+        { "intelgy", "Intel Tile-Y layout, GenericDetile", 0, AV_OPT_TYPE_CONST, {.i64=TYPE_INTELGY}, INT_MIN, INT_MAX, FLAGS, "type" },
+    { NULL }
+};
+
+AVFILTER_DEFINE_CLASS(fbdetile);
+
+static av_cold int init(AVFilterContext *ctx)
+{
+    FBDetileContext *fbdetile = ctx->priv;
+
+    if (fbdetile->type == TYPE_INTELX) {
+        fprintf(stderr,"INFO:fbdetile:init: Intel tile-x to linear\n");
+    } else if (fbdetile->type == TYPE_INTELY) {
+        fprintf(stderr,"INFO:fbdetile:init: Intel tile-y to linear\n");
+    } else if (fbdetile->type == TYPE_INTELYF) {
+        fprintf(stderr,"INFO:fbdetile:init: Intel tile-yf to linear\n");
+    } else if (fbdetile->type == TYPE_INTELGX) {
+        fprintf(stderr,"INFO:fbdetile:init: Intel tile-x to linear, using generic detile\n");
+    } else if (fbdetile->type == TYPE_INTELGY) {
+        fprintf(stderr,"INFO:fbdetile:init: Intel tile-y to linear, using generic detile\n");
+    } else {
+        fprintf(stderr,"DBUG:fbdetile:init: Unknown Tile format specified, shouldnt reach here\n");
+    }
+    fbdetile->width = 1920;
+    fbdetile->height = 1080;
+    return 0;
+}
+
+static int query_formats(AVFilterContext *ctx)
+{
+    // Currently only RGB based 32bit formats are specified
+    // TODO: Technically the logic is transparent to 16bit RGB formats also to a great extent
+    static const enum AVPixelFormat pix_fmts[] = {AV_PIX_FMT_RGB0, AV_PIX_FMT_0RGB, AV_PIX_FMT_BGR0, AV_PIX_FMT_0BGR,
+                                                  AV_PIX_FMT_RGBA, AV_PIX_FMT_ARGB, AV_PIX_FMT_BGRA, AV_PIX_FMT_ABGR,
+                                                  AV_PIX_FMT_NONE};
+    AVFilterFormats *fmts_list;
+
+    fmts_list = ff_make_format_list(pix_fmts);
+    if (!fmts_list)
+        return AVERROR(ENOMEM);
+    return ff_set_common_formats(ctx, fmts_list);
+}
+
+static int config_props(AVFilterLink *inlink)
+{
+    AVFilterContext *ctx = inlink->dst;
+    FBDetileContext *fbdetile = ctx->priv;
+
+    fbdetile->width = inlink->w;
+    fbdetile->height = inlink->h;
+    fprintf(stderr,"DBUG:fbdetile:config_props: %d x %d\n", fbdetile->width, fbdetile->height);
+
+    return 0;
+}
+
+static void detile_intelx(AVFilterContext *ctx, int w, int h,
+                                uint8_t *dst, int dstLineSize,
+                          const uint8_t *src, int srcLineSize)
+{
+    // Offsets and LineSize are in bytes
+    const int pixBytes = 4;                     // bytes per pixel
+    const int tileW = 128;                      // tileWidth inPixels, 512/4, For a 32Bits/Pixel framebuffer
+    const int tileH = 8;                        // tileHeight inPixelLines
+    const int tileWBytes = tileW*pixBytes;      // tileWidth inBytes
+
+    if (w*pixBytes != srcLineSize) {
+        fprintf(stderr,"DBUG:fbdetile:intelx: w%dxh%d, dL%d, sL%d\n", w, h, dstLineSize, srcLineSize);
+        fprintf(stderr,"ERRR:fbdetile:intelx: dont support LineSize | Pitch going beyond width\n");
+    }
+    int sO = 0;                 // srcOffset inBytes
+    int dX = 0;                 // destX inPixels
+    int dY = 0;                 // destY inPixels
+    int nTLines = (w*h)/tileW;  // numTileLines; One TileLine = One TileWidth
+    int cTL = 0;                // curTileLine
+    while (cTL < nTLines) {
+        int dO = dY*dstLineSize + dX*pixBytes;
+#ifdef DEBUG_FBTILE
+        fprintf(stderr,"DBUG:fbdetile:intelx: dX%d dY%d, sO%d, dO%d\n", dX, dY, sO, dO);
+#endif
+        memcpy(dst+dO+0*dstLineSize, src+sO+0*tileWBytes, tileWBytes);
+        memcpy(dst+dO+1*dstLineSize, src+sO+1*tileWBytes, tileWBytes);
+        memcpy(dst+dO+2*dstLineSize, src+sO+2*tileWBytes, tileWBytes);
+        memcpy(dst+dO+3*dstLineSize, src+sO+3*tileWBytes, tileWBytes);
+        memcpy(dst+dO+4*dstLineSize, src+sO+4*tileWBytes, tileWBytes);
+        memcpy(dst+dO+5*dstLineSize, src+sO+5*tileWBytes, tileWBytes);
+        memcpy(dst+dO+6*dstLineSize, src+sO+6*tileWBytes, tileWBytes);
+        memcpy(dst+dO+7*dstLineSize, src+sO+7*tileWBytes, tileWBytes);
+        dX += tileW;
+        if (dX >= w) {
+            dX = 0;
+            dY += tileH;
+        }
+        sO = sO + tileW*tileH*pixBytes;
+        cTL += tileH;
+    }
+}
+
+/*
+ * Intel Legacy Tile-Y layout conversion support
+ *
+ * currently done in a simple dumb way. Two low hanging optimisations
+ * that could be readily applied are
+ *
+ * a) unrolling the inner for loop
+ *    --- Given small size memcpy, should help, DONE
+ *
+ * b) using simd based 128bit loading and storing along with prefetch
+ *    hinting.
+ *
+ *    TOTHINK|CHECK: Does memcpy already does this and more if situation
+ *    is right?!
+ *
+ *    As code (or even intrinsics) would be specific to each architecture,
+ *    avoiding for now. Later have to check if vector_size attribute and
+ *    corresponding implementation by gcc can handle different architectures
+ *    properly, such that it wont become worse than memcpy provided for that
+ *    architecture.
+ *
+ * Or maybe I could even merge the two intel detiling logics into one, as
+ * the semantic and flow is almost same for both logics.
+ *
+ */
+static void detile_intely(AVFilterContext *ctx, int w, int h,
+                                uint8_t *dst, int dstLineSize,
+                          const uint8_t *src, int srcLineSize)
+{
+    // Offsets and LineSize are in bytes
+    const int pixBytes = 4;                 // bytesPerPixel
+    // tileW represents subTileWidth here, as it can be repeated to fill a tile
+    const int tileW = 4;                    // tileWidth inPixels, 16/4, For a 32Bits/Pixel framebuffer
+    const int tileH = 32;                   // tileHeight inPixelLines
+    const int tileWBytes = tileW*pixBytes;  // tileWidth inBytes
+
+    if (w*pixBytes != srcLineSize) {
+        fprintf(stderr,"DBUG:fbdetile:intely: w%dxh%d, dL%d, sL%d\n", w, h, dstLineSize, srcLineSize);
+        fprintf(stderr,"ERRR:fbdetile:intely: dont support LineSize | Pitch going beyond width\n");
+    }
+    int sO = 0;
+    int dX = 0;
+    int dY = 0;
+    const int nTLines = (w*h)/tileW;
+    int cTL = 0;
+    while (cTL < nTLines) {
+        int dO = dY*dstLineSize + dX*pixBytes;
+#ifdef DEBUG_FBTILE
+        fprintf(stderr,"DBUG:fbdetile:intely: dX%d dY%d, sO%d, dO%d\n", dX, dY, sO, dO);
+#endif
+
+        memcpy(dst+dO+0*dstLineSize, src+sO+0*tileWBytes, tileWBytes);
+        memcpy(dst+dO+1*dstLineSize, src+sO+1*tileWBytes, tileWBytes);
+        memcpy(dst+dO+2*dstLineSize, src+sO+2*tileWBytes, tileWBytes);
+        memcpy(dst+dO+3*dstLineSize, src+sO+3*tileWBytes, tileWBytes);
+        memcpy(dst+dO+4*dstLineSize, src+sO+4*tileWBytes, tileWBytes);
+        memcpy(dst+dO+5*dstLineSize, src+sO+5*tileWBytes, tileWBytes);
+        memcpy(dst+dO+6*dstLineSize, src+sO+6*tileWBytes, tileWBytes);
+        memcpy(dst+dO+7*dstLineSize, src+sO+7*tileWBytes, tileWBytes);
+        memcpy(dst+dO+8*dstLineSize, src+sO+8*tileWBytes, tileWBytes);
+        memcpy(dst+dO+9*dstLineSize, src+sO+9*tileWBytes, tileWBytes);
+        memcpy(dst+dO+10*dstLineSize, src+sO+10*tileWBytes, tileWBytes);
+        memcpy(dst+dO+11*dstLineSize, src+sO+11*tileWBytes, tileWBytes);
+        memcpy(dst+dO+12*dstLineSize, src+sO+12*tileWBytes, tileWBytes);
+        memcpy(dst+dO+13*dstLineSize, src+sO+13*tileWBytes, tileWBytes);
+        memcpy(dst+dO+14*dstLineSize, src+sO+14*tileWBytes, tileWBytes);
+        memcpy(dst+dO+15*dstLineSize, src+sO+15*tileWBytes, tileWBytes);
+        memcpy(dst+dO+16*dstLineSize, src+sO+16*tileWBytes, tileWBytes);
+        memcpy(dst+dO+17*dstLineSize, src+sO+17*tileWBytes, tileWBytes);
+        memcpy(dst+dO+18*dstLineSize, src+sO+18*tileWBytes, tileWBytes);
+        memcpy(dst+dO+19*dstLineSize, src+sO+19*tileWBytes, tileWBytes);
+        memcpy(dst+dO+20*dstLineSize, src+sO+20*tileWBytes, tileWBytes);
+        memcpy(dst+dO+21*dstLineSize, src+sO+21*tileWBytes, tileWBytes);
+        memcpy(dst+dO+22*dstLineSize, src+sO+22*tileWBytes, tileWBytes);
+        memcpy(dst+dO+23*dstLineSize, src+sO+23*tileWBytes, tileWBytes);
+        memcpy(dst+dO+24*dstLineSize, src+sO+24*tileWBytes, tileWBytes);
+        memcpy(dst+dO+25*dstLineSize, src+sO+25*tileWBytes, tileWBytes);
+        memcpy(dst+dO+26*dstLineSize, src+sO+26*tileWBytes, tileWBytes);
+        memcpy(dst+dO+27*dstLineSize, src+sO+27*tileWBytes, tileWBytes);
+        memcpy(dst+dO+28*dstLineSize, src+sO+28*tileWBytes, tileWBytes);
+        memcpy(dst+dO+29*dstLineSize, src+sO+29*tileWBytes, tileWBytes);
+        memcpy(dst+dO+30*dstLineSize, src+sO+30*tileWBytes, tileWBytes);
+        memcpy(dst+dO+31*dstLineSize, src+sO+31*tileWBytes, tileWBytes);
+
+        dX += tileW;
+        if (dX >= w) {
+            dX = 0;
+            dY += tileH;
+        }
+        sO = sO + tileW*tileH*pixBytes;
+        cTL += tileH;
+    }
+}
+
+/*
+ * Generic detile logic
+ */
+
+struct changeEntry {
+    int posOffset;
+    int xDelta;
+    int yDelta;
+};
+
+// Settings for Intel Tile-Yf framebuffer layout
+// May need to swap the 4 pixel wide subtile, have to check doc bit more
+int yfBytesPerPixel = 4;            // Assumes each pixel is 4 bytes
+int yfSubTileWidth = 4;
+int yfSubTileHeight = 8;
+struct changeEntry yfChanges[] = { {8, 4, 0}, {16, -4, 8}, {32, 4, -8}, {64, -12, 8 }, {128, 4, -24}, {256, 4, -24} };
+int yfNumChanges = 6;
+int yfSubTileWidthBytes = 16;       //subTileWidth*bytesPerPixel
+int yfTileWidth = 32;
+int yfTileHeight = 32;
+// Setting for Intel Tile-X framebuffer layout
+struct changeEntry txChanges[] = { {8, 128, 0} };
+int txBytesPerPixel = 4;            // Assumes each pixel is 4 bytes
+int txSubTileWidth = 128;
+int txSubTileHeight = 8;
+int txSubTileWidthBytes = 512;      //subTileWidth*bytesPerPixel
+int txTileWidth = 128;
+int txTileHeight = 8;
+int txNumChanges = 1;
+// Setting for Intel Tile-Y framebuffer layout
+// Even thou a simple generic detiling logic doesnt require the
+// dummy 256 posOffset entry. The pseudo parallel detiling based
+// opti logic requires to know about the Tile boundry.
+struct changeEntry tyChanges[] = { {32, 4, 0}, {256, 4, 0} };
+int tyBytesPerPixel = 4;            // Assumes each pixel is 4 bytes
+int tySubTileWidth = 4;
+int tySubTileHeight = 32;
+int tySubTileWidthBytes = 16;       //subTileWidth*bytesPerPixel
+int tyTileWidth = 32;
+int tyTileHeight = 32;
+int tyNumChanges = 2;
+
+static void detile_generic_simple(AVFilterContext *ctx, int w, int h,
+                                  uint8_t *dst, int dstLineSize,
+                            const uint8_t *src, int srcLineSize,
+                            int bytesPerPixel,
+                            int subTileWidth, int subTileHeight, int subTileWidthBytes,
+                            int tileWidth, int tileHeight,
+                            int numChanges, struct changeEntry *changes)
+{
+
+    if (w*bytesPerPixel != srcLineSize) {
+        fprintf(stderr,"DBUG:fbdetile:generic: w%dxh%d, dL%d, sL%d\n", w, h, dstLineSize, srcLineSize);
+        fprintf(stderr,"ERRR:fbdetile:generic: dont support LineSize | Pitch going beyond width\n");
+    }
+    int sO = 0;
+    int dX = 0;
+    int dY = 0;
+    int nSTLines = (w*h)/subTileWidth;  // numSubTileLines
+    int cSTL = 0;                       // curSubTileLine
+    while (cSTL < nSTLines) {
+        int dO = dY*dstLineSize + dX*bytesPerPixel;
+#ifdef DEBUG_FBTILE
+        fprintf(stderr,"DBUG:fbdetile:generic: dX%d dY%d, sO%d, dO%d\n", dX, dY, sO, dO);
+#endif
+
+        for (int k = 0; k < subTileHeight; k++) {
+            memcpy(dst+dO+k*dstLineSize, src+sO+k*subTileWidthBytes, subTileWidthBytes);
+        }
+        sO = sO + subTileHeight*subTileWidthBytes;
+
+        cSTL += subTileHeight;
+        for (int i=numChanges-1; i>=0; i--) {
+            if ((cSTL%changes[i].posOffset) == 0) {
+                dX += changes[i].xDelta;
+                dY += changes[i].yDelta;
+                break;
+            }
+        }
+        if (dX >= w) {
+            dX = 0;
+            dY += tileHeight;
+        }
+    }
+}
+
+
+static void detile_generic_opti(AVFilterContext *ctx, int w, int h,
+                                  uint8_t *dst, int dstLineSize,
+                            const uint8_t *src, int srcLineSize,
+                            int bytesPerPixel,
+                            int subTileWidth, int subTileHeight, int subTileWidthBytes,
+                            int tileWidth, int tileHeight,
+                            int numChanges, struct changeEntry *changes)
+{
+    int parallel = 1;
+
+    if (w*bytesPerPixel != srcLineSize) {
+        fprintf(stderr,"DBUG:fbdetile:generic: w%dxh%d, dL%d, sL%d\n", w, h, dstLineSize, srcLineSize);
+        fprintf(stderr,"ERRR:fbdetile:generic: dont support LineSize | Pitch going beyond width\n");
+    }
+    if (w%tileWidth != 0) {
+        fprintf(stderr,"DBUG:fbdetile:generic:NotSupported:NonMultWidth: width%d, tileWidth%d\n", w, tileWidth);
+    }
+    int sO = 0;
+    int sOPrev = 0;
+    int dX = 0;
+    int dY = 0;
+    int nSTLines = (w*h)/subTileWidth;
+    //int nSTLinesInATile = (tileWidth*tileHeight)/subTileWidth;
+    int nTilesInARow = w/tileWidth;
+    for (parallel=8; parallel>0; parallel--) {
+        if (nTilesInARow%parallel == 0)
+            break;
+    }
+    int cSTL = 0;
+    int curTileInRow = 0;
+    while (cSTL < nSTLines) {
+        int dO = dY*dstLineSize + dX*bytesPerPixel;
+#ifdef DEBUG_FBTILE
+        fprintf(stderr,"DBUG:fbdetile:generic: dX%d dY%d, sO%d, dO%d\n", dX, dY, sO, dO);
+#endif
+
+        // As most tiling layouts have a minimum subtile of 4x4, if I remember correctly,
+        // so this loop has been unrolled to be multiples of 4, and speed up a bit.
+        // However tiling involving 3x3 or 2x2 wont be handlable. Use detile_generic_simple
+        // for such tile layouts.
+        // Detile parallely to a limited extent. To avoid any cache set-associativity and or
+        // limited cache based thrashing, keep it spacially and inturn temporaly small at one level.
+        for (int k = 0; k < subTileHeight; k+=4) {
+            for (int p = 0; p < parallel; p++) {
+                int pSrcOffset = p*tileWidth*tileHeight*bytesPerPixel;
+                int pDstOffset = p*tileWidth*bytesPerPixel;
+                memcpy(dst+dO+k*dstLineSize+pDstOffset, src+sO+k*subTileWidthBytes+pSrcOffset, subTileWidthBytes);
+                memcpy(dst+dO+(k+1)*dstLineSize+pDstOffset, src+sO+(k+1)*subTileWidthBytes+pSrcOffset, subTileWidthBytes);
+                memcpy(dst+dO+(k+2)*dstLineSize+pDstOffset, src+sO+(k+2)*subTileWidthBytes+pSrcOffset, subTileWidthBytes);
+                memcpy(dst+dO+(k+3)*dstLineSize+pDstOffset, src+sO+(k+3)*subTileWidthBytes+pSrcOffset, subTileWidthBytes);
+            }
+        }
+        sO = sO + subTileHeight*subTileWidthBytes;
+
+        cSTL += subTileHeight;
+        for (int i=numChanges-1; i>=0; i--) {
+            if ((cSTL%changes[i].posOffset) == 0) {
+                if (i == numChanges-1) {
+                    curTileInRow += parallel;
+                    dX = curTileInRow*tileWidth;
+                    sO = sOPrev + tileWidth*tileHeight*bytesPerPixel*(parallel);
+                    sOPrev = sO;
+                } else {
+                    dX += changes[i].xDelta;
+                }
+                dY += changes[i].yDelta;
+		break;
+            }
+        }
+        if (dX >= w) {
+            dX = 0;
+            curTileInRow = 0;
+            dY += tileHeight;
+            if (dY >= h) {
+                break;
+            }
+        }
+    }
+}
+
+
+#ifdef DETILE_GENERIC_OPTI
+#define detile_generic detile_generic_opti
+#else
+#define detile_generic detile_generic_simple
+#endif
+
+static int filter_frame(AVFilterLink *inlink, AVFrame *in)
+{
+    AVFilterContext *ctx = inlink->dst;
+    FBDetileContext *fbdetile = ctx->priv;
+    AVFilterLink *outlink = ctx->outputs[0];
+    AVFrame *out;
+
+    out = ff_get_video_buffer(outlink, outlink->w, outlink->h);
+    if (!out) {
+        av_frame_free(&in);
+        return AVERROR(ENOMEM);
+    }
+    av_frame_copy_props(out, in);
+
+#ifdef DEBUG_PERF
+    uint64_t perfStart = __rdtsc();
+#endif
+    if (fbdetile->type == TYPE_INTELX) {
+        detile_intelx(ctx, fbdetile->width, fbdetile->height,
+                      out->data[0], out->linesize[0],
+                      in->data[0], in->linesize[0]);
+    } else if (fbdetile->type == TYPE_INTELY) {
+        detile_intely(ctx, fbdetile->width, fbdetile->height,
+                      out->data[0], out->linesize[0],
+                      in->data[0], in->linesize[0]);
+    } else if (fbdetile->type == TYPE_INTELYF) {
+        detile_generic(ctx, fbdetile->width, fbdetile->height,
+                        out->data[0], out->linesize[0],
+                        in->data[0], in->linesize[0],
+                        yfBytesPerPixel, yfSubTileWidth, yfSubTileHeight, yfSubTileWidthBytes,
+                        yfTileWidth, yfTileHeight,
+                        yfNumChanges, yfChanges);
+    } else if (fbdetile->type == TYPE_INTELGX) {
+        detile_generic(ctx, fbdetile->width, fbdetile->height,
+                        out->data[0], out->linesize[0],
+                        in->data[0], in->linesize[0],
+                        txBytesPerPixel, txSubTileWidth, txSubTileHeight, txSubTileWidthBytes,
+                        txTileWidth, txTileHeight,
+                        txNumChanges, txChanges);
+    } else if (fbdetile->type == TYPE_INTELGY) {
+        detile_generic(ctx, fbdetile->width, fbdetile->height,
+                        out->data[0], out->linesize[0],
+                        in->data[0], in->linesize[0],
+                        tyBytesPerPixel, tySubTileWidth, tySubTileHeight, tySubTileWidthBytes,
+                        tyTileWidth, tyTileHeight,
+                        tyNumChanges, tyChanges);
+    }
+#ifdef DEBUG_PERF
+    uint64_t perfEnd = __rdtsc();
+    perfTime += (perfEnd - perfStart);
+    perfCnt += 1;
+#endif
+
+    av_frame_free(&in);
+    return ff_filter_frame(outlink, out);
+}
+
+static av_cold void uninit(AVFilterContext *ctx)
+{
+#ifdef DEBUG_PERF
+    fprintf(stderr, "DBUG:fbdetile:uninit:perf: AvgTSCCnt %ld\n", perfTime/perfCnt);
+#endif
+}
+
+static const AVFilterPad fbdetile_inputs[] = {
+    {
+        .name         = "default",
+        .type         = AVMEDIA_TYPE_VIDEO,
+        .config_props = config_props,
+        .filter_frame = filter_frame,
+    },
+    { NULL }
+};
+
+static const AVFilterPad fbdetile_outputs[] = {
+    {
+        .name = "default",
+        .type = AVMEDIA_TYPE_VIDEO,
+    },
+    { NULL }
+};
+
+AVFilter ff_vf_fbdetile = {
+    .name          = "fbdetile",
+    .description   = NULL_IF_CONFIG_SMALL("Detile Framebuffer using CPU"),
+    .priv_size     = sizeof(FBDetileContext),
+    .init          = init,
+    .uninit        = uninit,
+    .query_formats = query_formats,
+    .inputs        = fbdetile_inputs,
+    .outputs       = fbdetile_outputs,
+    .priv_class    = &fbdetile_class,
+};
+
+// vim: set expandtab sts=4: //