[FFmpeg-devel] avfilter: add nvidia NPP based transpose filter

Submitted by Timo Rothenpieler on Sept. 8, 2018, 1:49 p.m.

Details

Message ID 20180908134929.5720-1-timo@rothenpieler.org
State New
Headers show

Commit Message

Timo Rothenpieler Sept. 8, 2018, 1:49 p.m.
From: Roman Arzumanyan <rarzumanyan@nvidia.com>

Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
---
I'm not overly a fan of a rotate filter that only support 90° angles
either.
So here's my modified version of the original transpose filter, which
now behaves the exact same as the software transpose filter.

Additionally, I removed the format conversion from the filter. That's
the job of the scale filter, and also saves you from doing pointless
double format conversion if you scale and transpose NV12 video.
Nvenc accepts yuv420p/444p input anyway, and if you really need to, one
can add another scale_npp after to get back nv12.

A possible commandline for this is:
./ffmpeg.exe -hwaccel cuvid -c:v h264_cuvid -i in.mkv -c copy -c:v h264_nvenc -vf scale_npp=format=yuv420p,transpose_npp=cclock_flip out.mkv


 configure                      |   5 +-
 doc/filters.texi               |  55 ++++
 libavfilter/Makefile           |   1 +
 libavfilter/allfilters.c       |   1 +
 libavfilter/version.h          |   2 +-
 libavfilter/vf_transpose_npp.c | 483 +++++++++++++++++++++++++++++++++
 6 files changed, 544 insertions(+), 3 deletions(-)
 create mode 100644 libavfilter/vf_transpose_npp.c

Comments

Timo Rothenpieler Sept. 8, 2018, 3:34 p.m.
On 9/8/2018 3:49 PM, Timo Rothenpieler wrote:
> From: Roman Arzumanyan <rarzumanyan@nvidia.com>
> 
> Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
> ---
> I'm not overly a fan of a rotate filter that only support 90° angles
> either.
> So here's my modified version of the original transpose filter, which
> now behaves the exact same as the software transpose filter.
> 
> Additionally, I removed the format conversion from the filter. That's
> the job of the scale filter, and also saves you from doing pointless
> double format conversion if you scale and transpose NV12 video.
> Nvenc accepts yuv420p/444p input anyway, and if you really need to, one
> can add another scale_npp after to get back nv12.
> 
> A possible commandline for this is:
> ./ffmpeg.exe -hwaccel cuvid -c:v h264_cuvid -i in.mkv -c copy -c:v h264_nvenc -vf scale_npp=format=yuv420p,transpose_npp=cclock_flip out.mkv

I'll probably remove the interp_algo from this before committing, hard 
coding it to nearest neighbor. I'm unable to see any difference between 
them for perfect 90° angles except that NN is easily 10 times faster 
than the current default Cubic.
Paul B Mahol Sept. 8, 2018, 3:38 p.m.
On 9/8/18, Timo Rothenpieler <timo@rothenpieler.org> wrote:
> On 9/8/2018 3:49 PM, Timo Rothenpieler wrote:
>> From: Roman Arzumanyan <rarzumanyan@nvidia.com>
>>
>> Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
>> ---
>> I'm not overly a fan of a rotate filter that only support 90DEG angles
>> either.
>> So here's my modified version of the original transpose filter, which
>> now behaves the exact same as the software transpose filter.
>>
>> Additionally, I removed the format conversion from the filter. That's
>> the job of the scale filter, and also saves you from doing pointless
>> double format conversion if you scale and transpose NV12 video.
>> Nvenc accepts yuv420p/444p input anyway, and if you really need to, one
>> can add another scale_npp after to get back nv12.
>>
>> A possible commandline for this is:
>> ./ffmpeg.exe -hwaccel cuvid -c:v h264_cuvid -i in.mkv -c copy -c:v
>> h264_nvenc -vf scale_npp=format=yuv420p,transpose_npp=cclock_flip out.mkv
>
> I'll probably remove the interp_algo from this before committing, hard
> coding it to nearest neighbor. I'm unable to see any difference between
> them for perfect 90DEG angles except that NN is easily 10 times faster
> than the current default Cubic.

Perhaps interpolation is useful for other pixel format where vertical
and horizontal subsampling are not same.
Timo Rothenpieler Sept. 8, 2018, 4:16 p.m.
On 9/8/2018 5:38 PM, Paul B Mahol wrote:
>> I'll probably remove the interp_algo from this before committing, hard
>> coding it to nearest neighbor. I'm unable to see any difference between
>> them for perfect 90DEG angles except that NN is easily 10 times faster
>> than the current default Cubic.
> 
> Perhaps interpolation is useful for other pixel format where vertical
> and horizontal subsampling are not same.

This only supports yuv420p and yuv444p so I don't think it's an issue. 
It's impossible for such formats to end up in a CUDA frame to begin 
with, since nothing supports putting them in there.
Michael Niedermayer Sept. 9, 2018, 12:12 a.m.
On Sat, Sep 08, 2018 at 03:49:29PM +0200, Timo Rothenpieler wrote:
> From: Roman Arzumanyan <rarzumanyan@nvidia.com>
> 
> Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
> ---
> I'm not overly a fan of a rotate filter that only support 90° angles
> either.
> So here's my modified version of the original transpose filter, which
> now behaves the exact same as the software transpose filter.
> 
> Additionally, I removed the format conversion from the filter. That's
> the job of the scale filter, and also saves you from doing pointless
> double format conversion if you scale and transpose NV12 video.
> Nvenc accepts yuv420p/444p input anyway, and if you really need to, one
> can add another scale_npp after to get back nv12.
> 
> A possible commandline for this is:
> ./ffmpeg.exe -hwaccel cuvid -c:v h264_cuvid -i in.mkv -c copy -c:v h264_nvenc -vf scale_npp=format=yuv420p,transpose_npp=cclock_flip out.mkv
> 
> 
>  configure                      |   5 +-
>  doc/filters.texi               |  55 ++++
>  libavfilter/Makefile           |   1 +
>  libavfilter/allfilters.c       |   1 +
>  libavfilter/version.h          |   2 +-
>  libavfilter/vf_transpose_npp.c | 483 +++++++++++++++++++++++++++++++++
>  6 files changed, 544 insertions(+), 3 deletions(-)
>  create mode 100644 libavfilter/vf_transpose_npp.c

breaks build:

HTML	doc/ffmpeg-filters.html
HTML	doc/ffplay-all.html
HTML	doc/ffmpeg-all.html
HTML	doc/ffprobe-all.html
doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
make: *** [doc/ffmpeg-filters.html] Error 1
doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
make: *** [doc/ffplay-all.html] Error 1
make: *** [doc/ffprobe-all.html] Error 1
doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
make: *** [doc/ffmpeg-all.html] Error 1
make: Target `all' not remade because of errors.



[...]
Timo Rothenpieler Sept. 9, 2018, 9:31 a.m.
On 9/9/2018 2:12 AM, Michael Niedermayer wrote:
> On Sat, Sep 08, 2018 at 03:49:29PM +0200, Timo Rothenpieler wrote:
>> From: Roman Arzumanyan <rarzumanyan@nvidia.com>
>>
>> Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
>> ---
>> I'm not overly a fan of a rotate filter that only support 90° angles
>> either.
>> So here's my modified version of the original transpose filter, which
>> now behaves the exact same as the software transpose filter.
>>
>> Additionally, I removed the format conversion from the filter. That's
>> the job of the scale filter, and also saves you from doing pointless
>> double format conversion if you scale and transpose NV12 video.
>> Nvenc accepts yuv420p/444p input anyway, and if you really need to, one
>> can add another scale_npp after to get back nv12.
>>
>> A possible commandline for this is:
>> ./ffmpeg.exe -hwaccel cuvid -c:v h264_cuvid -i in.mkv -c copy -c:v h264_nvenc -vf scale_npp=format=yuv420p,transpose_npp=cclock_flip out.mkv
>>
>>
>>   configure                      |   5 +-
>>   doc/filters.texi               |  55 ++++
>>   libavfilter/Makefile           |   1 +
>>   libavfilter/allfilters.c       |   1 +
>>   libavfilter/version.h          |   2 +-
>>   libavfilter/vf_transpose_npp.c | 483 +++++++++++++++++++++++++++++++++
>>   6 files changed, 544 insertions(+), 3 deletions(-)
>>   create mode 100644 libavfilter/vf_transpose_npp.c
> 
> breaks build:
> 
> HTML	doc/ffmpeg-filters.html
> HTML	doc/ffplay-all.html
> HTML	doc/ffmpeg-all.html
> HTML	doc/ffprobe-all.html
> doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
> make: *** [doc/ffmpeg-filters.html] Error 1
> doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
> doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
> make: *** [doc/ffplay-all.html] Error 1
> make: *** [doc/ffprobe-all.html] Error 1
> doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
> make: *** [doc/ffmpeg-all.html] Error 1
> make: Target `all' not remade because of errors.
> 

What's the correct way to link to another section? I have seen other 
parts use the @ref syntax.
Michael Niedermayer Sept. 9, 2018, 10:16 a.m.
On Sun, Sep 09, 2018 at 11:31:49AM +0200, Timo Rothenpieler wrote:
> On 9/9/2018 2:12 AM, Michael Niedermayer wrote:
> >On Sat, Sep 08, 2018 at 03:49:29PM +0200, Timo Rothenpieler wrote:
> >>From: Roman Arzumanyan <rarzumanyan@nvidia.com>
> >>
> >>Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
> >>---
> >>I'm not overly a fan of a rotate filter that only support 90° angles
> >>either.
> >>So here's my modified version of the original transpose filter, which
> >>now behaves the exact same as the software transpose filter.
> >>
> >>Additionally, I removed the format conversion from the filter. That's
> >>the job of the scale filter, and also saves you from doing pointless
> >>double format conversion if you scale and transpose NV12 video.
> >>Nvenc accepts yuv420p/444p input anyway, and if you really need to, one
> >>can add another scale_npp after to get back nv12.
> >>
> >>A possible commandline for this is:
> >>./ffmpeg.exe -hwaccel cuvid -c:v h264_cuvid -i in.mkv -c copy -c:v h264_nvenc -vf scale_npp=format=yuv420p,transpose_npp=cclock_flip out.mkv
> >>
> >>
> >>  configure                      |   5 +-
> >>  doc/filters.texi               |  55 ++++
> >>  libavfilter/Makefile           |   1 +
> >>  libavfilter/allfilters.c       |   1 +
> >>  libavfilter/version.h          |   2 +-
> >>  libavfilter/vf_transpose_npp.c | 483 +++++++++++++++++++++++++++++++++
> >>  6 files changed, 544 insertions(+), 3 deletions(-)
> >>  create mode 100644 libavfilter/vf_transpose_npp.c
> >
> >breaks build:
> >
> >HTML	doc/ffmpeg-filters.html
> >HTML	doc/ffplay-all.html
> >HTML	doc/ffmpeg-all.html
> >HTML	doc/ffprobe-all.html
> >doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
> >make: *** [doc/ffmpeg-filters.html] Error 1
> >doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
> >doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
> >make: *** [doc/ffplay-all.html] Error 1
> >make: *** [doc/ffprobe-all.html] Error 1
> >doc/filters.texi:16290: @ref reference to nonexistent node `transpose'
> >make: *** [doc/ffmpeg-all.html] Error 1
> >make: Target `all' not remade because of errors.
> >
> 
> What's the correct way to link to another section? I have seen other parts
> use the @ref syntax.

i think this needs a @anchor{transpose} somewhere probably but i didnt read
the "manual" for this stuff so i could be wrong


[...]

Patch hide | download patch | download mbox

diff --git a/configure b/configure
index 0d6ee0abfc..e1f229f052 100755
--- a/configure
+++ b/configure
@@ -2923,6 +2923,7 @@  hwupload_cuda_filter_deps="ffnvcodec"
 scale_npp_filter_deps="ffnvcodec libnpp"
 scale_cuda_filter_deps="cuda_sdk"
 thumbnail_cuda_filter_deps="cuda_sdk"
+transpose_npp_filter_deps="ffnvcodec libnpp"
 
 amf_deps_any="libdl LoadLibrary"
 nvenc_deps="ffnvcodec"
@@ -6082,8 +6083,8 @@  enabled libmodplug        && require_pkg_config libmodplug libmodplug libmodplug
 enabled libmp3lame        && require "libmp3lame >= 3.98.3" lame/lame.h lame_set_VBR_quality -lmp3lame $libm_extralibs
 enabled libmysofa         && { check_pkg_config libmysofa libmysofa mysofa.h mysofa_load ||
                                require libmysofa mysofa.h mysofa_load -lmysofa $zlib_extralibs; }
-enabled libnpp            && { check_lib libnpp npp.h nppGetLibVersion -lnppig -lnppicc -lnppc ||
-                               check_lib libnpp npp.h nppGetLibVersion -lnppi -lnppc ||
+enabled libnpp            && { check_lib libnpp npp.h nppGetLibVersion -lnppig -lnppicc -lnppc -lnppidei ||
+                               check_lib libnpp npp.h nppGetLibVersion -lnppi -lnppc -lnppidei ||
                                die "ERROR: libnpp not found"; }
 enabled libopencore_amrnb && require libopencore_amrnb opencore-amrnb/interf_dec.h Decoder_Interface_init -lopencore-amrnb
 enabled libopencore_amrwb && require libopencore_amrwb opencore-amrwb/dec_if.h D_IF_init -lopencore-amrwb
diff --git a/doc/filters.texi b/doc/filters.texi
index 37e79d34e1..5b839b6419 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -16284,6 +16284,61 @@  The command above can also be specified as:
 transpose=1:portrait
 @end example
 
+@section transpose_npp
+
+Transpose rows with columns in the input video and optionally flip it.
+For more in depth examples see the @ref{transpose} video filter, which shares mostly the same options.
+
+It accepts the following parameters:
+
+@table @option
+
+@item dir
+Specify the transposition direction.
+
+Can assume the following values:
+@table @samp
+@item cclock_flip
+Rotate by 90 degrees counterclockwise and vertically flip. (default)
+
+@item clock
+Rotate by 90 degrees clockwise.
+
+@item cclock
+Rotate by 90 degrees counterclockwise.
+
+@item clock_flip
+Rotate by 90 degrees clockwise and vertically flip.
+@end table
+
+@item passthrough
+Do not apply the transposition if the input geometry matches the one
+specified by the specified value. It accepts the following values:
+@table @samp
+@item none
+Always apply transposition. (default)
+@item portrait
+Preserve portrait geometry (when @var{height} >= @var{width}).
+@item landscape
+Preserve landscape geometry (when @var{width} >= @var{height}).
+@end table
+
+@item interp_algo
+The interpolation algorithm used for rotating. One of the following:
+@table @option
+@item nn
+Nearest neighbour
+
+@item linear
+Linear
+
+@item cubic
+Cubid (default)
+
+@end table
+
+@end table
+
 @section trim
 Trim the input so that the output contains one continuous subpart of the input.
 
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index e412000c8f..cc0cc15fd2 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -374,6 +374,7 @@  OBJS-$(CONFIG_TONEMAP_FILTER)                += vf_tonemap.o colorspace.o
 OBJS-$(CONFIG_TONEMAP_OPENCL_FILTER)         += vf_tonemap_opencl.o colorspace.o opencl.o \
                                                 opencl/tonemap.o opencl/colorspace_common.o
 OBJS-$(CONFIG_TRANSPOSE_FILTER)              += vf_transpose.o
+OBJS-$(CONFIG_TRANSPOSE_NPP_FILTER)          += vf_transpose_npp.o
 OBJS-$(CONFIG_TRIM_FILTER)                   += trim.o
 OBJS-$(CONFIG_UNPREMULTIPLY_FILTER)          += vf_premultiply.o framesync.o
 OBJS-$(CONFIG_UNSHARP_FILTER)                += vf_unsharp.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index 2fa9460335..73a5d7e188 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -356,6 +356,7 @@  extern AVFilter ff_vf_tmix;
 extern AVFilter ff_vf_tonemap;
 extern AVFilter ff_vf_tonemap_opencl;
 extern AVFilter ff_vf_transpose;
+extern AVFilter ff_vf_transpose_npp;
 extern AVFilter ff_vf_trim;
 extern AVFilter ff_vf_unpremultiply;
 extern AVFilter ff_vf_unsharp;
diff --git a/libavfilter/version.h b/libavfilter/version.h
index 2ff2b6a318..ef982339d7 100644
--- a/libavfilter/version.h
+++ b/libavfilter/version.h
@@ -30,7 +30,7 @@ 
 #include "libavutil/version.h"
 
 #define LIBAVFILTER_VERSION_MAJOR   7
-#define LIBAVFILTER_VERSION_MINOR  27
+#define LIBAVFILTER_VERSION_MINOR  28
 #define LIBAVFILTER_VERSION_MICRO 100
 
 #define LIBAVFILTER_VERSION_INT AV_VERSION_INT(LIBAVFILTER_VERSION_MAJOR, \
diff --git a/libavfilter/vf_transpose_npp.c b/libavfilter/vf_transpose_npp.c
new file mode 100644
index 0000000000..5842a25483
--- /dev/null
+++ b/libavfilter/vf_transpose_npp.c
@@ -0,0 +1,483 @@ 
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <nppi.h>
+#include <stdio.h>
+#include <string.h>
+#include "libavutil/avstring.h"
+#include "libavutil/common.h"
+#include "libavutil/eval.h"
+#include "libavutil/hwcontext.h"
+#include "libavutil/hwcontext_cuda_internal.h"
+#include "libavutil/internal.h"
+#include "libavutil/opt.h"
+#include "libavutil/pixdesc.h"
+#include "avfilter.h"
+#include "formats.h"
+#include "internal.h"
+#include "video.h"
+
+static const enum AVPixelFormat supported_formats[] = {
+    AV_PIX_FMT_YUV420P,
+    AV_PIX_FMT_YUV444P
+};
+
+enum TransposeStage {
+    STAGE_ROTATE,
+    STAGE_TRANSPOSE,
+    STAGE_NB
+};
+
+enum Transpose {
+    NPP_TRANSPOSE_CCLOCK_FLIP = 0,
+    NPP_TRANSPOSE_CLOCK = 1,
+    NPP_TRANSPOSE_CCLOCK = 2,
+    NPP_TRANSPOSE_CLOCK_FLIP = 3
+};
+
+enum Passthrough {
+    NPP_TRANSPOSE_PT_TYPE_NONE = 0,
+    NPP_TRANSPOSE_PT_TYPE_LANDSCAPE,
+    NPP_TRANSPOSE_PT_TYPE_PORTRAIT
+};
+
+typedef struct NPPTransposeStageContext {
+    int stage_needed;
+    enum AVPixelFormat in_fmt;
+    enum AVPixelFormat out_fmt;
+    struct {
+        int width;
+        int height;
+    } planes_in[3], planes_out[3];
+    AVBufferRef *frames_ctx;
+    AVFrame     *frame;
+} NPPTransposeStageContext;
+
+typedef struct NPPTransposeContext {
+    const AVClass *class;
+    NPPTransposeStageContext stages[STAGE_NB];
+    AVFrame *tmp_frame;
+
+    int passthrough;    ///< PassthroughType, landscape passthrough mode enabled
+    int dir;            ///< TransposeDir
+    int interp_algo;
+} NPPTransposeContext;
+
+static int npptranspose_init(AVFilterContext *ctx)
+{
+    NPPTransposeContext *s = ctx->priv;
+    int i;
+
+    for (i = 0; i < FF_ARRAY_ELEMS(s->stages); i++) {
+        s->stages[i].frame = av_frame_alloc();
+        if (!s->stages[i].frame)
+            return AVERROR(ENOMEM);
+    }
+
+    s->tmp_frame = av_frame_alloc();
+    if (!s->tmp_frame)
+        return AVERROR(ENOMEM);
+
+    return 0;
+}
+
+static void npptranspose_uninit(AVFilterContext *ctx)
+{
+    NPPTransposeContext *s = ctx->priv;
+    int i;
+
+    for (i = 0; i < FF_ARRAY_ELEMS(s->stages); i++) {
+        av_frame_free(&s->stages[i].frame);
+        av_buffer_unref(&s->stages[i].frames_ctx);
+    }
+
+    av_frame_free(&s->tmp_frame);
+}
+
+static int npptranspose_query_formats(AVFilterContext *ctx)
+{
+    static const enum AVPixelFormat pixel_formats[] = {
+        AV_PIX_FMT_CUDA, AV_PIX_FMT_NONE,
+    };
+
+    AVFilterFormats *pix_fmts = ff_make_format_list(pixel_formats);
+    return ff_set_common_formats(ctx, pix_fmts);
+}
+
+static int init_stage(NPPTransposeStageContext *stage, AVBufferRef *device_ctx)
+{
+    AVBufferRef *out_ref = NULL;
+    AVHWFramesContext *out_ctx;
+    int in_sw, in_sh, out_sw, out_sh;
+    int ret, i;
+
+    av_pix_fmt_get_chroma_sub_sample(stage->in_fmt,  &in_sw,  &in_sh);
+    av_pix_fmt_get_chroma_sub_sample(stage->out_fmt, &out_sw, &out_sh);
+
+    if (!stage->planes_out[0].width) {
+        stage->planes_out[0].width  = stage->planes_in[0].width;
+        stage->planes_out[0].height = stage->planes_in[0].height;
+    }
+
+    for (i = 1; i < FF_ARRAY_ELEMS(stage->planes_in); i++) {
+        stage->planes_in[i].width   = stage->planes_in[0].width   >> in_sw;
+        stage->planes_in[i].height  = stage->planes_in[0].height  >> in_sh;
+        stage->planes_out[i].width  = stage->planes_out[0].width  >> out_sw;
+        stage->planes_out[i].height = stage->planes_out[0].height >> out_sh;
+    }
+
+    out_ref = av_hwframe_ctx_alloc(device_ctx);
+    if (!out_ref)
+        return AVERROR(ENOMEM);
+    out_ctx = (AVHWFramesContext*)out_ref->data;
+
+    out_ctx->format    = AV_PIX_FMT_CUDA;
+    out_ctx->sw_format = stage->out_fmt;
+    out_ctx->width     = FFALIGN(stage->planes_out[0].width,  32);
+    out_ctx->height    = FFALIGN(stage->planes_out[0].height, 32);
+
+    ret = av_hwframe_ctx_init(out_ref);
+    if (ret < 0)
+        goto fail;
+
+    av_frame_unref(stage->frame);
+    ret = av_hwframe_get_buffer(out_ref, stage->frame, 0);
+    if (ret < 0)
+        goto fail;
+
+    stage->frame->width  = stage->planes_out[0].width;
+    stage->frame->height = stage->planes_out[0].height;
+    av_buffer_unref(&stage->frames_ctx);
+    stage->frames_ctx = out_ref;
+
+    return 0;
+
+fail:
+    av_buffer_unref(&out_ref);
+    return ret;
+}
+
+static int format_is_supported(enum AVPixelFormat fmt)
+{
+    int i;
+
+    for (i = 0; i < FF_ARRAY_ELEMS(supported_formats); i++)
+        if (supported_formats[i] == fmt)
+            return 1;
+
+    return 0;
+}
+
+static int init_processing_chain(AVFilterContext *ctx, int in_width, int in_height,
+                                 int out_width, int out_height)
+{
+    NPPTransposeContext *s = ctx->priv;
+    AVHWFramesContext *in_frames_ctx;
+    enum AVPixelFormat format;
+    int i, ret, last_stage = -1;
+    int rot_width = out_width, rot_height = out_height;
+
+    /* check that we have a hw context */
+    if (!ctx->inputs[0]->hw_frames_ctx) {
+        av_log(ctx, AV_LOG_ERROR, "No hw context provided on input\n");
+        return AVERROR(EINVAL);
+    }
+
+    in_frames_ctx = (AVHWFramesContext*)ctx->inputs[0]->hw_frames_ctx->data;
+    format        = in_frames_ctx->sw_format;
+
+    if (!format_is_supported(format)) {
+        av_log(ctx, AV_LOG_ERROR, "Unsupported input format: %s\n",
+               av_get_pix_fmt_name(format));
+        return AVERROR(ENOSYS);
+    }
+
+    if (s->dir != NPP_TRANSPOSE_CCLOCK_FLIP) {
+        s->stages[STAGE_ROTATE].stage_needed = 1;
+    }
+
+    if (s->dir == NPP_TRANSPOSE_CCLOCK_FLIP || s->dir == NPP_TRANSPOSE_CLOCK_FLIP) {
+        s->stages[STAGE_TRANSPOSE].stage_needed = 1;
+
+        /* Rotating by 180° in case of clock_flip, or not at all for cclock_flip, so width/height unchanged by rotation */
+        rot_width = in_width;
+        rot_height = in_height;
+    }
+
+    s->stages[STAGE_ROTATE].in_fmt               = format;
+    s->stages[STAGE_ROTATE].out_fmt              = format;
+    s->stages[STAGE_ROTATE].planes_in[0].width   = in_width;
+    s->stages[STAGE_ROTATE].planes_in[0].height  = in_height;
+    s->stages[STAGE_ROTATE].planes_out[0].width  = rot_width;
+    s->stages[STAGE_ROTATE].planes_out[0].height = rot_height;
+    s->stages[STAGE_TRANSPOSE].in_fmt               = format;
+    s->stages[STAGE_TRANSPOSE].out_fmt              = format;
+    s->stages[STAGE_TRANSPOSE].planes_in[0].width   = rot_width;
+    s->stages[STAGE_TRANSPOSE].planes_in[0].height  = rot_height;
+    s->stages[STAGE_TRANSPOSE].planes_out[0].width  = out_width;
+    s->stages[STAGE_TRANSPOSE].planes_out[0].height = out_height;
+
+    /* init the hardware contexts */
+    for (i = 0; i < FF_ARRAY_ELEMS(s->stages); i++) {
+        if (!s->stages[i].stage_needed)
+            continue;
+        ret = init_stage(&s->stages[i], in_frames_ctx->device_ref);
+        if (ret < 0)
+            return ret;
+        last_stage = i;
+    }
+
+    if (last_stage >= 0)
+        ctx->outputs[0]->hw_frames_ctx = av_buffer_ref(s->stages[last_stage].frames_ctx);
+    else
+        ctx->outputs[0]->hw_frames_ctx = av_buffer_ref(ctx->inputs[0]->hw_frames_ctx);
+
+    if (!ctx->outputs[0]->hw_frames_ctx)
+        return AVERROR(ENOMEM);
+
+    return 0;
+}
+
+static int npptranspose_config_props(AVFilterLink *outlink)
+{
+    AVFilterContext *ctx = outlink->src;
+    AVFilterLink *inlink = outlink->src->inputs[0];
+    NPPTransposeContext *s = ctx->priv;
+    int ret;
+
+    if ((inlink->w >= inlink->h && s->passthrough == NPP_TRANSPOSE_PT_TYPE_LANDSCAPE) ||
+        (inlink->w <= inlink->h && s->passthrough == NPP_TRANSPOSE_PT_TYPE_PORTRAIT)) {
+        av_log(ctx, AV_LOG_VERBOSE,
+               "w:%d h:%d -> w:%d h:%d (passthrough mode)\n",
+               inlink->w, inlink->h, inlink->w, inlink->h);
+        return 0;
+    } else {
+        s->passthrough = NPP_TRANSPOSE_PT_TYPE_NONE;
+    }
+
+    outlink->w = inlink->h;
+    outlink->h = inlink->w;
+    outlink->sample_aspect_ratio = (AVRational){inlink->sample_aspect_ratio.den, inlink->sample_aspect_ratio.num};
+
+    ret = init_processing_chain(ctx, inlink->w, inlink->h, outlink->w, outlink->h);
+    if (ret < 0)
+        return ret;
+
+    av_log(ctx, AV_LOG_VERBOSE, "w:%d h:%d -transpose-> w:%d h:%d\n",
+           inlink->w, inlink->h, outlink->w, outlink->h);
+
+    return 0;
+}
+
+static int npptranspose_rotate(AVFilterContext *ctx, NPPTransposeStageContext *stage,
+                               AVFrame *out, AVFrame *in)
+{
+    NPPTransposeContext *s = ctx->priv;
+    NppStatus err;
+    int i;
+
+    for (i = 0; i < FF_ARRAY_ELEMS(stage->planes_in) && i < FF_ARRAY_ELEMS(in->data) && in->data[i]; i++) {
+        int iw = stage->planes_in[i].width;
+        int ih = stage->planes_in[i].height;
+        int ow = stage->planes_out[i].width;
+        int oh = stage->planes_out[i].height;
+
+        // nppRotate uses 0,0 as the rotation point
+        // need to shift the image accordingly after rotation
+        // need to substract 1 to get the correct coordinates
+        double angle = s->dir == NPP_TRANSPOSE_CLOCK ? -90.0 : s->dir == NPP_TRANSPOSE_CCLOCK ? 90.0 : 180.0;
+        int shiftw = (s->dir == NPP_TRANSPOSE_CLOCK  || s->dir == NPP_TRANSPOSE_CLOCK_FLIP) ? ow - 1 : 0;
+        int shifth = (s->dir == NPP_TRANSPOSE_CCLOCK || s->dir == NPP_TRANSPOSE_CLOCK_FLIP) ? oh - 1 : 0;
+
+        err = nppiRotate_8u_C1R(in->data[i], (NppiSize){ iw, ih },
+                                in->linesize[i], (NppiRect){ 0, 0, iw, ih },
+                                out->data[i], out->linesize[i],
+                                (NppiRect){ 0, 0, ow, oh },
+                                angle, shiftw, shifth, s->interp_algo);
+        if (err != NPP_SUCCESS) {
+            av_log(ctx, AV_LOG_ERROR, "NPP rotate error: %d\n", err);
+            return AVERROR_UNKNOWN;
+        }
+    }
+
+    return 0;
+}
+
+static int npptranspose_transpose(AVFilterContext *ctx, NPPTransposeStageContext *stage,
+                                  AVFrame *out, AVFrame *in)
+{
+    NppStatus err;
+    int i;
+
+    for (i = 0; i < FF_ARRAY_ELEMS(stage->planes_in) && i < FF_ARRAY_ELEMS(in->data) && in->data[i]; i++) {
+        int iw = stage->planes_in[i].width;
+        int ih = stage->planes_in[i].height;
+
+        err = nppiTranspose_8u_C1R(in->data[i], in->linesize[i],
+                                   out->data[i], out->linesize[i],
+                                   (NppiSize){ iw, ih });
+        if (err != NPP_SUCCESS) {
+            av_log(ctx, AV_LOG_ERROR, "NPP transpose error: %d\n", err);
+            return AVERROR_UNKNOWN;
+        }
+    }
+
+    return 0;
+}
+
+static int (*const npptranspose_process[])(AVFilterContext *ctx, NPPTransposeStageContext *stage,
+                                           AVFrame *out, AVFrame *in) = {
+    [STAGE_ROTATE]       = npptranspose_rotate,
+    [STAGE_TRANSPOSE]    = npptranspose_transpose
+};
+
+static int npptranspose_filter(AVFilterContext *ctx, AVFrame *out, AVFrame *in)
+{
+    NPPTransposeContext *s = ctx->priv;
+    AVFrame *src = in;
+    int i, ret, last_stage = -1;
+
+    for (i = 0; i < FF_ARRAY_ELEMS(s->stages); i++) {
+        if (!s->stages[i].stage_needed)
+            continue;
+
+        ret = npptranspose_process[i](ctx, &s->stages[i], s->stages[i].frame, src);
+        if (ret < 0)
+            return ret;
+
+        src        = s->stages[i].frame;
+        last_stage = i;
+    }
+
+    if (last_stage < 0)
+        return AVERROR_BUG;
+
+    ret = av_hwframe_get_buffer(src->hw_frames_ctx, s->tmp_frame, 0);
+    if (ret < 0)
+        return ret;
+
+    av_frame_move_ref(out, src);
+    av_frame_move_ref(src, s->tmp_frame);
+
+    ret = av_frame_copy_props(out, in);
+    if (ret < 0)
+        return ret;
+
+    return 0;
+}
+
+static int npptranspose_filter_frame(AVFilterLink *link, AVFrame *in)
+{
+    AVFilterContext              *ctx = link->dst;
+    NPPTransposeContext            *s = ctx->priv;
+    AVFilterLink             *outlink = ctx->outputs[0];
+    AVHWFramesContext     *frames_ctx = (AVHWFramesContext*)outlink->hw_frames_ctx->data;
+    AVCUDADeviceContext *device_hwctx = frames_ctx->device_ctx->hwctx;
+    AVFrame *out = NULL;
+    CUresult err;
+    CUcontext dummy;
+    int ret = 0;
+
+    if (s->passthrough)
+        return ff_filter_frame(outlink, in);
+
+    out = av_frame_alloc();
+    if (!out) {
+        ret = AVERROR(ENOMEM);
+        goto fail;
+    }
+
+    err = device_hwctx->internal->cuda_dl->cuCtxPushCurrent(device_hwctx->cuda_ctx);
+    if (err != CUDA_SUCCESS) {
+        ret = AVERROR_UNKNOWN;
+        goto fail;
+    }
+
+    ret = npptranspose_filter(ctx, out, in);
+
+    device_hwctx->internal->cuda_dl->cuCtxPopCurrent(&dummy);
+    if (ret < 0)
+        goto fail;
+
+    av_frame_free(&in);
+
+    return ff_filter_frame(outlink, out);
+
+fail:
+    av_frame_free(&in);
+    av_frame_free(&out);
+    return ret;
+}
+
+#define OFFSET(x) offsetof(NPPTransposeContext, x)
+#define FLAGS (AV_OPT_FLAG_FILTERING_PARAM|AV_OPT_FLAG_VIDEO_PARAM)
+
+static const AVOption options[] = {
+    { "dir", "set transpose direction", OFFSET(dir), AV_OPT_TYPE_INT, { .i64 = NPP_TRANSPOSE_CCLOCK_FLIP }, 0, 3, FLAGS, "dir" },
+        { "cclock_flip", "rotate counter-clockwise with vertical flip", 0, AV_OPT_TYPE_CONST, { .i64 = NPP_TRANSPOSE_CCLOCK_FLIP }, 0, 0, FLAGS, "dir" },
+        { "clock",       "rotate clockwise",                            0, AV_OPT_TYPE_CONST, { .i64 = NPP_TRANSPOSE_CLOCK       }, 0, 0, FLAGS, "dir" },
+        { "cclock",      "rotate counter-clockwise",                    0, AV_OPT_TYPE_CONST, { .i64 = NPP_TRANSPOSE_CCLOCK      }, 0, 0, FLAGS, "dir" },
+        { "clock_flip",  "rotate clockwise with vertical flip",         0, AV_OPT_TYPE_CONST, { .i64 = NPP_TRANSPOSE_CLOCK_FLIP  }, 0, 0, FLAGS, "dir" },
+    { "passthrough", "do not apply transposition if the input matches the specified geometry", OFFSET(passthrough), AV_OPT_TYPE_INT, { .i64 = NPP_TRANSPOSE_PT_TYPE_NONE },  0, 2, FLAGS, "passthrough" },
+        { "none",      "always apply transposition",  0, AV_OPT_TYPE_CONST, { .i64 = NPP_TRANSPOSE_PT_TYPE_NONE },      0, 0, FLAGS, "passthrough" },
+        { "landscape", "preserve landscape geometry", 0, AV_OPT_TYPE_CONST, { .i64 = NPP_TRANSPOSE_PT_TYPE_LANDSCAPE }, 0, 0, FLAGS, "passthrough" },
+        { "portrait",  "preserve portrait geometry",  0, AV_OPT_TYPE_CONST, { .i64 = NPP_TRANSPOSE_PT_TYPE_PORTRAIT },  0, 0, FLAGS, "passthrough" },
+    { "interp_algo", "Interpolation algorithm used for rotating", OFFSET(interp_algo), AV_OPT_TYPE_INT, { .i64 = NPPI_INTER_CUBIC }, 0, INT_MAX, FLAGS, "interp_algo" },
+        { "nn",     "nearest neighbour", 0, AV_OPT_TYPE_CONST, { .i64 = NPPI_INTER_NN     }, 0, 0, FLAGS, "interp_algo" },
+        { "linear", "linear",            0, AV_OPT_TYPE_CONST, { .i64 = NPPI_INTER_LINEAR }, 0, 0, FLAGS, "interp_algo" },
+        { "cubic",  "cubic",             0, AV_OPT_TYPE_CONST, { .i64 = NPPI_INTER_CUBIC  }, 0, 0, FLAGS, "interp_algo" },
+    { NULL },
+};
+
+static const AVClass npptranspose_class = {
+    .class_name = "npptranspose",
+    .item_name  = av_default_item_name,
+    .option     = options,
+    .version    = LIBAVUTIL_VERSION_INT,
+};
+
+static const AVFilterPad npptranspose_inputs[] = {
+    {
+        .name         = "default",
+        .type         = AVMEDIA_TYPE_VIDEO,
+        .filter_frame = npptranspose_filter_frame,
+    },
+    { NULL }
+};
+
+static const AVFilterPad npptranspose_outputs[] = {
+    {
+        .name         = "default",
+        .type         = AVMEDIA_TYPE_VIDEO,
+        .config_props = npptranspose_config_props,
+    },
+    { NULL }
+};
+
+AVFilter ff_vf_transpose_npp = {
+    .name           = "transpose_npp",
+    .description    = NULL_IF_CONFIG_SMALL("NVIDIA Performance Primitives video transpose"),
+    .init           = npptranspose_init,
+    .uninit         = npptranspose_uninit,
+    .query_formats  = npptranspose_query_formats,
+    .priv_size      = sizeof(NPPTransposeContext),
+    .priv_class     = &npptranspose_class,
+    .inputs         = npptranspose_inputs,
+    .outputs        = npptranspose_outputs,
+    .flags_internal = FF_FILTER_FLAG_HWFRAME_AWARE,
+};
\ No newline at end of file