From patchwork Sun May 5 16:05:04 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paul B Mahol X-Patchwork-Id: 12994 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 1F640448CFE for ; Sun, 5 May 2019 19:10:45 +0300 (EEST) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id ED32768ACB4; Sun, 5 May 2019 19:10:44 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wm1-f68.google.com (mail-wm1-f68.google.com [209.85.128.68]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 635F368A65F for ; Sun, 5 May 2019 19:10:38 +0300 (EEST) Received: by mail-wm1-f68.google.com with SMTP id o189so1860006wmb.1 for ; Sun, 05 May 2019 09:10:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:subject:date:message-id; bh=E7uCvlX13Y2O3S82Ex2DhbTXZRo4bwF3x9vxYf+bk3Y=; b=iD3SAZEPG4ll8Dw4bu1nuUbvawifvXdaiwPMEOMe1+1eXu+QOB6Tl2kNycKPZMVNLr yEr83GqqLNT6GFMrBGczRExI6umlOCOXnRnkm8bsCw+YHp+8Vz727Q0Uzmv0AWeutl+w b+Vuwr5p61lNMMds0+5SfvWPq4DV9UDr50rloFx+ln0NzGHlz5Is4KfIEmcZqlmMD/dz FJ3ZcloAvPTK1D3Na6GSaHkEjOjt1fnZ5kPKdLFcDIJuo4/Lptff4lrxX/Iv3efUd3in oeM8Td+sOun344e7hYhd+crXpO+xwsg2JmIa2o7udrdRhmwMPKnyty1XF0/3O8t1n8rs jAnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id; bh=E7uCvlX13Y2O3S82Ex2DhbTXZRo4bwF3x9vxYf+bk3Y=; b=CiW88fJ2ug8prD1OSJhbpGN56RLeWFgDam0T29EUhoSTp1Y4nBapwCrkJ1K8pbb7KC IgKz/QGv4R55MQqN16yEZzO4p5/oAbaR/8JRvgowfWDYTBdu4HxVLrfTwB1vZtVK/lJn d9jnWYjCJqyjaa2C4aw16//2dPEK16mnvTQuIZOPRsIeidi+Fwjc9W5reqKacmMQN6sL b7pwZvrv/gP7f1jgpK3bsrkSQkb7XEHp86us64aojOlPtJO24QpjZBzqQqQOofEa0Qz/ foBEXivo98lnoFkB/L2aaejO8ZjKSr4xX6MZm29zR8Lvu0vYvZZCpECFRb+3Z10x+57J fbNQ== X-Gm-Message-State: APjAAAVvii7vT+HaKUavccxVJggiSQKVuW4FU28qZ5bY5Q6G2FtgtpSr 3kVNFVIUTBJepCPP8RNcjuw2uAfg X-Google-Smtp-Source: APXvYqyEbmcd+g6433S7JqtiYeadjYTLlyTKh70CTcIbF02u6WP73VPReKQAmpG6cuII0egyHoUxzA== X-Received: by 2002:a1c:5f02:: with SMTP id t2mr13316630wmb.19.1557072315058; Sun, 05 May 2019 09:05:15 -0700 (PDT) Received: from localhost.localdomain ([37.244.241.103]) by smtp.gmail.com with ESMTPSA id v9sm15459697wrg.20.2019.05.05.09.05.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 05 May 2019 09:05:14 -0700 (PDT) From: Paul B Mahol To: ffmpeg-devel@ffmpeg.org Date: Sun, 5 May 2019 18:05:04 +0200 Message-Id: <20190505160504.5683-1-onemda@gmail.com> X-Mailer: git-send-email 2.17.1 Subject: [FFmpeg-devel] [PATCH] avfilter: add asr filter X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Signed-off-by: Paul B Mahol --- configure | 4 + doc/filters.texi | 32 +++++++ libavfilter/Makefile | 1 + libavfilter/af_asr.c | 177 +++++++++++++++++++++++++++++++++++++++ libavfilter/allfilters.c | 1 + 5 files changed, 215 insertions(+) create mode 100644 libavfilter/af_asr.c diff --git a/configure b/configure index d644a5b1d4..586c293bb9 100755 --- a/configure +++ b/configure @@ -307,6 +307,7 @@ External library support: --enable-opengl enable OpenGL rendering [no] --enable-openssl enable openssl, needed for https support if gnutls, libtls or mbedtls is not used [no] + --enable-pocketsphinx enable PocketSphinx, needed for asr filter [no] --disable-sndio disable sndio support [autodetect] --disable-schannel disable SChannel SSP, needed for TLS support on Windows if openssl and gnutls are not used [autodetect] @@ -1799,6 +1800,7 @@ EXTERNAL_LIBRARY_LIST=" mediacodec openal opengl + pocketsphinx vapoursynth " @@ -3401,6 +3403,7 @@ afir_filter_deps="avcodec" afir_filter_select="fft" amovie_filter_deps="avcodec avformat" aresample_filter_deps="swresample" +asr_filter_deps="pocketsphinx" ass_filter_deps="libass" atempo_filter_deps="avcodec" atempo_filter_select="rdft" @@ -6299,6 +6302,7 @@ enabled openssl && { check_pkg_config openssl openssl openssl/ssl.h OP check_lib openssl openssl/ssl.h SSL_library_init -lssl32 -leay32 || check_lib openssl openssl/ssl.h SSL_library_init -lssl -lcrypto -lws2_32 -lgdi32 || die "ERROR: openssl not found"; } +enabled pocketsphinx && require_pkg_config pocketsphinx pocketsphinx pocketsphinx/pocketsphinx.h ps_init enabled rkmpp && { require_pkg_config rkmpp rockchip_mpp rockchip/rk_mpi.h mpp_create && require_pkg_config rockchip_mpp "rockchip_mpp >= 1.3.7" rockchip/rk_mpi.h mpp_create && { enabled libdrm || diff --git a/doc/filters.texi b/doc/filters.texi index 3c15bb95f4..3f25d12511 100644 --- a/doc/filters.texi +++ b/doc/filters.texi @@ -2131,6 +2131,38 @@ It accepts the following values: Set additional parameter which controls sigmoid function. @end table +@section asr +Automatic Speech Recognition + +This filter uses PocketSphinX for speech recognition. To enable +compilation of this filter, you need to configure FFmpeg with +@code{--enable-pocketsphinx}. + +It accepts the following options: + +@table @option +@item rate +Set sampling rate of input audio. Defaults is @code{16000}. +This need to match speech models, otherwise one will get poor results. + +@item dict +Set pronunciation dictionary. + +@item lm +Set language model file. + +@item lmctl +Set language model set. + +@item lmname +Set which language model to use. + +@item logfn +Set output for log messages. +@end table + +The filter exports recognized speech as the frame metadata @code{lavfi.asr.text}. + @anchor{astats} @section astats diff --git a/libavfilter/Makefile b/libavfilter/Makefile index 59d12ce069..cf12365c8d 100644 --- a/libavfilter/Makefile +++ b/libavfilter/Makefile @@ -82,6 +82,7 @@ OBJS-$(CONFIG_ASHOWINFO_FILTER) += af_ashowinfo.o OBJS-$(CONFIG_ASIDEDATA_FILTER) += f_sidedata.o OBJS-$(CONFIG_ASOFTCLIP_FILTER) += af_asoftclip.o OBJS-$(CONFIG_ASPLIT_FILTER) += split.o +OBJS-$(CONFIG_ASR_FILTER) += af_asr.o OBJS-$(CONFIG_ASTATS_FILTER) += af_astats.o OBJS-$(CONFIG_ASTREAMSELECT_FILTER) += f_streamselect.o framesync.o OBJS-$(CONFIG_ATEMPO_FILTER) += af_atempo.o diff --git a/libavfilter/af_asr.c b/libavfilter/af_asr.c new file mode 100644 index 0000000000..f14822215c --- /dev/null +++ b/libavfilter/af_asr.c @@ -0,0 +1,177 @@ +/* + * Copyright (c) 2019 Paul B Mahol + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include + +#include "libavutil/avassert.h" +#include "libavutil/avstring.h" +#include "libavutil/channel_layout.h" +#include "libavutil/opt.h" +#include "audio.h" +#include "avfilter.h" +#include "internal.h" + +typedef struct ASRContext { + const AVClass *class; + + int rate; + char *dict; + char *lm; + char *lmctl; + char *lmname; + char *logfn; + + ps_decoder_t *ps; + cmd_ln_t *config; + + int utt_started; +} ASRContext; + +#define OFFSET(x) offsetof(ASRContext, x) +#define FLAGS AV_OPT_FLAG_AUDIO_PARAM | AV_OPT_FLAG_FILTERING_PARAM +static const AVOption asr_options[] = { + { "rate", "set sampling rate", OFFSET(rate), AV_OPT_TYPE_INT, {.i64=16000}, 0, INT_MAX, .flags = FLAGS }, + { "dict", "set pronunciation dictionary", OFFSET(dict), AV_OPT_TYPE_STRING, {.str=NULL}, .flags = FLAGS }, + { "lm", "set language model file", OFFSET(lm), AV_OPT_TYPE_STRING, {.str=NULL}, .flags = FLAGS }, + { "lmctl", "set language model set", OFFSET(lmctl), AV_OPT_TYPE_STRING, {.str=NULL}, .flags = FLAGS }, + { "lmname","set which language model to use", OFFSET(lmname), AV_OPT_TYPE_STRING, {.str=NULL}, .flags = FLAGS }, + { "logfn", "set output for log messages", OFFSET(logfn), AV_OPT_TYPE_STRING, {.str="/dev/null"}, .flags = FLAGS }, + { NULL } +}; + +AVFILTER_DEFINE_CLASS(asr); + +static int filter_frame(AVFilterLink *inlink, AVFrame *in) +{ + AVFilterContext *ctx = inlink->dst; + AVDictionary **metadata = &in->metadata; + ASRContext *s = ctx->priv; + int have_speech; + const char *speech; + + ps_process_raw(s->ps, (const int16_t *)in->data[0], in->nb_samples, 0, 0); + have_speech = ps_get_in_speech(s->ps); + if (have_speech && !s->utt_started) + s->utt_started = 1; + if (!have_speech && s->utt_started) { + ps_end_utt(s->ps); + speech = ps_get_hyp(s->ps, NULL); + if (speech != NULL) + av_dict_set(metadata, "lavfi.asr.text", speech, 0); + ps_start_utt(s->ps); + s->utt_started = 0; + } + + return ff_filter_frame(ctx->outputs[0], in); +} + +static int config_input(AVFilterLink *inlink) +{ + AVFilterContext *ctx = inlink->dst; + ASRContext *s = ctx->priv; + + ps_start_utt(s->ps); + + return 0; +} + +static av_cold int init(AVFilterContext *ctx) +{ + ASRContext *s = ctx->priv; + const float frate = s->rate; + const char *rate = av_asprintf("%f", frate); + const char *argv[] = { "-logfn", s->logfn, + "-lm", s->lm, + "-lmctl", s->lmctl, + "-lmname",s->lmname, + "-dict", s->dict, + "-samprate", rate, + NULL }; + + s->config = cmd_ln_parse_r(NULL, ps_args(), 12, (char **)argv, TRUE); + if (!s->config) + return AVERROR(ENOMEM); + + ps_default_search_args(s->config); + s->ps = ps_init(s->config); + if (!s->ps) + return AVERROR(ENOMEM); + + return 0; +} + +static int query_formats(AVFilterContext *ctx) +{ + ASRContext *s = ctx->priv; + int sample_rates[] = { s->rate, -1 }; + int ret; + + AVFilterFormats *formats = NULL; + AVFilterChannelLayouts *layout = NULL; + + if ((ret = ff_add_format (&formats, AV_SAMPLE_FMT_S16 )) < 0 || + (ret = ff_set_common_formats (ctx , formats )) < 0 || + (ret = ff_add_channel_layout (&layout , AV_CH_LAYOUT_MONO )) < 0 || + (ret = ff_set_common_channel_layouts (ctx , layout )) < 0 || + (ret = ff_set_common_samplerates (ctx , ff_make_format_list(sample_rates) )) < 0) + return ret; + + return 0; +} + +static av_cold void uninit(AVFilterContext *ctx) +{ + ASRContext *s = ctx->priv; + + ps_free(s->ps); + s->ps = NULL; + cmd_ln_free_r(s->config); + s->config = NULL; +} + +static const AVFilterPad asr_inputs[] = { + { + .name = "default", + .type = AVMEDIA_TYPE_AUDIO, + .filter_frame = filter_frame, + .config_props = config_input, + }, + { NULL } +}; + +static const AVFilterPad asr_outputs[] = { + { + .name = "default", + .type = AVMEDIA_TYPE_AUDIO, + }, + { NULL } +}; + +AVFilter ff_af_asr = { + .name = "asr", + .description = NULL_IF_CONFIG_SMALL("Automatic Speech Recognition."), + .priv_size = sizeof(ASRContext), + .priv_class = &asr_class, + .init = init, + .uninit = uninit, + .query_formats = query_formats, + .inputs = asr_inputs, + .outputs = asr_outputs, +}; diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c index ae725cb0e0..fcbf50120b 100644 --- a/libavfilter/allfilters.c +++ b/libavfilter/allfilters.c @@ -74,6 +74,7 @@ extern AVFilter ff_af_ashowinfo; extern AVFilter ff_af_asidedata; extern AVFilter ff_af_asoftclip; extern AVFilter ff_af_asplit; +extern AVFilter ff_af_asr; extern AVFilter ff_af_astats; extern AVFilter ff_af_astreamselect; extern AVFilter ff_af_atempo;