高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于递归神经网络的语音识别快速解码算法

张舸 张鹏远 潘接林 颜永红

张舸, 张鹏远, 潘接林, 颜永红. 基于递归神经网络的语音识别快速解码算法[J]. 电子与信息学报, 2017, 39(4): 930-937. doi: 10.11999/JEIT160543
引用本文: 张舸, 张鹏远, 潘接林, 颜永红. 基于递归神经网络的语音识别快速解码算法[J]. 电子与信息学报, 2017, 39(4): 930-937. doi: 10.11999/JEIT160543
ZHANG Ge, ZHANG Pengyuan, PAN Jielin, YAN Yonghong. Fast Decoding Algorithm for Automatic Speech Recognition Based on Recurrent Neural Networks[J]. Journal of Electronics and Information Technology, 2017, 39(4): 930-937. doi: 10.11999/JEIT160543
Citation: ZHANG Ge, ZHANG Pengyuan, PAN Jielin, YAN Yonghong. Fast Decoding Algorithm for Automatic Speech Recognition Based on Recurrent Neural Networks[J]. Journal of Electronics and Information Technology, 2017, 39(4): 930-937. doi: 10.11999/JEIT160543

基于递归神经网络的语音识别快速解码算法

doi: 10.11999/JEIT160543
基金项目: 

国家自然科学基金(U1536117, 11590770-4),国家重点研发计划重点专项(2016YFB0801200, 2016YFB0801203),新疆维吾尔自治区科技重大专项(2016A03007-1)

Fast Decoding Algorithm for Automatic Speech Recognition Based on Recurrent Neural Networks

Funds: 

The National Natural Science Foundation of China (U1536117, 11590770-4), The National Key Research and Development Plan of China (2016YFB0801200, 2016YFB0801203), The Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (2016A03007-1)

  • 摘要: 递归神经网络(Recurrent Neural Network, RNN)如今已经广泛用于自动语音识别(Automatic Speech Recognition, ASR)的声学建模。虽然其较传统的声学建模方法有很大优势,但相对较高的计算复杂度限制了这种神经网络的应用,特别是在实时应用场景中。由于递归神经网络采用的输入特征通常有较长的上下文,因此利用重叠信息来同时降低声学后验和令牌传递的时间复杂度成为可能。该文介绍了一种新的解码器结构,通过有规律抛弃存在重叠的帧来获得解码过程中的计算开销降低。特别地,这种方法可以直接用于原始的递归神经网络模型,只需对隐马尔可夫模型(Hidden Markov Model, HMM)结构做小的变动,这使得这种方法具有很高的灵活性。该文以时延神经网络为例验证了所提出的方法,证明该方法能够在精度损失相对较小的情况下取得2~4倍的加速比。
  • [1] GRAVES Alex, JAITLY Navdeep, and MOHAMED Abdel-rahman. Hybrid speech recognition with deep bidirectional LSTM[C]. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Olomouc, Czech Republic, 2013: 273-278.
    [2] SAK Hasim, SENIOR Andrew, and BEAUFAYS Franoise. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]. 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore, 2014: 338-342.
    [3] NARAYANAN Arun, MISRA Ananya, and CHIN Kean. Large-scale, sequence-discriminative, joint adaptive training for masking-based robust ASR[C]. 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, 2015: 3571-3575.
    [4] LI Jinyu, MOHAMED Abdelrahman, ZWEIG Geoffrey, et al. Exploring multidimensional LSTMs for large vocabulary ASR[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016: 4940-4944.
    [5] PEDDINTI Vijayaditya, POVEY Daniel, and KHUDANPUR Sanjeev. A time delay neural network architecture for efficient modeling of long temporal contexts[C]. 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, 2015: 3214-3218.
    [6] SNYDER David, GARCIA-ROMERO Daniel, and POVEY Daniel. Time delay deep neural network-based universal background models for speaker recognition[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, USA, 2015: 92-97.
    [7] PEDDINTI Vijayaditya, CHEN Guoguo, MANOHAR Vimal, et al. JHU ASpIRE system: robust LVCSR with TDNNs, i-vector adaptation, and RNN-LMs[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, USA, 2015: 539-546.
    [8] SEIDE Frank, LI Gang, and YU Dong. Conversational speech transcription using context-dependent deep neural networks[C]. 12th Annual Conference of the International Speech Communication Association (Interspeech 2011), Florence, Italy, 2011: 437-440.
    [9] SELTZER Michael L, YU Dong, and WANG Yongqiang. An investigation of deep neural networks for noise robust speech recognition[C]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013: 7398-7402.
    [10] VANHOUCKE Vincent, DEVIN Matthieu, and HEIGOLD Georg. Multiframe deep neural networks for acoustic modeling[C]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013: 7582-7585.
    [11] MOORE Darren, DINES John, DOSS Mathew Magimai, et al. Juicer: A Weighted Finite-State Transducer Speech Decoder[M]. Berlin, Heidelberg, Springer, 2006: 285-296.
    [12] YOUNG S J, RUSSELL N H, and THORNTON J H S. Token passing: A simple conceptual model for connected speech recognition systems[R]. CUED/F-INFENG/TR38, Engineering Department, Cambridge University, 1989.
    [13] NOLDEN David, SCHLTER Ralf, and NEY Hermann. Extended search space pruning in LVCSR[C]. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012: 4429-4432.
    [14] 郭宇弘. 基于加权有限状态转换机的语音识别系统研究[D]. [博士论文], 中国科学院大学, 2013: 1-20.
    [15] GUO Yuhong. Automatic speech recognition system based on weighted finite-state transducers[D]. [Ph.D. dissertation], University of Chinese Academy of Sciences, 2013: 1-20.
    [16] RABINER Lawrence R and JUANG Biinghwang. An introduction to hidden Markov models[J]. IEEE ASSP Magazine, 1986, 3(1): 4-16. doi: 10.1109/MASSP.1986. 1165342
    [17] YOUNG Steve, EVERMANN Gunnar, GALES Mark, et al. The HTK Book Vol. 2[M]. Cambridge, Entropic Cambridge Research Laboratory, 1997: 59-210.
    [18] ZHANG Qingqing, SOONG Frank, QIAN Yao, et, al. Improved modeling for F0 generation and V/U decision in HMM-based TTS[C]. 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, USA, 2010: 4606-4609.
  • [1] 喻昕, 卢惠霞, 伍灵贞, 徐柳明.  一种新型单层递归神经网络解决非光滑伪凸优化问题, 电子与信息学报. doi: 10.11999/JEIT200558
    [2] 桑海峰, 陈紫珍.  基于双向门控循环单元的3D人体运动预测, 电子与信息学报. doi: 10.11999/JEIT180978
    [3] 刘勤让, 刘崇阳.  利用参数稀疏性的卷积神经网络计算优化及其FPGA加速器设计, 电子与信息学报. doi: 10.11999/JEIT170819
    [4] 刘畅, 张一珂, 张鹏远, 颜永红.  基于改进主题分布特征的神经网络语言模型, 电子与信息学报. doi: 10.11999/JEIT170219
    [5] 高兴龙, 潘接林, 颜永红.  基于隐藏单元条件随机场的多知识源融合改进自动语音识别置信度, 电子与信息学报. doi: 10.3724/SP.J.1146.2013.01614
    [6] 张文林, 张连海, 陈琦, 李弼程.  语音识别中基于低秩约束的本征音子说话人自适应方法, 电子与信息学报. doi: 10.3724/SP.J.1146.2013.00848
    [7] 黄程韦, 赵艳, 金赟, 于寅骅, 赵力.  实用语音情感的特征分析与识别的研究, 电子与信息学报. doi: 10.3724/SP.J.1146.2009.00886
    [8] 赖惠成, 褚辉.  一种混合模式的神经网络自动调制识别器, 电子与信息学报. doi: 10.3724/SP.J.1146.2007.00515
    [9] 吕国云, 蒋冬梅, 张艳宁, 赵荣椿, HSahli, IlseRavyse .  基于多流多状态动态贝叶斯网络的音视频连续语音识别, 电子与信息学报. doi: 10.3724/SP.J.1146.2007.00915
    [10] 赵强, 罗嵘, 汪蕙, 杨华中.  用于HDTV视频解码器的高性能SDRAM控制器, 电子与信息学报. doi: 10.3724/SP.J.1146.2006.00429
    [11] 张碧军, 朱光喜, 何业军.  新的时变信道下空时分组编码多用户系统解码器设计, 电子与信息学报.
    [12] 周井泉, 张顺颐.  基于双层递归神经网络的路由优化算法, 电子与信息学报.
    [13] 於东军, 徐蔚鸿, 赵海涛, 杨静宇.  基于神经网络的人脸自动识别, 电子与信息学报.
    [14] 刘健, 李华, 王承宁, 俞斯乐.  高清晰度电视视频解码器系统控制的设计与实现, 电子与信息学报.
    [15] 顾晓东, 余道衡, 赵鹤鸣.  利用CASSANDRA-I神经计算机实现有限词汇连续语音识别, 电子与信息学报.
    [16] 王少勇, 王兆华.  HDTV视频解码器中系统控制功能的实现, 电子与信息学报.
    [17] 肖怀铁, 庄钊文, 郭桂蓉.  基于雷达距离象序列的循环神经网络飞机目标识别, 电子与信息学报.
    [18] 周萍, 俞斯乐.  基于FPGA的数字高清晰度电视视频解码器的设计和实现, 电子与信息学报.
    [19] 梅勇, 王群生, 徐秉铮.  基于模糊神经网络的声母识别, 电子与信息学报.
    [20] 赵群, 保铮, 叶炜.  基于神经网络的雷达目标识别, 电子与信息学报.
  • 加载中
  • 计量
    • 文章访问数:  870
    • HTML全文浏览量:  80
    • PDF下载量:  595
    • 被引次数: 0
    出版历程
    • 收稿日期:  2016-05-26
    • 修回日期:  2017-01-09
    • 刊出日期:  2017-04-19

    目录

      /

      返回文章
      返回

      官方微信,欢迎关注