高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

文本无关说话人识别的一种多尺度特征提取方法

陈志高 李鹏 肖润秋 黎塔 王文超

陈志高, 李鹏, 肖润秋, 黎塔, 王文超. 文本无关说话人识别的一种多尺度特征提取方法[J]. 电子与信息学报. doi: 10.11999/JEIT200917
引用本文: 陈志高, 李鹏, 肖润秋, 黎塔, 王文超. 文本无关说话人识别的一种多尺度特征提取方法[J]. 电子与信息学报. doi: 10.11999/JEIT200917
Zhigao CHEN, Peng LI, Runqiu XIAO, Ta LI, Wenchao WANG. A Multiscale Feature Extraction Method for Text-independent Speaker Recognition[J]. Journal of Electronics and Information Technology. doi: 10.11999/JEIT200917
Citation: Zhigao CHEN, Peng LI, Runqiu XIAO, Ta LI, Wenchao WANG. A Multiscale Feature Extraction Method for Text-independent Speaker Recognition[J]. Journal of Electronics and Information Technology. doi: 10.11999/JEIT200917

文本无关说话人识别的一种多尺度特征提取方法

doi: 10.11999/JEIT200917
基金项目: 国家自然科学基金(11590772, 11590774, 11590770)
详细信息
    作者简介:

    陈志高:男,1994年生,博士生,研究方向为说话人识别、语音信号处理、语种识别等

    李鹏:男,1983年生,高级工程师,研究方向为网络与信息安全等

    肖润秋:男,1995年生,博士生,研究方向为鲁棒说话人识别、语音信号处理等

    黎塔:男,1983年生,研究员,研究方向为语音信号处理、大词汇自然口语语音识别,关键词识别等

    王文超:男,1991年生,助理研究员,研究方向为语音信号处理、说话人识别、语种识别等

    通讯作者:

    王文超 wangwenchao@hccl.ioa.ac.cn

A Multiscale Feature Extraction Method for Text-independent Speaker Recognition

Funds: The National Natural Science Foundation of China (11590772, 11590774, 11590770)
  • 摘要: 近些年来,多种基于卷积神经网络(CNNs)的模型结构表现出越来越强的多尺度特征表达能力,在说话人识别的各项任务中取得了持续的性能提升。然而,目前大多数方法只能利用更深更宽的网络结构来提升性能。该文引入一种更高效的多尺度说话人特征提取框架Res2Net,并对它的模块结构进行了改进。它以一种更细粒化的工作方式,获得多种感受野的组合,从而获得多种不同尺度组合的特征表达。实验表明,该方法在参数量几乎不变的情况下,等错误率(EER)相较ResNet有20%的相对下降,并且在VoxCeleb,野外演讲者 (SITW)等多种不同录制环境和识别任务中都有稳定的性能提升,证明了该方法的高效性和鲁棒性。改进后的全连接模块结构能更充分利用训练信息,在数据充足和任务复杂时性能提升明显。具体代码可以在https://github.com/czg0326/Res2Net-Speaker-Recognition获得。
  • 图  1  深度残差网络的残差块

    图  2  简化的Res2Net模块示意图

    图  3  全连接的Res2Net模块示意图

    表  1  VoxCeleb1测试集各系统性能表现(训练集:VoxCeleb1)

    系统等错误率EER(%)最小检测代价函数minDCF
    P=0.1P=0.01P=0.001
    x-vector4.1890.2120.3910.512
    ResNet-503.9550.2120.4040.483
    Res2Net-50-sim3.4840.1940.3700.481
    Res2Net-50-full3.6330.2010.3730.477
    下载: 导出CSV

    表  2  VoxCeleb1测试集各系统性能表现(训练集:VoxCeleb2)

    系统等错误率EER(%)最小检测代价函数minDCF
    P=0.1P=0.01P=0.001
    x-vector2.9850.1790.3360.465
    ResNet-502.2430.1580.2990.391
    Res2Net-50-sim1.7290.1430.2710.405
    Res2Net-50-full1.4030.1360.2590.364
    下载: 导出CSV

    表  3  系统VoxCeleb测试集性能

    训练集等错率(%)
    Nagrani[5]VoxCeleb17.80
    Okabe[17]VoxCeleb13.85
    Heo[12]VoxCeleb15.50
    Chung[13]VoxCeleb23.95
    Heo[12]VoxCeleb22.66
    Zeinali[18]VoxCeleb21.31
    本文系统VoxCeleb13.266
    本文系统VoxCeleb21.403
    下载: 导出CSV

    表  4  SITW 4种测试条件下各系统性能表现

    系统训练集SITW测试集EER(%)
    Core-coreCore-multiAssist-coreAssist-multi
    x-vectorVoxCeleb16.6988.6618.4769.920
    ResNet-507.2179.3589.28210.972
    Res2Net-50-sim6.4838.5208.3069.740
    Res2Net-50-full6.6038.5758.2979.516
    Res2Net-50-simVoxCeleb23.2584.7654.6135.706
    Res2Net-50-full2.9524.2013.9314.833
    下载: 导出CSV

    表  5  Res2Net-50调整width和scale在VoxCeleb性能表现

    参数设置等错误率EER(%)最小检测代价函数minDCF
    P=0.1P=0.01P=0.001
    7w4s3.4840.1940.3700.481
    16w4s3.4460.1860.3570.491
    7w8s3.2660.1880.3470.475
    下载: 导出CSV

    表  6  Res2Net-50调整width和scale在SITW性能表现

    系统SITW测试集EER(%)
    Core-coreCore-multiAssist-coreAssist-multi
    7w4s6.4838.5208.3069.740
    16w4s6.3708.3828.6019.411
    7w8s5.5497.7267.6999.122
    下载: 导出CSV
  • 郭武, 戴礼荣, 王仁华. 采用因子分析和支持向量机的说话人确认系统[J]. 电子与信息学报, 2009, 31(2): 302–305. doi: 10.3724/SP.J.1146.2007.01289

    GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics &Information Technology, 2009, 31(2): 302–305. doi: 10.3724/SP.J.1146.2007.01289
    VARIANI E, LEI Xin, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification[C]. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, 2014: 4052–4056.
    SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C]. The Interspeech 2017, Stockholm, Sweden, 2017: 999–1003.
    王文超, 黎塔. 基于多时间尺度的深层说话人特征提取研究[J]. 网络新媒体技术, 2019, 8(5): 21–26.

    WANG Wenchao and LI Ta. Research on deep speaker embeddings extraction based on multiple temporal scales[J]. Journal of Network New Media, 2019, 8(5): 21–26.
    NAGRANI A, CHUNG J S, and ZISSERMAN A. Voxceleb: A large-scale speaker identification dataset[EB/OL]. https://arxiv.org/abs/1706.08612, 2017.
    HUANG Zili, WANG Shuai, and YU Kai. Angular softmax for short-duration text-independent speaker verification[C]. The Interspeech 2018, Hyderabad, India, 2018: 3623–3627.
    YADAV S and RAI A. Learning discriminative features for speaker identification and verification[C]. The Interspeech 2018, Hyderabad, India, 2018: 2237–2241.
    HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
    GAO Shanghua, CHENG Mingming, ZHAO Kai, et al. Res2net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652–662.
    柳长源, 王琪, 毕晓君. 基于多通道多尺度卷积神经网络的单幅图像去雨方法[J]. 电子与信息学报, 2020, 42(9): 2285–2292. doi: 10.11999/JEIT190755

    LIU Changyuan, WANG Qi, and BI Xiaojun. Research on rain removal method for single image based on multi-channel and multi-scale CNN[J]. Journal of Electronics &Information Technology, 2020, 42(9): 2285–2292. doi: 10.11999/JEIT190755
    CAI Weicheng, CHEN Jinkun, and LI Ming. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system[EB/OL]. https://arxiv.org/abs/1804.05160, 2018.
    HEO H S, JUNG J W, YANG I H, et al. End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verification[EB/OL]. https://arxiv.org/abs/1902.02455, 2019.
    CHUNG J S, NAGRANI A, and ZISSERMAN A. Voxceleb2: Deep speaker recognition[EB/OL]. https://arxiv.org/abs/1806.05622, 2018.
    ZAGORUYKO S and KOMODAKIS N. Wide residual networks[EB/OL]. https://arxiv.org/abs/1605.07146, 2016.
    XIE Saining, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1492–1500.
    MCLAREN M, FERRER L, CASTAN D, et al. The speakers in the wild (SITW) speaker recognition database[C]. The Interspeech 2016, San Francisco, 2016: 818–822.
    OKABE K, KOSHINAKA T, and SHINODA K. Attentive statistics pooling for deep speaker embedding[EB/OL]. https://arxiv.org/abs/1803.10963, 2018.
    ZEINALI H, WANG Shuai, SILNOVA A, et al. BUT system description to VoxCeleb speaker recognition challenge 2019[EB/OL]. https://arxiv.org/abs/1910.12592, 2019.
  • 加载中
图(3) / 表(6)
计量
  • 文章访问数:  31
  • HTML全文浏览量:  10
  • PDF下载量:  9
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-10-26
  • 修回日期:  2021-03-13
  • 网络出版日期:  2021-03-25

目录

    /

    返回文章
    返回

    官方微信,欢迎关注