论文部分内容阅读
本文除介绍常用的说话人识别技术外,主要论述了一种基于长时时频特征的说话人识别方法,对输入的语音首先进行VAD处理,得到干净的语音后,对其提取基本时频特征。在每一语音单元内把基频、共振峰、谐波等时频特征的轨迹用Legendre多项式拟合的方法提取出主要的拟合参数,再利用HLDA的技术进行特征降维,用高斯混合模型的均值超向量表示每句话音时频特征的统计信息。在NIST06说话人1side-1side说话人测试集中,取得了18.7%的等错率,与传统的基于MFCC特征的说话人系统进行融合,等错率从4.9%下降到了4.6%,获得了6%的相对等错率下降。
In addition to introducing the commonly used speaker recognition technology, this paper mainly discusses a speaker recognition method based on long-time and time-frequency features. The input speech is firstly VAD processed and the clean speech is obtained, then the basic time-frequency features are extracted. In each speech unit, the main fitting parameters are extracted from the trajectories of time-frequency features such as fundamental frequency, formant and harmonics by Legendre polynomial fitting, and then the dimensionality reduction using the HLDA technique is performed. Gaussian mixture model The mean supervector represents the statistical information of each speech time-frequency feature. In the NIST06 speaker 1side-1side speaker test suite, an 18.7% equal error rate was achieved, converging with the traditional MFCC-based speaker system, with the error rate dropping from 4.9% to 4.6% with a 6% The relative error rate decreased.