当前位置:网站首页>ISCSLP 2022 | 8 papers from NPU-ASLP laboratory accepted

ISCSLP 2022 | 8 papers from NPU-ASLP laboratory accepted

2022-11-24 21:32:10voice home

As a flagship international conference in the field of speech processing technology,ISCSLP2022(International Symposium on Chinese Spoken Language Processing)将于12月11-14日在新加坡举办.

Ms audio speech and language processing team([email protected])This session will be there with a partnerRead the papers8篇,Involved in numerous research direction in the field of intelligent speech processing,包括语音识别、说话人日志、语音合成、语音转换等.Paper partners including:腾讯美团传音控股马上金融等.In addition in the meeting,Laboratory joint希尔贝壳天津大学南洋理工大学WeNet开源社区理想汽车Several units such as the success of theSmart cockpit voice recognition challenge(ICSRC).值得一提的是,Laboratory team wonIn the mixed speech recognition challenge(CSASR)第二名,At the same time lab and call holding cooperation obtain对话短语音说话人日志挑战赛(CSSD)第三名的优异成绩.Below is the meeting published information.


AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

作者列表:Zhang Yongmao,王智超,杨培基,Sun Hong gentlemen and,Wang Zhisheng,谢磊


论文摘要:Conformity to learn to speak accent to target people in the packet data with an accent is a kind of feasible ways to construct the speech synthesis system with an accent.为了实现这个目的,There are two challenging problems to be solved.第一,If direct use of low quality of crowdsourcing accent data and target data to train the high quality of the speaker accent accent migration model in synthetic quality obviously lower than the original data of target speaker.为了缓解这个问题,We adopt features with neural network bottleneck(BN)For the characteristics of middle voice synthesis scheme,The acoustic model of speech synthesis is divided intoText-to-BN(T2BN)和BN-to-Mel(BN2Mel)To modeling accent and tone color target speaker,At the same time, based on the neural network to extractBNHas the noise robustness.第二,If use crowdsourcing data directly training these two stages model will lead to poor pronunciation target speaker rhythm,This is because the crowdsourcing data are provided by the non-professional broadcast of ordinary people.为了解决这个问题,We will have two stages of model updating is a three-stage model,Using high quality data of training the target speakerT2BN和BN2Mel模块,And in the middle of the two modules to join aBN-to-BN(BN2BN)Module for accent migration task.We generated by means of data expansion of accentBNAnd with an accentBNParallel data to trainingBN2BN模块.最终,We put forward a three-stage model realizes the synthesis of the target speaker voice with an accent,Because the rhythm of synthetic speech from professional broadcast data in the target speaker to learn,So it's the voice sounds good rhythm.我们提出的AccentSpeechThe effect of the verified on Chinese accent migration task.




End-to-End Voice Conversion with Information Perturbation

作者列表:Xie Qicong,YangShan,LeiYi,谢磊,苏丹


论文摘要:Voice conversion goal is to transform source tone of voice tone target speaker,At the same time keep the contents of the source speech information remains the same.然而,The current method in the speaker similarity and rhythm wanting,And because the acoustic model characteristic of harmony between code does not match the question,Leads to a decline in the transformation of the voice quality.In this paper, the use of information disturbance way,In this paper, a complete end-to-end approach to high quality voice conversion.First information perturbation is used to remove the source voice information related to the speaker,By decoupling the speaker sound information content and language.In order to better the source on the rhythm of speech is transferred to target voice,This paper introduces the rhythm of the speaker related encoder,To maintain the same and the source speaker's rhythm pattern.By directly on the voice sample modeling improve quality,Avoided with MEL spectrum in the middle of the characterization of acoustic model harmony between code does not match the characteristics of a problem.最后,Through continuous speaker space modeling,The model can achieveZero Shot的语音转换.实验结果表明,The proposed end-to-end method in intelligibility、Naturalness and speaker similarity is superior to other contrast model.




Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios

作者列表:Xie Qicong,李涛,Wang Xinsheng,王智超,谢磊,Let the bridge,万广鲁


论文摘要:语音合成的风格迁移主要让说话人合成该说话人本不具有的风格的语音,比如,让普通说话人合成故事、新闻、广播、Read aloud and so on style voice.In order to make the synthesis system can learning style information,以往的研究所使用的语料是一位说话人要具备多种风格的语料,这将对说话人提出较高的要求.本文为了解决以上的问题,Design the style of single single style scenarios migration scheme,Training corpus of each speaker can be as long as have a style.At the same time, in this paper, the phoneme level fine-grained rhythm control,Making it easy to control the intensity of style.




Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

作者列表:宋堃,From established,Wang Xinsheng,Zhang Yongmao,谢磊,蒋宁,吴海英


论文摘要:In the current mainstream of two stagesTTS框架中,Ideally with a universal vocoder,The need to train only once to fine-tune the target data without the need for,And the acoustic model generationmelSpectrum possesses robustness.基于此目的,我们在multi-band MelGAN的基础上作出改进,提出Robust MelGAN模型,缓解multi-band MelGANIn the docking acoustic model generationMelSpectrum of electric problem,And improve its generalization ability.首先,We introduced in the generator of fine-grained networkdropout策略,By comparing periodic and non-periodic components in speech signal separation and pressure the aperiodic component networkdropout策略,Avoid electric sound at the same time to ensure the stability of tone color similarity.为了提高模型的泛化能力,We use a variety of data enhancement method to expand the false data of a discriminant,Including harmonic offset、Harmonic noise and phase noise.实验表明,Robust MelGANAs a general purpose vocoder,Can fit the acoustic model based on multiple data training,Maintain a good quality.




AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

作者列表:宋堃,Xue Heyang,Wang Xinsheng,From established,Zhang Yongmao,谢磊,杨兵,张雄,苏丹


论文摘要:The speaker adaptive task aims to advance trainingTTSModel using a small amount of data of target speaker adaptive and obtain the target speakerTTS系统.In this task there has been a lot of work,But there are very few in low computational resources lightweight speaker adaptive model of scene.本文提出一种基于VITSModel of lightweight speaker adaptive modelAdaVITS.In order to effectively reduceVITSThe parameters of the model and computation,We first proposed a based on the inverse Fourier transform (iSTFT) The decoder to replace the original structure of relatively large amount of calculation of sampling network decoder on;Secondly we introduceNanoFlowThe sharing probability estimation flow (flow) Module to replace the original flow module,从而减少参数量;In addition we text encoder linear attention mechanism is introduced to replace the original dot product attention so as to reduce the amount of calculation.为了提高VITS模型的稳定性,我们使用PPGCharacteristics as the intermediate representation supervision and text linguistics to the spectrum characteristics of the learning process.实验表明,On the speaker adaptive task,AdaVITSCan generate stable natural voice,并且只有8.97M The model number and 0.72 GFlops的计算量.




The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

作者列表:梁宇颢,Chen Peikun,俞帆,Zhu Xinfa,Xu Tianyi,谢磊

论文摘要:This article describes the msASLP实验室在ISCSLP2022In the mixed speech recognition system solutions presented at the challenge.在这次竞赛中,我们首先探索了bi-encoder,language-aware encoder(LAE)与mixture of experts(MoE)等多种ASRModel structure and training strategy.In order to enhance the system's modeling language ability,We try to further theinternal language model (ILM)与long context language model (LCLM).此外,We use the multiple data extension methods including variable speed、变调、音频编解码、Speech synthesis to overcome the problem of competition data scarce.最后我们使用ROVERThe way is a blend of different model identification results.We submit the system on the test set is second,实现了16.87%的MER.




TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge

作者列表:Bowen pang,赵欢,Zhang Gaosheng,Yang Xiaoyue,孙杨,张丽,王晴,谢磊


论文摘要:This article describes the ms and call holding cooperation team inISCSLP 2022Session phrase to log(CSSD)Scheme used in the race.The competition focus on short dialogue scenes,And adopted a new evaluation indexCDER.在这次竞赛中,We explore the three classical speaker logging solution,Are respectively based on spectral clustering(SC)系统、Based on the detection target speaker(TS-VAD)The system, and the end-to-end system.Our main conclusion summarized below.首先,在新的CDER指标下,The traditional method based on spectral clustering effect is better than the other two methods.其次,For all three types of speaker logging solution,Adjust the ultra reference forCDER指标至关重要.Such as when the split the length of the sub section set longer,CDER会变得更小.最后,通过DOVER-LAPFusion of multiple system did not achieve more ideal result.We submit the system eventually ranked third in the race results.




The ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge (ICSRC):Dataset, Tracks, Baseline and Results

作者列表:张奥, 俞帆, Huang Kaixun,谢磊, Wang Longbiao, Eng Siong Chng, 卜辉, 张彬彬, 陈伟, 徐昕

合作单位:天津大学,南洋理工大学,Hill baker,理想汽车,WeNet社区

论文摘要:本文总结了ISCSLP2022The on-board speech recognition challenge(ICSRC)的产出.We first expounds the necessity of the contest and introduce competition data set.The data set of the competition on the new energy vehicles to record,Cover smart cockpit of acoustic model and language features of speech interaction.之后,We introduce the race track Settings,The competition is divided into model size is limited and unlimited two track,Corresponding vehicle side side and cloud side speech recognition scene.Finally, we summarize the results of the competition and the main methods adopted by the submission system.




本文为[voice home]所创,转载请带上原文链接,感谢