ISCSLP 2022 | 8 papers from NPU-ASLP laboratory accepted
2022-11-24 21:32:10【voice home】
As a flagship international conference in the field of speech processing technology,ISCSLP2022（International Symposium on Chinese Spoken Language Processing）将于12月11-14日在新加坡举办.
Ms audio speech and language processing team([email protected])This session will be there with a partnerRead the papers8篇,Involved in numerous research direction in the field of intelligent speech processing,包括语音识别、说话人日志、语音合成、语音转换等.Paper partners including：腾讯、美团、传音控股、马上金融等.In addition in the meeting,Laboratory joint希尔贝壳、天津大学、南洋理工大学、WeNet开源社区、理想汽车Several units such as the success of theSmart cockpit voice recognition challenge(ICSRC).值得一提的是,Laboratory team wonIn the mixed speech recognition challenge(CSASR)第二名,At the same time lab and call holding cooperation obtain对话短语音说话人日志挑战赛(CSSD)第三名的优异成绩.Below is the meeting published information.
AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents
作者列表：Zhang Yongmao,王智超,杨培基,Sun Hong gentlemen and,Wang Zhisheng,谢磊
论文摘要：Conformity to learn to speak accent to target people in the packet data with an accent is a kind of feasible ways to construct the speech synthesis system with an accent.为了实现这个目的,There are two challenging problems to be solved.第一,If direct use of low quality of crowdsourcing accent data and target data to train the high quality of the speaker accent accent migration model in synthetic quality obviously lower than the original data of target speaker.为了缓解这个问题,We adopt features with neural network bottleneck（BN）For the characteristics of middle voice synthesis scheme,The acoustic model of speech synthesis is divided intoText-to-BN（T2BN）和BN-to-Mel（BN2Mel）To modeling accent and tone color target speaker,At the same time, based on the neural network to extractBNHas the noise robustness.第二,If use crowdsourcing data directly training these two stages model will lead to poor pronunciation target speaker rhythm,This is because the crowdsourcing data are provided by the non-professional broadcast of ordinary people.为了解决这个问题,We will have two stages of model updating is a three-stage model,Using high quality data of training the target speakerT2BN和BN2Mel模块,And in the middle of the two modules to join aBN-to-BN（BN2BN）Module for accent migration task.We generated by means of data expansion of accentBNAnd with an accentBNParallel data to trainingBN2BN模块.最终,We put forward a three-stage model realizes the synthesis of the target speaker voice with an accent,Because the rhythm of synthetic speech from professional broadcast data in the target speaker to learn,So it's the voice sounds good rhythm.我们提出的AccentSpeechThe effect of the verified on Chinese accent migration task.
End-to-End Voice Conversion with Information Perturbation
论文摘要：Voice conversion goal is to transform source tone of voice tone target speaker,At the same time keep the contents of the source speech information remains the same.然而,The current method in the speaker similarity and rhythm wanting,And because the acoustic model characteristic of harmony between code does not match the question,Leads to a decline in the transformation of the voice quality.In this paper, the use of information disturbance way,In this paper, a complete end-to-end approach to high quality voice conversion.First information perturbation is used to remove the source voice information related to the speaker,By decoupling the speaker sound information content and language.In order to better the source on the rhythm of speech is transferred to target voice,This paper introduces the rhythm of the speaker related encoder,To maintain the same and the source speaker's rhythm pattern.By directly on the voice sample modeling improve quality,Avoided with MEL spectrum in the middle of the characterization of acoustic model harmony between code does not match the characteristics of a problem.最后,Through continuous speaker space modeling,The model can achieveZero Shot的语音转换.实验结果表明,The proposed end-to-end method in intelligibility、Naturalness and speaker similarity is superior to other contrast model.
Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios
作者列表：Xie Qicong,李涛,Wang Xinsheng,王智超,谢磊,Let the bridge,万广鲁
论文摘要：语音合成的风格迁移主要让说话人合成该说话人本不具有的风格的语音,比如,让普通说话人合成故事、新闻、广播、Read aloud and so on style voice.In order to make the synthesis system can learning style information,以往的研究所使用的语料是一位说话人要具备多种风格的语料,这将对说话人提出较高的要求.本文为了解决以上的问题,Design the style of single single style scenarios migration scheme,Training corpus of each speaker can be as long as have a style.At the same time, in this paper, the phoneme level fine-grained rhythm control,Making it easy to control the intensity of style.
Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS
作者列表：宋堃,From established,Wang Xinsheng,Zhang Yongmao,谢磊,蒋宁,吴海英
论文摘要：In the current mainstream of two stagesTTS框架中,Ideally with a universal vocoder,The need to train only once to fine-tune the target data without the need for,And the acoustic model generationmelSpectrum possesses robustness.基于此目的,我们在multi-band MelGAN的基础上作出改进,提出Robust MelGAN模型,缓解multi-band MelGANIn the docking acoustic model generationMelSpectrum of electric problem,And improve its generalization ability.首先,We introduced in the generator of fine-grained networkdropout策略,By comparing periodic and non-periodic components in speech signal separation and pressure the aperiodic component networkdropout策略,Avoid electric sound at the same time to ensure the stability of tone color similarity.为了提高模型的泛化能力,We use a variety of data enhancement method to expand the false data of a discriminant,Including harmonic offset、Harmonic noise and phase noise.实验表明,Robust MelGANAs a general purpose vocoder,Can fit the acoustic model based on multiple data training,Maintain a good quality.
AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation
作者列表：宋堃,Xue Heyang,Wang Xinsheng,From established,Zhang Yongmao,谢磊,杨兵,张雄,苏丹
论文摘要：The speaker adaptive task aims to advance trainingTTSModel using a small amount of data of target speaker adaptive and obtain the target speakerTTS系统.In this task there has been a lot of work,But there are very few in low computational resources lightweight speaker adaptive model of scene.本文提出一种基于VITSModel of lightweight speaker adaptive modelAdaVITS.In order to effectively reduceVITSThe parameters of the model and computation,We first proposed a based on the inverse Fourier transform (iSTFT) The decoder to replace the original structure of relatively large amount of calculation of sampling network decoder on;Secondly we introduceNanoFlowThe sharing probability estimation flow (flow) Module to replace the original flow module,从而减少参数量;In addition we text encoder linear attention mechanism is introduced to replace the original dot product attention so as to reduce the amount of calculation.为了提高VITS模型的稳定性,我们使用PPGCharacteristics as the intermediate representation supervision and text linguistics to the spectrum characteristics of the learning process.实验表明,On the speaker adaptive task,AdaVITSCan generate stable natural voice,并且只有8.97M The model number and 0.72 GFlops的计算量.
The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge
作者列表：梁宇颢,Chen Peikun,俞帆,Zhu Xinfa,Xu Tianyi,谢磊
论文摘要：This article describes the msASLP实验室在ISCSLP2022In the mixed speech recognition system solutions presented at the challenge.在这次竞赛中,我们首先探索了bi-encoder,language-aware encoder(LAE)与mixture of experts(MoE)等多种ASRModel structure and training strategy.In order to enhance the system's modeling language ability,We try to further theinternal language model (ILM)与long context language model (LCLM).此外,We use the multiple data extension methods including variable speed、变调、音频编解码、Speech synthesis to overcome the problem of competition data scarce.最后我们使用ROVERThe way is a blend of different model identification results.We submit the system on the test set is second,实现了16.87%的MER.
TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge
作者列表：Bowen pang,赵欢,Zhang Gaosheng,Yang Xiaoyue,孙杨,张丽,王晴,谢磊
论文摘要：This article describes the ms and call holding cooperation team inISCSLP 2022Session phrase to log（CSSD）Scheme used in the race.The competition focus on short dialogue scenes,And adopted a new evaluation indexCDER.在这次竞赛中,We explore the three classical speaker logging solution,Are respectively based on spectral clustering（SC）系统、Based on the detection target speaker（TS-VAD）The system, and the end-to-end system.Our main conclusion summarized below.首先,在新的CDER指标下,The traditional method based on spectral clustering effect is better than the other two methods.其次,For all three types of speaker logging solution,Adjust the ultra reference forCDER指标至关重要.Such as when the split the length of the sub section set longer,CDER会变得更小.最后,通过DOVER-LAPFusion of multiple system did not achieve more ideal result.We submit the system eventually ranked third in the race results.
The ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge (ICSRC):Dataset, Tracks, Baseline and Results
作者列表：张奥, 俞帆, Huang Kaixun,谢磊, Wang Longbiao, Eng Siong Chng, 卜辉, 张彬彬, 陈伟, 徐昕
论文摘要：本文总结了ISCSLP2022The on-board speech recognition challenge（ICSRC）的产出.We first expounds the necessity of the contest and introduce competition data set.The data set of the competition on the new energy vehicles to record,Cover smart cockpit of acoustic model and language features of speech interaction.之后,We introduce the race track Settings,The competition is divided into model size is limited and unlimited two track,Corresponding vehicle side side and cloud side speech recognition scene.Finally, we summarize the results of the competition and the main methods adopted by the submission system.
- Definition and classification of software architecture
- Install mysql under Linux
- The Complete Flutter Handbook
- Spectral Eigen-Orthogonal Decomposition (SPOD) with matlab code
- Redis Deep Adventure: Core Principles and Application Practice
- Tensorflow 2.1 MNIST Image Classification
- Send group message notification through DingTalk robot
- "Goravel Shangxin" also supports Local, OSS, COS, S3 file storage modules, and also supports customization, where you want to save it!
- Universal Driver v7.22.0912.2 (released on 2022.10.24)
- Quartus II 18.0 software download and installation tutorial
- Quartus II 13.0 software download and installation tutorial
- Quartus II 15.0 software download and installation tutorial
- 【多目标进化优化】 Pareto 最优解集的构造方法
- Rust functions
- 感恩每一位 RockStar！
- 成为数字游民，他们为何「All in Web3」？
- Gateway 接口参数加解密
- 【Error】Expression of SELECT list is not in GROUP BY clause and contains nonaggregated column this is
- 【Mysql】mysql锁等待Lock wait timeout exceeded; try restarting transaction
- MySQL DBA的一天都有哪些日常任务来浅浅总结一下吧；
- The radar jamming 】 speed clustering deception interference based on matlab simulation
- [Image Fusion] Based on DSIFT multi-focus image fusion with matlab code
- Do you think the uptime command just tells you how long the system has been up
- [Mysql] mysql lock waits for Lock wait timeout exceeded; try restarting transaction
- Fighting monsters
- How to modify the boot sector of the hard disk by the locked computer software and read the password by the locked computer software
- Gateway interface parameter encryption and decryption
- Given a string, if only characters can be added at the end, add at least a few to make the string as a whole a palindrome, and output the entire palindrome.