The International Conference on Acoustics, Speech, and Signal Processing 2023 (ICASSP), the most influential international conference in the field of voice technology, recently unveiled the results of its review of submitted papers, and announced that it has accepted 15 papers from the iFLYTEK Research Institute Intelligent Speech Team and Joint Laboratory.

ICASSP is the official academic conference of the IEEE Signal Processing Society, and is the largest and most comprehensive international conference on acoustics, speech, and signal processing, and their applications.
Below is a summary of seven of the iFLYTEK papers that have been accepted. The additional eight essays will be summarized in a subsequent article.
1. Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses
This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is composed of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is made up of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and restricting the predicted phase values to the principal value interval.

To avoid the error expansion issue caused by phase wrapping, we designed anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error, and instantaneous angular frequency error using an anti-wrapping function. Experimental results show that our proposed neural speech phase prediction model outperforms the iterative Griffin-Lim algorithm and other neural network-based methods in both reconstructed speech quality and generation speed.
arXiv preprint download: https://arxiv.org/abs/2211.15974
Sound demos: https://yangai520.github.io/NSPP
Open-source code download: https://github.com/yangai520/NSPP
2. Speech Reconstruction from Silent Tongue and Lip Articulation by Pseudo Target Generation and Domain Adversarial Training
This paper studied speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where users activate their intra-oral and extra-oral articulators without sound. Researchers proposed to employ a method built on pseudo target generation and domain adversarial training with an iterative training strategy to improve the intelligibility and smoothness of the speech recovered from silent tongue and lip articulation.

Experiments show that our proposed method significantly improves the intelligibility and smoothness of the reconstructed speech in silent speaking mode compared to the baseline TaLNet model. When using an automatic speech recognition (ASR) model to measure intelligibility, the word error rate (WER) of the proposed method decreased by over 15% compared to the baseline. In addition, the proposed method also outperformed the baseline on the intelligibility of the speech reconstructed in vocalized articulating mode, reducing the WER by 10%.
Sound demos: https://zhengrachel.github.io/ImprovedTaLNet-demo/
3. Zero-shot Personalized Lip-to-Speech Synthesis with Face Image based Voice Control
Lip-to-Speech (Lip2Speech) synthesis refers to speech synthesis based on input video. In this paper, researchers proposed a zero-shot personalized Lip2Speech synthesis method in which face images control speaker identities. Face videos contain both linguistic content and speaker information, while the voice characteristics of Lip2Speech synthesized by current methods for speakers outside the training set often contradict speaker identity. Therefore, a variational autoencoder was adopted to disentangle the speaker identity and linguistic content representations, which enabled face-based speaker embeddings (FSE) to control the voice characteristics of synthetic speech for speakers.

Furthermore, considering the scarcity of the dataset, we proposed associated cross-modal representation learning to promote the ability of face-based speaker embeddings on voice control. Extensive experiments verify the effectiveness of the proposed method whose synthetic utterances are more natural and match the personality of input video than the compared methods.
Sound demos: https://levent9.github.io/Lip2Speech-demo/
4. A Multi-scale Feature Aggregation based Lightweight Network for Audio-visual Speech Enhancement
The method of Audio-visual Speech Enhancement (AVSE) has been proven to outperform Audio-only Speech Enhancement (AOSE) in improving speech quality.
However, most current AVSE models are heavyweight and their deployment and application are impeded by the large parameters. Researchers proposed a lightweight AVSE model (M3Net) by combining several multi-modal, multi-scale, and multi-branch strategies.
For video and audio branches, researchers designed three multi-scale methods: Multi-scale average pooling (MSAP), Multi-scale Residual Network (MSResNet), and Multi-scale short-time Fourier transform (MSSTFT).

Additionally, four Skip Connection methods were designed for audio and video feature aggregation, complementing the above three multi-scale methods. Experimental results show that these methods can be flexibly combined with current ones. More importantly, compared with the heavyweight network, the lightweight network delivers the same performance with a smaller model size.
5. Robust Data2vec: Noise-robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning
Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize more unlabeled data to improve the performance of automatic speech recognition (ASR). However, the impact of combining the two pre-training tasks and constructing different negative samples for contrastive learning on robustness remains unclear.
In this paper, researchers proposed a noise-robust data2vec for self-supervised speech representation learning by jointly optimizing the contrastive learning and regression tasks in the pre-training stage. Furthermore, we presented two improved methods to facilitate contrastive learning. The researchers first proposed to construct patch-based non-semantic negative samples to boost the noise robustness of the pre-training model, which is achieved by dividing the features into patches at different sizes. Then, by analyzing the distribution of positive and negative samples, we proposed to remove the easily distinguishable negative samples to improve the discriminative capacity for pre-training models.

Experimental results on the CHiME-4 dataset show that our method can improve the performance of the pre-trained model in noisy scenarios. The research team found that joint training of the contrastive learning and regression tasks can avoid the model collapse to some extent compared to only training the regression task.
arXiv preprint download: https://arxiv.org/abs/2210.15324
6. Incorporating Lip Features into Audio-visual Multi-speaker DOA Estimation by Gated Fusion
In this paper, researchers proposed a new audio-visual multi-speaker DOA estimation network, which combines the lip features of multiple speakers to adapt to complex scenes of multi-speaker overlapping and background noise. First, researchers encoded the multi-channel audio features, multi-speaker reference angle, and lip shape (RoI) obtained from the video. Then the encoded audio features, multi-speaker reference angle features, and lip features are fused by the three-mode gated fusion to balance their contributions to the final output positioning angle. The fused features are sent to the back-end network, and the accurate DOA estimation is acquired by combining the multi-speaker angle vector and activity probability predicted by the network.

Experimental results show that compared with the previous work on the dataset of the Multimodal Information Based Speech Processing (MISP) Challenge 2021, this method can reduce the positioning error by 73.48%, while relatively increasing the positioning accuracy of the network by 86.95%. The high accuracy and stability of the location results verify the robustness of the proposed model in multi-speaker scenes.
7. Quantum Transfer Learning using the Large-scale Unsupervised Pre-trained Model WavLM-Large for Synthetic Speech Detection The development of quantum machine learning has a quantum advantage compared with traditional deep learning and has the potential to find new models on supervised classified datasets. This paper proposed a classical quantum transfer learning system using a large-scale unsupervised pre-trained model to prove the competitiveness of quantum transfer learning in synthetic speech detection. Researchers used the pre-trained model WavLM-Large to extract the feature map from speech signals, obtain the low-dimensional embedding vector through classical network components, and then use the variational quantum circuit (VQC) to jointly fine-tune the pre-trained model and classical network components.

The researchers evaluated their system on ASVspoof 2021: Speech Deepfake (DF) task. Experiments using simulated quantum circuits show that quantum transfer learning can improve the performance of the classical transfer learning baseline in this task.