iFLYTEK has led technological breakthroughs in speech recognition, speech synthesis, machine translation, and other areas.

Part II: 15 Papers from Intelligent Speech Team and Joint Laboratory of iFLYTEK Research Institute Accepted by ICASSP 2023

The International Conference on Acoustics, Speech, and Signal Processing 2023 (ICASSP), the most influential international conference in the field of voice technology, recently unveiled the results of its review of submitted papers, and announced that it has accepted 15 papers from the iFLYTEK Research Institute Intelligent Speech Team and Joint Laboratory.

ICASSP is the official academic conference of the IEEE Signal Processing Society, and is the largest and most comprehensive international conference on acoustics, speech, and signal processing, and their applications. Below is a summary of eight of the iFLYTEK papers that have been accepted. The additional seven essays were summarized in an earlier article.

8. Super Dilated Nested Arrays with Ideal Critical Weights and Increased Degrees of Freedom

This paper presented two expansions on the recently introduced dilated nested arrays (DNA), which have the same virtual ULA as the nested arrays, as well as two dense physical ULAs with a critical spacing (2×λ/2). With a uniform number of DOFs in the parent array, the first dense ULA can be rearranged Qf times in the first dilation. As a result, the spacing between all sensor pairs and the critical sensors in them is fully treated in a specified Q-order expansion nested array, where 2≤Q≤Q_f+1. Meanwhile, in the second dilation—the super dilated nested array (SDNA)—the second dense ULA in the Q-order DNA is rearranged so that there are fixed weights as homogeneous arrays. Numerical examples verify the remarkable performance of these arrays.

9. Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing Audio-Visual Speech Enhancement

Traditional audio-visual speech enhancement networks take loud speech and corresponding videos as input and learn from the clean speech directly. To reduce the gap in the signal-to-noise ratio between the learning target and the input, we proposed AVPL—a mask-based progressive audio-visual speech enhancement framework—and gradually improved the signal-to-noise ratio by combining it with visual information reconstruction (VIR). Each stage of AVPL inputs pre-trained visual embedding (VE) and specific audio features to predict the mask after a certain improvement in the signal-to-noise ratio. To extract more visual features, the input visual features VE will be reconstructed in each stage of the AVPL-VIR model.

Experiments on the TCD-TIMIT dataset show that both single audio-only and audio-visual progressive learning significantly outperforms traditional single-step learning. In addition, AVPL-VIR brings further improvement based on AVPL as it extracts more adequate visual information.

10. An Experimental Study on Sound Event Localization and Detection under Realistic Testing Conditions

This paper explores four methods of data augmentation and two model structures for sound event localization and detection (SELD) in real testing conditions. In the SELD task, the realistic data is more difficult to handle than the simulated data due to the reverberation and sound overlapping in the room.

Researchers first compared the four data augmentation methods on the real DCASE 2022 dataset based on the ResNet-Conformer structure. Experiments show that due to the mismatch between the simulated test set and the real test set, except for the audio channel swapping method (ACS), the other three that work on the simulated dataset fail to perform satisfactorily on the real test set.

In addition, as ACS is introduced, our improved ResNet-Conformer further boosted the performance of the SELD task. By combining the above two technologies, our final system won first place in the DCASE 2022 Challenge.

11. Loss Function Design for DNN-Based Sound Event Localization and Detection on Low-Resource Realistic Data

This paper focused on the design of a loss function based on a deep neural network (DNN) model, which consists of two branches and is used for sound event localization and detection (SELD) with low-resource realistic data.

Researchers proposed an auxiliary network for audio classification that provides global event information for the main network to make SELD prediction more robust. In addition, we employed a momentum strategy for the estimation of direction of arrival (DOA) based on the strong coherence of sound events in the time dimension, thus effectively reducing the localization error.

In addition, researchers added a regularization term to the loss function to mitigate the overfitting problem on small datasets. After being tested on the DCASE 2022 Challenge Task 3 dataset for the detection and classification of acoustic scenes and sound events, the three methods proved effective in consistently boosting the SELD performance. The proposed loss function showed significant improvement in localization and detection accuracy on realistic data compared to the baseline system.

12. The Multimodal Information-based Speech Processing (Misp) 2022 Challenge: Audio-visual Diarization and Recognition

The Multimodal Information-Based Speech Processing (MISP) Challenge aimed to extend the application of signal processing in specific scenarios by promoting research on technologies such as wake word spotting, speaker diarization, and speech recognition. The MISP Challenge was composed of two tracks. Track 1 was Audio-Visual Speaker Diarization (AVSD), which handles the “who spoke and when” problem using audio-visual data, while Track 2 was a new Audio-Visual Diarization and Recognition (AVDR) task that tackles the “who said what and when” problem using audio-visual speaker diarization.

Both tracks focused on Chinese and use far-field audio and video from real home TV scenes (2-6 people communicating with TV as the background noise). This paper described the dataset, track setup, and baseline for the MISP2022 Challenge. Our analysis of experiments and examples show that although the AVDR baseline system delivers good performance, there are still difficulties due to far-field video quality, TV noise in the background, and indistinguishable speakers.

Open-source code download: https://github.com/mispchallenge/misp2022_baseline

13. AST-SED: An Effective Sound Event Detection Method based on Audio Spectrogram Transformer

The AST (Audio Spectrogram Transformer) model pre-trained with large-scale data delivers good performance on the acoustic classification task (AT), but it is not optimal to directly use the output features of the AST for the sound event detection task (SED). In response to this problem, this paper proposed an Encoder-Decoder downstream task module to efficiently fine-tune the AST model. In the Frequency-wise transformer encoder (FTE), a frequency-wise multi-headed self-attention mechanism is employed to enable the model to better detect multiple sound events in an audio clip.

In the Local GRU Decoder (LGD), the nearest neighbor interpolation (NNI) and GRU are combined to decode high temporal resolution features along the temporal direction for the detection task. Results on the DCASE 2022 Task4 development set show that the downstream task module proposed in this paper significantly improves the performance of AST on detection tasks without redesigning the AST structure.

14. Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation

This paper proposed an AV2vec model, which adopts a multimodal self-distillation approach for audio-visual speech representation learning. The AV2vec model employs a teacher network and a student network.

The student model is trained with a masked hidden layer feature regression task, while the target features from which it learns are generated online by the teacher network. The model parameters of the teacher network are exponential smoothing of those of the student network.

As the target features of the AV2vec model proposed in this paper are generated online, the AV2vec model does not require iterative training as the AV-HuBERT model, reducing its training time to a fifth of that of AV-HuBERT. Researchers then put forward the AV2vec-MLM model in this paper, which is an expansion of the AV2vec model using a loss function based on the mask-like language model.

Researchers’ experimental results show that the AV2vec model performs as well as the AV-HuBERT baseline. When the loss function of the mask-like language model is introduced, the AV2vec-MLM delivers the best performance in experiments on the downstream tasks of lip recognition, speech recognition, and multimodal speech recognition.

15. Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Transducer is one of the dominant frameworks for streaming speech recognition. Due to the difference in information available in the context, there is a significant gap in performance between streaming and non-streaming models. To reduce this gap, one solution is to make the distribution of the hidden and output layers as consistent as possible, which is usually achieved by a hierarchical knowledge distillation approach. However, since the learning of the output distribution depends on the hidden layer, it is difficult to ensure the consistency of both streaming and non-streaming distribution at the same time.

This paper proposed an adaptive two-stage knowledge distillation method, including hidden layer learning and output layer learning. In the former stage, we learn an implicit representation of the complete context by applying a mean-squared error loss function. In the latter stage, we design an adaptive smoothing method based on power transformation to learn stable output distribution. Compared with the original streaming Transducer scheme, our method reduces the WER by 19% relatively, while achieving a faster speed in first-word response in the LibriSpeech dataset.

Scroll to Top