In the recent High-Tech Expo at the 2022 iFLYTEK 1024 Global Developer Festival, the AI Tech Pavilion displayed the latest technologies and achievements from the iFLYTEK Research Institute.
iFLYTEK’s multi-modal speech recognition was on full display, allowing conference attendees to test the technology’s effectiveness by speaking to it, where the display screen then accurately recognized and displayed a transcription despite the hubbub of the expo hall. This capability is among the latest technological achievements of the iFLYTEK Research Institute in the field of speech and voice recognition.
The “Multi-modal Speech Recognition” presented by iFLYTEK Research Institute at the 2022 iFLYTEK 1024 Global Developer Festival is based on a new framework of multi-modal speech enhancement and recognition that benefits from earlier iFLYTEK breakthroughs. The base technology is the result of collaboration between iFLYTEK and the National Engineering Research Center of Speech and Language Information Processing of the University of Science and Technology of China (USTC-NELSLIP) to design an iterative mask estimation algorithm based on synchronous sound perception of the space and the target speaker. Further improvements were made in 2021 when iFLYTEK released TFMA (Temporary Feedback End-End Multi-Channel ASR), a front-end integrated speech recognition framework of microphone array, which greatly improved the accuracy of speech recognition in complex scenarios.
In addition to the audio information used in traditional speech recognition, the new technology accounts for lip shape to better identify the target speaker and distinguish speech. As vision and hearing account for about 80% and 10% of all the information acquired through our five senses respectively, integrating the information on video and audio sequence of the target speaker means “lip reading” to some extent, so the target speaker can be recognized. The result is a technology capable of recreating the ‘cocktail party effect’, where people are able to hold a conversation in crowded rooms despite overlapping background noises and has achieved multi-modal speech recognition accuracy of more than 63%.
At present, iFLYTEK’s multi-modal speech recognition framework, which combines lip shape and voice, is a pioneer for its application in cars, conferences, subway ticket purchases, and hospital registrations.
Speech recognition, as the starting point of human-computer interaction in the era of the Internet of Everything, is a vital part of iFLYTEK’s Super Brain 2030 initiative to make robot assistants available to every household. The combination of “seeing” and “hearing” allows machines to perceive and understand the world in multiple ways, just like humans do so that soon, machines will be able to interact with us more naturally and intelligently in various real-life scenes.