Learning to Disentangle Speech Information

Speaker:

Yang Zhang (MIT-IBM Watson AI Lab)

Time & Room:

2023/05/31 (Wed.) 10:00-11:00AM (UTC+8), TecentMeeting ID: 337-287-336

Abstract:

Speech contains rich information, which can roughly be divided into four categories: content, pitch, rhythm, and timbre. Many speech analysis tasks would mostly involve only one or a few of these categories, so a desirable speech processing system should be able to disentangle the four components. Recently, self-supervised learning (SSL) in speech has emerged as a promising solution to building a versatile speech processing system without much transcribed text data. SSL models are typically pre-trained on a large, untranscribed speech corpus to learn the structures in speech, so that they can be adapted to different speech processing tasks with relatively few labeled data. However, despite their success, existing SSL models still cannot distinguish among different information components in speech. In this talk, I will explore two research questions. First, are there ways to disentangle different types of speech information without relying on text supervision? Second, would disentanglement lead to an improved performance for SSL models in speech processing tasks? I will present a line of work that disentangles speech information and successfully applies the techniques to SSL model training. We hope that our findings can contribute to more powerful, aspect-specific SSL models, and to resolving the emerging textless NLP challenges..

Bio:

Yang Zhang is a research scientist at MIT-IBM Watson AI Lab. His research focuses on deep learning for speech, natural language, and other time-series processing. Recently, he has been working on disentanglement techniques for speech and its application to low-resourced languages, as well as improving NLP model interpretability via rationalization. Before joining MIT-IBM Watson AI Lab, Yang is a researcher at IBM Research Yorktown. Yang obtained his Ph.D. degree from University of Illinois at Urbana-Champaign (UIUC).
slides | youtube | bilibili