Graph-based Semi-Supervised Learning in Acoustic Modeling for Automatic Speech Recognition
MetadataShow full item record
Acoustic models require a large amount of training data. However, lots of labor is required to annotate the training data for automatic speech recognition. More importantly, the performance of the acoustic model could degenerate during test time, where the conditions of test data differ from the training data in speaker characteristics, channel and recording environment. To compensate for the deviation between training and test conditions, we investigate a graph-based semi-supervised learning approach to acoustic modeling in automatic speech recognition. Graph-based semi-supervised learning (SSL) is a widely used semi-supervised learning method in which the labeled data and unlabeled data are jointly represented as a weighted graph, and the information is propagated from the labeled data to the unlabeled data. The key assumption that graph-based SSL makes is that data samples lie on a low dimensional manifold, where samples that are close to each other are expected to have the same class label. More importantly, by exploiting the relationship between training and test samples, graph-based SSL implicitly adapts to the test data. In this thesis, we address several key challenges in applying graph-based SSL to acoustic modeling. We first investigate and compare several state-of-the-art graph-based SSL algorithms on a benchmark dataset. In addition, we propose novel graph construction methods that allow graph-based SSL to handle variable-length input features. We next investigate the efficacy of graph-based SSL in context of a fully-fledged DNN-based ASR system. We compare two different integration frameworks for graph-based learning. First, we propose a lattice-based late integration framework that combines graph-based SSL with the DNN-based acoustic modeling and evaluate the framework on continuous word recognition tasks. Second, we propose an early integration framework using neural graph embeddings and compare two different neural graph embedding features that capture the information of the manifold at different levels. The embedding features are used as input to a DNN system and are shown to outperform the conventional acoustic feature inputs on several medium-to-large vocabulary conversational speech recognition tasks.
- Electrical engineering