Scalable Learning in Latent State Sequence Models
MetadataShow full item record
In this dissertation, we develop scalable learning methods for sequential data models with latent (hidden) states. State space models (SSMs) and recurrent neural networks (RNNs) are popular models for sequential data using latent states. By augmenting an observed sequence with a latent state sequence, SSMs and RNNs can capture complex temporal dynamics with a simpler, smaller parametrization. Unfortunately, learning the parameters of these latent state sequence models requires processing the latent states along the entire sequence, which scales poorly for both long and high dimensional sequential data. For long sequential data, we develop scalable training methods that use stochastic gradients based on processing subsequences. Unlike independent data models, stochastic gradients for sequential data break temporal dependencies and as a result are biased. We develop theory to analyze the effect of this bias on learning and develop efficient estimators to control this bias. For SSMs, we use buffered stochastic gradient estimates, which reduces the bias by passing additional messages in a buffer around each subsequence. For RNNs, we adaptively truncate backpropagation to save computation and memory when possible. We find these methods provides significant speed-ups in both synthetic and real data sets with millions of time points (i.e. ion-channel recordings, canine electroencephalogram recordings, historical weather data, financial exchange rate data, and text corpus data), while maintaining accuracy similar to the computationally prohibitive batch approaches. For high dimensional sequential data, we focus on the computational challenge of marginalizing latent variables of many time series when clustering. Existing Bayesian methods for learning clusters of time series either mix slowly (naive Gibbs) or scale cubically in the number of dimensions (collapsed Gibbs). We propose an approximate collapsed Gibbs sampling scheme that improves mixing by approximately collapsing out parameters using expectation propagation, while scaling linearly instead of cubically. We show empirically on synthetic data, robust clustering of MNIST, and housing price prediction that our approximate sampler has similar performance to a collapsed Gibbs sampler at a fraction of the runtime.
- Statistics