A Sequentialization of Features Approach to Complex Event Sequence Prediction
Abstract
Sequence based prediction takes an ordered list of events as input and makes predictions about the next event. Most existing work on sequence based prediction assumes that the sequences are simple, i.e. consisting of symbols drawn from a small alphabet (like a DNA sequence), or consisting of numbers (like a time series). In some applications, the events are a lot more complex. In medical applications for instance, data often comes in the form of a longitudinal sequence of patient records, each of which internally contains hundreds of features of various data types. Most existing work on making predictions about the next event in complex event sequences is event based, meaning that only the most recent event in the sequence is used to make a prediction about the upcoming event. In this thesis we propose a new technique for sequence based prediction that is domain independent and that takes the order of occurrence of events into account when making predictions. The key idea is to dissect each sequence of k feature vectors of size m into a set of m simple sequences of length k, train m × k models using well established machine learning techniques such as decision trees or support vector machines, and group the m × k trained models into an ensemble for making the final prediction. We evaluate the predictive ability of our new technique by measuring the AUC for predicting risk of 30-day readmission, cost and length of stay using hospital discharge records of hundreds of thousands of congestive heart failure patients. Our experiments show that a combination of our sequence based method with an event based method gives better results than each of these methods by themselves.