Secure Training of Random Forest Classifiers over Continuous Data
MetadataShow full item record
Existing Secure Multi-Party Computation (MPC) protocols for privacy-preserving training of decision trees over distributed data assume that the attributes are categorical. In real-life applications, attributes are often numerical. The standard ''in the clear'' algorithm to grow decision trees on data with continuous values requires sorting of training examples for each attribute in each node in the quest for an optimal cut-point in the range of attribute values. Sorting is a prohibitively expensive operation in MPC, hence secure protocols that mimic the traditional decision tree training algorithm are very inefficient. In this thesis, we propose an alternative, more efficient strategy for secure training of decision tree-based models on data with continuous attributes, namely secure discretization of the data, followed by secure training of a random forest classifier over the discretized data. In addition to mathematically proving that the proposed approach is correct and secure, we experimentally evaluate it in terms of classification accuracy and runtime on a variety of benchmark data sets. To the best of our knowledge, our approach is the very first to privately train decision tree-based models with continuous attributes where the overall complexity depends only linearly on the size of the entire training data set -- contrary to existing sorting-based solutions.