Automatic and adaptive preprocessing for the development of predictive models.
In recent years, there has been an increasing interest in extracting valuable information from large amounts of data. This information can be useful for making predictions about the future or inferring unknown values. There exists a multitude of predictive models for the most common tasks of classification and regression. However, researchers often assume that data is clean and far too little attention has been paid to data pre-processing. Despite the fact that there are a number of methods for accomplishing individual pre-processing tasks (e.g. outlier detection or feature selection), the effort of performing comprehensive data preparation and cleaning can take between 60% and 80% of the whole data mining process time. One of the goals of this research is to speed up this process and make it more efficient. To this end, an approach for automating the selection and optimisation of multiple preprocessing methods and predictors has been proposed.
The combination of multiple data mining methods forming a workflow is known as Multi-Component Predictive System (MCPS). There are multiple software platforms like Weka and RapidMiner to create and run MCPSs including a large variety of pre-processing methods and predictors. There is, however, no common mathematical representation of MCPSs. An objective of this thesis is to establish a common representation framework of MCPSs. This will allow validating workflows before beginning the implementation phase with any particular platform. The validation of workflows becomes even more relevant when considering the automatic generation of MCPSs.
In order to automate the composition and optimisation of MCPSs, a search space is defined consisting of a number of preprocessing methods, predictive models and their hyperparameters. Then, the space is explored using a Bayesian optimisation strategy within a given time or computational budget. As a result, a parametrised sequence of methods is returned which after training form a complete predictive system. The whole process is data-driven and does not require human intervention once it has been started.
The generated predictive system can then be used to make predictions in an online scenario. However, it is possible that the nature of the input data changes over time. As a result, predictive models may need to be updated to capture the new characteristics of the data in order to reduce the loss of predictive performance. Similarly, preprocessing methods may have to be adapted as well. A novel hybrid strategy combining Bayesian optimisation and common adaptive techniques is proposed to automatically adapt MCPSs. This approach performs a global adaptation of the MCPS. However, in some situations, it could be costly to update the whole predictive system when maybe just a little adjustment is needed. The consequences of adapting a single component can, however, be significant. This thesis also analyses the impact of adapting individual components in an MCPS and proposes an approach to propagate changes through the system.
This thesis was initiated due to a joint research project with a chemical production company, which has provided several datasets with common raw data issues in the process industry. The final part of this thesis evaluates the feasibility of applying such automatic techniques for building and maintaining predictive models for real chemical production processes.