Exploring discrepancies in findings obtained with the KDD Cup '99 data set

This source preferred by Keith Phalp

Authors: Engen, V., Vincent, J. and Phalp, K.T.

Journal: Intelligent Data Analysis

ISSN: 1088-467X

The KDD Cup '99 data set has been widely used to evaluate intrusion detection prototypes, most based on machine learning techniques, for nearly a decade. The data set served well in the KDD Cup '99 competition to demonstrate that machine learning can be useful in intrusion detection systems.

However, there are discrepancies in the ndings reported in the literature. Further, some researchers have published criticisms of the data (and the DARPA data from which the KDD Cup '99 data has been derived), questioning the validity of results obtained with this data. Despite the criticisms, researchers continue to use the data due to a lack of better publicly available alternatives. Hence, it is important to identify the value of the data set and the ndings from the extensive body of research based on it, which has largely been ignored by the existing critiques. This paper reports on an empirical investigation, demonstrating the impact of several methodological dierences in the publicly available subsets, which uncovers several underlying causes of the discrepancy in the results reported in the literature. These ndings allow us to better interpret the current body of research, and inform recommendations for future use of the data set.

K

This data was imported from DBLP:

Authors: Engen, V., Vincent, J. and Phalp, K.

Journal: Intell. Data Anal.

Volume: 15

Pages: 251-276

This data was imported from Scopus:

Authors: Engen, V., Vincent, J. and Phalp, K.

Journal: Intelligent Data Analysis

Volume: 15

Issue: 2

Pages: 251-276

eISSN: 1571-4128

ISSN: 1088-467X

DOI: 10.3233/IDA-2010-0466

The KDD Cup '99 data set has been widely used to evaluate intrusion detection prototypes, most based on machine learning techniques, for nearly a decade. The data set served well in the KDD Cup '99 competition to demonstrate that machine learning can be useful in intrusion detection systems. However, there are discrepancies in the findings reported in the literature. Further, some researchers have published criticisms of the data (and the DARPA data from which the KDD Cup '99 data has been derived), questioning the validity of results obtained with this data. Despite the criticisms, researchers continue to use the data due to a lack of better publicly available alternatives. Hence, it is important to identify the value of the data set and the findings from the extensive body of research based on it, which has largely been ignored by the existing critiques. This paper reports on an empirical investigation, demonstrating the impact of several methodological differences in the publicly available subsets, which uncovers several underlying causes of the discrepancy in the results reported in the literature. These findings allow us to better interpret the current body of research, and inform recommendations for future use of the data set. © 2011 - IOS Press and the authors. All rights reserved.

The data on this page was last updated at 04:51 on November 17, 2018.