Data sets and data quality in software engineering: Eight years on

Authors: Liebchen, G. and Shepperd, M.

Start date: 7 September 2016

Journal: Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering

ISBN: 978-1-4503-4772-3

DOI: 10.1145/2972958.2972967

This data was imported from Scopus:

Authors: Liebchen, G. and Shepperd, M.

Journal: ACM International Conference Proceeding Series

ISBN: 9781450347723

DOI: 10.1145/2972958.2972967

© 2016 ACM. Context: We revisit our review of data quality within the context of empirical software engineering eight years on from our PROMISE 2008 article. Objective: To assess the extent and types of techniques used to manage quality within data sets. We consider this a particularly interesting question in the context of initiatives to promote sharing and secondary analysis of data sets. Method: We update the 2008 mapping study through four subsequently published reviews and a snowballing exercise. Results: The original study located only 23 articles explicitly considering data quality. This picture has changed substantially as our updated review now finds 283 articles, however, our estimate is that this still represents perhaps 1% of the total empirical software engineering literature. Conclusions: It appears the community is now taking the issue of data quality more seriously and there is more work exploring techniques to automatically detect (and sometimes repair) noise problems. However, there is still little systematic work to evaluate the various data sets that are widely used for secondary analysis; addressing this would be of considerable benefit. It should also be a priority to work collaboratively with practitioners to add new, higher quality data to the existing corpora.

The data on this page was last updated at 04:57 on May 21, 2019.