An empirical investigation of virtual screening
Journal: Proceedings - 2013 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013
Drug discovery relies much on data processing. Virtual screening (VS) is a typical method of drug discovery that exploits chemical structures (molecules) to identify those that are likely to bind to a particular drug target. VS can be turned into either a matching or a classification problem where the quality of the data matters very much. The number of features (and their properties) and data imbalance are general problems of chemical datasets used in VS. This paper investigates how to deal with these two problems to enhance the accuracy of VS and specifically to reduce the false positive rate. On one hand, we use the synthetic minority oversampling technique (SMOTE) as a technique to balance data and on the other hand we investigate different molecular descriptors and fingerprints to serve as features. A classification approach is used to assess the performance of four chosen classifiers first individually and then by combining them. As an alternative an instance-based approach is employed to observe the effect on accuracy. Results from the classification method show that a higher accuracy and a lower false positive rate can be achieved by initially balancing the datasets using SMOTE and then classifying them. The effects of descriptors and fingerprints on accuracy and false positive rates can only be discussed for each dataset separately. Combining distance matrices of different structural fingerprints does not cause active and similar compounds to appear at the top of the dissimilarity ranking. © 2013 IEEE.