Generic placeholder image

Current Medicinal Chemistry


ISSN (Print): 0929-8673
ISSN (Online): 1875-533X

Review Article

VSPrep: A KNIME Workflow for the Preparation of Molecular Databases for Virtual Screening

Author(s): José-Manuel Gally, Stéphane Bourg, Jade Fogha, Quoc-Tuan Do, Samia Aci-Sèche and Pascal Bonnet*

Volume 27, Issue 38, 2020

Page: [6480 - 6494] Pages: 15

DOI: 10.2174/0929867326666190614160451

Price: $65


Drug discovery is a challenging and expensive field. Hence, novel in silico tools have been developed in early discovery stage to identify and prioritize novel molecules with suitable physicochemical properties. In many in silico drug design projects, molecular databases are screened by virtual screening tools to search for potential bioactive molecules. The preparation of the molecules is therefore a key step in the success of well-established techniques such as docking, similarity or pharmacophore searching. We review here the lists of several toolkits used in different steps during the cleaning of molecular databases, integrated within a KNIME workflow. During the first step of the automatic workflow, salts are removed, and mixtures are split to get one compound per entry. Then compounds with unwanted features are filtered. Duplicated entries are then deleted while considering stereochemistry. As a compromise between exhaustiveness and computational time, most distributed tautomers at physiological pH are computed. Additionally, various flags are applied to molecules by using either classical molecular descriptors, similarity search to known libraries or substructure search rules. Moreover, stereoisomers are enumerated depending on the unassigned chiral centers. Then, three-dimensional coordinates, and optionally conformers, are generated. This workflow has been already applied to several drug design projects and can be used for molecular database preparation upon request.

Keywords: VSPrep, chemoinformatics, molecular databases, preparation, workflow, virtual screening, KNIME.

Hughes, J.P.; Rees, S.; Kalindjian, S.B.; Philpott, K.L. Principles of early drug discovery. Br. J. Pharmacol., 2011, 162(6), 1239-1249.
[] [PMID: 21091654]
IRI - The EU Industrial R&D Investment Scoreboard. Available at: (Accessed date: July 2018.).
Macarron, R.; Banks, M.N.; Bojanic, D.; Burns, D.J.; Cirovic, D.A.; Garyantes, T.; Green, D.V.S.; Hertzberg, R.P.; Janzen, W.P.; Paslay, J.W.; Schopfer, U.; Sittampalam, G.S. Impact of high-throughput screening in biomedical research. Nat. Rev. Drug Discov., 2011, 10(3), 188-195.
[] [PMID: 21358738]
Munos, B. Lessons from 60 years of pharmaceutical innovation. Nat. Rev. Drug Discov., 2009, 8(12), 959-968.
[] [PMID: 19949401]
Yang, C.; Wang, W.; Chen, L.; Liang, J.; Lin, S.; Lee, M-Y.; Ma, D-L.; Leung, C-H. Discovery of a VHL and HIF1α interaction inhibitor with in vivo angiogenic activity via structure-based virtual screening. Chem. Commun. (Camb.), 2016, 52(87), 12837-12840.
[] [PMID: 27709157]
Hidaka, K.; Kimura, T.; Sankaranarayanan, R.; Wang, J.; McDaniel, K.F.; Kempf, D.J.; Kameoka, M.; Adachi, M.; Kuroki, R.; Nguyen, J-T.; Hayashi, Y.; Kiso, Y. Identification of highly potent human immunodeficiency virus type-1 protease inhibitors against lopinavir and darunavir resistant viruses from allophenylnorstatine-based peptidomimetics with P2 tetrahydrofuranylglycine. J. Med. Chem., 2018, 61(12), 5138-5153.
[] [PMID: 29852069]
Lee, P.S.; Lapointe, G.; Madera, A.M.; Simmons, R.L.; Xu, W.; Yifru, A.; Tjandra, M.; Karur, S.; Rico, A.; Thompson, K.; Bojkovic, J.; Xie, L.; Uehara, K.; Liu, A.; Shu, W.; Bellamacina, C.; McKenney, D.; Morris, L.; Tonn, G.R.; Osborne, C.; Benton, B.M.; McDowell, L.; Fu, J.; Sweeney, Z.K. Application of virtual screening to the identification of new LpxC inhibitor chemotypes. J. Med. Chem., 2018, 61(20), 9360-9370.
[] [PMID: 30226381]
Sitzmann, M.; Ihlenfeldt, W-D.; Nicklaus, M.C. Tautomerism in large databases. J. Comput. Aided Mol. Des., 2010, 24(6-7), 521-551.
[] [PMID: 20512400]
Irwin, J.J.; Shoichet, B.K. ZINC--a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model., 2005, 45(1), 177-182.
[] [PMID: 15667143]
Suite, S.Y.B.Y.L-X. Certara. Available at: (Accessed Date: 24 November, 2016).
Molecular Discovery Ltd. Available at: http://www. (Accessed Date: 30 August, 2017).
Molecular Operating Environment. Available at: Environment.htm (Accessed Date: September 2018).
Software, O.S. Cheminformatics and Molecular Modeling Software (Open-Eye), Available at: http://www. (Accessed Date: September 2018)
ChemAxon – Software for Chemistry and Biology. Available at: (Accessed Date: September 2018).
BIOVIA , Pipeline Pilot Scientific workflow authoring appli-cation for data analysis. Available at: (Accessed Date: 9 November, 2016).
The amber molecular dynamics package. Available at: (Accessed Date: 30 August, 2017).
Schrödinger Release. S. 2017-3: LigPrep; Schrödinger, LLC: New York, NY, 2017. Available at (Available at: 30 August, 2017).
Sommer, K.; Friedrich, N-O.; Bietz, S.; Hilbig, M.; Inhester, T.; Rarey, M. UNICON: A Powerful and easy-to-use compound library converter. J. Chem. Inf. Model., 2016, 56(6), 1105-1111.
[] [PMID: 27227368]
AMBIT2. Available at: (3 July, 2018.).
O’Boyle, N.M.; Hutchison, G.R. Cinfony--combining Open Source cheminformatics toolkits behind a common interface. Chem. Cent. J., 2008, 2, 24.
[] [PMID: 19055766]
Wójcikowski, M.; Zielenkiewicz, P.; Siedlecki, P. Open Drug Discovery Toolkit (ODDT): a new open-source player in the drug discovery field. J. Cheminform., 2015, 7(1), 26.
[] [PMID: 26101548]
Ihlenfeldt, W.D.; Takahashi, Y.; Abe, H.; Sasaki, S. Computation and management of chemical properties in cactvs: an extensible networked approach toward modularity and compatibility. J. Chem. Inf. Comput. Sci., 1994, 34, 109-116.
Babel, O. Open Babel: The Open Source Chemistry Toolbox. Available at: (24 Novemeber, 2016.)
Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. The chemistry development kit (cdk): an open-source java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci., 2003, 43(2), 493-500.
[] [PMID: 12653513]
RDKit: open-source cheminformatics software. Available at: (Accessed Date: 9 November, 2016).
Pavlov, D.; Rybalkin, M.; Karulin, B.; Kozhevnikov, M.; Savelyev, A.; Churinov, A. Indigo: Universal cheminformatics API. J. Cheminform., 2011, 3, 4.
Urbaczek, S.; Kolodzik, A.; Groth, I.; Heuser, S.; Rarey, M. Reading PDB: perception of molecules from 3D atomic coordinates. J. Chem. Inf. Model., 2013, 53(1), 76-87.
[] [PMID: 23176552]
Sushko, I.; Novotarskyi, S.; Körner, R.; Pandey, A.K.; Rupp, M.; Teetz, W.; Brandmaier, S.; Abdelaziz, A.; Prokopenko, V.V.; Tanchuk, V.Y.; Todeschini, R.; Varnek, A.; Marcou, G.; Ertl, P.; Potemkin, V.; Grishina, M.; Gasteiger, J.; Schwab, C.; Baskin, I.I.; Palyulin, V.A.; Radchenko, E.V.; Welsh, W.J.; Kholodovych, V.; Chekmarev, D.; Cherkasov, A.; Aires-de-Sousa, J.; Zhang, Q-Y.; Bender, A.; Nigsch, F.; Patiny, L.; Williams, A.; Tkachenko, V.; Tetko, I.V. Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J. Comput. Aided Mol. Des., 2011, 25(6), 533-554.
[] [PMID: 21660515]
Zoete, V.; Daina, A.; Bovigny, C.; Michielin, O. SwissSimilarity: A web tool for low to ultra-high throughput ligand-based virtual screening. J. Chem. Inf. Model., 2016, 56(8), 1399-1404.
[] [PMID: 27391578]
Korkmaz, S.; Zararsiz, G.; Goksuluk, D. MLViS: A web tool for machine learning-based virtual screening in early-phase of drug discovery and development. PLoS One, 2015, 10(4), e0124600.
[] [PMID: 25928885]
Lagorce, D.; Sperandio, O.; Baell, J.B.; Miteva, M.A.; Villoutreix, B.O. FAF-Drugs3: a web server for compound property calculation and chemical library design. Nucleic Acids Res., 2015, 43(W1), W200-7.
[] [PMID: 25883137]
Afgan, E.; Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Čech, M.; Chilton, J.; Clements, D.; Coraor, N.; Eberhard, C.; Grüning, B.; Guerler, A.; Hillman-Jackson, J.; Von Kuster, G.; Rasche, E.; Soranzo, N.; Turaga, N.; Taylor, J.; Nekrutenko, A.; Goecks, J. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res., 2016, 44(W1), W3-W10.
[] [PMID: 27137889]
Chemical tool box. Available at: (Accessed Date: 3 July, 2018.).
Kuhn, T.; Willighagen, E.L.; Zielesny, A.; Steinbeck, C. CDK-Taverna: an open workflow environment for cheminformatics. BMC Bioinformatics, 2010, 11, 159.
[] [PMID: 20346188]
Truszkowski, A.; Jayaseelan, K.V.; Neumann, S.; Willighagen, E.L.; Zielesny, A.; Steinbeck, C. New developments on the cheminformatics open workflow environment CDK-Taverna. J. Cheminform., 2011, 3, 54.
[] [PMID: 22166170]
Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. The konstanz information miner in: Data analysis, machine learning and applications; studies in classification, data analysis, and knowledge organization; Preisach, C., Burkhardt, H., Schmidt-Thieme, B; Decker, R., Ed.; Springer: Berlin, Heidelberg, 2008, pp. 319-326.
Gally, J-M.; Bourg, S.; Do, Q-T.; Aci-Sèche, S.; Bonnet, P. VSPrep: a general KNIME workflow for the preparation of molecules for virtual screening. Mol. Inform., 2017, 36(10)
[] [PMID: 28586180]
Warr, W.A. Scientific workflow systems: Pipeline Pilot and KNIME. J. Comput. Aided Mol. Des., 2012, 26(7), 801-804.
[] [PMID: 22644661]
Chemical supplier of screening compounds, building blocks - Ambinter. Availabe at: (Accessed Date: 13 November, 2017).
Pospisil, P.; Ballmer, P.; Scapozza, L.; Folkers, G. Tautomerism in computer-aided drug design. J. Recept. Signal Transduct. Res., 2003, 23(4), 361-371.
[] [PMID: 14753297]
ten Brink, T.; Exner, T.E. Influence of protonation, tautomeric, and stereoisomeric states on protein-ligand docking results. J. Chem. Inf. Model., 2009, 49(6), 1535-1546.
[] [PMID: 19453150]
Kalliokoski, T.; Salo, H.S.; Lahtela-Kakkonen, M.; Poso, A. The effect of ligand-based tautomer and protomer prediction on structure-based virtual screening. J. Chem. Inf. Model., 2009, 49(12), 2742-2748.
[] [PMID: 19928753]
Ibrahim, T.M.; Bauer, M.R.; Boeckler, F.M. Applying DEKOIS 2.0 in structure-based virtual screening to probe the impact of preparation procedures and score normalization. J. Cheminform., 2015, 7, 21.
[] [PMID: 26034510]
Guasch, L.; Yapamudiyansel, W.; Peach, M.L.; Kelley, J.A.; Barchi, J.J. Jr.; Nicklaus, M.C. Experimental and chemoinformatics study of tautomerism in a database of commercially available screening samples. J. Chem. Inf. Model., 2016, 56(11), 2149-2161.
[] [PMID: 27669079]
Sarvagalla, S.; Singh, V.K.; Ke, Y-Y.; Shiao, H-Y.; Lin, W-H.; Hsieh, H-P.; Hsu, J.T.A.; Coumar, M.S. Identification of ligand efficient, fragment-like hits from an HTS library: structure-based virtual screening and docking investigations of 2H- and 3H-pyrazolo tautomers for Aurora kinase A selectivity. J. Comput. Aided Mol. Des., 2015, 29(1), 89-100.
[] [PMID: 25344840]
Calculator plugins were used for structure property prediction and calculation, Marvin 15.1.19, Chem. Axon. Available at: http://Www.Chemaxon.Com (Accessed Date: 3 July, 2018) .
Kochev, N.T.; Paskaleva, V.H.; Jeliazkova, N. Ambit-Tautomer: an open source tool for tautomer generation. Mol. Inform., 2013, 32(5-6), 481-504.
[] [PMID: 27481667]
Lipinski, C.A.; Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev., 2001, 46(1-3), 3-26.
[] [PMID: 11259830]
Veber, D.F.; Johnson, S.R.; Cheng, H-Y.; Smith, B.R.; Ward, K.W.; Kopple, K.D. Molecular properties that influence the oral bioavailability of drug candidates. J. Med. Chem., 2002, 45(12), 2615-2623.
[] [PMID: 12036371]
Teague, S.J.; Davis, A.M.; Leeson, P.D.; Oprea, T. The design of lead like combinatorial libraries. Angew. Chem. Int. Ed. Engl., 1999, 38(24), 3743-3748.
[<3743:AID-ANIE3743>3.0.CO;2-U] [PMID: 10649345]
Hamon, V.; Bourgeas, R.; Ducrot, P.; Theret, I.; Xuereb, L.; Basse, M.J.; Brunel, J.M.; Combes, S.; Morelli, X.; Roche, P. 2P2I HUNTER: a tool for filtering orthosteric protein-protein interaction modulators via a dedicated support vector machine. J. R. Soc. Interface, 2013, 11(90), 20130860.
[] [PMID: 24196694]
Congreve, M.; Carr, R.; Murray, C.; Jhoti, H.A. ‘rule of three’ for fragment-based lead discovery? Drug Discov. Today, 2003, 8(19), 876-877.
[] [PMID: 14554012]
Congreve, M.; Chessari, G.; Tisi, D.; Woodhead, A.J. Recent developments in fragment-based drug discovery. J. Med. Chem., 2008, 51(13), 3661-3680.
[] [PMID: 18457385]
Baell, J.B.; Holloway, G.A. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J. Med. Chem., 2010, 53(7), 2719-2740.
[] [PMID: 20131845]
Landrum, G. Curating the PAINS Filters. , Available at: (Accessed date: July 3, 2018).
Metz, J.T.; Johnson, E.F.; Soni, N.B.; Merta, P.J.; Kifle, L.; Hajduk, P.J. Navigating the kinome. Nat. Chem. Biol., 2011, 7(4), 200-202.
[] [PMID: 21336281]
Bain, J.; Plater, L.; Elliott, M.; Shpiro, N.; Hastie, C.J.; McLauchlan, H.; Klevernic, I.; Arthur, J.S.C.; Alessi, D.R.; Cohen, P. The selectivity of protein kinase inhibitors: a further update. Biochem. J., 2007, 408(3), 297-315.
[] [PMID: 17850214]
Anastassiadis, T.; Deacon, S.W.; Devarajan, K.; Ma, H.; Peterson, J.R. Comprehensive assay of kinase catalytic activity reveals features of kinase inhibitor selectivity. Nat. Biotechnol., 2011, 29(11), 1039-1045.
[] [PMID: 22037377]
Fedorov, O.; Marsden, B.; Pogacic, V.; Rellos, P.; Müller, S.; Bullock, A.N.; Schwaller, J.; Sundström, M.; Knapp, S. A systematic interaction map of validated kinase inhibitors with Ser/Thr kinases. Proc. Natl. Acad. Sci. USA, 2007, 104(51), 20523-20528.
[] [PMID: 18077363]
Gao, Y.; Davies, S.P.; Augustin, M.; Woodward, A.; Patel, U.A.; Kovelman, R.; Harvey, K.J. A broad activity screen in support of a chemogenomic map for kinase signalling research and drug discovery. Biochem. J., 2013, 451(2), 313-328.
[] [PMID: 23398362]
Davis, M.I.; Hunt, J.P.; Herrgard, S.; Ciceri, P.; Wodicka, L.M.; Pallares, G.; Hocker, M.; Treiber, D.K.; Zarrinkar, P.P. Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol., 2011, 29(11), 1046-1051.
[] [PMID: 22037378]
Carles, F.; Bourg, S.; Meyer, C.; Bonnet, P. PKIDB: A curated, annotated and updated database of protein kinase inhibitors in clinical trials. Molecules, 2018, 23(4), 23.
[] [PMID: 29662024]
Gatica, E.A.; Cavasotto, C.N. Ligand and decoy sets for docking to G protein-coupled receptors. J. Chem. Inf. Model., 2012, 52(1), 1-6.
[] [PMID: 22168315]
Lagarde, N.; Ben Nasr, N.; Jérémie, A.; Guillemain, H.; Laville, V.; Labib, T.; Zagury, J-F.; Montes, M. NRLiSt BDB, the manually curated nuclear receptors ligands and structures benchmarking database. J. Med. Chem., 2014, 57(7), 3117-3125.
[] [PMID: 24666037]
Sharma, A.; Dutta, P.; Sharma, M.; Rajput, N.K.; Dodiya, B.; Georrge, J.J.; Kholia, T.; Bhardwaj, A. OSDD Consortium. BioPhytMol: a drug discovery community resource on anti-mycobacterial phytomolecules and plant extracts. J. Cheminform., 2014, 6(1), 46.
[] [PMID: 25360160]
Nakamura, K.; Shimura, N.; Otabe, Y.; Hirai-Morita, A.; Nakamura, Y.; Ono, N.; Ul-Amin, M.A.; Kanaya, S. KNApSAcK-3D: a three-dimensional structure database of plant metabolites. Plant Cell Physiol., 2013, 54(2), e4.
[] [PMID: 23292603]
Chen, C.Y-C. TCM Database@Taiwan: the world’s largest traditional Chinese medicine database for drug screening in silico. PLoS One, 2011, 6(1), e15939.
[] [PMID: 21253603]
Xue, R.; Fang, Z.; Zhang, M.; Yi, Z.; Wen, C.; Shi, T. TCMID: Traditional Chinese Medicine integrative database for herb molecular mechanism analysis. Nucleic Acids Res., 2013, 41(Database issue), D1089-D1095.
[PMID: 23203875]
Klementz, D.; Döring, K.; Lucas, X.; Telukunta, K.K.; Erxleben, A.; Deubel, D.; Erber, A.; Santillana, I.; Thomas, O.S.; Bechthold, A.; Günther, S. StreptomeDB 2.0--an extended resource of natural products produced by streptomycetes. Nucleic Acids Res., 2016, 44(D1), D509-D514.
[] [PMID: 26615197]
Valli, M.; dos Santos, R.N.; Figueira, L.D.; Nakajima, C.H.; Castro-Gamboa, I.; Andricopulo, A.D.; Bolzani, V.S. Development of a natural products database from the biodiversity of Brazil. J. Nat. Prod., 2013, 76(3), 439-444.
[] [PMID: 23330984]
Hatherley, R.; Brown, D.K.; Musyoka, T.M.; Penkler, D.L.; Faya, N.; Lobb, K.A.; Tastan Bishop, Ö. SANCDB: a South African natural compound database. J. Cheminform., 2015, 7, 29.
[] [PMID: 26097510]
Brooks, W.H.; Daniel, K.G.; Sung, S-S.; Guida, W.C. Computational validation of the importance of absolute stereochemistry in virtual screening. J. Chem. Inf. Model., 2008, 48(3), 639-645.
[] [PMID: 18266348]
Brooks, W.H.; Guida, W.C.; Daniel, K.G. The significance of chirality in drug design and development. Curr. Top. Med. Chem., 2011, 11(7), 760-770.
[] [PMID: 21291399]
Computational validation of the importance of absolute stere-ochemistry in virtual screening. J. Chem. Inf. Model., 2018, 48(3), 639-645.
Tosco, P.; Stiefl, N.; Landrum, G. Bringing the MMFF force field to the RDKit: implementation and validation. J. Cheminform., 2014, 6, 37.
Ebejer, J-P.; Morris, G.M.; Deane, C.M. Freely available conformer generation methods: how good are they? J. Chem. Inf. Model., 2012, 52(5), 1146-1158.
[] [PMID: 22482737]
Rappe, A.K.; Casewit, C.J.; Colwell, K.S.; Goddard, W.A.; Skiff, W.M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J. Am. Chem. Soc., 1992, 114, 10024-10035.

Rights & Permissions Print Export Cite as
© 2023 Bentham Science Publishers | Privacy Policy