NCATS-sol

Dataset: Download it here.

Dataset description: 2,453 compounds and binary labels indicating whether they have low solubility.

Dataset preprocessing

PUBCHEM_CID PUBCHEM_EXT_DATASOURCE_SMILES PUBCHEM_ACTIVITY_OUTCOME Phenotype Analysis Comment
661788 CC(=O)NC1=CC=C(C=C1)OCC2=C(C=CC(=C2)C3NC4=CC=CC=C4C(=O)N3CC5=CC=CO5)OC Active Moderate/High class = 0
661788 CC(=O)NC1=CC=C(C=C1)OCC2=C(C=CC(=C2)C3NC4=CC=CC=C4C(=O)N3CC5=CC=CO5)OC Inactive Low class = 1
PUBCHEM_CID PUBCHEM_EXT_DATASOURCE_SMILES PUBCHEM_ACTIVITY_OUTCOME Phenotype Analysis Comment
135422895 CN=C1CN=C(C2=C(N1)C=CC(=C2)Cl)C3=CC=CN3 Active Moderate/High class = 0
135422895 CN=C1CN=C(C2=C(N1)C=CC(=C2)Cl)C3=CC=CN3 Active Moderate/High class = 0

Reference

  1. H. Sun, P. Shah, K. Nguyen, K. R. Yu, E. Kerns, M. Kabir, Y. Wang, and X. Xu, Predictive models of aqueous solubility of organic compounds built on a large dataset of high integrity, Bioorganic & Medicinal Chemistry 27, 3110 (2019).