NCATS-sol
Dataset: Download it here.
Dataset description: 2,453 compounds and binary labels indicating whether they have low solubility.
Dataset preprocessing
- Download the original dataset from here, which contains 2,532 records;
- Drop one compound (2 rows) with inconsistent outcomes;
| PUBCHEM_CID | PUBCHEM_EXT_DATASOURCE_SMILES | PUBCHEM_ACTIVITY_OUTCOME | Phenotype | Analysis Comment |
|---|---|---|---|---|
| 661788 | CC(=O)NC1=CC=C(C=C1)OCC2=C(C=CC(=C2)C3NC4=CC=CC=C4C(=O)N3CC5=CC=CO5)OC | Active | Moderate/High | class = 0 |
| 661788 | CC(=O)NC1=CC=C(C=C1)OCC2=C(C=CC(=C2)C3NC4=CC=CC=C4C(=O)N3CC5=CC=CO5)OC | Inactive | Low | class = 1 |
- Drop one duplicated row;
| PUBCHEM_CID | PUBCHEM_EXT_DATASOURCE_SMILES | PUBCHEM_ACTIVITY_OUTCOME | Phenotype | Analysis Comment |
|---|---|---|---|---|
| 135422895 | CN=C1CN=C(C2=C(N1)C=CC(=C2)Cl)C3=CC=CN3 | Active | Moderate/High | class = 0 |
| 135422895 | CN=C1CN=C(C2=C(N1)C=CC(=C2)Cl)C3=CC=CN3 | Active | Moderate/High | class = 0 |
- Drop 76 compounds (rows) with an inconclusive outcome;
- Generate a new column
low_solubilitybased on theAnalysis Commentcolumn:Lowphenotype is mapped to positive class,Moderate/Highnegative class; - Use RDKit to transform the SMILES to their canonical forms;
Reference
- H. Sun, P. Shah, K. Nguyen, K. R. Yu, E. Kerns, M. Kabir, Y. Wang, and X. Xu, Predictive models of aqueous solubility of organic compounds built on a large dataset of high integrity, Bioorganic & Medicinal Chemistry 27, 3110 (2019).