Datasets
Note:
- All the SMILES strings are canonical;
- “x” column: “canonical_smiles”;
ADME
Aqueous solubility
| Task | \(N\) | y | Dataset | Preprocessing |
|---|---|---|---|---|
| ESOL | 1,084 | logS | here | here |
| EPA-sol | 10,093 | logS | here | here |
| AZ-sol | 1,763 | logS | here | here |
| Biogen-sol | 2,173 | logS | here | here |
| NCATS-sol | 2,453 | low_solubility | here | here |
Lipophilicity
| Task | \(N\) | y | Dataset | Preprocessing |
|---|---|---|---|---|
| AZ-lipo | 4,195 | logD7.4 | here | here |
Permeability
| Task | \(N\) | y | Dataset | Preprocessing |
|---|---|---|---|---|
| CSU-Caco2 | 1,018 | logPapp | here | here |
| USTL-Caco2 | 1,780 | logPapp | here | here |
| Biogen-MDCK | 2,642 | “LOG MDR1-MDCK ER (B-A/A-B)” | here | here |
| NCATS-PAMPA-pH7.4 | 2,033 | low_moderate_permeability | here | |
| NCATS-PAMPA-pH5 | 486 | low_permeability | here |
Note:
Plasma protein binding (PPB)
| Task | \(N\) | y | Dataset | Preprocessing |
|---|---|---|---|---|
| AZ-rPPB | 717 | log_pct_unbound | here | |
| Biogen-rPPB | 168 | “LOG PLASMA PROTEIN BINDING (RAT) (% unbound)” | here | here |
| AZ-dPPB | 244 | log_pct_unbound | here | |
| AZ-mPPB | 162 | log_pct_unbound | here | |
| AZ-hPPB | 1,614 | log_pct_unbound | here | here |
| Biogen-hPPB | 194 | “LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)” | here | here |
Hepatocyte stability
| Task | \(N\) | y | Dataset | Preprocessing |
|---|---|---|---|---|
| AZ-rH | 837 | “LOG RH_CLint (uL/min/1E6 cells)” | here | |
| AZ-hH | 407 | “LOG HH_CLint (uL/min/1E6 cells)” | here |
Liver microsomal stability
| Task | \(N\) | y | Dataset | Preprocessing |
|---|---|---|---|---|
| NCATS-rLM | 2,528 | unstable | here | |
| Biogen-rLM | 3,054 | “LOG RLM_CLint (mL/min/kg)” | here | here |
| AZ-hLM | 1,102 | “LOG HLM_CLint (mL/min/g)” | here | |
| Biogen-HLM | 3,087 | “LOG HLM_CLint (mL/min/kg)” | here | here |
CYP450 interactions
| Task | \(N\) | y | Dataset |
|---|---|---|---|
| CYP1A2_CHEMBL1741322 | 9,600 | pchembl_value | here |
| CYP2C9_CHEMBL1614027 | 2,898 | pchembl_value | here |
| CYP2C9_CHEMBL1741325 | 7,220 | pchembl_value | here |
| CYP2C19_CHEMBL1613777 | 3,518 | pchembl_value | here |
| CYP2C19_CHEMBL1741323 | 8,850 | pchembl_value | here |
| CYP2D6_CHEMBL1614110 | 3,343 | pchembl_value | here |
| CYP2D6_CHEMBL1741321 | 5,461 | pchembl_value | here |
| CYP3A4_CHEMBL1613886 | 6,471 | pchembl_value | here |
| CYP3A4_CHEMBL1614108 | 6,471 | pchembl_value | here |
| CYP3A4_CHEMBL1741324 | 8,628 | pchembl_value | here |
Note:
- See here for the data extraction steps;
- What is pCHEMBL?
Acute toxicity
NCATS-LD50
| Task | \(N\) | y | Dataset |
|---|---|---|---|
| rat-SC | 1,886 | “rat_subcutaneous_LD50_(?log(mol/kg))” | here |
| rat-IV | 2,464 | “rat_intravenous_LD50_(?log(mol/kg))” | here |
| rat-IP | 5,001 | “rat_intraperitoneal_LD50_(?log(mol/kg))” | here |
| rat-oral | 10,151 | “rat_oral_LD50_(?log(mol/kg))” | here |
| mouse-SC | 6,754 | “mouse_subcutaneous_LD50_(?log(mol/kg))” | here |
| mouse-IV | 16,967 | “mouse_intravenous_LD50_(?log(mol/kg))” | here |
| mouse-IP | 36,267 | “mouse_intraperitoneal_LD50_(?log(mol/kg))_(?log(mol/kg))” | here |
| mouse-oral | 23,350 | “mouse_oral_LD50_(?log(mol/kg))” | here |
Note:
- See here for the data extraction steps;
- SC: subcutaneous; IV: intravenous; IP: intraperitoneal
- LD50: median lethal dose