USTL(University of Science and Technology Liaoning)-Caco2
Dataset: Download it here.
Dataset description: A curated dataset containing 1,780 compounds and their experimental \(\log P_{\text{app}}\) values.
Dataset preprocessing
- Download the original dataset from here, which contains 1,827 compounds;
- Drop 2 duplicate rows;
| logPapp | SMILES |
|---|---|
| -5.521434 | ONC(=O)\C=C\c1ccc2c(c1)nc(CCc3ccccc3)n2CCN4CCCC4 |
| -5.521434 | ONC(=O)\C=C\c1ccc2c(c1)nc(CCc3ccccc3)n2CCN4CCCC4 |
| -5.100000 | O1c2c(C(=O)C=C1c1cc(OC)c(OC)c(OC)c1)c(O)cc(O)c2 |
| -5.100000 | O1c2c(C(=O)C=C1c1cc(OC)c(OC)c(OC)c1)c(O)cc(O)c2 |
- Use RDKit to transform the SMILES to their canonical forms;
- For 28 compounds with two difference in \(\log P_{\text{app}}\) values, drop 17 compounds (34 rows) with \(\log P_{\text{app}}\) difference larger than 0.1:
| logPapp | canonical_smiles |
|---|---|
| -4.590067 | CC(C(=O)O)c1cccc(C(=O)c2ccccc2)c1 |
| -4.707191 | CC(C(=O)O)c1cccc(C(=O)c2ccccc2)c1 |
| -5.657577 | CCN(CC)CCCC(C)Nc1ccnc2cc(Cl)ccc12 |
| -4.550000 | CCN(CC)CCCC(C)Nc1ccnc2cc(Cl)ccc12 |
| -5.522879 | CC[C@H]1CN2CCc3cc(OC)c(OC)cc3[C@@H]2C[C@@H]1C[C@H]1NCCc2cc(OC)c(OC)cc21 |
| -5.690000 | CC[C@H]1CN2CCc3cc(OC)c(OC)cc3[C@@H]2C[C@@H]1C[C@H]1NCCc2cc(OC)c(OC)cc21 |
| -3.900665 | CN(C)CCC=C1c2ccccc2CCc2ccccc21 |
| -4.260000 | CN(C)CCC=C1c2ccccc2CCc2ccccc21 |
| -4.724059 | CN1C2CC(OC(=O)C(CO)c3ccccc3)CC1C1OC12 |
| -4.928118 | CN1C2CC(OC(=O)C(CO)c3ccccc3)CC1C1OC12 |
| -5.610834 | CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1 |
| -4.929962 | CNCCC(Oc1ccc(C(F)(F)F)cc1)c1ccccc1 |
| -4.535436 | COCCc1ccc(OCC(O)CNC(C)C)cc1 |
| -4.751046 | COCCc1ccc(OCC(O)CNC(C)C)cc1 |
| -4.610000 | COc1cc2nc(N3CCN(C(=O)C4CCCO4)CC3)nc(N)c2cc1OC |
| -5.090444 | COc1cc2nc(N3CCN(C(=O)C4CCCO4)CC3)nc(N)c2cc1OC |
| -4.644357 | COc1ccc(CCN(C)CCCC(C#N)(c2ccc(OC)c(OC)c2)C(C)C)cc1OC |
| -5.013228 | COc1ccc(CCN(C)CCCC(C#N)(c2ccc(OC)c(OC)c2)C(C)C)cc1OC |
| -6.031517 | CC@@H[C@@H]1NC(=O)C@HNC(=O)C@@HNC(=O)C@HNC(=O)[C@@H]2CCCN2C(=O)C@HNC1=O |
| -6.220924 | CC@@H[C@@H]1NC(=O)C@HNC(=O)C@@HNC(=O)C@HNC(=O)[C@@H]2CCCN2C(=O)C@HNC1=O |
| -4.720000 | C[C@]12C=CC(=O)C=C1CC[C@@H]1[C@@H]2C@@HC[C@@]2(C)[C@H]1CC[C@]2(O)C(=O)CO |
| -5.124939 | C[C@]12C=CC(=O)C=C1CC[C@@H]1[C@@H]2C@@HC[C@@]2(C)[C@H]1CC[C@]2(O)C(=O)CO |
| -5.468521 | Cc1c(O)cccc1C(=O)NC@@HC@HCN1C[C@H]2CCCC[C@H]2C[C@H]1C(=O)NC(C)(C)C |
| -6.124939 | Cc1c(O)cccc1C(=O)NC@@HC@HCN1C[C@H]2CCCC[C@H]2C[C@H]1C(=O)NC(C)(C)C |
| -5.279840 | Cc1ccccc1N1C(=O)c2cc(S(N)(=O)=O)c(Cl)cc2NC1C |
| -5.540608 | Cc1ccccc1N1C(=O)c2cc(S(N)(=O)=O)c(Cl)cc2NC1C |
| -4.572189 | Cn1c(=O)c2c(ncn2C)n(C)c1=O |
| -4.345198 | Cn1c(=O)c2c(ncn2C)n(C)c1=O |
| -4.680000 | NC(N)=N/N=C/c1c(Cl)cccc1Cl |
| -4.363412 | NC(N)=N/N=C/c1c(Cl)cccc1Cl |
| -6.059484 | O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O |
| -6.474235 | O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O |
| -7.327902 | OC[C@H]1OC@HC@HC@@H[C@H]1O |
| -7.620000 | OC[C@H]1OC@HC@HC@@H[C@H]1O |
- Calculate the average \(\log P_{\text{app}}\) value for the remaining 11 compounds:
| logPapp | canonical_smiles |
|---|---|
| -4.667562 | C=CCN1CC[C@]23c4c5ccc(O)c4O[C@H]2C(=O)CC[C@@]3(O)[C@H]1C5 |
| -4.670000 | C=CCN1CC[C@]23c4c5ccc(O)c4O[C@H]2C(=O)CC[C@@]3(O)[C@H]1C5 |
| -3.775545 | CC(CN1c2ccccc2Sc2ccccc21)N(C)C |
| -3.777772 | CC(CN1c2ccccc2Sc2ccccc21)N(C)C |
| -5.468521 | CC[C@H]1OC(=O)C@HC@@HC@HC@@HC@(OC)CC@@HC(=O)C@HC@@H[C@]1(C)O |
| -5.470000 | CC[C@H]1OC(=O)C@HC@@HC@HC@@HC@(OC)CC@@HC(=O)C@HC@@H[C@]1(C)O |
| -4.170696 | COc1ccc2[nH]c(S(=O)Cc3ncc(C)c(OC)c3C)nc2c1 |
| -4.170000 | COc1ccc2[nH]c(S(=O)Cc3ncc(C)c(OC)c3C)nc2c1 |
| -4.906578 | Cc1ccc(-c2cc(C(F)(F)F)nn2-c2ccc(S(N)(=O)=O)cc2)cc1 |
| -4.860121 | Cc1ccc(-c2cc(C(F)(F)F)nn2-c2ccc(S(N)(=O)=O)cc2)cc1 |
| -5.890000 | NS(=O)(=O)c1cc2c(cc1C(F)(F)F)NC(Cc1ccccc1)NS2(=O)=O |
| -5.886056 | NS(=O)(=O)c1cc2c(cc1C(F)(F)F)NC(Cc1ccccc1)NS2(=O)=O |
| -6.055517 | Nc1nc(O)c2ncn([C@@H]3CC@H[C@H]3CO)c2n1 |
| -6.055517 | Nc1nc(O)c2ncn([C@@H]3CC@H[C@H]3CO)c2n1 |
| -5.722753 | O=C(c1ccc(OCCN2CCCCC2)cc1)c1c(-c2ccc(O)cc2)sc2cc(O)ccc12 |
| -5.645507 | O=C(c1ccc(OCCN2CCCCC2)cc1)c1c(-c2ccc(O)cc2)sc2cc(O)ccc12 |
| -6.368266 | O=c1cc(-c2ccc(O)c(O)c2)oc2cc(O[C@@H]3OC@HC@@HC@H[C@H]3O)cc(O)c12 |
| -6.370000 | O=c1cc(-c2ccc(O)c(O)c2)oc2cc(O[C@@H]3OC@HC@@HC@H[C@H]3O)cc(O)c12 |
| -6.545757 | OC[C@H]1OC@@HC@HC@@H[C@H]1O |
| -6.568636 | OC[C@H]1OC@@HC@HC@@H[C@H]1O |
| -5.102373 | Oc1ccc(-c2cnc(-c3cccc(O)c3)o2)cc1 |
| -5.101186 | Oc1ccc(-c2cnc(-c3cccc(O)c3)o2)cc1 |
Reference
- Y. Wang and X. Chen, Qspr model for caco-2 cell permeability prediction using a combination of hqpso and dual-rbf neural network, RSC Advances 10, 42938 (2020).