Biogen-sol
Dataset: Download it here.
Dataset description: 2,173 compounds with measured solubility in \(\log S\), released by Biogen.
Dataset preprocessing
- Download the original dataset from here, which contains 2,173 compounds with available solubility values in \(\log \mu \text{g}/\text{mL}\);
- Use RDKit to transform the SMILES to their canonical forms (most SMILES are already canonical.);
- Convert the unit from \(\log \mu \text{g}/\text{mL}\) to \(\log S = \log_{10} \left( \frac{10^x}{\text{mw}} \cdot 1000 \cdot 10^{-6} \right)\), where mw is the molecular weight calculated using RDKit.
Reference
- C. Fang, Y. Wang, R. Grater, S. Kapadnis, C. Black, P. Trapa, and S. Sciabola, Prospective validation of machine learning algorithms for absorption, distribution, metabolism, and excretion prediction: An industrial perspective, Journal of Chemical Information and Modeling 63, 3263 (2023).