作者: Steve O’Hagan , Douglas B. Kell
关键词:
摘要: Armed with the digital availability of two natural products libraries, amounting to some 195 885 molecular entities, we ask question how can best sample from them maximize their "representativeness" in smaller and more usable libraries 96, 384, 1152, 1920 molecules. The term is intended include diversity, but for numerical reasons (and likelihood being able perform a QSAR) it necessary focus on areas chemical space that are highly populated. Encoding structures as fingerprints using RDKit "patterned" algorithm, first assess granularity simple clustering showing there major regions "denseness" also great many very sparsely populated areas. We then apply "hybrid" hierarchical K-means algorithm data produce statistically robust clusters which representative appropriate numbers samples may be chosen. There necessarily again trade-off between cluster size number, within these constraints, containing 384 or 1152 molecules found come represent 18 30% whole space, sizes of, respectively, 50 27 above, just about sufficient QSAR. By online via Molport system (www.molport.com), construct (and, time, provide contents of) small virtual library available provided effective coverage described. Consistent this, average similarities developed considerably than original libraries. suggested have use phenotypic screening, including determining possible transporter substrates.