An Evaluation of Text Retrieval Methods for Similarity Search of Multi-dimensional NMR-Spectra

作者: Alexander Hinneburg , Andrea Porzel , Karina Wolfram

DOI: 10.1007/978-3-540-71233-6_33

关键词: NMR spectra databaseMathematicsVector space modelTask (computing)Function (mathematics)Nearest neighbor searchInformation retrievalProbabilistic latent semantic analysisSimilarity (geometry)Simple (abstract algebra)

摘要: Searching and mining nuclear magnetic resonance (NMR)- spectra of naturally occurring substances is an important task to investigate new potentially useful chemical compounds. Multi-dimensional NMR-spectra are relational objects like documents, but consists continuous multi-dimensional points called peaks instead words. We develop several mappings from discrete textlike data. With the help those any text retrieval method can be applied. evaluate performance two methods, namely standard vector space model probabilistic latent semantic indexing (PLSI). PLSI learns hidden topics in data, which case 2D-NMR data interesting its owns rights. Additionally, we a simple direct similarity function, detect duplicates NMR-spectra. Our experiments show that as well PLSI, both designed for created by humans, effectively handle mapped NMR-data originating natural products. able find meaningful "topics" NMR-data.

参考文章(13)
Karina Wolfram, Andrea Porzel, Alexander Hinneburg, Similarity Search for Multi-dimensional NMR-Spectra of Natural Products Lecture Notes in Computer Science. pp. 650- 658 ,(2006) , 10.1007/11871637_67
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
Athanasios Tsipouras, John Ondeyka, Claude Dufresne, Seok Lee, Gino Salituro, Nancy Tsou, Michael Goetz, Sheo Bux Singh, Simon K. Kearsley, Using similarity searches over databases of estimated 13C NMR spectra for structure identification of natural product compounds Analytica Chimica Acta. ,vol. 316, pp. 161- 171 ,(1995) , 10.1016/0003-2670(95)00322-Q
Christoph Steinbeck, Stefan Krause, Stefan Kuhn, NMRShiftDB-constructing a free chemical information system with open-source components. Journal of Chemical Information and Computer Sciences. ,vol. 43, pp. 1733- 1739 ,(2003) , 10.1021/CI0341363
Qiaozhu Mei, ChengXiang Zhai, Discovering evolutionary theme patterns from text: an exploration of temporal text mining knowledge discovery and data mining. pp. 198- 207 ,(2005) , 10.1145/1081870.1081895
António S. Barros, Douglas N. Rutledge, Segmented principal component transform–principal component analysis Chemometrics and Intelligent Laboratory Systems. ,vol. 78, pp. 125- 137 ,(2005) , 10.1016/J.CHEMOLAB.2005.01.003
Margit Farkas, János Bendl, Dieter H. Welti, Ernö Pretsch, Stephan Dütsch, Pius Portmann, Martin Zürcher, Jean-Thomas Clerc, Similarity search for a 1H-NMR spectroscopic data base Analytica Chimica Acta. ,vol. 206, pp. 173- 187 ,(1988) , 10.1016/S0003-2670(00)80840-5
Alexandrin Popescul, Steve Lawrence, Lyle H. Ungar, David M. Pennock, Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments uncertainty in artificial intelligence. pp. 437- 444 ,(2001)
Thomas Hofmann, Probabilistic latent semantic indexing international acm sigir conference on research and development in information retrieval. ,vol. 51, pp. 50- 57 ,(1999) , 10.1145/3130348.3130370
Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths, Probabilistic author-topic models for information discovery knowledge discovery and data mining. pp. 306- 315 ,(2004) , 10.1145/1014052.1014087