作者: Saqib Mir , Steffen Staab , Isabel Rojas
DOI: 10.1007/978-3-642-02879-3_9
关键词:
摘要: We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional induction techniques focus on learning wrappers based examples one class of pages, i.e. pages that are all similar in structure and content. Thereby, traditional targets the understanding generated database same generation template as observed example set. However, sites typically contain structurally diverse web multiple classes making problem more challenging. Furthermore, we such do not just provide mere data, but they also tend schema terms data labels --- giving further cues for solving site wrapping task. Our solution this challenge Site-Wide consists sequence steps: 1. classification into classes, 2. discovery these 3. each class. thus allows us perform unsupervised retrieval across an entire site. test our algorithm against three real-world biochemical deep sources report preliminary results, which very promising.