作者: Steffen Staab , Saqib Mir , Isabel Rojas
DOI:
关键词:
摘要: Wrapper induction techniques traditionally focus on learning wrappers based examples from one class of Web pages, i.e. pages that are all similar in structure and content. Thereby, traditional wrapper targets the understanding generated a database using same generation template as observed example set. Applying such to sites biological databases, however, we found there is need for wrapping structurally diverse web multiple classes making problem more challenging. Furthermore, scientific do not just provide mere data, but they also tend schema information terms data labels – giving further cues solving site task. In this paper present novel approach automatic extraction whole considers challenge takes advantage additional clues commonly available deep databases. The solution consists sequence steps: 1. classification into classes, 2. discovery these 3. each class. Our thus allows us perform unsupervised retrieval across an entire site. We test our algorithm against three real-world biochemical sources report preliminary results, which very promising.