作者: Valter Crescenzi , Giansalvatore Mecca , Paolo Merialdo
关键词: Web page 、 Computer science 、 Security token 、 Information retrieval 、 Roadrunner 、 Iterative and incremental development 、 Data extraction
摘要: Data extraction from HTML pages is performed by software modules, usually called wrappers. Roughly speaking, a wrapper identifies and extracts relevant pieces of text inside web page, reorganizes them in more structured format. In the literature there number systems to (semi-)automatically generate wrappers for [1]. We have recently investigated original approaches that aims at pushing further level automation generation process. Our main intuition that, dataintensive site, can be classified small classes, such belonging same class share rather tight structure. Based on this observation, we studied an novel technique, call matching technique [2], automatically generates common exploiting similarities differences among class. addition, order deal with complexity heterogeneities real-life sites, also several complementary techniques greatly enhance effectiveness matching. demonstration presents RoadRunner, our prototype implements its companion techniques. conducted experiments real life sites; these experiences shown approach, as well efficiency system [2]. The inference [2] based iterative process; every step, works two objects time: (i) input which represented list tokens (each token either tag or field), (ii) wrapper, expressed regular expression. process starts taking one page initial version wrapper; then, matched against sample it progressively refined trying solve mismatches: mismatch happens when some does not comply grammar specified wrapper. Mismatches solved generalizing succeeds if generated solving all mismatches encountered.