作者: Sree Harsha Ramesh , Arnab Dhar , Raveena R Kumar , Anjaly V , Sarath KS
DOI: 10.1007/978-3-319-46565-4_21
关键词:
摘要: In this paper, we describe a pipeline that automatically converts journal article in the PDF format to an XML which conforms NLM JATS DTD. First, text and typographical features are extracted from document using character level information. Then, use trickle down multi-level conditional random fields based classifier where at each pre-trained CRF model classifies given line of into one tags DTD particular depth feeds resulting tag next as feature. After identifying upto three, make separate supervised models for parsing authors, affiliations, references citations. We employ heuristic methods matching affiliation citation references. The thus generated, is converted RDF document. SPARQL queries run on RDF, address Task 2 Semantic Publishing Challenge.