Automatically Identify and Label Sections in Scientific Journals Using Conditional Random Fields

作者: Sree Harsha Ramesh , Arnab Dhar , Raveena R Kumar , Anjaly V , Sarath KS

DOI: 10.1007/978-3-319-46565-4_21

关键词:

摘要: In this paper, we describe a pipeline that automatically converts journal article in the PDF format to an XML which conforms NLM JATS DTD. First, text and typographical features are extracted from document using character level information. Then, use trickle down multi-level conditional random fields based classifier where at each pre-trained CRF model classifies given line of into one tags DTD particular depth feeds resulting tag next as feature. After identifying upto three, make separate supervised models for parsing authors, affiliations, references citations. We employ heuristic methods matching affiliation citation references. The thus generated, is converted RDF document. SPARQL queries run on RDF, address Task 2 Semantic Publishing Challenge.

参考文章(11)
S. Peroni, D. Shotton, D.A. Lapeyre, From Markup to Linked Data: Mapping NISO JATS v1.0 to RDF using the SPAR (Semantic Publishing and Referencing) Ontologies National Center for Biotechnology Information (US). ,(2012)
Dominika Tkaczyk, Paweł Szostek, Mateusz Fedoryszak, Piotr Jan Dendek, Łukasz Bolikowski, CERMINE: automatic extraction of structured metadata from scientific literature International Journal on Document Analysis and Recognition (IJDAR). ,vol. 18, pp. 317- 335 ,(2015) , 10.1007/S10032-015-0249-8
Huy Hoang Nhat Do, Muthu Kumar Chandrasekaran, Philip S. Cho, Min Yen Kan, Extracting and matching authors and affiliations in scholarly documents acm/ieee joint conference on digital libraries. pp. 219- 228 ,(2013) , 10.1145/2467696.2467703
Jenny Rose Finkel, Trond Grenager, Christopher Manning, Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling meeting of the association for computational linguistics. pp. 363- 370 ,(2005) , 10.3115/1219840.1219885
F. Canan Pembe, Tunga Gungor, Heading-based sectional hierarchy identification for HTML documents international symposium on computer and information sciences. pp. 1- 6 ,(2007) , 10.1109/ISCIS.2007.4456839
John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data international conference on machine learning. pp. 282- 289 ,(2001)
Christoph Lange, Angelo Di Iorio, Semantic Publishing Challenge – Assessing the Quality of Scientific Output Communications in Computer and Information Science. pp. 61- 76 ,(2014) , 10.1007/978-3-319-12024-9_8
Angelo Di Iorio, Christoph Lange, Anastasia Dimou, Sahar Vahdati, Semantic Publishing Challenge – Assessing the Quality of Scientific Output by Information Extraction and Interlinking 2nd Conference on Semantic Web Evaluation Challenge (SemWebEval Challenge). ,vol. 548, pp. 65- 80 ,(2015) , 10.1007/978-3-319-25518-7_6
Joseph Bockhorst, Chad Oldfather, Scott Vanderbeck, A Machine Learning Approach to Identifying Sections in Legal Briefs. MAICS. pp. 16- 22 ,(2011)