作者: A Gangemi , A Meloni , A Nuzzolese , V Presutti , D Reforgiato Recupero
DOI:
关键词:
摘要: The MusicBO Knowledge Graph (KG), documenting Bologna’s pivotal role in the European Musical Heritage, is created using an automated text-to-Knowledge Graph pipeline on a multilingual and diachronic corpus. The corpus contains 137 documents in English, French, Italian, and Spanish. The documents’ publication dates span from 1700 to the current era. This corpus includes an extensive variety of textual genres, such as historical dissertations, critical analyses, correspondences, and business-oriented documents. The KG has been created automatically by leveraging text2AMR2FRED1 [2], a revised architecture of FRED’s text-to-KG pipeline [3]. For creating MusicBO KG, we applied the mentioned pipeline to a subset of the mentioned corpus, made of 47 documents in English and 51 documents in Italian. The documents were originally in. pdf, images, or. docx formats. We extrapolated plain text from them through ad hoc Optical Character Recognition (OCR) technologies. We then performed co-reference resolution, rule-based minimal post-OCR corrections, and sentence splitting on the extrapolated plain texts. We submitted the pre-processed sentences to State-of-the-Art (SotA) neural models for text2AMR parsing. For sentences in English, we used SPRING2 [1]. For sentences in Italian, we used USeA3 [5]. AMR graphs, grounded in PropBank Frames4, serve as an intermediate event-centric representation, wellsuited for retrieving ‘who-did-what-to-whom’in a text. The produced AMR graphs are filtered based on automatic metrics. The remaining AMR graphs undergo AMR-to-FRED translation, which exploits the AMR2FRED tool5 [4] to …