作者: Renata Vieira , Daniel Martins , Lucelene Lopes , Guilherme Fedrizzi , Paulo Fernandes
DOI:
关键词:
摘要: This paper presents a software tool to extract relevant terms from Portuguese texts. E ATOLP extracts the most frequent noun phrases in an annotated corpus. The annotation is provided by PALAVRAS parser. offers different options improve quality of extraction that goes post-treatment parser application linguistic and statistical criteria. also some additional features compare extracted with reference lists, compute efficiency numerical indexes search for Term corpora usually basis many Natural Language Processing (NLP) task such as automatic glossary construction [7], text categorization [4] even ontology learning [3]. extraction, other NLP applications, can benefit both approaches, combination these two approaches often better results than each separately. [6] thus uses select domain significant From point view, based on syntactic performed [2]. candidate are according extra set discard transformation rules. those subject frequency analysis, i.e., order more ones. Figure 1(a) graphically architecture. basic input .xml files which texts process consider rules to, respectively, may be unwanted, e.g., numerals, or adapt purpose remove articles. user chose sets applied. 1(b) upper screenshot interface where choose all options. Once extracted, their frequencies corpus computed. Then, choice, selected criteria, keeping only 10%