作者: Annette Hautli , Sebastian Sulger , Tina Bögel , Tafseer Ahmed , Miriam Butt
DOI:
关键词: Urdu 、 Noun 、 Process (engineering) 、 Computer science 、 Artificial intelligence 、 Hindi 、 Bigram 、 Range (mathematics) 、 Natural language processing 、 Component (UML)
摘要: A problem that crops up repeatedly in shallow and deep syntactic parsing approaches to South Asian languages like Urdu/Hindi is the proper treatment of complex predications. Problems for NLP predications are posed by their productiveness ill understood nature range combinatorial possibilities. This paper presents an investigation into whether fine-grained information about distributional properties nouns N+V CPs can be identified comparatively simple process extracting bigrams from a large “raw” corpus Urdu. In gathering relevant properties, we were aided visual analytics coupled our computational data analysis with interactive components sets. The visualization component proved essential part analysis, particular easy identification outliers false positives. Another turned out language-particular knowledge access existing resources. Overall, indeed able identify high frequency N-V as well pick combinations had not been aware before. However, manual inspection results also pointed sparsity, despite use corpus.