Identifying Urdu Complex Predication via Bigram Extraction

作者: Annette Hautli , Sebastian Sulger , Tina Bögel , Tafseer Ahmed , Miriam Butt

DOI:

关键词: UrduNounProcess (engineering)Computer scienceArtificial intelligenceHindiBigramRange (mathematics)Natural language processingComponent (UML)

摘要: A problem that crops up repeatedly in shallow and deep syntactic parsing approaches to South Asian languages like Urdu/Hindi is the proper treatment of complex predications. Problems for NLP predications are posed by their productiveness ill understood nature range combinatorial possibilities. This paper presents an investigation into whether fine-grained information about distributional properties nouns N+V CPs can be identified comparatively simple process extracting bigrams from a large “raw” corpus Urdu. In gathering relevant properties, we were aided visual analytics coupled our computational data analysis with interactive components sets. The visualization component proved essential part analysis, particular easy identification outliers false positives. Another turned out language-particular knowledge access existing resources. Overall, indeed able identify high frequency N-V as well pick combinations had not been aware before. However, manual inspection results also pointed sparsity, despite use corpus.

参考文章(36)
Tina Bögel, Urdu - Roman Transliteration via Finite State Transducers finite state methods and natural language processing. pp. 25- 29 ,(2012)
Miriam Butt, Tracy Holloway King, The Status of Case Springer, Dordrecht. pp. 153- 198 ,(2004) , 10.1007/978-1-4020-2719-2_6
Annette Hautli, Sebastian Sulger, Miriam Butt, Adding an Annotation Layer to the Hindi/Urdu Treebank Linguistic Issues in Language Technology. ,vol. 7, ,(2012)
Christian Rohrdantz, Frans Plank, Thomas Mayer, Daniel A. Keim, Miriam Butt, Visualizing vowel harmony Linguistic Issues in Language Technology. ,vol. 4, pp. 1- 33 ,(2011)
Tafseer Ahmed, Miriam Butt, Discovering semantic classes for Urdu N-V complex predicates IWCS '11 Proceedings of the Ninth International Conference on Computational Semantics. pp. 305- 309 ,(2011)
Florian Mansmann, Jörn Kohlhammer, Daniel Keim, Geoffrey Ellis, Mastering the information age : solving problems with visual analytics Goslar : Eurographics Association. ,(2010) , 10.2312/14803
Ben Shneiderman, Stuart K Card, Jock Mackinlay, B Shneiderman, Readings in Information Visualization: Using Vision to Think ,(1999)
Harald Hammarström, Muhammad Humayoun, Aarne Ranta, Urdu Morphology, Orthography and Lexicon Extraction CAASL-2: The Second Workshop on Computational Approaches to Arabic Script-based Languages, July 21-22, 2007, LSA 2007 Linguistic Institute, Stanford University. ,(2007)