Processing highly variant language using incremental model selection

作者: Paul Rodrigues , Sandra Kuebler

DOI:

关键词:

摘要: This dissertation demonstrates a framework for incremental model selection and processing of highly variant speech transcripts user-generated text. The system reduces natural language (NLP) ambiguity by segmenting text domain, allowing domain-specific downstream processes to analyze each segment independently. A tokenized input stream is received the system. At every word, an Indicator Function calculates quantitative feature signal we call Value Signal, that runs in parallel stream. monitored domain changes event controller, which segments into chunks. controller can activate slowly over large spans text, or rapidly intrasententially. As indicates change with signal, pipeline assigned specific indicator function values are executed process segment, add additional signals stack. end pipeline, unified produce single annotated output To exemplify framework, this makes three contributions. first novel short-string identification our Signal. second machine transliteration convert Arabizi chat alphabet Arabic script. third modular part tagger multilingual code-mixing. The extracts n-gram, selects closest out 373 reference languages using Support Vector Machine (SVM) classifier trained on matrix measurements. learns patterns similarity divergence language's tokens across all languages, leading high accuracy in-domain n-grams from legal corpus as well out-of-domain English-Egyptian code-mixing microblog corpus. converts Arabizi, Latinized script, order utilize existing NLP tools A parallel, word-aligned was collected dozen speakers. From induced probabilistic mapping cross-dialect characters script used train accurate transducer. modularity framework. We find before tagging, then applying single-language homogeneous models, competitive heterogeneous tagging models. compare two approaches transcript English-Spanish In addition identification, consider range alternative functions, such genre entropy, gender could adaptation ability top systems provide boost performance variational processing. summarize, provides architecture allows better handling complicated variation. To demonstrate model, introduce state art accuracy, research alphabet,

参考文章(93)
Johnnie F. Caver, Novel Topic Impact on Authorship Attribution Monterey, California. Naval Postgraduate School. ,(2009)
Mike Rosner, Paulseph-John Farrugia, A tagging algorithm for mixed language identification in a noisy domain. conference of the international speech communication association. pp. 190- 193 ,(2007)
Sami Virpioja, Jaakko J. Väyrynen, Tommi Vatanen, Language identification of short text segments with n-gram models language resources and evaluation. ,(2010)
Vesa Siivola, Mikko Kurimo, Mathias Creutz, Morfessor and VariKN machine learning tools for speech and language technology conference of the international speech communication association. pp. 1549- 1552 ,(2007)
Rhonda K. Kaufman, R. Scott Baldwin, A Concurrent Validity Study of the Raygor Readability Estimate. The Journal of Reading. ,vol. 23, ,(1979)
David M. Zajic, Paul Rodrigues, Corey Miller, Charles Blake, Tristan Purvis, Nathanael Lynn, Jeff Carnes, Bridget Hirsch, Jason White, Chris Taylor, Sarah C. Wayland, Evelyn Browne, C. Anton Rytting, Tim Buckwalter, Error Correction for Arabic Dictionary Lookup language resources and evaluation. ,(2010)
Tanguy Urvoy, Thomas Lavergne, François Yvon, Detecting fake content with relative entropy scoring PAN'08 Proceedings of the 2008 International Conference on Uncovering Plagiarism, Authorship and Social Software Misuse - Volume 377. pp. 27- 31 ,(2008)
Akira Kurematsu, Atsushi Sukenori, Language model selection based on the analysis of Japanese spontaneous speech on travel arrangement task. conference of the international speech communication association. ,(1999)
G. Harry McLaughlin, SMOG Grading - A New Readability Formula. The Journal of Reading. ,(1969)