作者: Paul Rodrigues , Sandra Kuebler
DOI:
关键词:
摘要: This dissertation demonstrates a framework for incremental model selection and processing of highly variant speech transcripts user-generated text. The system reduces natural language (NLP) ambiguity by segmenting text domain, allowing domain-specific downstream processes to analyze each segment independently. A tokenized input stream is received the system. At every word, an Indicator Function calculates quantitative feature signal we call Value Signal, that runs in parallel stream. monitored domain changes event controller, which segments into chunks. controller can activate slowly over large spans text, or rapidly intrasententially. As indicates change with signal, pipeline assigned specific indicator function values are executed process segment, add additional signals stack. end pipeline, unified produce single annotated output To exemplify framework, this makes three contributions. first novel short-string identification our Signal. second machine transliteration convert Arabizi chat alphabet Arabic script. third modular part tagger multilingual code-mixing. The extracts n-gram, selects closest out 373 reference languages using Support Vector Machine (SVM) classifier trained on matrix measurements. learns patterns similarity divergence language's tokens across all languages, leading high accuracy in-domain n-grams from legal corpus as well out-of-domain English-Egyptian code-mixing microblog corpus. converts Arabizi, Latinized script, order utilize existing NLP tools A parallel, word-aligned was collected dozen speakers. From induced probabilistic mapping cross-dialect characters script used train accurate transducer. modularity framework. We find before tagging, then applying single-language homogeneous models, competitive heterogeneous tagging models. compare two approaches transcript English-Spanish In addition identification, consider range alternative functions, such genre entropy, gender could adaptation ability top systems provide boost performance variational processing. summarize, provides architecture allows better handling complicated variation. To demonstrate model, introduce state art accuracy, research alphabet,