作者: Ludovic Tanguy , Josiane Mothe
DOI:
关键词:
摘要: Query difficulty can be linked to a number of causes. Some these causes related the query expression itself, and therefore detected through linguistic analysis text. Using 16 different features, automatically computed on TREC queries, we looked for significant correlations between features average recall precision scores obtained by systems. Each viewed as clue linguisticallyspecific characteristic, either morphological, syntactical or semantic. Two (syntactic links span polysemy value) are shown have impact previous adhoc campaigns. Although correlation values not very high, they indicate promising link some characteristics difficulty. 1. CONTEXT This study has been conducted in context ARIEL research project, which investigate processing IR The ultimate objective is build an adaptive system, several natural language (NLP) techniques available, but selectively used given query, depending predicted efficiency each technique. 2. OBJECTIVE linguistics NLP solutions IR, overall systems doubtful at best. From fine-grained morphological expansion based semantic word classes, use linguistically-sound resources often proven efficient other cruder [5] [8]. In this paper, consider way predict rather than means model IR. 3. RELATED WORK A closely-related approach performed [7] CLEF topics. Their intent was discover if could correlated system performance thus kind bias evaluation campaign, further fusion-based engine. describe topic mostly concerned syntactic forms aspects, were calculated hand. They measure precision, only result 0.4 proper nouns precision. Further studies led authors named entities useful feature, able propose that improved after classification topics according entities. increase using feature varied from 0 10%, across tasks (monoand multi-lingual). Our deals with more especially order deal complexity. addition, automatic methods techniques. Focusing documents instead [6] also characterize collections. His main point notion relevance, test whether it stylistic genre document relevant selection. [3] clarity score depends both target collection. Both need exhaustive information collection; while decided focus queries only, wider range situations. [2] classes failures drawn manually, no elements how assign category. 4. METHOD We selected following data: 3, 5, 6 7 results task; corresponds total 200 (50 per year). collections analysed described variables, corresponding specific feature. considered title part its length format closest real user’s query. Because web site makes participants’ runs available (i.e. lists retrieved query), possible compute run (using trec-eval utility). then over Finally, variables. These tested statistical significance. As first result, simple dealing size words presence certain parts speech do clear consequences query's difficulty, sophisticated variables interesting results. Globally, complexity negative scores, ambiguity scores. little less significantly, effect recall. 4.1. Linguistic Features well-known It thoroughly tasks, ranging analysis. principles quite simple: text our case) generic parsing (e.g. tagging, chunking, parsing). Based tagged data, programs information. used: Tree Tagger part-of-speech tagging lemmatisation: tool attributes single 1TreeTagger, H. Schmidt; www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ morphosyntactic category input text, general lexicon model; Syntex [4] shallow detection): analyser identifies relation sentence, grammatical rules; resources: WordNet 1.6 network ambiguity: database provides, among information, meanings word; CELEX derivational morphology: resource gives decomposition word. According final objective, all without any human intervention, such prone errors. Table 1, categorized three their level analysis: 1: List Morphological : (#) NBWORDS LENGTH # morphemes MORPH suffixed tokens SUFFIX PN acronyms ACRO numeral (dates, quantities, etc.) NUM unknown UNKNOWN Syntactical conjunctions CONJ prepositions PREP personal pronouns PP depth SYNTDEPTH SYNTDIST