Concise and accurate data summaries for fast approximate query answering

作者: Kenneth C. Sevcik , Hai Wang

DOI:

关键词:

摘要: Many techniques have been proposed to support fast approximate query answering using summarized information of the data. Among them, histogram and wavelet are two popular types that extensively studied. In this thesis, we investigate trade-off between space used accuracy various techniques. We also examine their construction costs time. The major contributions thesis as follows. First, present a general model for in many database applications. This unifies different scenarios so can be systematically evaluated compared. Second, thorough experimental evaluation previously Third, new family histograms, Hierarchical Model Fitting (HMF) based on Minimum Description Length (MDL) principle, which has widely selection statistics machine learning. one-dimensional HMF is applicable data, multi-dimensional histograms constructed either seek highest possible within given budget, or most concise representation leads specified tolerance. show capable providing more accurate approximations than real synthetic data sets across variety workloads. Fourth, Information Theory, quantitatively assess gain due each both individually combination. Based theoretical evidence, suggest effective heuristics allocating utilize information. type histogram, called Values & Intervals (VI) just one scan through All other require much larger VI they seldom practice high costs. Through set experiments, currently commercial management systems, including IBM DB2, Oracle Database, Microsoft SQL Server, with similar Finally, identify characteristics perform poorly excellently. an algorithm, Majorization Ranking Test (MRT) quickly determine technique use (if any). MRT algorithm allows us decide whether techniques, Space Efficient Wavelet (SEW) improve by utilizing efficient way. SEW dominate cases.

参考文章(0)