COCA Filters: Co-occurrence Aware Bloom Filters

作者: Kamran Tirdad , Pedram Ghodsnia , J. Ian Munro , Alejandro López-Ortiz

DOI: 10.1007/978-3-642-24583-1_31

关键词: Index (publishing)CocaSignature (logic)AlgorithmData structureBloom filterSearch engine indexingCo-occurrenceComputer science

摘要: We propose an indexing data structure based on a novel variation of Bloom filters. Signature files have been proposed in the past as method to index large text databases though they suffer from high false positive error problem. In this paper we introduce COCA Filters, new type filters which exploits co-occurrence probability words documents reduce error. show experimentally that by using technique can up 21.6 times for same size. Furthermore be replaced wherever any two members universe is identifiable.

参考文章(30)
Ben Carterette, Fazli Can, Comparing inverted files and signature files for searching a large lexicon Information Processing and Management. ,vol. 41, pp. 613- 633 ,(2005) , 10.1016/J.IPM.2003.12.003
Ronitt Rubinfeld, Bernard Chazelle, Joe Kilian, Ayellet Tal, The Bloomier filter: an efficient data structure for static support lookup tables symposium on discrete algorithms. pp. 30- 39 ,(2004) , 10.5555/982792.982797
S. Sadagopan, WWW Ubiquity. ,vol. 2002, pp. 7- ,(2002) , 10.1145/504689.504690
Rasmus Pagh, S. Srinivasa Rao, Anna Pagh, An optimal Bloom filter replacement symposium on discrete algorithms. pp. 823- 829 ,(2005) , 10.5555/1070432.1070548
James K. Mullin, Daniel J. Margoliash, A tale of three spelling checkers Software - Practice and Experience. ,vol. 20, pp. 625- 630 ,(1990) , 10.1002/SPE.4380200607
Jinyang Li, Boon Thau Loo, Joseph M Hellerstein, M Frans Kaashoek, David R Karger, Robert Morris, None, On the feasibility of peer-to-peer web indexing and search international workshop on peer-to-peer systems. pp. 207- 215 ,(2003) , 10.1007/978-3-540-45172-3_19
Georgescu, Shimshoni, Meer, Mean shift based clustering in high dimensions: a texture classification example international conference on computer vision. ,vol. 2, pp. 456- 463 ,(2003) , 10.1109/ICCV.2003.1238382
Burton H. Bloom, Space/time trade-offs in hash coding with allowable errors Communications of the ACM. ,vol. 13, pp. 422- 426 ,(1970) , 10.1145/362686.362692
Li Fan, Pei Cao, J. Almeida, A.Z. Broder, Summary cache: a scalable wide-area web cache sharing protocol IEEE ACM Transactions on Networking. ,vol. 8, pp. 281- 293 ,(2000) , 10.1109/90.851975
Ashish Goel, Pankaj Gupta, Small subset queries and bloom filters using ternary associative memories, with applications measurement and modeling of computer systems. ,vol. 38, pp. 143- 154 ,(2010) , 10.1145/1811039.1811056