The Graph Structure in the Web – Analyzed on Different Aggregation Levels

作者: Robert Meusel , Sebastiano Vigna , Oliver Lehmberg , Christian Bizer

DOI: 10.1561/106.00000003

关键词:

摘要: Knowledge about the general graph structure of theWorldWideWeb is important for understanding social mechanisms that govern its growth, designing ranking methods, devising better crawling algorithms, and creating accurate models structure. In this paper, we analyze a large web graph. The was extracted from publicly accessible crawl gathered by Common Crawl Foundation in 2012. covers over 3:5 billion pages 128:7 hyperlinks. We compare, among other features, degree distributions, connectivity, average distances, weakly/strongly connected components. conduct our analysis on three different levels aggregation: page, host, pay-level domain (PLD) (one “dot level” above public suffixes). Our shows that, as evidenced previous research (Serrano et al., 2007), some features previously observed Broder 2000 are very dependent artifacts process, whereas appear to be more structural. confirm existence giant strongly component; however find, researchers (Donato 2005; Boldi 2002; Baeza-Yates Poblete, 2003), proportions nodes can reach or reached component, suggesting “bow-tie structure” described al. best current knowledge not structural property Web. More importantly, statistical testing visual inspection size-rank plots show distributions indegree, outdegree sizes components page host power laws, contrarily what reported much smaller crawls, although they might heavy tailed. If aggregate at domain, however, law emerges. also provide first time measurement distance-based using recently introduced algorithms scale size (Boldi Vigna, 2013).

参考文章(25)
Stefano Millozzi, Stefano Leonardi, Debora Donato, Panayiotis Tsaparas, Mining the inner structure of the Web graph. international workshop on the web and databases. pp. 145- 150 ,(2005)
Walter Willinger, David Alderson, John C. Doyle, Mathematics and the Internet: A Source of Enormous Confusion and Great Potential American Mathematical Society. ,(2009)
Yu Hirate, Shin Kato, Hayato Yamana, Web Structure in 2005 workshop on algorithms and models for the web-graph. pp. 36- 46 ,(2007) , 10.1007/978-3-540-78808-9_4
Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, Johanna Völker, Deployment of RDFa, Microdata, and Microformats on the Web A Quantitative Analysis international semantic web conference. pp. 17- 32 ,(2013) , 10.1007/978-3-642-41338-4_2
Rajeev Motwani, Terry Winograd, Lawrence Page, Sergey Brin, The PageRank Citation Ranking : Bringing Order to the Web the web conference. ,vol. 98, pp. 161- 172 ,(1999)
M. Ángeles Serrano, Ana Maguitman, Marián Boguñá, Santo Fortunato, Alessandro Vespignani, Decoding the structure of the WWW ACM Transactions on the Web. ,vol. 1, pp. 10- ,(2007) , 10.1145/1255438.1255442
Oliver Lehmberg, Robert Meusel, Christian Bizer, Graph structure in the web Proceedings of the 2014 ACM conference on Web science - WebSci '14. pp. 119- 128 ,(2014) , 10.1145/2615569.2615674
Dennis Fetterly, Mark Manasse, Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages international workshop on the web and databases. pp. 1- 6 ,(2004) , 10.1145/1017074.1017077
P. Boldi, S. Vigna, The webgraph framework I Proceedings of the 13th conference on World Wide Web - WWW '04. pp. 595- 602 ,(2004) , 10.1145/988672.988752
Aaron Clauset, Cosma Rohilla Shalizi, M. E. J. Newman, Power-Law Distributions in Empirical Data Siam Review. ,vol. 51, pp. 661- 703 ,(2009) , 10.1137/070710111