作者: Robert Meusel , Sebastiano Vigna , Oliver Lehmberg , Christian Bizer
DOI: 10.1561/106.00000003
关键词:
摘要: Knowledge about the general graph structure of theWorldWideWeb is important for understanding social mechanisms that govern its growth, designing ranking methods, devising better crawling algorithms, and creating accurate models structure. In this paper, we analyze a large web graph. The was extracted from publicly accessible crawl gathered by Common Crawl Foundation in 2012. covers over 3:5 billion pages 128:7 hyperlinks. We compare, among other features, degree distributions, connectivity, average distances, weakly/strongly connected components. conduct our analysis on three different levels aggregation: page, host, pay-level domain (PLD) (one “dot level” above public suffixes). Our shows that, as evidenced previous research (Serrano et al., 2007), some features previously observed Broder 2000 are very dependent artifacts process, whereas appear to be more structural. confirm existence giant strongly component; however find, researchers (Donato 2005; Boldi 2002; Baeza-Yates Poblete, 2003), proportions nodes can reach or reached component, suggesting “bow-tie structure” described al. best current knowledge not structural property Web. More importantly, statistical testing visual inspection size-rank plots show distributions indegree, outdegree sizes components page host power laws, contrarily what reported much smaller crawls, although they might heavy tailed. If aggregate at domain, however, law emerges. also provide first time measurement distance-based using recently introduced algorithms scale size (Boldi Vigna, 2013).