WebGraph is a framework to study the web graph. It provides simple ways to manage very large graphs, exploiting modern compression techniques. More precisely, it is currently made of:

  1. A set of simple codes, called ζ codes, which are particularly suitable for storing web graphs (or, in general, integers with a power-law distribution in a certain exponent range).
  2. Algorithms for compressing web graphs that exploit gap compression and differential compression (à la LINK), intervalisation and ζ codes to provide a high compression ratio: for instance, the WebBase graph is compressed at 3.08 bits per link, and a a snapshot of about 18,500,000 pages of the .uk domain gathered by UbiCrawler is compressed at 2.22 bits per link (the corresponding figures for the transposed graphs are 2.89 bits per link and 1.98 bits per link). The algorithms are controlled by several parameters, which provide different tradeoffs between access speed and compression ratio.
  3. Algorithms for accessing a compressed graph without actually decompressing it, using lazy techniques that delay the decompression until it is actually necessary.
  4. This package, providing a complete, documented implementation of the algorithms above in Java. It is free software distributed under the GNU General Public License.
  5. Data sets for very large graph (e.g., a billion of links). These are either gathered from public sources (such as WebBase), or produced by UbiCrawler.

In the end, with WebGraph you can access and analyse a very large web graph, even on a PC with as little as 256 MiB RAM. Using WebGraph is as easy as installing a few jar files and downloading a data set. This makes studying phenomena such as PageRank, distribution of graph properties of the web graph, etc. very easy.

For in-depth information on the Webgraph framework, you should have a look at its home page, where you can find some papers about the compression techniques it uses.

The classes of interest for the casual Webgraph user are {@link it.unimi.dsi.webgraph.ImmutableGraph}, which specifies the access methods for an immutable graph, {@link it.unimi.dsi.webgraph.BVGraph}, which allow to retrieve or recompress a graph stored in the format described in The WebGraph Framework I: Compression Techniques, by Paolo Boldi and Sebastiano Vigna, in Proc. of the Thirteenth World–Wide Web Conference, ACM Press, and {@link it.unimi.dsi.webgraph.TransposeBVGraph}, which allows to transpose a {@link it.unimi.dsi.webgraph.BVGraph}.

The package {@link it.unimi.dsi.webgraph.examples} contains useful examples that show how to access sequentially and randomly an immutable graph.

Dependencies

This package relies on fastutil for a type-specific, high-performance collections framework, on MG4J for bit-level I/O, on the COLT distribution for ready-to-use, efficient algorithms and on GNU getopt for line-command parsing.