Not logged in [ Register for account ] [ Login ]  
Cornell University

The Web Laboratory: Database Sizes Statistics

The Web Lab database contains metadata about web collections.  Some of the collections are crawls that have been downloaded from the Internet Archive.  They are identified by two letter codes, e.g., DJ.  Other collections are  special collections used by research projects.

This page gives summary data about several of the principal collections.  See the Status page for more detailed profiles of some of these collections.


Crawl Name Database Size # Pages # Links # Urls # Hosts
Amazon0.56 TB39,017,2482,954,146,69534,884,739356
Cornell0.005 TB793,14011,889,778756,34140,964
DJ2.7 TB1,140,839,47526,244,734,149904,946,38016,089,901
DP6.5 TB1,785,298,63445,740,376,3291,390,553,96839,884,497
DV17.7 TB2,638,752,713111,772,592,3032,448,549,44280,154,600
EB20 TB2,851,741,704129,591,958,95020,147,845,829380,188,095