Not logged in [ Register for account ] [ Login ]  
Cornell University

The Web Laboratory: Documentation

Data base schema

  • Database schema. The main database and the smaller collections (such as the Amazon collection) use the same schema.

Data base access tools

  • GetPages. This tool is used to retrieve a subset of the data in a Web Lab database, by specifying which fields to included in the data, and restrictions which are used to select which data to retrieve.
  • GetCrawls. This tool shows you contents of the database's 'Crawl' table.

Internet Archive Crawler

  • WebBack Crawler script This is a tool that can crawl both the live web and the Internet Archive. After crawling, the script can analyze different aspects of the crawl, as well as output metadata to a CSV file.

Analysis tools

  • HeritrixWebLab. This is a customized version of the Heritrix open-source, web crawler.
  • VizTool. A simple tool for interactively exploring the web graph of a selected set of seed pages.

File formats

  • ARC/DAT. Raw data as received from the Internet Archive is in the ARC and DAT formats.
  • TSV files. The WebLab TSV (Tab-Separated Values) file format is used by the tools that save some subset of the data from the WebLab database to a file.