The Internet Archive Wayback Machine (WM) is the largest extant, openly accessible archive of non-current content from the surface web. Content is harvested by frequent, periodic web crawls of open web content that is not protected by robots.txt exclusions. Harvesting is guided in part by an intenational network of advisors, who identify topics and sites for harvesting, and by Internet Archive client libraries who subscribe to the Archive-It service to archive websites in particular domains, regions, or on particular subjects.
Wayback Machine content is confined largely to materials from the surface web. "Deep web" content and most content behind paywalls, such as commercial databases and subscription materials, is not included. A summary of the strengths and limitations of the Wayback Machine appeared in Jill Lepore's "Annals of Technology: The Cobweb", The New Yorker, January 26, 2015.
Coverage of The New York Times online in the Wayback Machine
The New York Times website first came online in December 1996. The Wayback Machine presentation of its snapshot of the December 20, 1996 landing page of the online edition of The Times, is at https://web.archive.org/web/19961220073509/http://www.nytimes.com/. Between December 1996 and November 2013, the Internet Archive (IA) reports that it created “20,653 captures” or snapshots of The Times site. It is not clear, however, that this number, reported in the histogram, is reliable, since only 3,205 captures are reported between 2000 and November 23, 2013, a time period in which the bulk of IA collecting to date occurred.
Many of the earlier captures of nytimes.com content have broken links, or links that redirect the user to content apparently harvested in later IA crawls. In addition, there are weeks and even months when no crawls seem to have occurred, particularly in the years 1996 (7 captures) through 2004 (21 captures). Captures increased considerably by 2012 when captures of nytimes.com were averaging about 17 per day.
On November 23, 2013, CRL compared archived content on the nytimes.com site from November 17, 2013, with content from the same date available in the WM. The IA crawl of the Times site on that date produced 17 snapshots of the site from various times throughout the 24-hour period. (The landing page URL in the WM was: https://web.archive.org/web/20131117150605/http://www.nytimes.com/. ) CRL sampling indicated that landing page links to some headline articles returned the message “Wayback Machine doesn't have that page archived” rather than the content. One headline with a broken link to the corresponding article, for example, was “Growing Clamor about Inequities of Climate Crisis.”
Comparison of another article from November 23, “Addiction Treatment with a Dark Side” by Deborah Sontag, revealed that the WM site version included the complete text of the article, but lacked significant materials that accompanied the text in nytimes.com. Specifically, the WM version lacked the following materials:
- 5 photographs by Leslye Davis for The New York Times
- links to 5 accompanying videos (totaling over six minutes) by Leslye Davis for The New York Times
- 3 statistical graphics
- 379 comments posted by Times readers.
The Times content was accessed at the following URL: http://www.nytimes.com/2013/11/17/health/in-demand-in-clinics-and-on-the-street-bupe-can-be-savior-or-menace.html.
WM content from the Times site was accessed at the following URL: https://web.archive.org/web/20131117095824/http://www.nytimes.com/2013/11/17/health/in-demand-in-clinics-and-on-the-street-bupe-can-be-savior-or-menace.html
The Internet Archive capture of web content has both advantages and limitations. For The New York Times, and for the many other news sites it archives, the WM provides valuable information about the news cycle, and the relative prominence and emphasis the publisher gives to individual stories during the course of that cycle. It preserves the configuration of the main landing page and the landing pages for individual sections ("World", "Politics", "Opinion", etc.) at the time of the crawl. These pages do not appear in the archived content on nytimes.com itself. (The value of this is somewhat undermined, though, by the erratic timing of the snapshots, which seem to be sporadic and are not consistent from one day to the next.) The WM also includes a significant amount, but by no means all, of the text of the Times articles.
On the other hand, the WM cannot be considered a comprehensive or authoritative record of the content of nytimes.com to date. IA crawls miss some important, openly available content, including feature articles, videos, photographs, databases, graphics, reader comments, and advertisements. Content in the WM, moreover, is searchable only by date of crawl and URL.One must go to www.nytimes.com to identify articles and features by subject, writer, and so forth. In addition, it cannot be assured that a given website is captured in its entirety as it existed at any one moment. Because the crawl of the complex site may take several days to complete, versions of some pages included are from the site as it existed at a date different from the time when the landing page was captured. Theferore, there is no guarantee of the integrity of website content as a "snapshot" of a site.
In practical terms the archived content on nytimes.com itself provides users more functionality, and is likely to prove even more useful to researchers as the amount of Times dynamic content grows. This content, however, is for the most part only available to Times subscribers.