Wayback Machine

CRL Status:
Not set yet.
Feedback:
0 User comments 0 0

CRL licensing and community input features are only available with a CRL member login.

If your institution is a CRL Member please:

log in or sign up
    Overview

    The Internet Archive Wayback Machine (WM), created and maintained by the Internet Archive, is an open access online archive of website content, derived from periodic crawls of the open web and data donations from Alexa Internet and others.  

    Provider
    Apr 25, 2017 9:09pm
    Details
    Collection Content

    The Internet Archive Wayback Machine (WM) is the largest extant, openly accessible archive of non-current content from the surface web.  Content is harvested by frequent, periodic web crawls of open web content that is not protected by robots.txt exclusions. Harvesting is guided in part by an intenational network of advisors, who identify topics and sites for harvesting, and by Internet Archive client libraries who subscribe to the Archive-It service to archive websites in particular domains, regions, or on particular subjects. 

    Wayback Machine content is confined largely to materials from the surface web.  "Deep web" content and most content behind paywalls, such as commercial databases and subscription materials, is not included.  A summary of the strengths and limitations of the Wayback Machine appeared in Jill Lepore's "Annals of Technology: The Cobweb", The New Yorker, January 26, 2015.

    Coverage of The New York Times online in the Wayback Machine

    The New York Times website first came online in December 1996. The Wayback Machine presentation of its snapshot of the December 20, 1996 landing page of the online edition of The Times, is at https://web.archive.org/web/19961220073509/http://www.nytimes.com/. Between December 1996 and November 2013, the Internet Archive (IA) reports that it created “20,653 captures” or snapshots of The Times site. It is not clear, however, that this number, reported in the histogram, is reliable, since only 3,205 captures are reported between 2000 and November 23, 2013, a time period in which the bulk of IA collecting to date occurred.

    Many of the earlier captures of nytimes.com content have broken links, or links that redirect the user to content apparently harvested in later IA crawls. In addition, there are weeks and even months when no crawls seem to have occurred, particularly in the years 1996 (7 captures) through 2004 (21 captures). Captures increased considerably by 2012 when captures of nytimes.com were averaging about 17 per day.

    On November 23, 2013, CRL compared archived content on the nytimes.com site from November 17, 2013, with content from the same date available in the WM.  The IA crawl of the Times site on that date produced 17 snapshots of the site from various times throughout the 24-hour period.  (The landing page URL in the WM was: https://web.archive.org/web/20131117150605/http://www.nytimes.com/. ) CRL sampling indicated that landing page links to some headline articles returned the message “Wayback Machine doesn't have that page archived” rather than the content. One headline with a broken link to the corresponding article, for example, was “Growing Clamor about Inequities of Climate Crisis.”

    Comparison of another article from November 23, “Addiction Treatment with a Dark Side” by Deborah Sontag, revealed that the WM site version included the complete text of the article, but lacked significant materials that accompanied the text in nytimes.com.  Specifically, the WM version lacked the following materials:  

    • 5 photographs by Leslye Davis for The New York Times
    • links to 5 accompanying videos (totaling over six minutes) by Leslye Davis for The New York Times
    • 3 statistical graphics
    • 379 comments posted by Times readers.

    The Times content was accessed at the following URL: http://www.nytimes.com/2013/11/17/health/in-demand-in-clinics-and-on-the-street-bupe-can-be-savior-or-menace.html

    WM content from the Times site was accessed at the following URL:  https://web.archive.org/web/20131117095824/http://www.nytimes.com/2013/11/17/health/in-demand-in-clinics-and-on-the-street-bupe-can-be-savior-or-menace.html

    General Observations

    The Internet Archive capture of web content has both advantages and limitations. For The New York Times, and for the many other news sites it archives, the WM provides valuable information about the news cycle, and the relative prominence and emphasis the publisher gives to individual stories during the course of that cycle. It preserves the configuration of the main landing page and the landing pages for individual sections ("World", "Politics", "Opinion", etc.) at the time of the crawl. These pages do not appear in the archived content on nytimes.com itself. (The value of this is somewhat undermined, though, by the erratic timing of the snapshots, which seem to be sporadic and are not consistent from one day to the next.)  The WM also includes a significant amount, but by no means all, of the text of the Times articles. 

    On the other hand, the WM cannot be considered a comprehensive or authoritative record of the content of nytimes.com to dateIA crawls miss some important, openly available content, including feature articles, videos, photographs, databases, graphics, reader comments, and advertisements. Content in the WM, moreover, is searchable only by date of crawl and URL.One must go to www.nytimes.com to identify articles and features by subject, writer, and so forth. In addition, it cannot be assured that a given website is captured in its entirety as it existed at any one moment. Because the crawl of the complex site may take several days to complete, versions of some pages included are from the site as it existed at a date different from the time when the landing page was captured. Theferore, there is no guarantee of the integrity of website content as a "snapshot" of a site.   

    An FAQ describes another limitation to WM: “When a dynamic page renders standard html, the archive works beautifully. When a dynamic page contains forms, JavaScript, or other elements that require interaction with the originating host, the archive will not contain the original site's functionality.”  This means that interactives and live data feeds that rely on service from web domains outside nytimes.com, for example, are often not preserved. 

    In practical terms the archived content on nytimes.com itself provides users more functionality, and is likely to prove even more useful to researchers as the amount of Times dynamic content grows.  This content, however, is for the most part only available to Times subscribers. 

    Delivery

    The Wayback Machine is openly accessible via the Web, using standard browsers.

    Terms

    The Internet Archive Wayback Machine is openly accessible via the Web. Internet Archive terms of use and privacy policy, originally posted on 10 March 2001, are available at:  http://archive.org/about/terms.php.

    Strengths and Weaknesses

    See analysis in "Collection Content" above.  

    Wayback Machine content is searchable by source website URL and date only, and is not indexed or searchable through standard search engines.  Moreover, the content, in place, is not subject to analysis by standard text and data mining tools. In addition, the number of websites archived, and the frequency of harvesting of a given site, like nytimes.com, can vary widely over time.   

    For news content, the WM provides valuable information about the news cycle and patterns of coverage. It preserves the configuration of main landing pages at a particular moment in time. On the other hand, IA crawls miss important, openly available content rendered in special formats that require interaction with the originating host.

    A 2015 New Yorker profile of the Internet Archive (Jill Lepore, "Annals of Technology: The Cobweb," January 26, 2015) reported that the Internet Archive does not clear copyright for website content archived, and hence does not normally obtain the rights and permissions necessary to provide long-term public access to proprietary content harvested.  Lepore noted that The Wayback Machine "will honor [a 'robots.txt') file and not crawl that site, and it will also, when it comes across a robots.txt, remove all past versions of that site."  

    An in-depth analysis of Wayback Machine content published in Forbes in 2015 found that the methodology of the the Internet Archive web crawls over time has been inconsistent, erratic, and not well documented, compromising the integrity of the materials archived. The report's author concluded that, "Taken together, these findings suggest that far greater understanding of the Internet Archive’s Wayback Machine is required before it can be used for robust reliable scholarly research on the evolution of the web." 

     

    However, the Wayback Machine remains the most extensive existing archive of the open web, It is also the only source for much historical web content. 

    Additional Reviews in Other Sources

    Jill Lepore, "Annals of Technology: The Cobweb", The New Yorker, January 26, 2015.

    Community Ratings

    Content scope and completeness
    No votes yet
    Cost and price-structure
    No votes yet
    Platform and user interface
    No votes yet