Web Archiving at Columbia University
Columbia University Libraries and Information Services (CUL) has expanded the scope of its collection development activities to include curated archival collections of freely available Internet resources. The development of this program, self-funded by CUL as of 2013, was made possible by generous support from the Andrew W. Mellon Foundation.
Columbia University Libraries' commitment to integrating web archiving into ongoing collection development and preservation best practice is informed by collaboration with other research libraries and the broader web archiving community. In 2012 CUL became a member of the International Internet Preservation Consortium (IIPC) and hosted a summit meeting for practitioners on Web Archiving Policies and Practices in the US. Columbia has also recently received new support from the Andrew W. Mellon Foundation for the explicit goal of fostering web archiving collaboration. On June 4-5, 2015, Columbia University Libraries hosted a conference called Web Archiving Collaboration: New Tools and Models.
Supporting Grant Projects
The Andrew W. Mellon Foundation has provided CUL funding for three grant projects in the area of web archiving:
Web Resources Archiving Collaboration (2013-2015)
Objective: To extend the effectiveness of Columbia’s web resource collecting program, and of the collective web archiving work within the US, by developing and testing models of collaboration with other research libraries, with other web archiving programs, with web content producers, and with scholars.
Web Resources Collection Program Development (2009-2012)
Objective: To put into production procedures for selecting, acquiring, describing, preserving, and providing access to freely available web content, starting with the subject area of human rights and expanding into other thematic and Columbia-related content.
Collection Building for Web Resources (2008-2009)
Objective: A joint project with the University of Maryland Libraries to develop and test coherent, holistic models for incorporating web content into research library collections.
The Web Resources Collection Program archives selected websites in thematic areas corresponding to existing CUL collection strengths, websites produced by affiliates of Columbia University, and websites from organizations or individuals whose papers or records are held in CUL's physical archives.
Specific collections of archived websites include:
Selection of Websites for Archiving
Subject Specialists at Columbia University Libraries work with the program's Web Resources Collection Coordinator to identify websites for archiving. For thematic collections we also invite website nominations from researchers and website owners. A variety of criteria drives our selection process, including relevance of subject matter to current research, teaching and advocacy, perceived risk of website longevity, and complementarity of websites with existing print collections held at Columbia University Libraries. Websites affiliated with Columbia University, and those of organizations whose print archives are held at Columbia will also be high priorities for archiving.
The Web Resources Collection Program follows principles and techniques of non-intrusive harvesting. We attempt to notify all organizations and/or individuals whose websites are selected for archiving. We refrain from archiving websites that do not wish to be included in this project and will remove harvested content from the archive upon request by website owner(s). More information for website owners is available on our FAQ page.
Websites selected for our collections are harvested using the Archive-It service from the Internet Archive, which incorporates a version of the open source crawling software Heritrix. Depending on collection guidelines and the nature of individual websites, websites may be recaptured at regularly scheduled intervals, such as semi-annual or quarterly.
Description and Access
Archived websites will remain freely available to the public via CUL's Archive-It partner page, where website-level metadata is added to allow browsing and full-text search. Additionally for some collections archived websites receive individual catalog records in CLIO (the online library catalog for Columbia University Libraries) and in OCLC's Worldcat database with links to both the live websites and the archived content.
CUL has also developed an experimental local access portal for the Human Rights collection, the Human Rights Web Archive. This portal allows enhanced browsing and full-text search for archived human rights websites, and will be further developed to allow some searching of other human rights resources at Columbia and other archived human rights websites.
For general program information, copyright inquiries, technical:
Alex Thurman, Web Resources Collection Coordinator