Human Rights Web Archive-Archived FAQ
1. What is the Human Rights Web Archive?
The Human Rights Web Archive (HRWA) is a collection of freely available websites featuring substantive resources related to human rights, mainly by and about non-governmental organizations dedicated to human rights issues. The HRWA is an initiative of Columbia University Libraries and Information Services and its Center for Human Rights Documentation and Research. Our goal is to build an ongoing mechanism for preserving web-based human rights information, adaptable to a variety of fields and subjects. The development of this program is supported by a grant from the Andrew W. Mellon Foundation.
2. How are websites selected for inclusion in the archive?
Subject specialists at Columbia University Libraries with regional and language expertise select websites for inclusion in the archive. We also invite suggestions from researchers, students, scholars and human rights advocates. Criteria for selection include relevance of the site to current research, teaching, and advocacy, perceived risk of a site disappearing, and likelihood that a site will not be archived or preserved by other means. Organizations whose paper archives are held at Columbia are another priority for web archiving.
3. Can I suggest a site for archiving in this project?
4. What is your copyright and permissions policy for archiving websites?
We follow principles and techniques of non-intrusive harvesting. We attempt to notify all organizations and/or site owners of our interest in archiving their sites. We will refrain from harvesting sites whose owners do not wish to participate in this project. Some sites may contain material that is produced by other parties who may claim copyright ownership of such materials. The HRWA reserves the right to remove any material that in our reasonable opinion may violate copyright or other intellectual property rights. Third-party copyright holders who believe their rights have been infringed by inclusion of their content in our archive may contact us at culhrweb@library.columbia.edu.
5. How do you collect and store websites?
We use the open source web crawler, "Heritrix," to create archival quality copies of the websites. Currently we are managing Heritrix through the Internet Archive's "Archive-It" service. All data created using the Archive-It service is hosted and stored by the Internet Archive. Eventually we will also store the data in Columbia University Libraries' digital repository.
6. Do site owners have to change or alter websites to be included in the crawls?
No, site owners do not have to change the content, structure, or appearance of their websites to be included in the crawls.
7. Will your crawling interfere with access to our site? Who do I contact if your crawler causes problems?
We crawl websites at a polite rate so as not to interfere with access to your website. Crawls will generally be run quarterly or semi-annually for actively updated sites, and last for a few days. Once a crawl is complete, the crawler no longer interacts with your server. If you encounter any issues or have any additional questions, please contact us at culhrweb-tech@library.columbia.edu.
8. Are you able to capture media, audio and video files etc?
Yes, downloadable media, audio and video files can usually be captured, although YouTube videos are challenging. Our crawler follows links in order to discover and capture content, so links to content must exist on a website in order for that content to be included in the archive. We can’t capture files that are not linked and have to be retrieved from a database via user query. (For example, a publications database that requires one to execute a search in order to access publications.) Streaming audio and video can’t be captured at all by the current generation of web crawlers.
9. How can I view the sites that have been archived?
Archived sites are publically available and can be browsed, searched and viewed on our Archive-It partner page. We are actively developing alternative means of accessing the archived sites.
10. Why do some sites appear to be incomplete?
There are several reasons why an archived site may appear to be incomplete. Some types of content are challenging or impossible to capture and/or reproduce, including JavaScript-driven navigation menus, streaming audio and video, and dynamic form and database-driven content. We can’t capture files that are not linked and have to be retrieved from a database via user query. (For example, a publications database that requires one to execute a search in order to access publications.) Also, portions of a website may be restricted or password-protected. We will only collect public content, so password protected material will not be crawled.
11. I would like my organization’s site to be removed from the HRWA. Who do I contact?
We will honor requests to remove archived content. Please contact culhrweb-tech@library.columbia.edu.
12. How can I learn more about your project?
- Visit the HRWA page and the Web Resources Collection Program site.
- Contact culhrweb@library.columbia.edu for more information.