The Human Rights Web Archive (HRWA) is a collection of freely available websites featuring substantive resources related to human rights, mainly by and about non-governmental organizations dedicated to human rights issues. The HRWA is an initiative of Columbia University Libraries and Information Services and its Center for Human Rights Documentation and Research. Our goal is to build an ongoing mechanism for preserving web-based human rights information, adaptable to a variety of fields and subjects. The development of this program is supported by a grant from the Andrew W. Mellon Foundation.
Human Rights Web Archive FAQ
1. What is the Human Rights Web Archive?
Subject specialists at Columbia University Libraries with regional and language expertise select websites for inclusion in the archive. We also invite suggestions from researchers, students, scholars and human rights advocates. Criteria for selection include relevance of the site to current research, teaching, and advocacy, perceived risk of a site disappearing, and likelihood that a site will not be archived or preserved by other means. Organizations whose paper archives are held at Columbia will be another priority for web archiving.
2. How are websites selected for inclusion in the archive?
Subject specialists at Columbia University Libraries with regional and language expertise select websites for inclusion in the archive. We also invite suggestions from researchers, students, scholars and human rights advocates. Criteria for selection include relevance of the site to current research, teaching, and advocacy, perceived risk of a site disappearing, and likelihood that a site will not be archived or preserved by other means. Organizations whose paper archives are held at Columbia will be another priority for web archiving.
3. Can I suggest a site for archiving in this project?
4. What is your copyright and permissions policy for archiving websites?
We follow principles and techniques of non-intrusive harvesting. We attempt to notify all organizations and/or site owners of our interest in archiving their sites. We will refrain from harvesting sites whose owners do not wish to participate in this project. Some sites may contain material that is produced by other parties who may claim copyright ownership of such materials. The HRWA reserves the right to remove any material that in our reasonable opinion may violate copyright or other intellectual property rights. Third-party copyright holders who believe their rights have been infringed by inclusion of their content in our archive may contact us at culhrweb@libraries.cul.columbia.edu.
5. How do you collect and store websites?
We use the open source web crawler, "Heritrix," to create archival quality copies of the websites. Currently we are managing Heritrix through the Internet Archive's "Archive-It" service. All data created using the Archive-It service is hosted and stored by the Internet Archive. Eventually we will also store the data in Columbia University Libraries' digital repository.
6. Do site owners have to change or alter websites to be included in the crawls?
No, site owners do not have to change the content, structure, or appearance of their websites to be included in the crawls.
7. Will your crawling interfere with access to our site? Who do I contact if your crawler causes problems?
We crawl websites at a polite rate so as not to interfere with access to your website. Crawls will generally be run quarterly or semi-annually for actively updated sites, and last for a few days. Once a crawl is complete, the crawler no longer interacts with your server. If you encounter any issues or have any additional questions, please contact us at culhrweb-tech@libraries.cul.columbia.edu.
8. Are you able to capture media, audio and video files etc?
Yes, as long as there are links to the raw content. We cannot capture content that is retrieved through a database query. For example, we cannot capture files retrieved from a database after a user enters a search query.
9. How can I view the sites that have been archived?
Archived sites will be publicly available. Initially you can view sites by date of capture in our Internet Archive Human Rights Web Archive collection. Additional means of viewing archived sites will be explored by program staff.
10. Why do some sites appear to be incomplete?
There are several reasons why a site may appear to be incomplete. First, parts of the site may be excluded from crawls by robots.txt. We respect all robots.txt exclusions. Second, some types of content are very challenging and sometimes impossible to archive from a technical standpoint. Some examples include Javascript, streaming media, and form/database driven content. Finally, portions of a website may be restricted, or password protected. We will only collect public content, so password protected material will not be crawled.
11. I would like my organization’s site to be removed from the HRWA. Who do I contact?
We will honor requests to remove archived content. Please contact culhrweb-tech@libraries.cul.columbia.edu.
12. How can I learn more about your project?
- Visit the HRWA page and the Web Resources Collection Program site.
- Contact culhrweb@libraries.cul.columbia.edu for more information.