Site Owner FAQ

How are websites selected for the Web Resources Collection?

Subject specialists and web collection curators at Columbia University Libraries select websites to be archived. Criteria for selection include relevance of the website to current research, teaching, and advocacy, perceived risk of a website disappearing, and likelihood that a website will not be archived or preserved by other means. Websites affiliated with Columbia University and organizations whose paper archives are held at Columbia will be additional priorities for web archiving.

What is your copyright and permissions policy for archiving websites?

We follow principles and techniques of non-intrusive harvesting. We attempt to notify all organizations and/or website owners of our interest in archiving their websites. We will refrain from harvesting websites whose owners do not wish to participate in this project. Some websites may contain material that is produced by other parties who may claim copyright ownership of such materials. CUL reserves the right to remove any material that in our reasonable opinion may violate copyright or other intellectual property rights. Third-party copyright holders who believe their rights have been infringed by inclusion of their content in our archive may contact us at culhrweb@libraries.cul.columbia.edu.

How do you collect and store websites?

We use the open source web crawler, "Heritrix," to create archival copies of the websites. Currently we are managing Heritrix through the Internet Archive's "Archive-It" service. All data created using the Archive-It service is hosted and stored by the Internet Archive. Eventually we will also store the data in Columbia University Libraries' digital repository.

Do website owners have to change or alter websites to be included in the crawls?

No, website owners do not have to change the content, structure, or appearance of their websites to be included in the crawls.

Will your crawling interfere with access to our website?

We crawl websites at a polite rate so as not to interfere with access to your website. Crawls will generally be run quarterly or semi-annually for actively updated websites, and last for a few days. Once a crawl is complete, the crawler no longer interacts with your server. If you encounter any issues or have any additional questions, please contact us at culhrweb-tech@libraries.cul.columbia.edu.

Are you able to capture media, audio and video files?

Yes, downloadable media, audio and video files can usually be captured, although YouTube videos are challenging. Our crawler follows links in order to discover and capture content, so links to content must exist on a website in order for that content to be included in the archive. We can’t capture files that are not linked and have to be retrieved from a database via user query. (For example, a publications database that requires one to execute a search in order to access publications.) Streaming audio and video can’t be captured at all by the current generation of web crawlers.

How can I view the websites that have been archived? Will access always be free?

Archived websites will remain freely accessible to the public. Websites can be viewed by date of capture via our Internet Archive partner page. Additional means of viewing archived websites will be explored by program staff.

Why do the archived versions of some websites appear to be incomplete?

There are several reasons why an archived website may appear to be incomplete. Some types of content are challenging or impossible to capture and/or reproduce, including JavaScript-driven navigation menus, streaming audio and video, and dynamic form and database-driven content. We can’t capture files that are not linked and have to be retrieved from a database via user query.  (For example, a publications database that requires one to execute a search in order to access publications.) Also, portions of a website may be restricted or password-protected. We will only collect public content, so password protected material will not be crawled. Owners wishing to optimize their site design to allow full archiving may be interested in these useful guidelines from the Portuguese Web Archive.

I would like my organization’s website to be removed from the Web Resources Collection. Who do I contact?

We will honor requests to remove archived content. Please contact culhrweb-tech@libraries.cul.columbia.edu.

How can I learn more about your project?