Guidelines for Preservable Websites

Internet search engines and website archiving efforts both rely on crawlers or robots visiting a website and following its internal links to discover the content there; following these guidelines can help insure both that a website is well-indexed by search engines and that it is preservable. The guidelines are adapted from the valuable sources listed below--follow those links for more detailed information.

PROVIDE A STANDARD LINK TO ALL WEBSITE CONTENT (INCLUDING PAGES, IMAGES, VIDEOS, DOCUMENTS)

To be visible to crawlers, links should be in HTML/XHTML format, rather than embedded in Javascript or Flash

AVOID PROPRIETARY FORMATS FOR IMPORTANT CONTENT, ESPECIALLY THE HOME PAGE

Do not create home pages relying heavily on images or animations such as Flash, but if you do create such pages also provide alternative text-only HTML versions.

INCLUDE A USER AND/OR XML SITEMAP

Sitemaps providing links to all content in a website ensure crawlers will find the content

OMIT ROBOTS.TXT EXCLUSIONS OR LIMIT THEM TO AREAS NOT NEEDED FOR ARCHIVING

Unlike search engines that need to index text only, successful archiving requires access to all files needed to render the website (including stylesheets, images, etc.). Check your robots.txt file to be sure directories containing stylesheets and images are not restricted. By contrast, some content (like calendar functions, databases, shopping baskets) can slow down or trap the crawler and is not needed in archived copies; optionally preventing access to these areas via robots.txt can improve preservability. To provide full access to our crawler specifically, add the following two lines to your robots.txt file.

User agent: archive.org_bot

Disallow:

MAINTAIN STABLE URLS AND REDIRECT WHEN NECESSARY

Keeping the URLs for particular content consistent over time minimizes "link rot" within your site and for external sites linking to your content and allows the archives to show the evolution of a page over time. If the URL structure of content on your website must change, be sure to redirect visitors from each changed old URL to the corresponding new URL.

CORRECTLY IDENTIFY CHARACTER SET ENCODING

Your web server's Content-Type field in the HTTP header must correctly identify the character set encoding in order for successful capture and rendering of the archived copy. The meta tag Content-Type in the source code of a page can also identify the character set, and must be consistent with the character set cited in the HTTP header.

Sources

Library of Congress. Library of Congress Guide to Creating Preservable Websites

National Archives (UK). The UK Government Web Archive : guidance for digital and records management teams

Portuguese Web Archive. 2015. "Recommendations for authors to enable web archiving"

Smithsonian Institution Archives. August 2, 2011. Robin C. Davis. "Five Tips for Designing Preservable Websites"

Stanford University Libraries. Archivability

UK Web Archive. Technical Information