Guidelines for Preservable Websites
Internet search engines and website archiving efforts both rely on crawlers or robots visiting a website and following its internal links to discover the content there; following these guidelines can help insure both that a website is well-indexed by search engines and that it is preservable. The guidelines are adapted from the valuable sources listed below--follow those links for more detailed information.
Provide a standard link to all website content (including pages, images, videos, documents)
Avoid proprietary formats for important content, especially the home page
Do not create home pages relying heavily on images or animations such as Flash, but if you do create such pages also provide alternative text-only HTML versions.
Include a user and/or xml Sitemap
Sitemaps providing links to all content in a website ensure crawlers will find the content
Omit robots.txt exclusions or limit them to areas not needed for archiving
Unlike search engines that need to index text only, successful archiving requires access to all files needed to render the website (including stylesheets, images, etc.). Check your robots.txt file to be sure directories containing stylesheets and images are not restricted. By contrast, some content (like calendar functions, databases, shopping baskets) can slow down or trap the crawler and is not needed in archived copies; optionally preventing access to these areas via robots.txt can improve preservability. To provide full access to our crawler specifically, add the following two lines to your robots.txt file.
User agent: archive.org_bot
Maintain stable URLs and redirect when necessary
Keeping the URLs for particular content consistent over time minimizes "link rot" within your site and for external sites linking to your content and allows the archives to show the evolution of a page over time. If the URL structure of content on your website must change, be sure to redirect visitors from each changed old URL to the corresponding new URL.
Correctly identify character set encoding
Your web server's Content-Type field in the HTTP header must correctly identify the character set encoding in order for successful capture and rendering of the archived copy. The meta tag Content-Type in the source code of a page can also identify the character set, and must be consistent with the character set cited in the HTTP header.
Library of Congress. The Signal. February 6, 2012. Nicholas Taylor, "Designing preservable websites, redux."
Portuguese Web Archive. 2010. "Recommendations for authors to enable web archiving"
Smithsonian Institution Archives. August 2, 2011. Robin C. Davis. "Five Tips for Designing Preservable Websites"
Stanford University Libraries. Archivability
UK Web Archive. Technical Information FAQ #2. "Making Your Website Crawler-Friendly."