Forum:WPWeb:Web archives

WookieeProjects > WookieeProject Web > WPWeb:Web archives

As you should be aware by now, web archives are the last lifeline for us to access deleted websites, but they are also much more than that. Here is everything you should know about web archives.

What is a web archive?

Web archives are web library containing numerous saved web pages, often with various copies made at different point in time. They preserve a large part of the internet, and it's the most efficient way to fight link rot: the lose of access to web content. When a webpage is saved, its code (css, html, etc.) is copied by bots operated by the archiving service, and host on the archive in a condition similar to the original hosting service. This means that you can navigate (through hyperlinks) between saved pages the same way you would do with a live website.

Note that some website are very difficult to save and are even actively fighting archiving bots. You might encounters difficulty with websites (mostly social media platforms) such as Facebook, Flickr, Instagram and LinkedIn for example. However, don't take this for granted and always try to produce an archive.

Internet Archive

Founded in 1996, and hosted on archive.org, Internet Archive is a non-profit organization dedicated to the preservation of documents in digital format to provide an "universal access to all knowledge". Their archive include books, videos, musics, softwares, etc. and of course over 600 billions webpages. Yes, billions. Those are managed by a service called the Wayback Machine (hosted on web.archive.org), and if you want to take part in any indexing project, its imperative to learn how to use it.

Please note that up to the early 2010's, images where often not saved properly, and you'll find difficult to find them.

Template integration

Since we rely heavily on Internet Archive for our archivelinks, we've moved to an integrated model for our internet citation templates. In practice, this mean that their URL is hard-coded in in most of our templates, and only the "archivedate" (see bellow) is necessary to use in parameters.

Finding a page

Nothing complicated here. Either on the main page or on the Wayback page, use the search engine with an url. This will take you to the "Calendar" function of the Wayback Machine, displaying every archived instances of the page you're searching. Blue indicate that the link can be accessed without issue, green that it was redirected, and yellow that its an error. You just need to select a year and day, an click on one of the links.

When filling an url on {{WebCite}} with a page saved on the Internet Archive, always make sure you copy the url used by the archive itself and not the one you used for searching, as it might have gone through a redirect. If the address in the url parameter isn't the same one than the one associated with the archivedate parameter, the backup might generate an error.

Archiving a page

If a page doesn't exist on the Wayback Machine, it's very easy to create an archive:

Go on the web.archive.org page and copy an the url of the page you want to save with the "Save Page Now" function (bottom right).
You're not done yet, as your taken to a second page (web.archive.org/save) with several option. Don't click on anything except "Save Page" and let the Wayback Machine work. It can take a few second to a few minutes.
You'll then be taken to a page displaying "A snapshot was captured. Visit page:" with a link. Click on it and access the newly made archive.
You should have access to the archive and can safely copy the link or just the archivedate (as we only use that on most template). Sometimes the page doesn't not display and you're send back to the /save page, which would still display "A snapshot was captured", and looping back and forth. This is a regular issue that happen from time to time. Just wait 15min and the page should be displayed properly.

Navigate the Wayback Machine

There is several ways for you to explore the Wayback Machine. Remember that an archived website work the same way as a live one and you can go through page to page, exploring as you wish, as long as the hyperlink lead to an archived page. Understanding the path (.com/path/subpath/etc) of a website can also greatly help you. And once you've understood it's structure, you might be able to experiment with it, maybe even guess page urls.

Here a few tips on how to use the archive.org url to get the best of it:

Archived page are located this way: https://web.archive.org/web/archivedate/websiteurl
- What we call the "archivedate" (ex: 20090413154534) is a timestamp of 14 digits indicating the year (Y), month (M), day (D), hour (H), minute (m) and second (S): YYYYMMDDHHmmSS.
Using "*" in the archivedate can be used for quick access:
- https://web.archive.org/web/*/websiteurl will link to the Calendar.
- https://web.archive.org/web/YYYY*/websiteurl will link to the specific year in the Calendar. It can also be done with YYYYMM*, YYYYMMDD*, etc., but those don't really have any practical application.
https://web.archive.org/web/0/websiteurl will open the first archived version of an url.
https://web.archive.org/web/2/websiteurl will open the last archived version of an url.
"if" or "fw_" with the url, right after the archivedate, will hide the Wayback widget at the top of the screen, make it better for taking screenshot, for example. Do not includes this in the url saved on Wookieepedia though.

Others Wayback Machine tools

In total, Wayback Machine present 6 tools, with some still in beta:

Calendar, described earlier.
Collections (beta), which details the provenance of the archived data. Of little interest to us.
Changes (beta), compare the changes between two archived versions. Can be very useful if we want to know what exactly was changed in on page over time.
Summary, provide various data regarding the website. Can be useful to assess the scope of a website.
Site Map, provide a visualization of the websites paths. Can be useful to help navigate the website.
URLs, provide a list of "all" URLs of a website. The most powerful tool on the Wayback Machine. Can be overwhelming, but very useful once mastered. However, it seems to be in fact incomplete for older content.

Extensions

Internet Archive has created very useful browser tools:

A browser extension provide a quick access to the Wayback search engine, as well a direct access to the first and last archive, the Calendar, and the Save tool for the page you're on.

More

For more information about Internet Archive, please consult the Internet Archive Help Center.

Archive.today

Whenever the Wayback Machine faces an issue and can't produce an archivelink for us to use, our fallback solution is archive.today. It is less interesting in general (we have no idea how reliable it will stay in the future compared to the Internet Archive) and slower, but can sometimes work for website for which Wayback would struggle. It can also provide surprisingly interesting archives unavailable on archive.org, like for the defunct StarWars.com Message Boards.

It's quite easy to use. To archive a link, enter it in the red area and click on "save". To look for their archives, use the blue-grey area and click on "search". For more question on archive.today, please consult their FAQ. Note that they also support a dedicated (chrome-only) browser extension.

Troubleshooting

How to archive when everything else fail

You tried to save a page on both archive.org and archive.today to no avail? There is always the good old (and tedious) solution of the screenshot. There is even browser extension that would screenshot a whole page (top to bottom, whithout being limited to the part only displayed on your screen). You then can upload the screenshot to Wookieepedia, and copy the file link to use it with the archivefile parameter. Images created this way must be added manually to the gallery on either Wookieepedia:Social media screenshots and Wookieepedia:Website screenshots.

How to find a page when there is no archive and the page can't be accessed

Don't despair just yet, maybe the page wasn't really deleted but only moved from one URL to another, which is hopefully still live. Use the elements at your disposition, such as the name of the website, the title of the article or a quote (you can google an exact quote by copying it between " ") to try to search it on the web. In the case of a systematic URL transfert, you might even be able to guess the new url address with some effort.

The issue of Flash

The Flash format was put to rest in 2020, and browsers stopped supporting it soon after. This is a major issue for users of web archives, as numerous websites (of interest to us: lucasarts.com and some of starwars.com) used the format. To bypass the problem, one possible solution would be to use a browser extension Flash Player emulator called Ruffle. It is used by Internet Archive from their Flash games collection. However, it's still in development and only support properly ActionScript 1 and 2 at the moment, making it usable for content from prior 2006, but not between 2006 and 2020, which most likely use ActionScript 3.

Further notes

URL

Definition

An URL (Uniform Resource Locators) is the address of a website.

Taking this page URL as example:

https://starwars.fandom.com/wiki/Forum:WPWeb:Web_archives

An URL is composed of several elements organized within a strict syntax:

scheme://authority/path/filename?query#fragment

"scheme": the base of the protocol (scheme://), generally http or https.
"authority": main element of the address, limited to the host (subdomain.domain:port) for the purpose of Wookieepedia citations; the host is composed of a subdomain (ex: "starwars", but in general it's "www") and a domain (ex: "fandom.com"), rarely you'll encounter a port affixed to the domain (ex: ":80").
"path": how the address fit within the website (similar to a computer folders director filepath, but may not follow the rigidity of this structure) (ex:"/wiki/"); path may be composed of several segments (ex: /1/2/3/etc/).
"filename": the page itself, often a html document (ex: "Forum:WPWeb:Web_archives").
"query": affixed after a separator ("?"), providing values defining a particular state of the page, like a timestamp on YouTube, or if the page was accessed from a social media (ex: "?action=edit" at the end of this page mean that it's currently being edited); note that some websites uses queries in addition to the path system to identify the final page (ex: "?page2").
"fragment": affixed after a hash ("#"), which can indicate a particular element (like a header) within a page (ex: "#URL" at the end of this page will lead to this section).

Proper use

If not providing additional identification to the page ("?page2"), remove queries and fragments, as they are often unnecessary for linking and citing a page and will confuse search engine within the Wayback Machine. Just make sure the page still load correctly after you removed those elements, and if not revert the url to it's original state (ctrl+Z).

We tend to cut excessive slash "/" at the end of path in urls used on Wookieepedia. For example: scheme://authority/path/filename/ will be reduced to scheme://authority/path/filename.

Some websites also have unique ways to use the url of their pages. Such is the case with YouTube, with URLs like "https://www.youtube.com/watch?v=videocode" (a videocode is something like dQw4w9WgXcQ), and it would automatically add the channel name to the url when opened (ex: https://www.youtube.com/watch?v=videocode&ab_channel=StarWars) using "&" as an alternative to a query. Please be mindful to remove those queries, such as the timecode (&t=xxx or ?t=xxx) and channel (&ab_channel=), when saving or searching a YouTube page on archive.org, as the archive will be sensitive to those queries and won't display the same archive calendar depending if they are present or not.

Title

Most of the time, webpages titles should be formatted on Wookieepedia as they are used in html code. That's why we don't use the full uppercase display used on titles from StarWars.com for example. To find this, you'll need to access the source code of the webpage, either using ctrl+u, F12 or right-clicking on the page and select "Display the source code" and find the line <title></title> (just ctrl+f "title") in the header. Replace any html artefact by their special characters equivalent if needed, and format the title as displayed in the article (ex: with italics). If the webpage doesn't have a specific title defined, it's up to the editor discretion to best title the page (ex: Website homepage).

Converting archiveurl to archivedate

Due to past changes and improved integration of archive links in our web citation templates, we've come to rely mostly on the use of the archivedate parameter (archive.org timestamp), leading to the conversion of most archiveurl to the simpler archivedate. This means that instead of having to use a complete archive.org link (https://web.archive.org/web/[timestamp]/[saved url]), we only need to use the timestamp (along with the url= parameter). A large part of the conversion were done with bots, however you might find some citation still using an archiveurl (sometimes with an archivedate). Those were ignored by the bots because of a discrepancy (often caused by link rot, even as simple as a http to https change) between the saved url in the url= and the archiveurl= parameters, thus the archivedate might not be incompatible with the url parameter. The citation still need to be converted, but it need to be done manually to ensure that the saved url is the correct one for the timestamp provided. Thus a template using "url=[url version 1]" and "archiveurl=https://web.archive.org/web/[timestamp]/[url version 2]" needs to be converted so that the parameters are as follows: "url=[url version 2]" and "archivedate=[timestamp]". Keep in mind that web citation templates other than {{WebCite}} don't use the full url, only the path after the initial / that follow the authority (see the Url section above, explaining the anatomy of an url), and in doubt: always refers to a template documentation.

Further questions

If you have a question not covered by this guide, please leave a message to NanoLuuke's personal talk page.