User:GreenC/WaybackMedic 2.5

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
WaybackMedic
by GreenC

Wayback Medic 2.5 is a bot that adds and maintains links from the list of known webarchive services in use on Wikimedia sites.

Edits made after 2018-12-04 are by version 2.5

The bot operator is User:GreenC. The bot account is User:GreenC bot. The bot (software) is "WaybackMedic".

Note: Some of the below functions and links are relevant to Enwiki only.


WaybackMedic Fixes
Fix number Function name Example edit Description Notes Date added
1 fixthespuriousone Example Remove spurious |1= in cite templates. August 2016
2 fixmissingprotocol Example 1. Add https if protocol missing from the archive.org URL.
2. Convert existing protocol http to https.
3. Add second-level domain web if missing (archive.org/web/ → web.archive.org/web/)
4. Add /web/ path (web.archive.org/2016/ → web.archive.org/web/2016/). In some URLs adding /web/ breaks the link, test for those.
HTTPS per RFC August 2016
3 fixemptyarchive Example 1. If |archiveurl= is empty or missing but |archivedate= has content, attempt to find a working archive URL based on the archive date, otherwise add {{dead link}} if appropriate.
2. If |archivedate= is empty or missing but |archiveurl= has content, generate date value based on timestamp in the archive URL.
3. If |archiveurl= and |archivedate= are empty, remove both and leave a {{dead link}} if appropriate.
August 2016
4 fixbadstatus Example Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API - first using accessdate, then using the earliest date available. If none there, check WebCite API. Try Memento API which checks a few dozen other archives. Other techniques undocumented. If still none found, remove |archiveurl= and |archivedate= and add {{dead link}}. August 2016
5 Retired
6 fixemptywayback Example The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example). August 2016
7 fixencodedurl Example The URL was incorrectly encoded. Fully decode URL and re-encode. August 2016
8 fixdatemismatch Example 1. Ensure |archivedate= matches the snapshot date in the URL
2. Ensure date format matches dmy or mdy if set (retain ymd if in use)
August 2016
9 fixwebcitlong Example
Example
Convert WebCite URL's from short-form to long-form
Convert Freezepage.com URL's from short-form to long-form
WebCite Usage January 2017
10 fixstraydt Example Remove stray {{dead link}} template when an archive exists for the link January 2017
11 fixwam Example Merge {{wayback}} and {{webcite}} --> {{webarchive}}
Merge completed February 5, 2017
Webarchive TfM January 2017
12 fixiats Example archive url -> |archive-url) January 2017
13 fixswitchurl Example Move an archive.org URL from |url= to |archiveurl= and add |archivedate= if missing. January 2017
14 Retired
15 fixembway Example
Example
1. A {{wayback}} is embedded in a CS template.
2. A {{dead link}} is embedded in a CS template.
January 2017
16 <various> Example Timestamp and/or |archivedate= is 19700101 and/or out-of-bounds. January 2017
17 fixdoubleurl Example archive.org URLs are doubled, tripled, etc.. January 2017
18 fixemptywebarchive Example {{webarchive}} |date= is missing or empty value. January 2017
19 fixdoublewebarchive Example Remove duplicate {{webarchive}} instances. January 2017
20 fixembwebarchive Example A {{cite web}} is embedded in a {{webarchive}} January 2017
21 fixarchiveis Example
Example
1. Convert Archive.is URL's from short-form to long-form
2. Fix URL encoding of broken links
Archive.is Usage January 2017
22 fixitems Example Change "/items/" URLs that are using machine IDs BRFA January 2017
23 encodemag Example Convert MediaWiki encoding to url encoding in URLs (ie. {{!}} and {{=}}) RFC3986 January 2017
24 decodespace Example Convert %20 to +, + to %20, etc.. in URLs that can be repaired this way See also June 2017
25 waytree_trailgarb Example
Example
Example
Remove typical garbage characters found at the end of URLs: .,;:-"l(%XX)('') February 2018
26 fixcommentarchive Example Open-up commented-out archives and add a |deadurl= "yes" or "no" February 2018
27 waytree_x2encoding Example Repair double URL-encoding eg. %3A -> %253A February 2018
28 fixencodebug Example Repair missed URL-encoding of square brackets February 2018
29 fixiats Example
Example
Restore truncated Wayback URL February 2018
30 fixiats Example Convert |title={title} -> |title=Archived copy September 2018
31 urlchanger Example Move broken URL to a new working URL and undo previous archives. BOTREQ November 2018
32 cosmetic
Example
Example
Example
Example
Example
Edits that might be cosmetic. Only with other edits.
1. Del trailing # in URLs
2. Del empty archive fields
3. archive.is --> archive.today
4. Fix double fragments
5. Convert protocol-relative URLs
w:WP:PRURL, T214855, Archive.today January 2019
BotWikiAwk
Technical details
  • Changes to URLs are checked against the remote site to ensure they are working
  • Real-time link checks, no link database. However, links are checked over a 24hr period before final upload of diff.
  • Supports many APIs including Internet Archive, Memento, WebCite and "Timemap" APIs at individual services
  • Multiple HTTP header status code checks at the application (WaybackMedic) layer
  • Additional time-out & retries built-in to the web transfer libraries.
  • Additional operating-procedure level checks against network and other errors - bot is semi-supervised in known trouble areas.
  • Multiple redundant checks of the APIs using multiple dates to ensure a page really is unavailable
  • Accepts API results but then verifies by looking at page headers and/or contents
  • The bot is primarily written in Nim (compiles to C source) with support utilities in Awk. Libraries were custom made including a string primitives library for regex, a wiki template parsing library, OAuth library (in awk), a MediaWiki API interface library, a soft404 detector.
  • Due to the nature of the task, running the bot includes a fair amount of supervisory overhead so it requires operator training, though the steps are documented in the source package.

Notes

[edit]