User:GreenC/WaybackMedic 2.5
Jump to navigation
Jump to search
WaybackMedic
by GreenC
by GreenC
Wayback Medic 2.5 is a bot that adds and maintains links from the list of known webarchive services in use on Wikimedia sites.
Edits made after 2018-12-04 are by version 2.5
The bot operator is User:GreenC. The bot account is User:GreenC bot. The bot (software) is "WaybackMedic".
Note: Some of the below functions and links are relevant to Enwiki only.
Fix number | Function name | Example edit | Description | Notes | Date added |
---|---|---|---|---|---|
1 | fixthespuriousone | Example | Remove spurious |1= in cite templates.
|
August 2016 | |
2 | fixmissingprotocol | Example | 1. Add https if protocol missing from the archive.org URL. 2. Convert existing protocol http to https. 3. Add second-level domain web if missing (archive.org/web/ → web.archive.org/web/) 4. Add /web/ path (web.archive.org/2016/ → web.archive.org/web/2016/). In some URLs adding /web/ breaks the link, test for those. |
HTTPS per RFC | August 2016 |
3 | fixemptyarchive | Example | 1. If |archiveurl= is empty or missing but |archivedate= has content, attempt to find a working archive URL based on the archive date, otherwise add {{dead link}} if appropriate.2. If |archivedate= is empty or missing but |archiveurl= has content, generate date value based on timestamp in the archive URL.3. If |archiveurl= and |archivedate= are empty, remove both and leave a {{dead link}} if appropriate.
|
August 2016 | |
4 | fixbadstatus | Example | Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API - first using accessdate, then using the earliest date available. If none there, check WebCite API. Try Memento API which checks a few dozen other archives. Other techniques undocumented. If still none found, remove |archiveurl= and |archivedate= and add {{dead link}}.
|
August 2016 | |
5 | Retired | ||||
6 | fixemptywayback | Example | The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example). | August 2016 | |
7 | fixencodedurl | Example | The URL was incorrectly encoded. Fully decode URL and re-encode. | August 2016 | |
8 | fixdatemismatch | Example | 1. Ensure |archivedate= matches the snapshot date in the URL2. Ensure date format matches dmy or mdy if set (retain ymd if in use) |
August 2016 | |
9 | fixwebcitlong | Example Example |
Convert WebCite URL's from short-form to long-form Convert Freezepage.com URL's from short-form to long-form |
WebCite Usage | January 2017 |
10 | fixstraydt | Example | Remove stray {{dead link}} template when an archive exists for the link | January 2017 | |
11 | fixwam | Example | Merge {{wayback}} and {{webcite}} --> {{webarchive}} Merge completed February 5, 2017 |
Webarchive TfM | January 2017 |
12 | fixiats | Example | archive url -> |archive-url) | January 2017 | |
13 | fixswitchurl | Example | Move an archive.org URL from |url= to |archiveurl= and add |archivedate= if missing.
|
January 2017 | |
14 | Retired | ||||
15 | fixembway | Example Example |
1. A {{wayback}} is embedded in a CS template. 2. A {{dead link}} is embedded in a CS template. |
January 2017 | |
16 | <various> | Example | Timestamp and/or |archivedate= is 19700101 and/or out-of-bounds.
|
January 2017 | |
17 | fixdoubleurl | Example | archive.org URLs are doubled, tripled, etc.. | January 2017 | |
18 | fixemptywebarchive | Example | {{webarchive}} |date= is missing or empty value.
|
January 2017 | |
19 | fixdoublewebarchive | Example | Remove duplicate {{webarchive}} instances. | January 2017 | |
20 | fixembwebarchive | Example | A {{cite web}} is embedded in a {{webarchive}} | January 2017 | |
21 | fixarchiveis | Example Example |
1. Convert Archive.is URL's from short-form to long-form 2. Fix URL encoding of broken links |
Archive.is Usage | January 2017 |
22 | fixitems | Example | Change "/items/" URLs that are using machine IDs | BRFA | January 2017 |
23 | encodemag | Example | Convert MediaWiki encoding to url encoding in URLs (ie. {{!}} and {{=}}) | RFC3986 | January 2017 |
24 | decodespace | Example | Convert %20 to +, + to %20, etc.. in URLs that can be repaired this way | See also | June 2017 |
25 | waytree_trailgarb | Example Example Example |
Remove typical garbage characters found at the end of URLs: .,;:-"l(%XX)('') | February 2018 | |
26 | fixcommentarchive | Example | Open-up commented-out archives and add a |deadurl= "yes" or "no"
|
February 2018 | |
27 | waytree_x2encoding | Example | Repair double URL-encoding eg. %3A -> %253A | February 2018 | |
28 | fixencodebug | Example | Repair missed URL-encoding of square brackets | February 2018 | |
29 | fixiats | Example Example |
Restore truncated Wayback URL | February 2018 | |
30 | fixiats | Example | Convert |title={title } -> |title=Archived copy
|
September 2018 | |
31 | urlchanger | Example | Move broken URL to a new working URL and undo previous archives. | BOTREQ | November 2018 |
32 | cosmetic | Example Example Example Example Example |
Edits that might be cosmetic. Only with other edits. 1. Del trailing # in URLs 2. Del empty archive fields 3. archive.is --> archive.today 4. Fix double fragments 5. Convert protocol-relative URLs |
w:WP:PRURL, T214855, Archive.today | January 2019 |
- Technical details
- Changes to URLs are checked against the remote site to ensure they are working
- Real-time link checks, no link database. However, links are checked over a 24hr period before final upload of diff.
- Supports many APIs including Internet Archive, Memento, WebCite and "Timemap" APIs at individual services
- Multiple HTTP header status code checks at the application (WaybackMedic) layer
- Additional time-out & retries built-in to the web transfer libraries.
- Additional operating-procedure level checks against network and other errors - bot is semi-supervised in known trouble areas.
- Multiple redundant checks of the APIs using multiple dates to ensure a page really is unavailable
- Accepts API results but then verifies by looking at page headers and/or contents
- The bot is primarily written in Nim (compiles to C source) with support utilities in Awk. Libraries were custom made including a string primitives library for regex, a wiki template parsing library, OAuth library (in awk), a MediaWiki API interface library, a soft404 detector.
- Due to the nature of the task, running the bot includes a fair amount of supervisory overhead so it requires operator training, though the steps are documented in the source package.