User:Fæ/Project list/PAS

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Shortcut: COM:Portable Antiquities Scheme project

Corieltauvi Stater, between 50 BC to 20 BC. Findspot in North Lincolnshire.

This is the project page for the batch upload project populating Category:Portable Antiquities Scheme adding photographs of archaeological interest to Wikimedia Commons.

As of January 2021 there were over 614,980 files in this project, with a total filesize of 701 gigabytes (query/15896). The uploads are done by Fæ, feel free to ask questions about the project in general at User talk:Fæ.

Reports

[edit]
Dog brooch, Roman 1st-2nd century, found in Lincolnshire.

Scope

[edit]

The Portable Antiquities Scheme (PAS) publishes photographs and descriptions of archaeological finds registered in England and Wales on its website, https://finds.org.uk, on a free CC-BY license. Artefacts have a main "record" and may have one or several photographs and drawings associated with it. In the majority of cases where there are photographs from different angles, these have been merged into one photograph, for example most coin photographs show the obverse and reverse in the same photograph.

The images vary significantly in quality, with some being older amateur photographs and others high resolution recent research quality images taken by institutions such as the British Museum.

Category:Portable Antiquities Scheme is the "bucket category" for all images. This contains all the images uploaded within the batch upload project, but also images from other sources, such as volunteer photographs uploaded directly and images taken from PAS events and publications. Before the batch upload project started there were 1,200 photographs in this category and subfolders. All images in the batch upload project use the template {{Portable Antiquities Scheme}} as a credit line, with the benefit that this may be easily amended should PAS wish to be credited differently.

Advice for reuse

[edit]

Searching

[edit]
Figurine of Harpocrates, 1st-2nd century, found in Hertfordshire. Photograph shows views from four directions.

Useful example searches:

Categorizing

[edit]

During the batch upload there has been no automatic categorization. The reasoning for this approach is that with such a large upload, including many thousands of coin images, category flooding would be highly likely; for example on the PAS database a search for "farthing" returns over 5,000 photographs. The aim is that once the batch upload is complete, volunteers will use searches or surfing through larger categories to find images of interest and add detailed categories to images which best illustrate the topic. In this way there is a default type of curation over time, with less educationally valuable images ending up having the least matches in searches.

As an exception, one sub-category is added to images, this is the category of the type "Portable Antiquities Scheme, <institution>", where the abbreviation for the identifying institution is substituted. This makes it easier to visually surf through images related by their region of findspot, as the identifying institution normally only is approached to identify local finds. For example "CORN" covers the Cornwall region, with the institution being the Royal Institution of Cornwall. Note that abbreviations beginning "FA" are associated with the named assessor rather than a region, so "FAIL" is finds identified by Ian Leins who specializes in ancient coin identification nationally, hence Category:Portable Antiquities Scheme, FAIL is populated with photographs of ancient coins.

The quickest way to categorize a few hundred images at a time, is to run a relevant search, such as this search for Louis XIV silver artefacts, then use the Cat-a-lot tool to quickly select which images you would like in a specific category, such as Category:Coins of Louis XIV of France. If there are more matches than images to exclude, you can do a 'select all', then deselect the non-matches. Take the time to examine the category, it may have better sub-categories to apply to the images, or already be flooded and be in need of more diffusion or sub-categorization. In practice when categories are reaching thousands of images, they should be considered over-full unless they are an intentional 'bucket category'.

Crops

[edit]
Roman gold intaglio, 100-200 AD, Essex. Photograph cropped to remove measurement marker.
  • Rulers - the majority of photographs of artefacts have measurement rulers or marks in the photograph and some have identity labels. These can be safely cropped off the photograph to provide a better focus on the artefact as the original remains in the file history and can be clicked on within the Commons image page, by anyone wanting to refer to it. The built-in Commons CropTool can provide a lossless version.
  • Detail crops - if a section of the photograph would be useful, such as the front main view of a figurine where the original photograph has several views, then a crop can be created as a new file for reusers. Again the standard CropTool can do this, you just need to select the option to create as a new file. For coin images, it is normally best to keep both the obverse (front) and reverse (back) in the same image rather than separating them.

Batch upload project

[edit]
17th century rowel (of a horse rider's spur), found in Cornwall.

For the original discussion that started this upload project refer to User talk:Fæ/2017#Portable antiquities. The upload was suggested by Andy Mabbett and executed by with the benefit of reusing upload techniques from other projects.

The upload relies on the PAS database being available as JSON records, these are visible as links on every catalogue page. The batch upload runs through the entire set of images using URLs like https://finds.org.uk/database/images/index/page/8, which as a default returns 18 images per page. Each unique image ID is then used to pull the JSON metadata for the image. The image metadata has title and label fields (which appear to be consistently identical), but no long description. Where a main record can be found using the findID, there is an attempt to improve the description field by using the main record. One disadvantage is that the main record description may refer to several objects within one find, and so many images may end up sharing this same description, however this appears to be a relatively rare exception.

From 2017-01-23, the finds site has been whitelisted for upload-from-url, phab:T155844. This avoids having to upload from a local client version relying on home bandwidth and the upload rate can be significantly faster.

[edit]

Licenses are found against images under the metadata tag "license" and any attribution is defined under "imagerights". Tested for licenses are:

  1. Attribution-ShareAlike License -> {{cc-by-sa-2.0}} or {{cc-by-sa-4.0}}
  2. Attribution License -> {{cc-by-2.0}}

From 2020-11-11 the version number of the license is tested before upload.

The CC version numbers are as given in the database links. Any licenses not matching the precise text above are rejected and the image skipped for upload, such as "All Rights Reserved" or "Attribution-NonCommercial-ShareAlike License". Where attributions do not exist, a default of "The Portable Antiquities Scheme/The Trustees of the British Museum" is used. Of the files uploaded to Commons, around ¼ are cc-by and ¾ are cc-by-sa.

Since the initial uploads the general license on PAS has changed to version 4.0 licensing. However:

  • The v4 license stated on the artefact record may not apply to the image(s) which display a generic icon that can link to v2 licenses.
  • The license version is not visible on the PAS JSON records, which only states a license in the form "license":"Attribution License". However the web page display of the image may include a link to the v2 license.
  • Confusingly the web site footer includes a cc-by license v3, which may be there by accident rather than design.
  • The pattern of changes appears to show that all cc-by-sa licenses on the specific PAS image pages are version 4.0, and all cc-by licenses are version 2.0. This may be an oversight rather than a deliberate choice.

Together this means that to update the license in a technically correct manner, we have to access the image display web page, checking the icon being used under "Image meta data" for the specific "Creative commons license". Example Image page gives cc-by-v2, artefact page shows image with "CC License" of cc-by-v2 but a generic "Image use policy" which links to cc-by-v4 with the caveat of "unless stated otherwise".

In addition to using webarchive links for the image page, a housekeeping task is planned to update the 'permissions' parameter to state both the license and its version with a date of that verification. Where the license template differs it can be changed at the same time. Example 1 Example 2. Files which have yet to have a verification check shown can be listed with incategory:Portable_Antiquities_Scheme -insource:/\(verified/ -hastemplate:ISOdate

Why this should reassure license reviewers
The automated task is checking the PAS displayed website image for precisely the source text of "Creative commons license:" followed by "by-sa/4.0" or "by/2.0" in the hyperlink. It is then compared to the existing uploaded Commons image page to see if the licenses match or not, and adds a verified date and if needed the corrected template. As explained above, the pattern appears that only the two options of by-sa 4.0 or by 2.0 are in use at the time of running this housekeeping task. Though PAS may change the default license in the future, the verification date and edit history should give a sufficient credible audit trail to satisfy the curious.
Code snippet, where 'html' is the image web page from PAS and 'p' is the matching Commons ImagePage
	lic = html.split('Creative commons license: ')[1][:200].split('/licenses/')[1].split('/"')[0]
	llic = html.split('Creative commons license: ')[1].split('"')[1].split('"')[0]
	if lic in ['by-sa/4.0', 'by/2.0']:
		phtml = p.get()
		line = phtml.split("permission = ")[1].split('\n')[0]
		if lic == 'by-sa/4.0':
			if not re.search("\{cc-by-sa-4.0", phtml, flags=re.I) and re.search("c-by-sa-2.0", phtml):
				phtml = re.sub("c-by-sa-2.0", "c-by-sa-4.0", phtml)
				phtml = re.sub("(permission = ).*\n", r"\1" + "Attribution-ShareAlike License version 4.0 (verified " + date + ")\n", phtml)
				action = proj + "Update license cc-by-sa-2.0 to cc-by-sa-4.0 and verified date"
				pywikibot.setAction(action)
				print Fore.CYAN, count, Fore.GREEN + title, Fore.WHITE
				p.put(phtml)
		elif lic == 'by/2.0':
			if not re.search("2.0 verified", phtml) and re.search("\{cc-by-2.0", phtml, flags=re.I):
				phtml = re.sub("(permission = ).*\n", r"\1" + "Attribution License version 2.0 (verified " + date + ")\n", phtml)
				action = proj + "Add license cc-by-sa-2.0 verified date"
				pywikibot.setAction(action)
				print Fore.CYAN, count, Fore.GREEN + title, Fore.WHITE
				p.put(phtml)

JSON mapping

[edit]
Where a geolocation is included, about half of all images uploaded in this project, an automatic link uses the WikiMap tool showing related finds. Brooch (FindID 90253) is highlighted among a cluster of 396 photos.

This table shows how PAS JSON metadata is mapped to parameters in the standard Commons template {{Photograph}} and {{Object location}} when relevant.

results (i.e. image gallery)
 - broadperiod → date (default)
 - title       → title
 - findID      → accession number AND (filename = <title> + (FindID <findID>))
 # post-2018 'findIdentifier' is used
 - county      → depicted place
 - old_findID  → accession number
 - filename    → accession number
 - institution → [[Category:Portable Antiquities Scheme, <institution>]]
image
 - id          → imageID (1)
 - filename    → accession number
 - label       → description (default)
 - imagerights → author
 - mimetype    → (internal type checks)
 - fullname    → author
 - license     → permission
record
 - description → description
 - daterange   → date
 - numdate1    → date
 - numdate2    → date
 - centreLat   → object location - lat
 - centreLon   → object location - lon

1. imageID used only where titles are non-unique for multiple photographs of the artefact.

Where field may be 'null', there is a fall back to default values. If 'daterange' exists this overrides 'numdate's. Where numdate2 does not exist or equals numdate1, the date falls back to numdate1 on the presumption that the specific year is identified. If centreLat does not exist then {{Object location}} is skipped; note that for some finds the location is suppressed from public view on the PAS database and so will not be used on Commons.

Galleries for multiple views

[edit]
Taunton civil war hoard, c.1645

As of 2020, gallery creation needs to be updated to match JSON changes at finds.org.uk, so is not currently working. The revised process of navigating from a list of artefacts to images does provide a list of associated images, so this list as pure text is added to the other_versions parameter to make later housekeeping easier. Example File:2018T930 (FindID 1014607-1120516).jpg

Where there are multiple images for an artefact, a housekeeping process is adding galleries to each file so that reusers can find and navigate between views of the artefact. This process is deliberately lagging the batch upload process and may be many days later than the original upload.

An example is File:Socketed gouge (profile) (FindID 661462).jpg which has two other views, "convex face" and "concave face", added as the following gallery:

In the case of File:Taunton civil war hoard (FindID 643649).jpg, there are photographs of each coin in the hoard shown in the gallery, making it easy to surf through the collection.

You can find other multiple view galleries with this search.

Where the titles for images are not unique, a separate upload procedure will be applied to generate alternative file names and upload the missing files. Where there is a mix of some repeated titles with others being unique for the artefact, the gallery will be incomplete (there's a limit on how much volunteer time is worth it for rare edge cases!), however they should still be findable through navigating the gallery and checking the what-links-here list. Where apparent duplicates have been uploaded, these exist as separate images on the PAS database and have been uploaded because they are not digitally identical, possibly due to minor image enhancement or changes in the EXIF data. As part of later 'housekeeping', later uploaded missing files include a search link above the gallery which will show all possible matches, including those missed in the other versions gallery, for example File:Roman headstud brooch, close up of decoration (FindID 438056-324489).jpg.

The alternative naming scheme for multiple files with duplicated titles for an artefact is:

File:<title> (FindID <findID>-<imageID>).jpg

Example: File:Post-medieval crotal bell (FindID 660227-501403).jpg

The gallery process also identifies missing files and attempts to upload them, a number of drawings in TIFF with titles matching main photographs were identified this way, see search.

Slurped photo of Roman brooch fragment. Recorded on the Finds database on 2020-10-26, automatically uploaded to Commons 5 days later.

Slurping

[edit]

By using a database search using the advanced search feature of updatedAfter, a slurping, or refreshing, task looks back over the previous week to see if new photographs can be uploaded to the collection. On the Finds website the most recent date of change is displayed in the "Audit data" section of an artefact record. This is unrelated to the date the photograph was taken or added to the Finds database.

The slurp is limited to looking for photographs new to Commons, which may include extra photographs or sketches of artefacts already on Commons. Updates to the curator's text is not changed on existing files, though new photographs will have the latest artefact information. Records may show significantly earlier dates than a week, which may occur for photographs newly available to the public after being held privately as drafted, or for older records that were missed at the time but have recent updates to the text. As a benchmark, the previous week at the time of first run showed 902 updated records to check.

Wayback Machine

[edit]

Initially as a retrospective housekeeping task, now as a pre-upload job, archive links to the Internet Archive Wayback Machine are automatically added to the image page. Where not already archived, the archive pages are created at the Internet Archive before the upload to Commons. This batch upload is the first where archive links have been added to support verification and avoid reliance on manual license reviews.

There is potential for the attribution text or license to change on the Finds database, or for records to be removed from public view. It is also possible that Finds itself may have a limited life and go offline or be replaced by another system. Having a record at the Wayback Machine gives a much greater chance for long term verification, given that the Internet Archive has a 100 plan for staying available.

Known bugs and features

[edit]
X-ray of 4th century Roman bowl from Irchester. Metadata of title and description not available for photograph, so replacements inherited from the find record.

These were discovered when pulling records from the PAS database. Some have caused the batch upload to Commons process to fall over, and other users of the external database may benefit from planning to address these bugs and features. It is worth keeping in mind that contributors to the database are not in a central institution and include records created by non-professionals, consequently inconsistency in usage is to be expected.

Access issues
  • Category:PAS images that no longer display at source contains images which may have been deleted or renamed at source. An example is given in the gallery below, which no longer exists under its image number at PAS, but an alternative has been used where it is the same image but with brightness increased. Presumably it was easier to create a new image and delete the old one rather than overwrite it.
  • Category:Uploads with finds.org.uk access error contains images where the source record appears to have been hidden from view for unknown reasons. In many cases the access error has been generated because the naming requirements of the image link changed at PAS to add "/recordtype/artefacts" to the URL and this change has apparently not been backwards compatible.
Bugs
  1. Wrong format - https://finds.org.uk/database/artefacts/record/id/67545/format/json returns XML rather than JSON. It is unknown how many records may be affected, the presumption is that this is limited to a few early records on the Finds database.
  2. Missing (or TIFFs and PNGs?) - https://finds.org.uk/database/artefacts/record/id/52689 Mysterously referenced but apparently missing images. There are quite a few of these, though probably significantly fewer than 0.5% of the database. TIFF and PNG drawings do not seem to be viewable on the PAS website, but do appear as thumbnails on record views; this may also result in unexpected errors. It seems likely that the website has been designed to cope with viewing jpeg files but is inconsistent for displaying other formats even though these are on the database. For example https://finds.org.uk/database/ajax/download/id/659837 should download a PNG file, but results in an inconclusive text error message. For links to missing TIFFs see Petscan report for image pages with gallery cross-links to missing images.
  3. Images wrongly linked to artefacts - https://finds.org.uk/database/artefacts/record/id/515346, example showing medieval manuscript images linked to a record for a thimble. This is hopefully a rare error and probably down to human error, but there seems no automatic way that the database ensures that incorrect record numbers are applied to images.
  4. Blank image records, possibly where images have been deleted from the database. Example
Features or Known Errors
  • Drawings are not indicated in the PAS records. Given the range of ways the images may be scanned, it is not possible to automatically or reliably identify which images are photographs versus drawings. Due to the fact that many artefacts are photographed on a pure white background, the conventional ways of testing for whitespace or single colour regions, are especially unreliable.
  • BC glitches - Some BC dates have been entered in the PAS database as positive values in numdate1, numdate2 (they should be entered as negatives). If greater than 2016 these are being logically corrected as being presumed to be negative, but some false dates may be used if between 2016 BC and 1 BC.
  • Broadperiod missing - A few entries are found with no 'broadperiod' set in the gallery record, these are being filled in if discovered in the image metadata.
  • Access failures - Some records are unavailable in the public view, even though referenced in the image list, this easily causes unexpected read failures. Example https://finds.org.uk/database/artefacts/record/id/621448.
  • numdate2 missing - In some cases the artefact has been identified with a date range, but though numdate1 exists, numdate2 is missing. If daterange is also missing, this may give a false impression of accuracy.
  • CenterLat unexpected values - some entries have been typed in as "None" rather than being left blank (i.e. Null values), though these have been trapped, it appears that these fields are not restricted to appropriate values.
  • Images may have duplicate entries - this is probably a rare error, example https://finds.org.uk/database/ajax/download/id/502678 is duplicated as https://finds.org.uk/database/images/image/id/502677.
  • Blacklisted links - Short links like bit.ly are automatically trapped by the Commons API and rejected. Though there is a work-around for bit.ly, there may be other blacklisted sites used in the descriptions which have stopped uploads. E.g. File:LEIC-734E08_late_medieval_gold_and_diamond_finger_ring_(FindID_564277).jpg
  • Photo updates it is unknown how often image files, such as sketches, might be changed. The is no obvious way of finding these without testing each file, which would probably be overly expensive in processing and programming time.

Design notes

[edit]

In 2018 the upload process broke, and the shortage of volunteer time to look at it again meant around a 2 year hiatus. The cause behind the fatal breakage was the apparent abandoning the finds database image index, so this primary way that images were being uploaded ceased working in March 2018. For example https://finds.org.uk/database/images/index/createdAfter/2018-03-10/format/json returns no results, but will do just a few days earlier. Without this index it was no longer possible to find prospective images to upload then deduce the finds records associated with them, instead a major rewrite was needed to go via finds records, then work out what images apply. The new system indicates a single 'thumbnail' with images then abstracted from that image record.

The JSON structure at finds.org.uk had changed, and as there is no obvious set of change notes for reusers, this means a poke it and see approach, which can be time-consuming and may also lead to misunderstandings about the structure and poor quality uploads on Commons in terms of metadata.

Here are some breadcrumbs, for the tired hobbyist programmer.

A simple JSON based list of artefacts with images is found using https://finds.org.uk/database/search/results/thumbnail/1/page/1/format/json where page/1 can be incremented to churn through everything on the database and thumbnail/1 flags that only artefact records with images are listed. Additional qualifiers can be added if list subsets based on date, location or type is needed. This may include advanced searches like https://finds.org.uk/database/search/results/updatedAfter/2020-10-30/createdBefore/2020-10-01/thumbnail/1/page/1/format/json which (at the time of writing) shows that 27 entries that were created on the database before 2020-10-01 were updated after that date.

Parsing this data requires a robust approach, keeping in mind that as the data is entered manually, or pasted from finds spreadsheets and reports, by various types of user across the country, and there appears to be little validation of data against a set schema, that human 'errors' are likely. These can include dates entered inconsistently, text with html code embedded and fields may be haphazardly blank. In particular finding text to base a Commons filename on for the image may be problematic, including the fact that 'title' seems to have been stopped being used as a field in the 2018 changes, possibly replaced by 'label'.

The list looks like:

{
	"meta":{
		"currentPage":1,
		"totalResults":27,
		"resultsPerPage":20
	},
	"results":[ #list of maximum 20 entries here
	]
}

# Example entry

		{
			"findIdentifier":"finds-1010030",
			"id":1010030,
			"old_findID":"WILT-D27A74",
			"objecttype":"RING",
			"broadperiod":"MEDIEVAL",
			"description":"A complete copper alloy Medieval to Post-Medieval ring, dating to the period c.AD 1200-1800. The ring is&nbsp;sub-hexagonal&nbsp;in cross-section. It has a short, slightly protruding flattened outward facing section on the outer edge&nbsp;of the hoop.&nbsp;It has a sharp inner edge, which may be a casting seam, with chamferring above and below this seam, and slight indications of the same on the outer edge. There is a dark green-brown patina on the surface.\n\nCf. BERK-45ACFB on the PAS database which states:&nbsp;Geake (2001, Finds Recording Guide, p.74) suggests that the broadly hexagonal cross-section which this ring possesses probably indicates a medieval date, and that they are often suggested to be harness rings.\n\nDimensions: 19mm in diameter; &#39;bezel&#39; 8mm in length.&nbsp;Weight 3.3g.",
			"periodFrom":36,
			"periodTo":29,
			"fromdate":1200,
			"todate":1800,
			"workflow":4,
			"institution":"WILT",
			"created":"2020-08-31T17:39:03Z",
			"updated":"2020-10-30T11:08:17Z",
			"weight":3.3,
			"secuid":"PAS5F4D27A700142F",
			"diameter":19,
			"quantity":1,
			"material":7,
			"subsequentAction":"1",
			"completeness":4,
			"manufacture":1,create
			"discovery":1,
			"recorderID":"PAS544E4F6B0011B0",
			"identifierID":"PAS544E4F6B0011B0",
			"createdBy":27440,
			"regionID":41427,
			"countyID":43925,
			"parishID":44142,
			"districtID":43925,
			"county":"Wiltshire",
			"district":"Wiltshire",
			"parish":"Salisbury",
			"fourFigure":"SU1332",
			"fourFigureLat":51.08717029,
			"fourFigureLon":-1.81576919,
			"what3words":"firming.warms.union",
			"precision":10,
			"findspotcode":"WILT-D281ED",
			"creator":"Jane Hanbidge",
			"updatedBy":"Denise Wilding",
			"materialTerm":"Copper alloy",
			"primaryMaterialBM":10627,
			"manufactureTerm":"Cast",
			"completenessTerm":"Complete",
			"periodFromName":"POST MEDIEVAL",
			"periodFromBM":"x41047",
			"periodToName":"MEDIEVAL",
			"periodToBM":"x14221",
			"broadperiodBM":"x14221",
			"discoveryMethod":"Metal detector",
			"subsequentActionTerm":"Returned to finder",
			"filename":"WILTD27A74.jpg",
			"thumbnail":1120500,
			"imagedir":"images\/denisewilding\/",
			"recorder":"Jane Hanbidge",
			"timestamp":"2020-10-30T11:08:17.418Z"
		},

Key fields of high impact for Commons are findsIdentifier, previously shown as findsID and apparently identical to id apart from a post-2018 change that adds "finds-" to the identifier.

Also of interest in the geolocation data which is used when available to add a {{location}}. The use of what3words is interesting, but it's not known how this is generated. workflow is how the processing or release of records is known, with draft stages (or potentially for security reasons) hidden from public view. Confusingly photographs may be public even when artefact records are behind this internal firewall, these show a redirect page of the format https://finds.org.uk/database/artefacts/unavailable/id/877320. If a better value for title cannot be deduced, the objecttype may be the default short title for the filename. As a design choice, the many additional optional fields like weight and diameter are not used, instead Commons is highly reliant on the description. The purpose of uploading the images to Commons is to encourage reuse, especially on Wikipedia projects, however for anyone wanting to use the data for serious research, they should always be going to the finds database itself, where there may be recent amendments, additional images and better correlation with related artefacts.

In the detail example above we can see the artefact record by using the findsIdentifier to create https://finds.org.uk/database/artefacts/record/id/1010030 and the JSON is shown by adding format/json to the URL.

The critical breaking change is that there is no apparent way to navigate from a FindID to a list of mulitple images just using JSON (or XML) records. Consequently this means scraping the html page to find all image links visible in a find record, then using this list to generate the relevant JSON records.