User:Fæ/Project list/Fleuron
This batch upload project by populates Category:Fleurons by year. Questions can be raised on the project talk page.
Fleuron is from old French for 'flower', and is used to describe printers' decorative ornaments in documents, often indicating breaks in the text, and where decorative flower-style designs are parts of architectural details. 18th century fleurons in common use by printers, were the precursor to dingbats and other ornamental icons such as the floral heart ❦.
Introduction
[edit]Highlighted by Pigsonthewing, the Fleuron database, https://fleuron.lib.cam.ac.uk/about, is an extensive set of automatically identified images of Fleurons in 18th century books. The database references books using the ESTCID or English Short Title Catalogue identity, a system supported by the British Library and other major institutions for books published before 1800. Refer to http://estc.ucr.edu/.
The original source of the book scans is ECCO - http://gale.cengage.co.uk/product-highlights/history/eighteenth-century-collections-online.aspx - which has 136,000+ books in its catalogue. Unfortunately, though all the books are public domain, the database itself is not generally available to the public and is behind paywalls. The contract with Gale would stop any direct data mining of images for the public benefit, for example to release the original book scans.
The Fleuron database has no paywall, or other restrictions, and is a project run by Fitzwilliam College and centralized technology departments within the University of Cambridge. The images were created by taking the original full page ECCO scans, and using image recognition methods to identify potential Fleurons on a page, isolate these from surrounding text, and enhance the image to remove the background page colour or markings.
Batch upload
[edit]The Commons batch upload sets an arbitrary cut-off on image size at 600px x 100px. This is still a very large number of images, but avoids the many simple patterned horizontal lines, which would probably be too repetitive to be realistically useful for Commons.
Automated categorization
[edit]There is no bucket category, however all files from books with an identified date should be added to "<year> Fleurons", see Category:Fleurons by year. Images with no date, or a more complex date, will be added to Category:Fleurons, though after a run of around 1,000 images, none was found without a simple date.
Manual categorization
[edit]Categories for fleurons are, at the time of initial batch upload, quite limited. This project will probably encourage significant growth in more category choices for different fleuron types, such as floral embellishments, leaf ornaments, repeating border patterns, long decorative scrolls, etc.
How to do it
[edit]As the Fleuron uploads have the English Short Title Catalogue number in the filename, it is straightforward to find all uploaded images from a given book. If the book has its own category, it can be mass added to all the files using Help:Cat-a-lot, or files can be removed from the Fleurons category and added to a book category using VisualFileChange.
- Speedy deleting
In the example of "A continuation of the Practical register, in two parts", 1710, source, all images were of fragments of text rather than any fleurons or other useful small images. A search for the ESTCID of Fleuron T133310, produced a list of 17 files, which cat-a-lot can then help with; if switched on in user preferences, it should be in the bottom right of any Commons search view. Selecting all, then adding to Category:Uploads by Fæ needing speedy deletion was an easy way to mark them all for deletion.
- Moving
In the example of "A complete history of drugs" by Pomet, this was a book published in 1748 with many illustrations of animals:
- There was no pre-existing category for this book, but searching for the ESTCID of T031045 gave me a full list of 38 images.
- Then using cat-a-lot on that same search list, I added the yet to be category of Category:A complete history of drugs (1748) to all the files as a red-link.
- Keeping the search list in my browser, but clicking on Perform batch task in Tools, put me into VisualFileChange.
- Using a Regex search of
/\n\[\[Category:1748 Fleurons\]\]/
it was possible to replace the Fleuron category with nothing, i.e. removing the images from that category. An easier visual way of doing the same thing is to go back to the Fleuron category and use the option in Cat-a-lot to 'remove from this category', though this does mean clicking on each image to be removed. - Picking one of the images, I then opened the new category for the book, and added some parent categories in order to create it.
- Of the 38 images, one was actually fleuron-like, so I re-added it to Category:1748 Fleurons.
Taking the example a step further, a scan of the full book was found on archive.org. This was downloaded locally, then uploaded to Commons as a single full PDF version (58 MB, taking about 10 minutes to upload) and itself added to the newly created category for the book. File:A_complete_history_of_drugs_T031045.pdf
Naming
[edit]The naming scheme is:
<shortened book title> Fleuron <ESTCID>-<image number>.png
- The shortened book title uses the book title and if over 100 characters attempts to intelligently chop it down in size by looking at punctuation.
- ESTCID provides the easiest way of tracking an image back to the Fleuron database. Full image identifier could be used, as it is unique, however these are very long strings of digits and would be ungainly in the filename. The full image identifier is given in the source link back to the image on the Commons image page.
- Image number is the sequence number of the image on the Fleuron listing page of images from a book. These may be on multiple results pages, but the list is pre-assembled and sorted, effectively putting the images in order by the full image identifier. As not all images are uploaded, due to the cut-off size, these are not a full sequence on Commons, but are unique when used with the ESTCID.
Quality
[edit]There are issues with quality that will need to be managed. Sticking to a cut-off size helps to avoid flooding the Fleuron categories with small and unlikely to be interesting snippets from the text. There are many misidentified images counted as Fleurons, including some parts of odd text or titles. These appear to be a small minority, but should be speedy deleted when found, as they are likely to not be the full image or useful full texts from a page. Some images are technically not Fleurons, but small woodcut illustrations or capital illuminations, in the longer term these should be re-categorized.
A particular problem is books of music scores. The score lines are detected as fleurons, resulting in many false matches from the same document. These are being filtered based on book title matches, starting with "psalm| ayres| chanting|[-\s]tunes| songs| choir"
but is likely to be added to based on experience as the uploads continue.
The image processing means that the images are high contrast, which depending on the clarity of the original scan may be heavily pixelated, though at 600px+ width, is unlikely to make the image unusable for research, categorization or illustration.
Due to scanned pages being mostly in books bound on the left of the image, wider images tend to slant upwards from left to right, with the left side being compressed, presumably as the page nearest the binding was at an angle to the camera. These may be digitally corrected at a later date.
Image casebook of non-fleurons which are in-scope
[edit]-
This is an illuminated capital "T", in-scope as useful to illustrate capitals and in this case lion and unicorn royal figures.
-
1720 woodcut, in-scope for illustrating woodcuts of this type, as well as the specific biblical scene.
-
In-scope as an illustration of 18th C. crests, specifically for the U. of London.
-
Illustration of otter in Vermont, arguably in-scope but the quality is marginal, potentially delete if it can be replaced by a better scan.A
- A - T086104 is available on IA at https://archive.org/details/adescriptiveske00grahgoog. The scan quality is still poor, but a crop, from original page 142, could provide a softer grayscaled and less pixelated version.
Upload runs
[edit]After an initial test run on 7 May 2017, a full run is being executed which runs through books in alphabetical order by title. The initial test included a run through all books starting "Y", which happens to include a few in Welsh. Initial runs did not include "Appearing on Page" values.
Uploads are client-side as the site is not whitelisted. The is no plan to whitelist the site for direct url uploading as the images are relatively small in file size.
Housekeeping and Image hashing
[edit]For other examples of image hash experiments, refer to User:Fæ/Imagehash.
Imagehash tests
[edit]This is at an experimental stage, but may be turned into housekeeping such as adding galleries of similar images to each relevant image page. As finding similar images would only be fully useful after categories are populated, this type of housekeeping will be deferred until the batch uploads are finished.
An imagehash is a code created based on visual characteristics of an image. In Python there are some established open source modules for scientific image analysis that can help, though this is fairly processing intensive. The code used for testing hashing on this upload project is ImageHash. This offers four different types of image hash. Based on the images of the same Fleuron being potentially different crops, slightly distorted and slightly rotated, the hashes attempted are the perception hash and the wavelet hash. Processing a category of Fleurons is helped by the images being relatively small in file size, taking around 4 minutes for a 500 image category. Images which tend to be missed as matches include those with extra marks and lines, such as page edges, and those with more than a slight angle of rotation; unfortunately compensating for these differences would result in significant increases of false matches.
Testing variations of how close hashes for different images in this collection need to be to get fairly reliable matches gave some useful heuristics. In the following example the image hash differences are "d1" for wHash and "d2" for pHash differences and the filter is if (d1<13 and d2<14) or (d1<25 and d2<8) or (d1<5 and d2<19) or (d1<3 and d2<25)
. Each image can only be matched in a set of matches once.
Image hash experiment for 1714 Fleurons
|
---|
Images in the gallery below are listed by the primary image with book short code, followed by matches with their wavelet hash and perception hash differences from the primary image. The accuracy appears to be over 95% and is highly effective at applying to Fleurons rather than other types of image.
|
Lessons from the experiment:
- An 'adequacy' limit has to be determined, if an error rate of ~3% is not acceptable, the number of useful matches will rapidly diminish and fail to find what are fairly obvious similar images to the human eye.
- Matches are across books, so matching in the parent year category makes sense. Processing at an even higher level may be possible, though the number of cross-checks increases exponentially. If the hash values were created once per image and stored, such as within the Commons image page text, then the processing cost becomes far more manageable and reusable for other tests.
- Example Fleuron match across books:
-
Collection of poems
-
Book on fish and fish-ponds
- Attempting matching provides an element of serendipity, providing an insight into the designs. In the following case the leaves and framing are identical, with the central bird being adapted. This gives an insight to the evolution of the printer's fleurons.
-
Raptor with head right and down.
-
Raptor with head left and up.
Thoughts on future experiments:
- The first experiment presumed there was no value in retesting any previously matched image, on the basis that we are creating an effective Hamming space, where sets of matches act like they are clustered at a node, and the "distance" between points is associative - i.e. testing A against B is the same as B against A plus if A is in list1 then once B matches A, B can only be in list1 and nowhere else. Okay, that sounds good, but in practice it's useful to find out when testing a list {A, B, C} whether image A is closer to B than C as this is useful to present to the potential image reuser or categorizer. In practical terms, this may end up returning a list ordered by closeness in image properties to the primary image which can then be displayed as a gallery of nearest matches on the Commons image page. Given a large set of returns, such as the 8 matches to N053201 a Royal Crest, we may even want to just show the first 6 closest matches to the image being displayed from within the set.
- Though it may be worth adding image hashes to the image page text, this means that in larger numbers of comparisons, such as >10,000 rather than <500, we may start having very large numbers of page accesses for both the image and the image page text. It makes sense to save the hash results locally, which then also makes the logic of running the exponentially growing number of cross-checks less of a processing burden. This means a routine would run through all the Fleuron project images linearly to generate hashes, save the hashes locally, then we can play around with clustering hashes by 'distance' to create (sortable) match lists, and then use that to add galleries or categorize depending on what suits users best.
Hamming database
[edit]Design notes
[edit]- Time
Based on other experiments, if generating image hashes locally (i.e. not on WMF Labs), then we can estimate 2 seconds per hash, regardless of whether this includes more than one type of hash. As the most significant component of the time will be the internet access to download the file, parallel processing may help speed this up, so long as internet connections were via a suitable pool. Past parallel processing jobs have shown that over 2 or 3 parallel connections would be unlikely to make much difference to processing time so long as only one machine is being used.
If the estimate of 2 seconds per image remains valid, then processing 200,000 images would take around a week.
- Hamming database
When comparing one image to the rest of the set, rather than comparing one at a time, it is better to think of an associated Hamming space where images are "pre-clustered" using the image hash difference values. This can be very simplistic if all we want to do is see whether the difference between one image and others is zero, but far less obvious if we want to measure other difference values.
For the zero case, we could save the hashes in sets where a selected few bits are the same. As the resulting sets are probably always going to be fairly isomorphic, the choice of which bits in the hash produce the clustering can be arbitrary. Consequently this can be as simple as creating lists of hashes as text files where the first two bytes match, so the resulting lists are very manageable at 1/256th of the total; a list size that should be convenient to load into memory for any complex analysis rather than relying on multiple file accesses. This means that the task of comparing one hash to all others in our total set is hugely reduced. Similarly within each 'small' list, we can numerically sort the hashes so that extracting matching values becomes computationally trivial.
For non-zero cases, which in practice may mean finding differences up to 12 bits, a different solution will be needed.