Commons:Batch uploading/World Digital Library
Images from World Digital Library
[edit]New site with PD-images - http://www.wdl.org. Contain 1170 items --Butko (talk) 06:52, 22 April 2009 (UTC)
- User:Sj shown interest in working on this upload. Looks like a very nice collection. Some points:
- The items have an id (http://www.wdl.org/en/item/100/), so easy to loop over
- The description of the items is available in a lot of languages, we should use that
- Lot's of metadata is available, this should make categorization easier
- One item can contain multiple files. We should be aware of that
- Files are available in the tiff file format. We should either have tiff thumbnails or upload tiff and a jpg version (transcoding!)
- Experience and code gained with the usgov uploads should be (re)used
- Some items have curator video's, might be fun to upload too
- Multichill (talk) 14:13, 8 November 2009 (UTC)
- Aside: There's a lot of interest in using data from how these images are used in encyclopedia articles, and how traffic is driven to the original archives, to inspire more libraries to take part in WDL. +sj + 14:14, 8 November 2009 (UTC)
Any progress? -- RE rillke questions? 18:29, 4 June 2012 (UTC)
- Thanks for the reminder. They've done a batch of updates recently; I'll see if I can get a dump next week before finding a suitable scraper. --SJ+ 06:52, 21 June 2012 (UTC)
- Hi there. I'm the Wikipedian in Residence at the World Digital Library. Some content is already found here: Category:Images from the World Digital Library. Whatever you decide to do is what you decide to do, but, WDL asked that facilitating mass uploads and encouraging extensive uploading not be a part of my scope this year. This is at the request of the majority of their partners. But, I can't control what others do, of course, if something is in the public domain, and I do upload occasional images. Do note: not all content on WDL is public domain - there is content from post-1923 (per US law) on the site, and much of that content was not created by federal/government entities. So make sure you check each page accordingly. (This includes content from the Florida State Library, for example.) Sarah (talk) 04:06, 13 July 2013 (UTC)
2014 Uploads
[edit]@Sj, SarahStierch, and Rillke: I am revisiting these based on an (independent) email request earlier this week from 維基小霸王. Progress below. It may take 5 years but we get there eventually. --Fæ (talk) 11:16, 28 February 2014 (UTC)
- Thank you, Fæ.--維基小霸王 (talk) 10:40, 3 April 2014 (UTC)
Technical, comments
[edit]- Format
File:<title[<=200 characters]> WDL<id>.{png, pdf, jpg}
Where <id> is the WDL database number and title may be trimmed to below 200 characters by breaking off sentences, some titles are long due to additional sentences adding details. The title is taken from the English version of the page, but may include various non-English characters such as the under-dot and over-bar in "Ṭahmāsp", consequently the Commons file name relies on utf-8 rather than being limited to ascii.
First (and possibly only) run limited to photographs which are single items unless in pdf format. Other formats such as mp3 files exist on the site and would require pre-processing, whether this is worth the time to batch automate will depend on volumes and interest.
- Samples
-
Turkestan ethnographic album (87 pages, 1865-1872)
-
Man playing a Karnay (1865-1872)
-
Theatre design (1649-1685)
-
Khoi people in a storm (1700-1730)
-
Ismāʻīl, the Persian Ambassador of Ṭahmāsp (woodcut, 1557-1562)
-
La cena de le Ceneri (146 pages, 1584)
- Copyright issues
- File:Bombed Copy of “Defensor pacis” WDL11254.png WDL was impossible to filter automatically. There is no date for the photograph, nor any detail about the photographer, only details of the object. The same photograph (in a different tint and size) appears at kb.dk (The Royal Library of Denmark) and may be CC-BY-NC-ND under the website terms, however the absence of a copyright status or back-link when released through WDL may indicate this general term was not intended to apply to this individual photo.
- completeness issues
- Books like File:On Aristotle’s “On the Heavens” WDL7106.jpg are incomplete: only the first page are uploaded.--維基小霸王 (talk) 04:59, 3 March 2014 (UTC)
- I think the easiest solution is to identify these as a backlog list and then create them by hand using the existing text with the original then being deleted as an inferior duplicate. This example turns into:
- On Aristotle’s “On the Heavens” WDL7106 V1.pdf & On Aristotle’s “On the Heavens” WDL7106 V2.pdf
- Others fixed
- File:Journey to Ethiopia, Eastern Sudan, and Nigritia WDL2550.pdf
- File:Abstracts on the Physiognomy of Horses WDL4668 V1.pdf
- File:Abstracts on the Physiognomy of Horses WDL4668 V2.pdf
- Please add any more to be processed to Images uploaded by Fæ (check needed) as they will then appear in this live list.
- rotation issues
Book scans like File:Treatise on Geometry WDL7107.pdf requires rotation.--維基小霸王 (talk) 09:42, 3 March 2014 (UTC)
- I have raised this on Commons:Bots/Work_requests#Rotating_books as someone may have created a tool to do this already. --Fæ (talk) 09:52, 3 March 2014 (UTC)
- resolution issues
Resolution of book scans like File:Theater of Instruments and Machines WDL4305.pdf are too low to be able to identify the text. Resolution can be improved through reconstructing the PDFs from pictures, although it still not very clear.--維基小霸王 (talk) 01:40, 4 March 2014 (UTC)
- I agree that a resolution of 675px is disappointingly small but is just about usable. I am not sure it would be worth the effort of restitching a new book unless the scanned pages were significantly improved and the available WDL png versions in this case are only 836px wide. These may be a case of attempting to find better scans elsewhere and re-uploading with new files rather than bothering with the existing WDL copy.
- Checking all the books in Books from the World Digital Library, 93%+ have a width of over 800px, in fact a large proportion are over 2,000px. List for analysis below.
- If any are definitely unusable, I encourage raising a deletion request under COM:SCOPE and we can discuss if there are possible replacements or if repairs are worth the effort. --Fæ (talk) 09:48, 4 March 2014 (UTC)
- I ran a little experiment with File:Theater of Instruments and Machines WDL4305.pdf by bot-downloading all the png versions of the pages, converting to very high quality jpegs and compiling them into a new pdf. The result was changing from an 8MB file to a 252MB file, so a bit more compression would probably be okay, though a reader can now see the pages at 2,000px across rather than 675px. This is all expensive in volunteer time and a bit too complex to automate, so I would only expect to do this work for a handful of desirable books. --Fæ (talk) 20:12, 9 March 2014 (UTC)
Progress
[edit]Assigned to, task | Progress | Category | ||
---|---|---|---|---|
Fæ, mapping to {{Artwork}} using BeautifulSoup in Python, including capturing all available languages ('en', 'fr', 'ru', 'ar', 'es', 'pt', 'zh'). I was particularly pleased to get both Chinese and Arabic working on the image pages. | Status: Done | - | ||
Fæ, intelligent categorization filter based on WDL location and keywords. (Considered adding a grandparent/parent/child check on the category list, but this could be a later Faebot project to apply to many batch uploads and need not be a dependency here as the categories appear "reasonable".) | Status: Done | - | ||
Fæ, intelligent licensing to skip post-1923 works.
May miss some files due to WDL layout inconsistencies. Filter is based on date and creator fields that match
Licenses being used are not great, however with the ample metadata being included in the Artwork template, this might be easy to refine if there are suggested improvements. |
Status: Done | - | ||
Fæ, upload first batch photographs (~1,500 done)
During the first 400 or so I found a number of bugs/improvements (such as improved categorization). A couple of uploads may end up getting deleted due to uncertain copyright, however the date filter should now be adequate for the rest of the run. Upload second batch in parallel books (max 877) |
Status: In progress | Catscan report {Images from the World Digital Library & Images uploaded by Fæ} The WDL category started with 250 images already uploaded. Second category specifically for books (pdf) Books from the World Digital Library | ||
Fæ, decide what to do about uploads over 100MB, for example 10630 which is over 180MB in size.
These have to be handled manually as there is no readily available batch process, list below.
|
Status: Backlog | - | ||
Fæ, re-work 'zoomified' objects such as File:From Tobol'sk to Obdorsk WDL181.jpg where there are multiple images. Currently only the first has been taken, however these appear to be a small proportion of the total. In this particular example the WDL provided no pdf version. I assembled one by downloading the 33 png files used by the "zoomify" tool and repackaged them into a pdf. See File:From Tobol'sk to Obdorsk WDL181.pdf.
195 files to be checked identified. Same issue as raised in discussion. Backlog created here. |
Status: 1% done | - | ||
Fæ, many of the images have links to an original library source (in particular images provided by the US Library of Congress). It should be possible to go through these and check if an original very high resolution TIFF is available and can be uploaded as an alternative version. Based on the previous NARA batch upload, the Commons Community preferred to have both a high resolution jpeg/png which is easily used elsewhere, as well as a larger or sometimes extremely large tiff, with the categorization on the usable file with a pointer in the other_versions parameter to the on-wiki tiff.
Where a tiff is available and does not already appear on Commons, it would be ideal to upload it as an alternative in this way. Note, in some cases the original link no longer works, this is the case for the Biblioteca Nacional Digital of Brazil and it may be impossible to find an original scan.
I will ponder the path to take here. Where a source exists in higher resolution online, it will invariably be in a library collection which itself may be worth creating as its own batch upload project rather than piecemeal due to happen-stance of being published on the WDL. |
Status: On backlog | - |