Commons:Batch uploading/LSH

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
  • Livrustkammaren och Skoklosters slott med Stiftelsen Hallwylska museet (COM:LSH):
    • Each image caries a unique identifier which may be linked to a URL (although these aren't live yet)
    • No API
    • They are donating the images and together with these the associated metadata. They've also done some preliminary matching between keywords and Commons categories as well as between artist/events/depicted people and (Swedish) Wikipedia pages.
    • It's a collaboration so yes!
  • Describe the works to be uploaded in detail (audio files, images by …):

These are a collection of approx. 20,000 12,000 high resolution photographs in tiff formate of the objects held in the collections of these three museums. The files are all less than 500MB but may be larger than 100MB, resolution is less than 25 megapixels (relevant with respect to Commons:Maximum file size).

Change in number of images is due to more manual work being needed for the last 8,000 in order to generate sufficiently descriptive image pages

  • Which license tag(s) should be applied?

All of the depicted objects are owned by the museum and PD-old. The photographs themselves are either old enough to be PD-Sweden-photo or released as either CC0 or CC-BY-SA (don't know which version yet). {{LSH license}} has been prepared for this purpose.

  • Is there a template that could be used on the file description pages? Do you think a special template should be created?

{{LSH artwork}}

Opinions

[edit]

Having looked around a bit it looks as though Chunked uploads might not be integrated into the pywikipediabot framework. Does anyone know anything more about this or if there is a practical workaround? I can ask them to downsample the images but it seems as a waste when they've offered us high-res. /André Costa (WMSE) (talk) 09:32, 7 February 2013 (UTC)[reply]

Seems it is not integrated indeed :-/ source. Jean-Fred (talk) 12:23, 7 February 2013 (UTC)[reply]
Yes I saw that one. Since it was close to a year ago though I was hoping that the situation had changed since then and that I had somehow missed the follow-up e-mail. /André Costa (WMSE) (talk) 08:17, 8 February 2013 (UTC)[reply]

If you want to start at the low level, you can construct your XMLHTTP-Requests yourself. Sample how chunked upload // how the XHR should look like at mw:API:Upload. On Windows, I used Fiddler2 to inspect that everything worked as it should. If you like I can supply my VB(A)-classes but I guess you are on Linux. -- Rillke(q?) 18:18, 22 March 2013 (UTC)[reply]

The chunked uploads issue has been resolved thanks to this py-scrit by Smallman12q. /Lokal_Profil 13:10, 6 May 2013 (UTC)[reply]

Finally got a proper grip on the metadata preparations and the first example information templates can now be seen at /Examples. There are a few more bits of information to include but most of the final product should be there. To make connections to commons categories and templates I largely rely on the following lists:

  • /Keywords (used to match the keywords used for the images with categories on Commons)
  • /People (used to identify artists, depicted people and provenance/owners)
  • /Events (identifies categories related to historical events)
  • /Places (identifies places and links these using city/country-templates. Also identifies institutions which are linked to page/institution-template)
  • /Materials (identifies materials/techniques and links these using the technique tempalate)
  • /ObjKeywords [provisional] (used to match keywords used for object descriptions with categories on Commons, to be merged with earlier information here and be sorted by frequency)

Feel free to help out in expanding these (although I expect a basic knowledge of Swedish is required). As the last few fields are tidied up there should be one or two more such lists appearing. I might also try some automated matching for a few of these.

The filenames will likely take the form <description> - <museum> - <identifier/museum system filename>.tif. The descriptions used for the filenames can be viewed at /Filenames. I've strived to keep the descriptions shorter than 100 characters with a hard limit of 128 characters. These limits as well as other filters for the descriptions can easily be changed based on feedback.

Having had a closer look at the data, one or two parameters in the artwork template may also change. /Lokal_Profil 06:45, 13 May 2013 (UTC)[reply]

An update
Most of the code preparations are now done with the main remaining step being translations/mappings of the above lists. The filename pattern was changed slightly and an "event" parameter was added to the Artwork template. All of these changes can be seen at /Examples. I'm going to try and do the most frequent translations/mappings and will then do a few test uploads (unless someone suggests otherwise). /Lokal_Profil 22:39, 19 May 2013 (UTC)[reply]
First two actual files are now up at
Still lots of unmatched parameters kicking around thus the lack of categories. Feel free to help out with any of the subpages above. /Lokal_Profil 14:46, 24 May 2013 (UTC)[reply]

After a longer than expected hiatus this project is now on track again. The files for the first larger test upload is being prepared as I write this. Meanwhile some statistics on categories. Out of the 13055 files:

2541 (832) have no categories;
1.36 (1.92) is the average number of categories per file;
1 (2) is the median number of categories per file;
11 (7) is the maximum number of categories per file.

Where the first number is the image specific categories and the bracketed number are the maintenance categories. The "images without categories" does not include images with a photographer category but no subject specific category. As can be seen most images still have multiple maintenance category. The most frequent are

  • dim-with multiple size templates 5481
  • unmatched creator 2946
  • unmatched keyword 2927
  • unmatched objKeyword 2881
  • no objects 2665
  • unmatched material 2553
  • without any categories 2526
  • malformated year 1778
  • unmatched event 918
  • unmatched place 180
  • dim-with weird units 128

/Lokal_Profil 09:22, 16 August 2013 (UTC)[reply]

Have now started a first test upload see Special:ListFiles/LSHuploadBot. All in all it should end up as ~150 files. Any feedback is welcome. /Lokal_Profil 11:34, 17 August 2013 (UTC)[reply]
So about 200 files were uploaded. Two things I thought about afterwards:
  • Filenames: For this batch many of the filenames still included objectIDs which could have been removed (e.g. Pistoler, 9362, 24412, 4612 - Livrustkammaren - 68782-negative.tif -> Pistoler - Livrustkammaren - 68782-negative.tif). The problem is setting up something which removes these numbers without removing valid numbers such as e.g. years. Looking at /Filenames this isn't a very common problem though so maybe not worth fixing.
  • Negatives: The negatives today includes exactly the same information/categories as the positives. It may be that it would be better to remove the "normal" categories.
/Lokal_Profil 08:19, 18 August 2013 (UTC)[reply]
Filenames: Have figured out a scheme which removes most if the object ids. Will move the current files to the new filenames as soon as LSHuploadBot gets a bot-flag. At that point I'll also update the /Filenames page.
Negatives. These are no longer categorised in the normal category tree but are instead put in the Category:Negative images from Livrustkammaren och Skoklosters slott med Stiftelsen Hallwylska museet tree.
After files have been moved I'll update the image descriptions and continue with some more test uploads of HWY and SKO images. /Lokal_Profil 10:10, 22 August 2013 (UTC)[reply]
Had a very quick look − looks very good! I am impressed by your command of advanced templates, especially the handling of {{Size}} & {{Technique}} in case of multiple entries. The other versions is also very good.
On the minor points:
  • I think we have some wrapper template somewhere to have a better layout for {{LSH license}} − maybe {{Copyright information}} ? − but this can be fixed later with no worries.
  • {{LSH negative}} looks a bit scary to me − we might want to use a less bulky thing than {{Original}}.
  • Conversely, {{LSH positive}} takes too big a place IMO − what about putting after the rest of the metadata or making it less bulky? − reworking from the original negative sounds to me like a fringe usecase compared to looking up the author or some other piece of metadata. What do you think?
From what I can see, an awesome job overall, thanks a lot for your work (and sorry for the lack of feedback you got :-()
Jean-Fred (talk) 12:15, 22 August 2013 (UTC)[reply]
Thanks for the feedback!
  • I've implemented the {{Title}} template.
  • The description comes from two sources. Either a field called description in which case it is in Swedish (and I wrap it in {{Sv}}) or from the original description field in which case it need not be in Swedish. This is the case for File:Pressbild, utställning Made in Sweden, år 2003 - Livrustkammaren - 76413.tif.
  • I looked at using {{Original caption}} but remember that this caused some problems (think it messed up some formatting) which is why I used the current solution. The current setup still uses the header from {{Original caption}} so the internationalization support is maintained. Apart from this it uses {{Note:}} and a general warning that it might be inaccurate (found at {{LSH artwork/i18n}}).
  • I've switched to {{Copyright information}} inside {{LSH license}}. There are a couple of other wrappers around that I'd found but wasn't happy with but this one works. This will also include the {{PermissionOTRS}} once I get a number.
  • {{LSH negative}}: I wanted something which explained why the image shouldn't be deleted and {{Original}} gives a lot of the internationalization for this. It's a ridiculously expensive template though which checks for existence of the target file. I'm open to a better suggestion.
  • {{LSH positive}}: You are definitely right. It should be moved to below the information template (now implemented) and made smaller (changed this a bit).
    • In general though these could be tweaked after upload by changing the templates.
Appreciate the feedback. I know that the slow progress have meant that this batch upload has been slightly forgotten. /Lokal_Profil 19:33, 24 August 2013 (UTC)[reply]
Sorry for the delay in responding.
  • {{Title}}, {{LSH license}}, {{LSH positive}} moved: Excellent, thanks.
  • description, {{Original caption}}: Fair enough, your explanations are fine to me.
  • Yeah, templates tweaking can always be done after the upload. This is not a blocker in any way − but again, there is hardly anything blocking here :)
Thanks for considering my remarks! Jean-Fred (talk) 12:34, 13 September 2013 (UTC)[reply]
Updated /Filenames. Still waiting for flag to move the files and update the information though. /Lokal_Profil 21:58, 24 August 2013 (UTC)[reply]
Batch 1 files moved and info pages updated. Will prep the rest and prepare the drive for shipping unless anyone else has some more input. /Lokal_Profil 23:19, 12 September 2013 (UTC)[reply]
I checked the edits you made to the already uploaded files. − looks good to me.
I noticed on File:Pressbild, utställning Made in Sweden, år 2003 - Livrustkammaren - 7924.tif that Creator:Göran Schmidt was linked but not transcluded − I am curious about the reasoning for that? (We do have the highly experimental {{Author information}} for similar cases, but I am not sure I would advise to use it.)
In any case, please ship whenever you want. You could consider making it in batches in case there are comments to be made on future uploads, but I do not think one would find much to say there. :) Thanks a lot for your work! Jean-Fred (talk) 12:34, 13 September 2013 (UTC)[reply]
Thanks. I'll invert the last 500 negative images and get the details for shipping the disc to the US. With regards to the photographer being linked but not transcluded I ended up doing it this way since the same link is also used in the byline for the CC-BY-SA template (something I'd completely forgotten when creating the Creator templates). I figured that by linking to it at Least I hadn't wasted the information.
I've also spotted that two (1, 2) of the uploads failed due to problems with the tiffinfo command. Can't really figure out why though since an inverted version worked in both cases (1, 2). Seems to be an anomaly though so these and any future cases can hopefully be solved manually. /Lokal_Profil 16:45, 16 September 2013 (UTC)[reply]
Considering a last change before finalizing. Namely moving the positive/negative parameter into {{LSH artwork}}. Apart from tidying up the individual image descriptions this, combined with moving the categorisation from {{LSH license}} to {{LSH artwork}}, would also allow negatives to be categorised straight into "negative images from x" categories. See User:Lokal Profil/Template:LSH artwork for the new template. /Lokal_Profil 07:11, 18 September 2013 (UTC)[reply]
Implemented. Also fixed a minor bug in the date formatting. /Lokal_Profil 07:08, 20 September 2013 (UTC)[reply]
Yep, good idea to move this parameter to the template (since we have a custom, better to take full advantage of that) − gives us more flexibility later. And we can always subst: stuff if needed. Jean-Fred (talk) 10:02, 20 September 2013 (UTC)[reply]
Images are now uploaded except for a few strays with exif issues so unless something pops up this page can be archived. /Lokal_Profil 08:06, 21 October 2013 (UTC)[reply]

Artificial post-upload break

[edit]

Thanks for the heads-up. Going through the collection right now, there are some very cool pictures there :) A good addition to Commons. I would like to thanks you again for your work Lokal Profil! This has been a long and challenging process (with requests to include even more crazy templates and so ;) and you have delivered an excellent upload to Commons. Kudos. Jean-Fred (talk) 12:47, 21 October 2013 (UTC) By the way, it seems that 20K pictures were uploaded in the end? Things with the last 8K were resolved? Jean-Fred (talk) 12:59, 21 October 2013 (UTC)[reply]

Is the uploading finished now? I still have about 1k images with missing other versions in Category:Files with broken file links --Denniss (talk) 16:24, 28 October 2013 (UTC)[reply]
@Jean-Fred: Thanks =). The differences in numbers stems from the negatives not being included in the initial count. So the last 8k images are still missing (did not have enough metadata to generate a filename/description).
@Denniss: Thanks for the heads up. This should most definitely not be the case. There are a few (roughly 15) images which are still missing due to a combination of EXIF-data problems and filesize limitations but they of course do not account for all of the broken links. I found that the error primarily stems from some images being JPGs when I'd assumed that all were TIFs. I'll need to generate a list of the affected missmatches and then I'll update the links. It's not impossible that there are other issues as well but I wont be able to identify those until afterwards.
@World:I also noticed that several of the images have ended up in Category:Pages where expansion depth is exceeded. I'm guessing this is the normal problem with nested {{Other date}}. /Lokal_Profil 11:17, 1 November 2013 (UTC)[reply]
@Denniss:Fixed the jpg-tif ones. Seems to be some other issue as well, possibly due to some files missing from hard drive. Will look ito this further and return with an update. /Lokal_Profil 21:36, 6 November 2013 (UTC)[reply]
@Denniss: 346 missing images causing the 1k broken links out of which 211 belong to this batchupload. Will look for these specifically on mine and the museums drives to see if they can be found. If not I can tell the bot to comment them out using the same script as the one which fixed the jpg/tif issue. /Lokal_Profil 22:34, 6 November 2013 (UTC)[reply]
Sorry for the lack of updates. There seems to be roughly 1000 images missing from the upload due to an early bug when I filtered through the images. I'm working on getting new copies of these from the museum. Once I have uploaded them the broken links will be fixed. Due to personal circumstances there might be a delay in getting the images though. /Lokal_Profil 19:55, 30 November 2013 (UTC)[reply]


Assigned to Progress Bot name Category
Lokal_Profil Status:    Done LSHuploadBot Category:Images from Livrustkammaren och Skoklosters slott med Stiftelsen Hallwylska museet


Wow, and a question

[edit]

I just came across this set of images and wow, it is fantastic! Thank you for all your hard work. We will be enjoying these images for years to come.

Are there any plans to create and batch-upload JPG versions of all these TIFFs? What is the current policy regarding using large TIFFs (or rather, their thumbnails) in wikipedia articles? Specifically, these templates {{JPEG version of TIF}} {{LargeTIFF}} state that TIFFs are for archiving and JPGs are for using. Are they obsolete? Or do users need to create and upload a JPG version of any file they want to use piecemeal? Thanks! Laura1822 (talk) 21:33, 30 July 2014 (UTC)[reply]

I have been trying to get answers to my questions about TIFFs answered elsewhere. My opinion is that no JPGs are necessary, but I want to get it clarified so that when editing I can do things "correctly."
But I have a couple of other questions about the project. (1) Are there any plans to add the missing creator templates by bot? Or should these be done manually by editors like me as I find them? For example, Alexander Roslin has a creator template, but the paintings in this set (only a few of them) are missing the template. (2) The paintings are missing {{Technique}} information. Most are probably {{Oil on canvas}}. Are there any plans to add these by bot? Really great work you've done here! Laura1822 (talk) 18:36, 31 July 2014 (UTC)[reply]
@Laura1822: . Glad that you have found the images and have enjoyed them =)
There are currently no plan of mass-creating jpgs. It is my understanding that most "non-huge" tiffs correctly generate jpg thumbnails. For very large images thumbnailing is an issue (see e.g. File:Umeå_1923_-_Generalstabskartan.tif and fore these specific cases we use the {{Archival version}}/{{Compressed version}} combination.
I initially tried to identify creators (and depicted people) over at /People together with the museum staff. This was the list used to match museum entries to image page descriptions. As you can see though there are lots of unmatched people. If you find any matches I'd appreciate if you update both that page (example) and the actual image page(s) since future uploads (next is planned for the end of the year) can then make use of that matching. The museums own database also maintains links to Wikipedia articles (not yet Wikidata though) so additionally matches can be used to enrich their data.
The {{Technique}} -template was added based on the matches made in /Materials. This is limited by the which materials were listed in the database of the museum though. Since most of the images in the upload are of objects, rather than paintings, there is no plan of adding the technique parameters by bot where these are missing in the original database. Still adding it to the image pages (or putting together a list of all pages which should have {{Oil on canvas}} added to them, I can then add these by bot) would be valuable both for Commons and hopefully for the museum itself. /André Costa (WMSE) (talk) 09:07, 4 August 2014 (UTC)[reply]
The cut-off point for thumbnail creation is 50 megapixels in resolution. Sometimes this can affected files of modest filesize but which happen to be very high resolution. The general consensus is that it is only worth making jpg versions for TIFFs over 50MP as the WMF is working on improving how thumbnails are created for sub-50MP TIFFs, so they should be of better usable quality than the blurry versions we tended to get previously. As an example, I have thousands of non-rendering TIFFs with alternatives being created in Uploads by Fæ (over 50 MP). -- (talk) 10:01, 4 August 2014 (UTC)[reply]
Thank you both. @André Costa (WMSE): , if all you need is, say, a category that includes all the paintings which need {{Oil on canvas}}, I can do that pretty easily with Cat-a-lot. I'll have to figure out how to make it a hidden category, though. I will look into your example and see if I can update the missing creators I find. Thanks, and again, great work! Laura1822 (talk) 20:16, 6 August 2014 (UTC)[reply]
Voila, a hidden category. Ping me once you want me to run through it. /André Costa (WMSE) (talk) 08:58, 7 August 2014 (UTC)[reply]
@Laura1822: Did you have time to put images into the category? /André Costa (WMSE) (talk) 12:14, 16 September 2014 (UTC)[reply]
Thanks for the nudge. I just added about 200 images to the category, using searches and cat-a-lot. Laura1822 (talk) 15:51, 16 September 2014 (UTC)[reply]
@Laura1822: Sorry for not confirming before that I dealt with most of these. Keep adding any new ones you find to the category and ping me whenever you want me to go through it again. Cheers André Costa (WMSE) (talk) 13:01, 11 November 2014 (UTC)[reply]

Clothing of Gustavus Adolphus

[edit]

I am working on adding English translations for the files for clothing of Gustavus Adolphus. The shirts will take some time, as the descriptions have lots of technical terminology - I have asked some native Swedish speakers with knowledge of period textiles for assistance.

I am also adding categories (and creating them where necessary). I have not removed any existing categories, although some should be removed. For example, "Sweater" is a bad Google translation - in this context "tröja" should be translated to the category "Doublet (clothing)". Do you want me to delete these incorrect categories?

Also, we can apply refined categories in place of the high-level categories like "Textiles", "Clothing" - let me know if that would be a good idea. I don't want to cause problems with keyword matching for the good folks at LSH. - PKM (talk) 01:10, 29 August 2014 (UTC)[reply]

2nd upload

[edit]

Just a heads up that there will be an additional 6000 images uploaded during the coming week. Same procedure but with most of the post-upload comments worked in. Source code is available at github. /André Costa (WMSE) (talk) 13:08, 11 November 2014 (UTC)[reply]

Yay \o/ Jean-Fred (talk) 14:29, 11 November 2014 (UTC)[reply]