User:Fæ/Projects/Xeno-canto
Xeno-canto birdsong (background discussion copied from Village Pump)
[edit]The website xeno-canto is a repository of good quality bird song recordings, almost 400 of which [1] are cc-by-sa. It also allows contributors to remove their contribution at any time. Do we, or could we, have a tool or bot for importing files from there? Andy Mabbett (talk) 13:04, 7 January 2013 (UTC)
- Hi Andy, these are quite delightful. I don't know of an easy tool to use for this. I am taking a look, but as a back-burner task to fiddle around with using my Python muscles, and create some modules that I might reuse for another custom upload. I might stutter due to the pressure of other stuff going on but I'll contact you if I run into a brick wall.
- After reviewing the website a little, it would be (relatively) straightforward to scrape the data but loading these to Commons seems the sort of re-use that their project is set up for. Rather than just going on a hack raid, I'll try writing to the project first and checking if they have an API I can use, this might help if we want to refresh in the future.
- Update they have a csv service, but this appears strictly limited to queries by species, not quite what is needed here.
- Oh, and of course, Commons cannot take mp3 files, so these would need to be batch re-encoded to ogg format.
- I have created an initial upload example which only uses the data from the file page on xeno-canto.org, and so could be easily automatically created. See File:Nothoprocta pentlandii - Andean Tinamou - XC112728.ogg. The Latin names are particularly useful for automatic categorization, so can nicely be played from within Category:Nothoprocta pentlandi where the user can be looking at photographs of the bird at the same time. --Fæ (talk) 12:53, 16 January 2013 (UTC)
Project brief
[edit]- Capture the list of source file references/links from xeno-canto with CC-BY-SA licenses
- Download the mp3 files locally
- Batch reencode to ogg format
- Scrape the file information from xeno-canto for each audio file
- Batch upload the ogg files with file information to Wikimedia Commons
- Design a report to check which files released in the future on xeno-canto are not already on Commons
- Notes
- I have added a thread about this project on the xeno-canto.org discussion forum to encourage feedback.
- A credit template for files is maintained at /credit.
Step 1 - Capture the list
[edit]I have a bit of python than will now churn through the CC-BY-SA query at any time and dump a new file list. It's quick, it seems to take about 10 seconds to run. Captured the current result below.
Step 2 - Download
[edit]A slight tweak to the listing script and now I have all the files downloaded as local mp3 files. It was quick again, around about 2 seconds per file or less. The total downloaded was 246mb.
Step 3 - encode mp3 -> ogg
[edit]This was more of a problem on my old macmini than I thought and I failed to get batch processing to work after several attempts. Fortunately wolfgang42 offered to lend a hand with batch processing the files (with the help of dropbox and wikisend to share them) and these are now ready for upload.
Second round of encoding
[edit]Unfortunately around half of the files had some sort of apparent glitch with the ogg encoding, making them playback online at double speed (possibly an issue with the built in player on Commons). A new encoding of the mp3 files using ffmpeg2theora at high quality and then a new batch upload for all those identified in Category:Xeno-canto (check needed) is required.
Step 4 & 5 - scraping the file information, uploading
[edit]Using the id numbers, grab the metadata to populate the information box, location data, categories and set the best title.
This seems to be running well apart from the sort of teething problems one might expect as I have never done a batch upload using my own Python scripts rather than other bots before. Based on Andy's edit I am using {{Species-inline}} for the scientific and vernacular names and I decided against repeating the Latitude and Longitude when I am also adding {{Location dec}}.
As I am uploading from a macmini, I ran into the file connection problem and had to (temporarily) set my user-config.py file to avoid connection timeouts. I also could not work out how to set the edit message on upload, so the default is being used (wikipedia.setAction() seemed to make no difference to upload.UploadRobot()). Being cautious, I am uploading at a maximum rate of 2 per minute.
I had assumed that files would have coordinates, but some do not as it turns out. This generates a bit of naff code - {{Location dec|<span class='unspecified'>Not specified</span>|<span class='unspecified'>Not specified</span>}} - which I will hook out when the upload is finished using VisualFileChange as I expect this to be a relatively small number. I have set an error trap in the script in case it is used to add future uploads.
A second problem showed itself in Location data with Paraná State as the non-ascii text caused the script to halt. I had to add an encoding for location which may display poorly for some files, though the encoding will give a clue as to the characters needed. For example this diff.
Step 6 - maintenance
[edit]I will defer putting this together until there are a fresh lot of files on xeno-canto with a CC-BY-SA licence to worry about. I have the routines now to pull the live list of relevant id numbers from XC and from the category on Commons, so creating a script to show the difference should not be too big a deal.
- By the way, there are now about 250 more BY-SA files available on XC, so it might be worth doing another import? Jnthnjng (talk) 16:40, 18 July 2013 (UTC)
- Thanks for the prompt. I did take a brief look at this a month ago. I know a lot more about how to go about this now, having worked on several other batch upload projects. I would like to create a more stable routine that I can re-run to upload new files without any re-programming. I'll put it on my backlog, so I'll probably re-visit this in a month or so and have a program that can run every month without much hassle. --Fæ (talk) 17:25, 18 July 2013 (UTC)
- I have decided to rewrite this from scratch. It's a bit of hassle for a few hundred more files, but repeatability and the ability to re-encode to ogg on the fly, seems worth it to me. Rather than scraping the website, I will pull in the JSON array from the API (hopefully it will stay available) and then do existence/duplicate checks for each file based on the XC number rather than any filesize or hash, as due to the ogg recoding this may be unreliable. For anyone reviewing this, the JSON records contain info in the form:
{ u'cnt': u'Argentina', u'en': u'Andean Tinamou', u'file': u'http://www.xeno-canto.org/download.php?XC=112728', u'gen': u'Nothoprocta', u'id': u'112728', u'lat': u'-25.2001', u'lic': u'http://creativecommons.org/licenses/by-sa/3.0/', u'lng': u'-65.8167', u'loc': u'Finca El Candado, Salta', u'rec': u'Niels Krabbe', u'sp': u'pentlandii', u'type': u'song', u'url': u'http://xeno-canto.org/112728'}
- Done I managed to fit this in before taking a break. 290 files uploaded. The new routine can be re-run at any time for automatic updates and left to suck in the JSON array, filter out existing files and upload with ogg coversions on the fly. I am now losslessly converting with the maximum quality setting for the ogg files (q=10), meaning they are slightly larger that the mp3 "originals". I noticed 2 files where the identification had changed at source, so the "duplicate" test now only identifies a match if genus, species and XC-number are the same (see Commons:Deletion requests/File:Butorides virescens - Green Heron - XC107690.ogg and File:Carpodacus mexicanus frontalis - House Finch - XC99546.ogg). This way, if the species identification changes on XC, there will be a new upload (which we could later pick out by searching for duplicate id numbers). I was slightly worried that we may be getting false matches of the type XC12345 being a substring of XC123456, checking the genus and species makes this far less likely to be a problem. --Fæ (talk) 07:07, 24 July 2013 (UTC)
- By the way, the XC API is now finalized and documented. Unfortunately the output format has also changed (but only very slightly). You can find full documentation here. Sorry for the inconvenience. Jnthnjng (talk) 14:33, 9 August 2013 (UTC)
- Nice write up, I'll make sure to check it over carefully on the next big update. I had reason to run a small update this week and spotted some changes which I corrected for on the fly. By the way, the most recently added files have extra goodness added as comments embedded within the ogg file header, so context (which includes a link back to xeno-canto) should never be completely lost, even if a downloader forgets they originally grabbed an audio file from Commons. --Fæ (talk) 14:53, 9 August 2013 (UTC)
- By the way, the XC API is now finalized and documented. Unfortunately the output format has also changed (but only very slightly). You can find full documentation here. Sorry for the inconvenience. Jnthnjng (talk) 14:33, 9 August 2013 (UTC)
- Done I managed to fit this in before taking a break. 290 files uploaded. The new routine can be re-run at any time for automatic updates and left to suck in the JSON array, filter out existing files and upload with ogg coversions on the fly. I am now losslessly converting with the maximum quality setting for the ogg files (q=10), meaning they are slightly larger that the mp3 "originals". I noticed 2 files where the identification had changed at source, so the "duplicate" test now only identifies a match if genus, species and XC-number are the same (see Commons:Deletion requests/File:Butorides virescens - Green Heron - XC107690.ogg and File:Carpodacus mexicanus frontalis - House Finch - XC99546.ogg). This way, if the species identification changes on XC, there will be a new upload (which we could later pick out by searching for duplicate id numbers). I was slightly worried that we may be getting false matches of the type XC12345 being a substring of XC123456, checking the genus and species makes this far less likely to be a problem. --Fæ (talk) 07:07, 24 July 2013 (UTC)
Discussion
[edit]Categories
[edit](per email) Recordings with subspecies seem to be getting assigned to invalid categories. For instance File:Vireo gilvus gilvus - Warbling_Vireo - XC102969.ogg is assigned to Category:Vireo gilvus gilvus rather than just Category:Vireo gilvus.
If the scientific name has two components; no problem. If it has three, then there's a subspecies. The appropriate logic would be:
- If three parts in Sci name; check for subspecies category
- If subspecies category exists, use it
- else sue species category.
If there are only a few examples, I'm happy to mop up manually. Andy Mabbett (talk) 18:24, 17 January 2013 (UTC)
- I think there are only a few examples. As the duff categories can easily be mass changed using hotcat, I suggest that's the way to go. If at a future time there are a few hundred more files to upload from xeno-canto, then I'll think about tweaking the category logic. Checking if a category exists and reacting accordingly is the sort of thing it might be worth fiddling around with for 10k+ files but not 1k-.
- The other issue of file encoding problems I might address by comparing file play lengths on XC and WC, but still have to try out the WC API to see how easy that is. --Fæ (talk) 19:08, 17 January 2013 (UTC)
Source code
[edit]Refer to https://github.com/faebug/batchuploads/blob/master/batchXenocanto.py