Commons:Bots/Requests/Smallbot 9
Operator: Smallman12q (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)
Bot's tasks for which permission is being sought: To fulfill Commons:Batch_uploading#VOA_pronunciation_sound_files. Uploading ~6500 pronunciation files from http://names.voa.gov
Automatic or manually assisted: Automatic
Edit type (e.g. Continuous, daily, one time run): Initial one run, followed by monthly run
Maximum edit rate (e.g. edits per minute): 10-15, as fast it uploads
Bot flag requested: (Y/N): No
Programming language(s): Python3.2 w/ requests, beautifulsoup4. ffmpeg for conversion.
Source |
---|
#!/usr/bin/env python3.2
# -*- coding: utf-8 -*-
#For uploading files from names.voa.gov to commons
from bs4 import BeautifulSoup
import requests
from subprocess import call
import os.path
import traceback
from PyRWiki import Wiki #Requests based wrapper for api
from p import p
DEBUG=False
#http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents
#http://stackoverflow.com/questions/10993612/python-removing-xa0-from-string
#http://stackoverflow.com/questions/2077897/substitute-multiple-whitespace-with-single-whitespace-in-python
def textOf(soup):
return ' '.join(''.join(soup.findAll(text=True)).replace('\xa0', ' ').strip().split())
#make first letter of word Upper if after space or "-", only single space/-
def fixname(oldname):
oldname=oldname.strip()
fixed=""
lastwasspace=True#Make first letter upper
for i in oldname:
if lastwasspace:
fixed += i.upper()
lastwasspace=False
else:
if i == " " or i == "-":
lastwasspace = True
else:
lastwasspace = False
fixed += i.lower()
return fixed
def log(stuff):
print(stuff)
#Log in
commons = Wiki("https://commons.wikimedia.org/w/api.php","Smallbot")
commons.login('Smallbot',p.bP)
commons.setEditToken()
counter= 2 #starts at 2
log('Checking for last id.')
if os.path.isfile('last.txt'):
with open('last.txt', 'r') as content_file:
counter = int(content_file.read())
log('Last id found: ' + str(counter))
else:
log('No prior id found. Starting at 2.')
session=requests.session()
if DEBUG:
session.proxies = {'http': 'http://localhost:8888'}
session.headers = {'Referer': 'https://commons.wikimedia.org/wiki/Commons:Batch_uploading/VOA_pronunciation_sound_files'}
lastsuccess=counter
reached404=0
try:
while reached404 < 25: # up to 25 can be skipped
r = session.get('http://names.voa.gov/modal.phrasedetail.php?id=' + str(counter))
#if r.status_code == 404:
if "Cannot find the requested name" in r.text:
reached404 += 1
log('404 reached for ' + str(counter))
else:
reached404=0#reset 404 counter
lastsuccess=counter
soup=BeautifulSoup(r.content)
soupbody=soup.select('div.modal-body')[0]
if textOf(soupbody) != "How do you say ?":
name=textOf(soupbody.select("h2")[0])[15:-1] # remove "How do you say" and '?'
name=fixname(name)
pronounce= textOf(soupbody.select('p')[0])
region=textOf(soup.select('h4')[0].findNext('p'))#('h4 + p')[0]) #Adjacent sibling selector
if textOf(soup.select('h4')[0]) != 'Region':
region=''
r=session.get('http://names.voa.gov/sounds/' + str(counter) + '.mp3')
r.raise_for_status() #should be no errors
log('---------------------------------')
log('ID: ' + str(counter))
log('Name: ' + name)
log('Pronounce: ' + repr(pronounce))
log('Region: ' + region)
log(str(len(r.content)) + ' bytes')
with open('data.mp3','wb') as voamp3:
voamp3.write(r.content)
filedesc="{{Information\n" +\
"|description= {{VOA pronunciation|term=" + name + "|region=" + region + "|transliteration=" + pronounce + "}}\n" +\
"|date= 2013\n" +\
"|source= VOA pronunciation guide: [http://names.voa.gov/modal.phrasedetail.php?id=" + str(counter) + " " + name + "]\n" +\
"|author= Jim Tedder\n" +\
"|permission= {{PD-USGov-VOA}}\n" +\
"|other_versions=\n" +\
"}}\n"
if os.path.exists('data.ogg'):
os.remove('data.ogg')
#call(['avconv', '-i', 'data.mp3', '-acodec', 'libvorbis', '-aq', '7', 'data.webm'])
call(['avconv', '-i', 'data.mp3', '-acodec', 'libvorbis', '-aq', '7', 'data.ogg']) #use .ogg instead
if region != '':
region = ' from ' + region
commons.upload(title="En-us-" + name + region + ' pronunciation (Voice of America).ogg',
filelocation='data.ogg',
text=filedesc,
comment='[[Commons:Bots/Requests/Smallbot 9]]: Uploading Voice of America pronunciation files from http://names.voa.gov',
uploadifduplicate=False)
#TODO-upload data.webm as file
else:
log('Empty at ' + str(counter))
counter += 1
except:
traceback.print_exc()
finally: #
with open('last.txt','w') as lastfp:
lastfp.write(str(lastsuccess))
log('Done.')
Also need a 'last.txt' with the value of 6937 |
Smallman12q (talk) 20:45, 2 May 2013 (UTC)
Discussion
What should the file description be? Should I use {{Pronunciation}}? Smallman12q (talk) 20:45, 2 May 2013 (UTC)
- Looks like this template is not popular, but it's good idea to standardize media files class descriptions. BTW is this source so unique and Commons doesn't have such pronunciations? :-) --EugeneZelenko (talk) 14:39, 3 May 2013 (UTC)
- I don't believe Commons has these pronunciations. Is there some standard pronunciation template? I'll probably make one for the VOA files.Smallman12q (talk) 03:09, 4 May 2013 (UTC)
- The template would read:
- I don't believe Commons has these pronunciations. Is there some standard pronunciation template? I'll probably make one for the VOA files.Smallman12q (talk) 03:09, 4 May 2013 (UTC)
Voice of America pronunciation of <term> from the region of <region>. Transliteration: <transliteration>
- Is that fine? It'll also auto-categorize by region and first letter of the first name so "AL-HALQI, WAEL" would be "WAEL AL-HALQI" and categorized by W. Is the letter/region categorization needed?Smallman12q (talk) 19:51, 4 May 2013 (UTC)
Well... this should clearly be marked as an american pronounciation recommendation. At least for the few german names I have checked this is certainly not the gold-standard for pronounciation (Erik Honnecker, Frantz Muntefering, and many more). --Dschwen (talk) 16:49, 3 May 2013 (UTC)
- there is contact info, i'm sure they would be open to your feedback. [1] (or refer them to our local Goethe institute) - the value is that it is a currently maintained public domain source of pronunciations. Slowking4⇔ †@1₭ 13:03, 4 May 2013 (UTC)
I've uploaded a few to Category:Terms from Voice of America pronunciation guide. Is it good to go?Smallman12q (talk) 14:03, 8 May 2013 (UTC)
- It'll be good idea to include these files into some pronunciation categories. --EugeneZelenko (talk) 14:27, 8 May 2013 (UTC)
- I could add them to Category:English pronunciation and also prepend the names with En-us so it'd be "File:En-us Abadilla from Philippines pronunciation (Voice of America).webm"? Would that be all?Smallman12q (talk) 23:17, 8 May 2013 (UTC)
- Adding language code prefix is definitely good idea. BTW why not to upload in Ogg format? At least majority of pronunciations use this format. --EugeneZelenko (talk) 14:37, 9 May 2013 (UTC)
- I've asked at w:Wikipedia:Village_pump_(technical)#Preferred_format_for_pronunciations whether it should be .webm or .ogg. Is there a reason you prefer one over the other? I can do either, it's only a one line change.Smallman12q (talk) 17:56, 9 May 2013 (UTC)
- Bot is uploading as .ogg for all. Could you delete:
- I've asked at w:Wikipedia:Village_pump_(technical)#Preferred_format_for_pronunciations whether it should be .webm or .ogg. Is there a reason you prefer one over the other? I can do either, it's only a one line change.Smallman12q (talk) 17:56, 9 May 2013 (UTC)
- Adding language code prefix is definitely good idea. BTW why not to upload in Ogg format? At least majority of pronunciations use this format. --EugeneZelenko (talk) 14:37, 9 May 2013 (UTC)
- I could add them to Category:English pronunciation and also prepend the names with En-us so it'd be "File:En-us Abadilla from Philippines pronunciation (Voice of America).webm"? Would that be all?Smallman12q (talk) 23:17, 8 May 2013 (UTC)
- File:Egil Aarvik from Norway pronunciation (Voice of America).webm
- File:Sani Abacha from Nigeria pronunciation (Voice of America).webm
- File:Jorge Abadia from Panama pronunciation (Voice of America).webm
- File:Abadilla from Philippines pronunciation (Voice of America).webm
- File:Leonid Abalkin from Russia pronunciation (Voice of America).webm
- File:Domingo Iturbe Abasolo from Spain pronunciation (Voice of America).webm
Smallman12q (talk) 00:36, 10 May 2013 (UTC)
- You could just add {{Superseded}} or {{Delete}} on files.
If there is no other objections, I think task should be approved. --EugeneZelenko (talk) 14:31, 10 May 2013 (UTC)
- Initial run is done. Will run monthly or so in the future.Smallman12q (talk) 23:20, 10 May 2013 (UTC)