Commons:Bots/Requests/Andrebot (2)

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Operator:Andrei S. Talk

Bot's tasks for which permission is being sought: Massive upload of files from the Romanian language Wikipedia, in particular CC files uploaded to ro.wp by this user.

Automatic or manually assisted: Automatic, supervised by the operator.

Edit type (e.g. Continuous, daily, one time run): Intermittent run (manual start for each set of 10-50 images with similar names)

Maximum edit rate (eg edits per minute): 1-2 uploads per minute

Bot flag requested: (Y/N):Yes

Programming language(s): Python, script using the Pywikipediabot framework. Code follows:

Code
# -*- coding: utf-8 -*-
import os,sys,wikipedia, codecs, catlib, pagegenerators, upload, datetime, re

site = wikipedia.getSite()
cmsite = wikipedia.getSite("commons", "commons")
pattern = sys.argv[1]
nfpattern = sys.argv[2]
categ = unicode(sys.argv[3], 'utf-8')
startno = int(sys.argv[4])
endno = int(sys.argv[5])
extranamefile = sys.argv[6]

crtfileindex = 0
extranames = []
if os.path.exists(extranamefile):
	extranames = open(extranamefile).readlines()

for i in range(startno, endno + 1): 
	origfilename = pattern.replace("_number_", repr(i))
	filename = u"Fişier:" + origfilename
	wikipedia.output(filename)
	page = wikipedia.ImagePage(site, filename)
	if (not page.exists()):
		wikipedia.output('page not found')
		continue

	url = page.fileUrl()
	#uo = wikipedia.MyURLopener()
	#datasrc = uo.open(url)
	pagetext = page.get()
	ex = re.compile(u"\{\{([\w \-\:]+[\s]*)((\|(([\w \-]+[\s]*=)?[\s]*[\w \-\{\}\[\]\,\.\?]*[\s]*)[\s]*)*)\}\}", re.U)
	res = re.findall(ex, pagetext)
	if res:
		print repr(res)
	if res[0][0].startswith(u"Informaţii"):
		params = re.split(u"\|", res[0][1], re.U)
		for param in params:
			pelems = re.split(u"=", param, re.U)
			print repr(pelems)
			if pelems[0].startswith("Descriere"):
				origdescr = pelems[1].lstrip().rstrip()
			elif pelems[0].startswith("Sursa"):
				origsrc = pelems[1].lstrip().rstrip()
			elif pelems[0].startswith("Data"):
				origdate = pelems[1].lstrip().rstrip()
			elif pelems[0].startswith("Autor"):
				origauthor = pelems[1].lstrip().rstrip()
	versionhistory = page.getVersionHistory(reverseOrder=True, revCount=1)
	origuploader = versionhistory[0][2]

	months = [u'ianuarie', u'februarie', u'martie', u'aprilie',  u'mai',  u'iunie',  u'iulie',  u'august',  u'septembrie',  u'octombrie',  u'noiembrie',  u'decembrie']
	dateparts = origdate.split()
	year = int(dateparts.pop())
	month = months.index(dateparts.pop()) + 1
	if (len(dateparts) > 0):
		day = dateparts.pop()
		actualdate = datetime.date(year, month, day)
		newdate = actualdate.strftime('%Y-%m-%d')
	else:
		actualdate = datetime.date(year, month, 1)
		newdate = actualdate.strftime('%Y-%m')
	
	description = u"=={{int:filedesc}}==\n"
	description += u"{{Information\n"
	description += u"|Description={{ro|" + origdescr + u"}}\n"
	description += u"|Date=" + newdate + u"\n"
	if origauthor == origuploader:
		description += u"|Source=Uploaded to ro.wikipedia by the author, as [[:ro:Fişier:" + origfilename + u"|]]\n"
	else:
		description += u"|Source=Uploaded to ro.wikipedia by the [[:ro:Utilizator:" + origuploader + u"|]], as [[:ro:Fişier:" + origfilename + u"|]]\n" 
	if origauthor == origuploader:
		description += u"|Author=[[:ro:Utilizator:" + origuploader + "|]] at ro.wikipedia\n" + origfilename + u"|]]\n" 
	else:
		description += u"|Author=" + origauthor + u", uploaded by [[:ro:Utilizator:" + origuploader + "|]] at ro.wikipedia\n"
	description += u"|Permission={{self|cc-by-sa-2.5|author=" + origauthor + u"}}\n"
	description += u"}}\n"
	description += u"==Original history==\n" + page.getFileVersionHistoryTable()
	description += u"[[Category:" + categ + "]]"

	wikipedia.output(description);

	if crtfileindex < len(extranames):
		alteredpattern = nfpattern.replace("_number_", "_" + extranames[crtfileindex].strip() + "__number_")
	else:
		alteredpattern = nfpattern
	newname = alteredpattern.replace("_number_", repr(i))
	newname = newname.replace(" ","_").replace("__", "_")
	crtfileindex = crtfileindex + 1
	uploader = upload.UploadRobot(useFilename = newname, description = description, targetSite = cmsite, url = url)
	uploader.run()

	pagetext = pagetext + u"{{NowCommons|File:" + newname + u"}}";
	page.put(pagetext, comment = u"mutat la Commons")

Andrei S. Talk 07:33, 21 January 2010 (UTC)[reply]

Discussion

I guess you took a look at imagecopy.py, right? Looks good on first sight. Could you apply some logic to the author field? So if origauthor and uploader are the same produce something nicer. Also it wouldn't hurt to have {{Self|Cc-by-sa-2.5|author=the author}} to stress the author (this will also give nice a nice attribution field). Multichill (talk) 09:12, 21 January 2010 (UTC)[reply]

You're right, I did look into Pywikipediabot source code, to reuse as many parts of it as possible. I'll look into your suggestion tonight and I'll try to improve the author logic as well as the CC template.—Andrei S. Talk 10:13, 21 January 2010 (UTC)[reply]
Ok great. Please include a link to the original file somewhere. Multichill (talk) 10:23, 21 January 2010 (UTC)[reply]
I updated the code with added improvements. This is a test-file uploaded by the bot running the above code.
Image review process is done by manually analizing the file description page at ro.wikipedia. The purpose of this bot request are, in fact, the files uploaded by Utilizator:Țetcu Mircea Rareș, who appears to copy and paste the description page on those batches. All his files are licensed Cc-by-sa-2.5; if there are any exceptions, they won't be uploaded. To be sure, I'll add a bit of code to check for the Cc-by-sa-2.5 template and automatically skip the files whose description page doesn't contain it. For any other files at ro.wp, I think Commonshelper2 would do, but Mr Ţetcu's thousands of files are too many to handle one by one in a reasonable ammount of time and with a reasonable ammount of effort.
I think also will be good idea to ask authors of thumbnails/without EXIF/single upload too professional looking (which may be taken from Internet) to upload full resolution version. --EugeneZelenko (talk) 16:23, 23 January 2010 (UTC)[reply]
The script takes an argument (the third one), which is the name of a Commons category to be added to all the files. As the script will be run manually on command line for each batch of images, all the images will be from the same place and therefore can be categorized as such. The category will be added manually to the tree, as was the case with last night's test upload. This is all that the bot can do. Manually, using HotCat, I can later review and add some of them (that can be so added) to other categories, such as those in the Wooden churches in Romania subtree.
I admit that these pictures are rather poorly described, rather poorly titled and most of them are only of passable quality. We tried to talk to the uploader, but he has proven to be very touchy when it comes to any shade of a request for an improvement of his uploads. That said, the pictures are still valuable, because most of them are taken in remote villages of Romania, where people with a digital camera rarely set foot with the intent of visually documenting the place, and also most of the photos document old churches that are historical monuments. I am renaming his original filenames to a template like: RO XX Yyyyyy number.jpg, where RO means Romania, XX is the designation of the administrative division where the village is located, and Yyyyy is the name of the village. It isn't very descriptive, but I think it's the best I can do—I am open to any ideas, though.—Andrei S. Talk 15:41, 22 January 2010 (UTC)[reply]
May be you could create list of better file names during reviewing and give them for bot? I think acronyms should be replaced with full names. Will be good idea to replace numbers with subject (building, church, event, elephant (in case of African images), etc.) --EugeneZelenko (talk) 16:23, 23 January 2010 (UTC)[reply]
Sorry for the delayed answer. This enhancement needs a bit of extra coding and I don't have time at the moment. This weekend I'll be able to work on this.—Andrei S. Talk 19:33, 28 January 2010 (UTC)[reply]
OK, I put up the updated code. An example of the recent run is the files uploaded to Category:Urisiu de Sus. Now a file is read for extra wording to insert into the file name just before the number.—Andrei S. Talk 19:51, 7 February 2010 (UTC)[reply]
Thank you! Could you also replace acronyms with full name (RO -> Romania, etc.)? --EugeneZelenko (talk) 15:29, 9 February 2010 (UTC)[reply]
Yes, I'll keep that in mind. The script doesn't need to be changed for this; the new name template is provided as command line arguments.—Andrei S. Talk 17:39, 9 February 2010 (UTC)[reply]