Commons:Bots/Requests/SchlurcherBot9
SchlurcherBot (talk · contribs) 9 (Update to Request 8)
Operator: Schlurcher (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)
Bot's tasks for which permission is being sought: Add structured data based on information provided on file description page according to Commons:Structured data/Modeling
Automatic or manually assisted: Automatic
Edit type (e.g. Continuous, daily, one time run): One time run at increased speed.
Maximum edit rate (e.g. edits per minute): 30 50 then 100 then 200 (if feasible)
Bot flag requested: (Y/N): N (Bot has flag already)
Programming language(s): Bash + QuickStatements, later Pywikibot + Python API Calls (operated locally and on en:Microsoft Azure with Commons:IP block exemption)
Schlurcher (talk) 14:22, 6 January 2020 (UTC)
Discussion
This request is motivated by a discussion at Commons_talk:Structured_data#Structured_copyright_and_licensing_for_search_indexing. Summary:
- The development team is trying to use structured copyright and licensing information to improve search experience
- Currently there are only a few bots that add structured data to files and completion of the task will take years
- I have been asked if it is possible to drastically increase edit speed
- Database administrators are okay to slowly ramp up from the operations side whenever the community is ready
This request will use the same code as my request 8. Previous request for reference:
Commons:Bots/Requests/SchlurcherBot8 |
---|
===SchlurcherBot (talk · contribs) ===
Bot's tasks for which permission is being sought: Add structured data based on information provided on file description page according to Commons:Structured data/Modeling
Automatic or manually assisted: Automatic Edit type (e.g. Continuous, daily, one time run): Batches based on prepared lists Maximum edit rate (e.g. edits per minute): 30 Bot flag requested: (Y/N): N (Bot has flag already) Programming language(s): Bash + QuickStatements, later Pywikibot Schlurcher (talk) 14:22, 6 January 2020 (UTC) DiscussionThis request is motivated by a discussion on User_talk:JarektBot#original_creation_by_uploader. The intended task is to add structured data based on information provided on file description page. The first task envisoned is to add P7482 / Q66458942 (own work by original uploader) to all files that fulfill the following requirements:
Currently, I will use a bash script that queries the Commons API to check for all 3 conditions. Edits will be added to a list that can be processed through QuickStatements batch runs. A first batch run was performed under my username: Bot Test Run. Moving forward the edits are planned under the bot username (with flag). Further structured data statements are expected to be added. The task is similar to the recent actions of BotMultichill (talk · contribs) and BotMultichillT (talk · contribs). However, Multichill (talk · contribs) works on files that use both {{Own}} and {{Self}} (without checking for condition number 3 above). So some broader coverage is expected with this task. The task is expected to be broadned over time to include structured data derived from the author, source, data and license infromation. Once structured data on Commons is properly implemented in Pywikibot, the bot might switch to this framework (as used for the other tasks of the bot). --Schlurcher (talk) 14:22, 6 January 2020 (UTC)
@EugeneZelenko: I have performed a test run on the bot account. Results are here: [1] --Schlurcher (talk) 19:21, 8 January 2020 (UTC)
If there are no objections, I think task should be approved. --EugeneZelenko (talk) 16:17, 15 January 2020 (UTC)
|
The change is that I plan to run additional versions of the same script in parallel as well as making use of cloud resources (en:Microsoft Azure). Updates are marked in bold above. A staggered approach as follows is envisioned:
- 50 edits per minute (which can be achieved by my personal infrastructure)
- 100 edits per minute (personal infrastructure + cloud infrastructure
- 200 edits per minute (personal infrastructure + additional cloud infrastructure)
- Additional clone of this bot (not coved by this request and likely would need to be done by a separate volunteer)
Use of cloud resources is required starting from stage 2. The use of cloud resources requires Commons:IP block exemption, which has been granted by Taivo per standard process. Stage 3. would go beyond maximum bot activity recorded on commons and would likely require closer monitoring by the infrastructure team. Stage 4 will be part of a separate discussion and request. This request is to get community support for increasing the maximum edit rate as compared to my previous approved request. --Schlurcher (talk) 21:03, 6 August 2020 (UTC)
- @Keegan (WMF), Bjh21, Jarekt, Tacsipacsi, Multichill, Mmullie (WMF), Gestumblindi, and Taivo: You have been involved in the discussions that lead to this proposal. Please share your thoughts. --Schlurcher (talk) 21:03, 6 August 2020 (UTC)
- If this is just about being allowed to edit with higher speed then I Support it. I agree that it is a good idea to start slow and monitor when going to the next step. --MGA73 (talk) 21:54, 6 August 2020 (UTC)
- Thanks, MGA73. Note that the staggered increase in edit rate is not only to monitor system performance, but also to monitor the community reaction. Whereas all edits are marked as bot edits, as far as I know, it is not possible to mark these kind of edits as minor edits. So people that show bot edits on their watchlists have some more difficulty to filter these edits out. There is a risk that users get flooded with these edits. --Schlurcher (talk) 11:38, 7 August 2020 (UTC)
- @Schlurcher: Thanks for the info. I have 52k pages on my watchlist so I see your bot all the time :-) But I prefer flooding over having a nice watchlist. But if we want we can probably code our commons.js so it ignores changes made by your bot. --MGA73 (talk) 11:45, 7 August 2020 (UTC)
- I found this User_talk:INaturalistReviewBot#Watchlist but I have not testet it. --MGA73 (talk) 11:47, 7 August 2020 (UTC)
- @Schlurcher: Thanks for the info. I have 52k pages on my watchlist so I see your bot all the time :-) But I prefer flooding over having a nice watchlist. But if we want we can probably code our commons.js so it ignores changes made by your bot. --MGA73 (talk) 11:45, 7 August 2020 (UTC)
- Support This is something that we asked Schlurcher to do and we are grateful that he agreed to look into it. As a reference 100 edits/min means ~1M edits/week and Commons has 63M files. I am not sure how many files meet Schlurcher's bot conditions, but at stage 2 would take about a year of continuous running. Also as a reference, I often use widely used d:Help:QuickStatements (QS) tool for much simpler tasks of adding single SDC statements. With QS you do not have a control over speed since the tool running on tookserver paces itself based on load on the servers. Its speed can vary a lot but in some cases I clocked it at over 180 edits per minute. However, that pace is usually short lived as QS work with individually loaded batches of images and any batch bigger than 25K times out during job loading. --Jarekt (talk) 04:10, 7 August 2020 (UTC)
- I don't get it. If you modify the Pywikibot settings a bit, one stupid old laptop can do 100s of edits per minute. I understand you don't want to leave on your laptop 24x7. Why use Azure and not Toolforge? The whole replag thing is related to Wikidata and that that the query service can't keep up. We don't have that problem here as our local query service is only updated about once a week. Multichill (talk) 07:46, 7 August 2020 (UTC)
- Hi Multichill, there are a couple of reasons. 1) I'm not using a laptop but I am using a dedicated raspberry pi to run this operation. This is mainly due to power consumption considerations. 2.) Azure offers a free 12 month trial for two vCPUs that I can use to run this 24x7. 3.) I have zero experience with Toolforge and limited motivation to learn it. On the other hand, I had interest to learn Azure for a long time and when the discussion started, I thought this was finally a good opportunity do so. 4.) My motivation to run this bot is also to learn system administration, system monitoring and process management and improve my linux/cloud skills. I like to have full control over the environment. --Schlurcher (talk) 08:19, 7 August 2020 (UTC)
- Seems OK for me. --EugeneZelenko (talk) 13:21, 7 August 2020 (UTC)
- Support as the person who brought up the topic on the SDC talk page, thanks Schlurcher. Keegan (WMF) (talk) 18:27, 10 August 2020 (UTC)
Info based on the support seen so far, I have increased edit rate to approximately 50 edits per minute. Please let me know if there are any concerns. --Schlurcher (talk) 18:38, 13 August 2020 (UTC)
Approved. --Krd 05:55, 14 August 2020 (UTC)