Commons:Bots/Requests/Embedded Data Bot
- See also: AN thread on this issue
Operator: Zhuyifei1999 (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)
Bot's tasks for which permission is being sought: Attempts to detect files with embedded data and adds {{Speedy}} to the file description pages.
Automatic or manually assisted: Automatic unsupervised
Edit type (e.g. Continuous, daily, one time run): Continuous
Maximum edit rate (e.g. edits per minute): 6 edits/min if this stays on tool labs
Bot flag requested: (Y/N): Y
Programming language(s): python: pywikibot (source)
Zhuyifei1999 (talk) 19:53, 29 December 2016 (UTC)
Discussion
- Currently undergoing testing. --Zhuyifei1999 (talk) 19:53, 29 December 2016 (UTC)
- Edits are currently not saved, I'll start to let it edit once the false positive / false negative rate is reduced and the program's performance gets optimized. --Zhuyifei1999 (talk) 06:20, 30 December 2016 (UTC)
- Now test running with {{Speedy}} hidden in <!-- -->. --Zhuyifei1999 (talk) 11:25, 30 December 2016 (UTC)
- It has been observed that a few files such as File:Eha_emetẹ_dance_from_Esaba.jpg contains an embedded jpg right after the visible jpg (this one is at byte 1838329), at a lower resolution, with no visible differences from the visible part of the jpg. A more detailed machine-generated description of the embeeded file is 'JPEG image data, Exif standard: [TIFF image data, little-endian, direntries=1], baseline, precision 8, 1440x1080, frames 3'. Shall the bot just ignore the file if the embedded file is found to be an image file, or overwrite the file? I personally prefer the latter, as it will prevent three-part-embedded archives (jpg-jpg-zip); my bot will get the first part out, and the concatenated second and third will be identified as jpg and get ignored, if set to the former. --Zhuyifei1999 (talk) 07:21, 30 December 2016 (UTC)
- I'd say overwrite is reasonable. --Krd 09:49, 30 December 2016 (UTC)
- Ok set to overwrite when both conditions are satisfied: exact file ending is found (i.e. not detected by ffmpeg hack, used with video/audio files), and the MIME of the second (or more) part equals to that of the first part. --Zhuyifei1999 (talk) 11:25, 30 December 2016 (UTC)
- I'd say overwrite is reasonable. --Krd 09:49, 30 December 2016 (UTC)
- Is it possible just to remove data and override file with valid image? --EugeneZelenko (talk) 15:10, 30 December 2016 (UTC)
- Yes, I actually prefer it this way. But a few people didn't like that. Not sure if this is still the case. --Zhuyifei1999 (talk) 16:59, 30 December 2016 (UTC)
- Analyzing some not-so-obvious files, some of them seem really weird: File:Postkarte_Hosterlitz_Ortsansichten_1915.jpg, containing two end-of-file markers (0xFFD9), but only one start-of-file marker (0xFFD8). So the "embedded data" seems like a half of a valid jpeg in some way. The truncated file is uploaded, and I can't see any visible difference between the original and the truncated --Zhuyifei1999 (talk) 16:59, 30 December 2016 (UTC)
Comment Bot flag applied for test phase, keeping discussion open for 7 days. --Krd 08:30, 31 December 2016 (UTC)
- For reference,c custom parsers have been added to accurately determine the endings of ogg and webm files.
I'm currently writing one for flac.Some files tagged by the AF is found to be very suspicious, but false positives, such as the one by User:Yethuhtet771. It's possible that the pirates are testing our system --Zhuyifei1999 (talk) 07:00, 1 January 2017 (UTC)- (Same for File:Chemical process217.webm, an almost two-hour long video.) --Zhuyifei1999 (talk) 07:35, 1 January 2017 (UTC)
- {{Speedy}} is no longer commented out with improved precision --Zhuyifei1999 (talk) 09:01, 1 January 2017 (UTC)
- Per the suggestion from @Srittau: , the bot is now using {{Embedded data}} to tag speedy deletions --Zhuyifei1999 (talk) 08:29, 3 January 2017 (UTC)
- Could you please repeat test run? I see files overwrites, but not added template. --EugeneZelenko (talk) 15:22, 3 January 2017 (UTC)
- @EugeneZelenko: The files tagged are supposed to be deleted fast, thus the transclutions of {{Speedy}} or {{Embedded data}} (which transcludes {{Speedy}} anyways), not starting a full-fledged DR; see Special:DeletedContributions/Embedded_Data_Bot. The older edits that had added tags visible in Special:Contributions/Embedded_Data_Bot are more like noise, and usually not added in a malicious manner, and my bot is currently coded to ignore them. --Zhuyifei1999 (talk) 17:03, 3 January 2017 (UTC)
- Regarding the full algorithm, feel free to refer to the code (published on GitHub linked above). I don't feel comfortable to reveal the whole thing in plain English in public, as that may ease workarounds or other abuse.
- Also, the bot is operating by watching the Recent Changes. There is one RecentChanges-watching process that feeds all the uploads and reuploads into a FIFO (first in first out) queue in Tool Labs, and two worker processes that pulls from the RC and checks each file for embedded data. Basically: The bot run is continuous, much like SignBot. --Zhuyifei1999 (talk) 17:10, 3 January 2017 (UTC)
- Probably I misunderstood big picture. In case of safe data, file will be just overwritten, an in case of unsafe data will be nominated to deletion? --EugeneZelenko (talk) 15:13, 4 January 2017 (UTC)
- I'm not entirely sure about what you mean by safe & unsafe. The overwritten ones are only those are seems to be redundant (i.e. MIME of removed part = MIME of kept part); overwriting is to prevent possible abusive workarounds on this bot by the pirates. The tagged for speedy ones are in compliance of COM:CSD#F9, to discourage Zero pirates' usage of Commons as a arbitrary file-sharing piracy site. These are not security issues (I mean I doesn't trigger a modern browser to run a script or whatever).
- Perhaps the task description isn't clear about the recent Zero abuse, so I've added a link on the top to the relevant AN discussion --Zhuyifei1999 (talk) 15:41, 4 January 2017 (UTC)
- Probably I misunderstood big picture. In case of safe data, file will be just overwritten, an in case of unsafe data will be nominated to deletion? --EugeneZelenko (talk) 15:13, 4 January 2017 (UTC)
- Could you please repeat test run? I see files overwrites, but not added template. --EugeneZelenko (talk) 15:22, 3 January 2017 (UTC)
Test running well, ongoing development, no known problems. Approved. --Krd 11:59, 8 January 2017 (UTC)