User:Dispenser/Absurd overhead

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Do some images seem to have unreasonably huge file size for their low resolution? Did you know you can easily hide Zip files in JPEGs because of SFX support? This program calculates a hypothetical uncompressed equivalent image and flags the source if it is larger. Then it downloads all flagged images, analyzes them by looking for various magic numbers, extracts metadata, and attempts with OptiPNG, pngout, and jpegtran to optimize the image.

We need help identifying what's in these files, deleting malicious contents, and categories files with extra data (e.g. Category:Fireworks PNG files). Columns with percentage indicate the file size reduction with those operations. Using the following commands: BMP: If only image data was kept uncompressed 3*img_width*img_height. Zip: gzip -9. If high, file is "zero padded" Trim: jpegtran -copy all for JPEGs, pngout -ks -s4 for PNGs Opti: jpegtran -copy all -optimize for JPEG, optipng -fix then pngout

Abuse filters: JPEG, PNG, GIF, and large newbie uploads; See also: GIF check, User:Embedded Data Bot, and Dec 2016 discussion

/* Absurd Overhead (approximate)
 * Author: Dispenser
 * License: Public Domain
 * Run time: 40 minutes <SLOW_OK>
 */
SELECT img_name, img_size, img_timestamp, img_user_text, img_sha1,
  IF(img_minor_mime="jpeg", 3, /* Workaround for [[phab:T132986]] */
    IF(img_metadata LIKE '%s:16:"truecolour-alpha"%', 4,
      IF(img_bits<8 OR img_metadata LIKE '%s:14:"index-coloured"%' OR img_metadata LIKE '%s:9:"greyscale"%', 1, 3)
    ) * img_bits / 8
  ) * img_width * img_height + IFNULL(fo_size, 50*1024) AS est_size
FROM image
JOIN page ON page_namespace=6 AND page_title=img_name
LEFT JOIN u2815__.file_overhead ON fo_page=page_id /* magic overhead */
LEFT JOIN categorylinks ON cl_from=page_id AND cl_to IN ("Animated_PNG", "Fireworks_PNG_files", "Picture_It!_files")
WHERE img_width>0 AND img_height>0
AND img_media_type="BITMAP" AND img_major_mime="image"
AND img_minor_mime IN ("jpeg", "png")
AND img_size > 250 * 1024
AND cl_from IS NULL
HAVING img_size > est_size
/* Worst offender by absolute size; works better than percentages */
ORDER BY cast(img_size as signed) - est_size DESC;