User:Multichill/Scaled-down duplicates
People sometimes make the mistake to not transfer the original image from a wikipedia to Commons, but a thumbnail. This page describes a bot to spot and mark these kind of mistakes.
Process
[edit]Find pairs to work on
[edit]First we need pairs of images to work on. These pairs can be found in several ways:
- On Commons we have an image and at some wikipedia we have an image with the same name, but a different hash
- On Commons we have an image with a name in the form <number>px-<name>.<extension> where an image <name>.<extension> exists at some wikipedia or Commons
We should probably divide it:
- Batch runs to find old duplicates
- Daily run to find yesterdays duplicates
Match duplicates
[edit]We're working on pairs to peform matches
Size
[edit]One of the images should be smaller in size. This is the image which could be marked in the end.
Aspect ratio
[edit]The image should have about the same aspect ratio. For example with a 20% margin: 80% < (height image A / width image A) / (height image B / width image B) * 100 < 120%
Histogram
[edit]Histograms are the core of the matching. First the biggest image has to be scaled down to the same size as the other image. It's probably best to make a couple of histograms:
- Whole images
- Top left part of the images
- Top right part of the images
- Bottom left part of the images
- Bottom right part of the images
- Central part of the images
These histograms will match for a certain percentage. If this is above a certain threshold we have a match
Mark duplicates
[edit]The lowest quality image of the match should be marked with a template containing:
- The location of the higher quality image
- The size of this image and the other image
- The height of this image and the other image
- The width of this image and the other image
- Maybe aspect ratio
- The results of the histogram calculations
- The match percentage
Implementation
[edit]The first implementation is available in the pywikipedia package and is called match_images.py (source).