Deprecated: Assigning the return value of new by reference is deprecated in /home/gathadam/public_html/wp-settings.php on line 520

Deprecated: Assigning the return value of new by reference is deprecated in /home/gathadam/public_html/wp-settings.php on line 535

Deprecated: Assigning the return value of new by reference is deprecated in /home/gathadam/public_html/wp-settings.php on line 542

Deprecated: Assigning the return value of new by reference is deprecated in /home/gathadam/public_html/wp-settings.php on line 578

Deprecated: Function set_magic_quotes_runtime() is deprecated in /home/gathadam/public_html/wp-settings.php on line 18
Check 1000s of Images for duplicates quickly
Previous     Next

Check 1000s of Images for duplicates quickly

I came up against a problem recently where I had over 4000 images and I needed to check to see if there were any duplicates and remove them. The images were all different sizes/resolutions and duplicate images could have the same picture, but have different sizes/resolutions.

I might also add that I’m not exactly ‘Dr Maths’ or have any clue about image transformations - but I needed to get rid of the dupes in a hurry.

The method I came up with in the end was pretty simple, and while not foolproof, I haven’t seen it make a mistake yet. It was also quick enough for my needs, removing all the duplicates in around 20 minutes.

The Screenshot

If you want to compare 2 directories, then fill out both directory boxes. Any duplicates you choose to delete will be removed from directory 2. If all the files are in one directory, then leave the directory 2 textbox empty.

The Downloads

CompareImages.exe - You can run this
CompareImages.zip - Full code

The Solution:

1. build 50×50 copies of all images

Under the root directory of the images I add another directory called ’small’. I then make 50×50 copies of all the images and put them in this directory. I do this for 2 reasons.

  • It normalises the size of all the images - I couldn’t compare 200×200 to 300×300 images before, but now they are both 50×50.
  • Using pixel-by-pixel comparison, 50×50 is a lot less comparisons than 300×300 (2500 vs 90,000) and is still accurate enough to detect duplicates

2. Compare the 50×50 images with each other

Here is where the people that know something about images will most likely shake their heads (if they haven’t started already). For the rest;

For both images, I compare every pixel to see if they have ‘almos the same’ RGB value. If the RGB values are not similar, then I increment difference count. Once the number of differences reaches 50 these images are not the same, and I stop comparing (which saves a lot of time).


iDiff = 20

For iY = 0 To iHeight - 1
    For iX = 0 To iWidth - 1
	oOrigPixel = bmOrig.GetPixel(iX, iY)
	oNewPixel = bmNew.GetPixel(iX, iY)

	If Math.Abs(Int(oOrigPixel.R) - Int(oNewPixel.R)) > iDiff OrElse _
    	Math.Abs(Int(oOrigPixel.G) - Int(oNewPixel.G)) > iDiff OrElse _
   	Math.Abs(Int(oOrigPixel.B) - Int(oNewPixel.B)) > iDiff Then
    		iDiffCount = iDiffCount + 1
    		If iDiffCount > 50 Then
			'Images are not duplicates
			Exit Function
    		End If
	End If
    Next
Next

'Images ARE duplicates


If you liked this, then subscribe to my RSS feed

3 comments

  1. Jason Nov 29

    Hey Dr Maths, I know nothing about images but I’m still shaking my head. Nice hack all the same. Jason.

  2. Shannon Dec 7

    How accurate is this? I’m shaking my head at the algorithm just because I can see several ways this could go wrong, but iuf it’s accurate enough, then its probably a great solution!

  3. Gath Dec 11

    Hey Jas - good to see you last night in Melbourne!

    Shannon - It is definitely *possible* to miss duplicates, but in practice I haven’t come missed any.