Previous     Next

Check 1000s of Images for duplicates quickly

I came up against a problem recently where I had over 4000 images and I needed to check to see if there were any duplicates and remove them. The images were all different sizes/resolutions and duplicate images could have the same picture, but have different sizes/resolutions.

I might also add that I’m not exactly ‘Dr Maths’ or have any clue about image transformations - but I needed to get rid of the dupes in a hurry.

The method I came up with in the end was pretty simple, and while not foolproof, I haven’t seen it make a mistake yet. It was also quick enough for my needs, removing all the duplicates in around 20 minutes.

The Screenshot

If you want to compare 2 directories, then fill out both directory boxes. Any duplicates you choose to delete will be removed from directory 2. If all the files are in one directory, then leave the directory 2 textbox empty.

The Downloads

CompareImages.exe - You can run this - Full code

The Solution:

1. build 50×50 copies of all images

Under the root directory of the images I add another directory called ’small’. I then make 50×50 copies of all the images and put them in this directory. I do this for 2 reasons.

  • It normalises the size of all the images - I couldn’t compare 200×200 to 300×300 images before, but now they are both 50×50.
  • Using pixel-by-pixel comparison, 50×50 is a lot less comparisons than 300×300 (2500 vs 90,000) and is still accurate enough to detect duplicates

2. Compare the 50×50 images with each other

Here is where the people that know something about images will most likely shake their heads (if they haven’t started already). For the rest;

For both images, I compare every pixel to see if they have ‘almos the same’ RGB value. If the RGB values are not similar, then I increment difference count. Once the number of differences reaches 50 these images are not the same, and I stop comparing (which saves a lot of time).

iDiff = 20

For iY = 0 To iHeight - 1
    For iX = 0 To iWidth - 1
	oOrigPixel = bmOrig.GetPixel(iX, iY)
	oNewPixel = bmNew.GetPixel(iX, iY)

	If Math.Abs(Int(oOrigPixel.R) - Int(oNewPixel.R)) > iDiff OrElse _
    	Math.Abs(Int(oOrigPixel.G) - Int(oNewPixel.G)) > iDiff OrElse _
   	Math.Abs(Int(oOrigPixel.B) - Int(oNewPixel.B)) > iDiff Then
    		iDiffCount = iDiffCount + 1
    		If iDiffCount > 50 Then
			'Images are not duplicates
			Exit Function
    		End If
	End If

'Images ARE duplicates

If you liked this, then subscribe to my RSS feed


  1. Jason Nov 29

    Hey Dr Maths, I know nothing about images but I’m still shaking my head. Nice hack all the same. Jason.

  2. Shannon Dec 7

    How accurate is this? I’m shaking my head at the algorithm just because I can see several ways this could go wrong, but iuf it’s accurate enough, then its probably a great solution!

  3. Gath Dec 11

    Hey Jas - good to see you last night in Melbourne!

    Shannon - It is definitely *possible* to miss duplicates, but in practice I haven’t come missed any.