Check 1000s of Images for duplicates quickly

I came up against a problem recently where I had over 4000 images and I needed to check to see if there were any duplicates and remove them. The images were all different sizes/resolutions and duplicate images could have the same picture, but have different sizes/resolutions.

I might also add that I’m not exactly ‘Dr Maths’ or have any clue about image transformations - but I needed to get rid of the dupes in a hurry.

The method I came up with in the end was pretty simple, and while not foolproof, I haven’t seen it make a mistake yet. It was also quick enough for my needs, removing all the duplicates in around 20 minutes.

The Screenshot

If you want to compare 2 directories, then fill out both directory boxes. Any duplicates you choose to delete will be removed from directory 2. If all the files are in one directory, then leave the directory 2 textbox empty.

The Downloads

CompareImages.exe - You can run this - Full code

The Solution:

1. build 50×50 copies of all images

Under the root directory of the images I add another directory called ’small’. I then make 50×50 copies of all the images and put them in this directory. I do this for 2 reasons.

  • It normalises the size of all the images - I couldn’t compare 200×200 to 300×300 images before, but now they are both 50×50.
  • Using pixel-by-pixel comparison, 50×50 is a lot less comparisons than 300×300 (2500 vs 90,000) and is still accurate enough to detect duplicates

2. Compare the 50×50 images with each other

Here is where the people that know something about images will most likely shake their heads (if they haven’t started already). For the rest;

For both images, I compare every pixel to see if they have ‘almos the same’ RGB value. If the RGB values are not similar, then I increment difference count. Once the number of differences reaches 50 these images are not the same, and I stop comparing (which saves a lot of time).

iDiff = 20

For iY = 0 To iHeight - 1
    For iX = 0 To iWidth - 1
	oOrigPixel = bmOrig.GetPixel(iX, iY)
	oNewPixel = bmNew.GetPixel(iX, iY)

	If Math.Abs(Int(oOrigPixel.R) - Int(oNewPixel.R)) > iDiff OrElse _
    	Math.Abs(Int(oOrigPixel.G) - Int(oNewPixel.G)) > iDiff OrElse _
   	Math.Abs(Int(oOrigPixel.B) - Int(oNewPixel.B)) > iDiff Then
    		iDiffCount = iDiffCount + 1
    		If iDiffCount > 50 Then
			'Images are not duplicates
			Exit Function
    		End If
	End If

'Images ARE duplicates

Need more traffic? Blog MatchUp is coming!

I’ve been a bit quiet lately, but thats just because I’ve been working on my latest project - and it’s almost ready.

Blog MatchUp allows you to compete with other blogs for bragging rights over who has the most traffic. But as everyone saw from the ShoeMoney vs john Chow competition, everyone is a winner with more traffic.

At the moment Blog MatchUp is taking registrations, but should be ready to launch beta testing within a week.