Improving the speed of finding duplicates

Post by *Hacker » 2012-10-18, 10:40 UTC

Hi Christian,
When searching for duplicates (same size and content) I assume the search works like this:
1. Create a file list
2. Check which files have the same size
4. Calculate their checksums and compare.

I suggest to add another step:
1. Create a file list
2. Check which files have the same size
3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.
4. Calculate their checksums and compare.

What do you think?

Roman

ts4242 · Post by *ts4242 » 2012-10-18, 11:03 UTC

Something like this was requested before but unfortunately there was no response from Ghisler.

Post by *Hacker » 2012-10-18, 11:52 UTC

Hello ts4242,
Your suggestion concerns something quite different, though useful, too,

Roman

ts4242 · Post by *ts4242 » 2012-10-18, 12:06 UTC

Hacker wrote:Your suggestion concerns something quite different

But the same concept, that is, using checksum instead of direct compare by contents.

MVV · Post by *MVV » 2012-10-18, 13:05 UTC

ts4242 wrote:But the same concept, that is, using checksum instead of direct compare by contents.

As I see in first post, Hacker suggest the opposite, do not use checksums while comparing.

Post by *ghisler(Author) » 2012-10-18, 13:17 UTC

TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.

ts4242 · Post by *ts4242 » 2012-10-18, 15:04 UTC

ghisler(Author) wrote:TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.

This is not new to me but it confirm my conclusion which i posted in this topic http://ghisler.ch/board/viewtopic.php?t=24390

Edit:
I'm pretty sure that the suggested algorithm somehow used in find duplicate files, to say that, I did the following test
1- create folder F:\test
2- create 5 subfolders
3- copy a file of size 1.2 GB to each subfolder
4- search for duplicate files in F:\test (same size, same contents are checked)
5- TC takes about 2 minutes to list the five files

How this come? The answer which I'm imagine is TC doesn't compare files by contents but instead creates MD5 checksum for each file and consider files with identical checksum are duplicate.
calculate MD5 for 1.2GB takes 25 seconds, so 5*25 =125 seconds (2~ minutes)

2Ghisler
Am I right?
Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?

My question is still "Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?"

MVV wrote: As I see in first post, Hacker suggest the opposite, do not use checksums while comparing.

Are you sure? Please read it again carefully, he said "4. Calculate their checksums and compare." compare here means compare the file's checksum not the file's contents.

MVV · Post by *MVV » 2012-10-18, 15:09 UTC

Are you sure?

I mean that topic starter concentrates our attention on comparing small data blocks (suggested 3rd step) and not on comparing checksums (step 4, showed only for clarity).

ts4242 · Post by *ts4242 » 2012-10-18, 15:42 UTC

@MVV

Again, Are you sure?

3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.

MVV · Post by *MVV » 2012-10-18, 18:03 UTC

Yeah, sure. "Compare small block (or maybe its checksum) and not whole file, as intermediate operation" != "Compare checksums of whole files".

Post by *Hacker » 2012-10-18, 21:21 UTC

Christian,

TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.

OK, in that case my suggestion holds for the case when there are more than two files of the same size.

Roman