Total Commander

Posted: **2012-10-18, 10:40 UTC**

Hi Christian,
When searching for duplicates (same size and content) I assume the search works like this:
1. Create a file list
2. Check which files have the same size
4. Calculate their checksums and compare.

I suggest to add another step:
1. Create a file list
2. Check which files have the same size
3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.
4. Calculate their checksums and compare.

What do you think?

Roman

Posted: **2012-10-18, 11:03 UTC**

Something like this was requested before but unfortunately there was no response from Ghisler.

Posted: **2012-10-18, 11:52 UTC**

Hello ts4242,
Your suggestion concerns something quite different, though useful, too,

Roman

Posted: **2012-10-18, 12:06 UTC**

Hacker wrote:Your suggestion concerns something quite different

But the same concept, that is, using checksum instead of direct compare by contents.

Posted: **2012-10-18, 13:05 UTC**

ts4242 wrote:But the same concept, that is, using checksum instead of direct compare by contents.

As I see in first post, Hacker suggest the opposite, do not use checksums while comparing.

Posted: **2012-10-18, 13:17 UTC**

TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.

Posted: **2012-10-18, 15:04 UTC**

ghisler(Author) wrote:TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.

This is not new to me but it confirm my conclusion which i posted in this topic http://ghisler.ch/board/viewtopic.php?t=24390

Edit:
I'm pretty sure that the suggested algorithm somehow used in find duplicate files, to say that, I did the following test
1- create folder F:\test
2- create 5 subfolders
3- copy a file of size 1.2 GB to each subfolder
4- search for duplicate files in F:\test (same size, same contents are checked)
5- TC takes about 2 minutes to list the five files

How this come? The answer which I'm imagine is TC doesn't compare files by contents but instead creates MD5 checksum for each file and consider files with identical checksum are duplicate.
calculate MD5 for 1.2GB takes 25 seconds, so 5*25 =125 seconds (2~ minutes)

2Ghisler
Am I right?
Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?

My question is still "Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?"

MVV wrote: As I see in first post, Hacker suggest the opposite, do not use checksums while comparing.

Are you sure? Please read it again carefully, he said "4. Calculate their checksums and compare." compare here means compare the file's checksum not the file's contents.

Posted: **2012-10-18, 15:09 UTC**

Are you sure?

I mean that topic starter concentrates our attention on comparing small data blocks (suggested 3rd step) and not on comparing checksums (step 4, showed only for clarity).

Posted: **2012-10-18, 15:42 UTC**

@MVV

Again, Are you sure?

3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.

Posted: **2012-10-18, 18:03 UTC**

Yeah, sure. "Compare small block (or maybe its checksum) and not whole file, as intermediate operation" != "Compare checksums of whole files".

Posted: **2012-10-18, 21:21 UTC**

Christian,

TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.

OK, in that case my suggestion holds for the case when there are more than two files of the same size.

Roman

Total Commander

Improving the speed of finding duplicates

Improving the speed of finding duplicates