Page 1 of 1
Improving the speed of finding duplicates
Posted: 2012-10-18, 10:40 UTC
by Hacker
Hi Christian,
When searching for duplicates (same size and content) I assume the search works like this:
1. Create a file list
2. Check which files have the same size
4. Calculate their checksums and compare.
I suggest to add another step:
1. Create a file list
2. Check which files have the same size
3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.
4. Calculate their checksums and compare.
What do you think?
Roman
Posted: 2012-10-18, 11:03 UTC
by ts4242
Something like this was requested before but unfortunately there was no response from
Ghisler.
Posted: 2012-10-18, 11:52 UTC
by Hacker
Hello ts4242,
Your suggestion concerns something quite different, though useful, too,
Roman
Posted: 2012-10-18, 12:06 UTC
by ts4242
Hacker wrote:Your suggestion concerns something quite different
But the same concept, that is, using checksum instead of direct compare by contents.
Posted: 2012-10-18, 13:05 UTC
by MVV
ts4242 wrote:But the same concept, that is, using checksum instead of direct compare by contents.
As I see in first post,
Hacker suggest the opposite, do
not use checksums while comparing.

Posted: 2012-10-18, 13:17 UTC
by ghisler(Author)
TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.
Posted: 2012-10-18, 15:04 UTC
by ts4242
ghisler(Author) wrote:TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.

This is not new to me but it confirm my conclusion which i posted in this topic
http://ghisler.ch/board/viewtopic.php?t=24390
Edit:
I'm pretty sure that the suggested algorithm somehow used in find duplicate files, to say that, I did the following test
1- create folder F:\test
2- create 5 subfolders
3- copy a file of size 1.2 GB to each subfolder
4- search for duplicate files in F:\test (same size, same contents are checked)
5- TC takes about 2 minutes to list the five files
How this come? The answer which I'm imagine is TC doesn't compare files by contents but instead creates MD5 checksum for each file and consider files with identical checksum are duplicate.
calculate MD5 for 1.2GB takes 25 seconds, so 5*25 =125 seconds (2~ minutes)
2Ghisler
Am I right?
Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?
My question is still "Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?"
MVV wrote:
As I see in first post,
Hacker suggest the opposite, do
not use checksums while comparing.

Are you sure? Please read it again carefully, he said "
4. Calculate their checksums and compare." compare here means compare the file's checksum not the file's contents.
Posted: 2012-10-18, 15:09 UTC
by MVV
Are you sure?
I mean that topic starter concentrates our attention on comparing small data blocks (suggested 3rd step) and not on comparing checksums (step 4, showed only for clarity).
Posted: 2012-10-18, 15:42 UTC
by ts4242
@
MVV
Again, Are you sure?
3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.
Posted: 2012-10-18, 18:03 UTC
by MVV
Yeah, sure. "Compare small block (or maybe its checksum) and not whole file, as intermediate operation" != "Compare checksums of whole files".

Posted: 2012-10-18, 21:21 UTC
by Hacker
Christian,
TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.
OK, in that case my suggestion holds for the case when there are more than two files of the same size.
Roman