Improving the speed of finding duplicates
Moderators: Hacker, petermad, Stefan2, white
Improving the speed of finding duplicates
Hi Christian,
When searching for duplicates (same size and content) I assume the search works like this:
1. Create a file list
2. Check which files have the same size
4. Calculate their checksums and compare.
I suggest to add another step:
1. Create a file list
2. Check which files have the same size
3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.
4. Calculate their checksums and compare.
What do you think?
Roman
When searching for duplicates (same size and content) I assume the search works like this:
1. Create a file list
2. Check which files have the same size
4. Calculate their checksums and compare.
I suggest to add another step:
1. Create a file list
2. Check which files have the same size
3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.
4. Calculate their checksums and compare.
What do you think?
Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
Something like this was requested before but unfortunately there was no response from Ghisler.
- ghisler(Author)
- Site Admin
- Posts: 50873
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
ghisler(Author) wrote:TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.

My question is still "Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?"Edit:
I'm pretty sure that the suggested algorithm somehow used in find duplicate files, to say that, I did the following test
1- create folder F:\test
2- create 5 subfolders
3- copy a file of size 1.2 GB to each subfolder
4- search for duplicate files in F:\test (same size, same contents are checked)
5- TC takes about 2 minutes to list the five files
How this come? The answer which I'm imagine is TC doesn't compare files by contents but instead creates MD5 checksum for each file and consider files with identical checksum are duplicate.
calculate MD5 for 1.2GB takes 25 seconds, so 5*25 =125 seconds (2~ minutes)
2Ghisler
Am I right?
Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?
Are you sure? Please read it again carefully, he said "4. Calculate their checksums and compare." compare here means compare the file's checksum not the file's contents.MVV wrote: As I see in first post, Hacker suggest the opposite, do not use checksums while comparing.
Christian,
Roman
OK, in that case my suggestion holds for the case when there are more than two files of the same size.TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.
Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.