Improving the speed of finding duplicates

Here you can propose new features, make suggestions etc.

Moderators: Hacker, petermad, Stefan2, white

Post Reply
User avatar
Hacker
Moderator
Moderator
Posts: 13144
Joined: 2003-02-06, 14:56 UTC
Location: Bratislava, Slovakia

Improving the speed of finding duplicates

Post by *Hacker »

Hi Christian,
When searching for duplicates (same size and content) I assume the search works like this:
1. Create a file list
2. Check which files have the same size
4. Calculate their checksums and compare.

I suggest to add another step:
1. Create a file list
2. Check which files have the same size
3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.
4. Calculate their checksums and compare.

What do you think?

Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
User avatar
ts4242
Power Member
Power Member
Posts: 2081
Joined: 2004-02-02, 20:08 UTC
Contact:

Post by *ts4242 »

Something like this was requested before but unfortunately there was no response from Ghisler.
User avatar
Hacker
Moderator
Moderator
Posts: 13144
Joined: 2003-02-06, 14:56 UTC
Location: Bratislava, Slovakia

Post by *Hacker »

Hello ts4242,
Your suggestion concerns something quite different, though useful, too,

Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
User avatar
ts4242
Power Member
Power Member
Posts: 2081
Joined: 2004-02-02, 20:08 UTC
Contact:

Post by *ts4242 »

Hacker wrote:Your suggestion concerns something quite different
But the same concept, that is, using checksum instead of direct compare by contents.
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

ts4242 wrote:But the same concept, that is, using checksum instead of direct compare by contents.
As I see in first post, Hacker suggest the opposite, do not use checksums while comparing. :D
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50873
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.
Author of Total Commander
https://www.ghisler.com
User avatar
ts4242
Power Member
Power Member
Posts: 2081
Joined: 2004-02-02, 20:08 UTC
Contact:

Post by *ts4242 »

ghisler(Author) wrote:TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.
:lol: This is not new to me but it confirm my conclusion which i posted in this topic http://ghisler.ch/board/viewtopic.php?t=24390
Edit:
I'm pretty sure that the suggested algorithm somehow used in find duplicate files, to say that, I did the following test
1- create folder F:\test
2- create 5 subfolders
3- copy a file of size 1.2 GB to each subfolder
4- search for duplicate files in F:\test (same size, same contents are checked)
5- TC takes about 2 minutes to list the five files

How this come? The answer which I'm imagine is TC doesn't compare files by contents but instead creates MD5 checksum for each file and consider files with identical checksum are duplicate.
calculate MD5 for 1.2GB takes 25 seconds, so 5*25 =125 seconds (2~ minutes)

2Ghisler
Am I right?
Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?
My question is still "Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?"

MVV wrote: As I see in first post, Hacker suggest the opposite, do not use checksums while comparing. :D
Are you sure? Please read it again carefully, he said "4. Calculate their checksums and compare." compare here means compare the file's checksum not the file's contents.
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

Are you sure?
I mean that topic starter concentrates our attention on comparing small data blocks (suggested 3rd step) and not on comparing checksums (step 4, showed only for clarity).
User avatar
ts4242
Power Member
Power Member
Posts: 2081
Joined: 2004-02-02, 20:08 UTC
Contact:

Post by *ts4242 »

@MVV

Again, Are you sure? ;)
3. Read a short (1kB? 32kB?) block from the middle of the file and compare it or its checksum to those of the other files' with the same sizes.
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

Yeah, sure. "Compare small block (or maybe its checksum) and not whole file, as intermediate operation" != "Compare checksums of whole files". :)
User avatar
Hacker
Moderator
Moderator
Posts: 13144
Joined: 2003-02-06, 14:56 UTC
Location: Bratislava, Slovakia

Post by *Hacker »

Christian,
TC only uses checksums when there are more than 2 files with the same size. Otherwise it compares the two directly.
OK, in that case my suggestion holds for the case when there are more than two files of the same size.

Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
Post Reply