Search duplicates by same content very very slow

Here you can propose new features, make suggestions etc.

Moderators: white, Hacker, petermad, Stefan2

maverick1999
Junior Member
Junior Member
Posts: 8
Joined: 2015-11-19, 10:35 UTC

Post by *maverick1999 »

Main issue is with 2 files which ARE identical and you search for them to delete the duplicate.

In that case (using sync, search or pure compare) checksum is the fastest followed by larger buffer method.
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

MVV wrote:But as you can see from my quick test, even memory compare may be done in different ways, and two methods may be quite different in speed (up to 10 times).
It is a commonplace that the low-level functions (memcmp/memmove/memcpy) scratch on the theoretical memory bandwidth.
(see e.g. http://nadeausoftware.com/articles/2012/05/c_c_tip_how_copy_memory_quickly )


Using the checksum method isn't such a bad idea.
The problem is that you need a good hash function (SHA-224 and above),
but those functions have calculation performance of ~100-200 MB in a single thread,
and you even need to half the base performance, as you need to do it for both files.
TC already uses SHA-1 for comparing files from archives and the mentioned 3 or more files of the same size,
but the calc. performance also limits such comparisons in case of fast SSDs.

In any case, making the CBC compare operations configurable can't hurt, and would help in cases like the OP.
TC plugins: PCREsearch and RegXtract
User avatar
MVV
Power Member
Power Member
Posts: 8702
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

I would try hash-per-block method for comparing large files by contents, just to be able to detect difference earlier than after reading both files entirely.

When you search for duplicates, you can already use any checksum compare method using corresponding WDX plugin (e.g. http://totalcmd.net/plugring/wdhash.html).
maverick1999
Junior Member
Junior Member
Posts: 8
Joined: 2015-11-19, 10:35 UTC

Post by *maverick1999 »

@milo1012
On a laptop with 2 identical files located on the same drive the current speed never goes above 7MB/s. (the drive has a 32MB buffer).

SHA-512 however is limited only by disk read speed (~100MB/s) so the bottleneck is not the hash function.
User avatar
MVV
Power Member
Power Member
Posts: 8702
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

Have you tried searching for dups in case of 3+ identical files? TC should use hashes here so you can compare speed. Don't forget about file system cache though if you repeat your test with same files.
maverick1999
Junior Member
Junior Member
Posts: 8
Joined: 2015-11-19, 10:35 UTC

Post by *maverick1999 »

I think the WDX plugin is no longer compatible with 64bit version of TC and most of the times the large duplicates come in 2 not 3 or more files.

Hopefully the checksum or larger buffer features will make it into TC.
User avatar
MVV
Power Member
Power Member
Posts: 8702
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

I only suggest to compare speed of comparing by contents or by hashes, you can do that already.

There are another plugins that have 64-bit versions, just use the search. E.g. LotsOfHashes.
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

maverick1999 wrote:On a laptop with 2 identical files located on the same drive the current speed never goes above 7MB/s. (the drive has a 32MB buffer).

SHA-512 however is limited only by disk read speed (~100MB/s) so the bottleneck is not the hash function.
It's understandable for a (mechanical) HDD.
And like I said: I'm all for a configurable buffer, and/or using a checksum/hash compare.

But, in case of a fast SSD, the bottleneck will be the hash function.
Just try to compare files from solid rar archives with TC's "Synchronize dirs" function, where TC uses SHA-1.
I barely get more than ~140 MB/s on my test system, even though the SSD would easily reach ~400 MB/s.
So when looking for duplicates of large files on today's fast (SSD) drives, using a hash is probably not the best solution,
even if TC's SHA-1 implementation would get a slight speed boost.

I'm not sure if you can read drive properties from the WinAPI, i.e. identify SSD drives in a non-ambiguous way.
But if you can, TC could use checksum/hash compare for mechanical HDDs, and for SSDs the buffered binary compare that we have now.
And of course it would be best to still let the user configure which method he wants to use, including the mentioned buffer size.

Update: there actually are some ways to detect non-mechanical drives:
http://stackoverflow.com/questions/9273373/tell-if-a-path-refers-to-a-solid-state-drive-with-winapi
http://stackoverflow.com/questions/23363115/detecting-ssd-in-windows
TC plugins: PCREsearch and RegXtract
User avatar
MVV
Power Member
Power Member
Posts: 8702
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

Just try to compare files from solid rar archives with TC's "Synchronize dirs" function, where TC uses SHA-1.
I barely get more than ~140 MB/s on my test system, even though the SSD would easily reach ~400 MB/s.
I think that the bottleneck in such case is RAR decompressor. :)
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

MVV wrote:I think that the bottleneck in such case is RAR decompressor
No. The first file is decompressed, and the hash is calculated after that (you can see it by the lower status bar).
Then the second file is decompressed, and that hash will be calculated.
I measured e.g. ~15 seconds for a 2GB file repeatedly, just for the hash calculation, and not counting decompression.

It's the typical calculation performance for SHA-1, as you can also see by my linked article.
I doubt that you would ever get much faster than ~200 MB/s on current machines, even with optimized code.

And yes, this is the same hash function that is optionally used for duplicates search.
TC plugins: PCREsearch and RegXtract
maverick1999
Junior Member
Junior Member
Posts: 8
Joined: 2015-11-19, 10:35 UTC

Post by *maverick1999 »

For SSD the bottleneck could be the hash function but I could hardly call it a bottleneck.
At most a slowdown, and you could just use a faster hash.

For mechanical HDD the issue stands.
I can just hope checksum and larger buffer features can make it into next TC.
Post Reply