Search duplicates by same content very very slow

maverick1999 · Post by *maverick1999 » 2015-11-24, 16:22 UTC

Main issue is with 2 files which ARE identical and you search for them to delete the duplicate.

In that case (using sync, search or pure compare) checksum is the fastest followed by larger buffer method.

milo1012 · Post by *milo1012 » 2015-11-24, 17:38 UTC

MVV wrote:But as you can see from my quick test, even memory compare may be done in different ways, and two methods may be quite different in speed (up to 10 times).

It is a commonplace that the low-level functions (memcmp/memmove/memcpy) scratch on the theoretical memory bandwidth.
(see e.g. http://nadeausoftware.com/articles/2012/05/c_c_tip_how_copy_memory_quickly )

Using the checksum method isn't such a bad idea.
The problem is that you need a good hash function (SHA-224 and above),
but those functions have calculation performance of ~100-200 MB in a single thread,
and you even need to half the base performance, as you need to do it for both files.
TC already uses SHA-1 for comparing files from archives and the mentioned 3 or more files of the same size,
but the calc. performance also limits such comparisons in case of fast SSDs.

In any case, making the CBC compare operations configurable can't hurt, and would help in cases like the OP.

MVV · Post by *MVV » 2015-11-24, 17:45 UTC

I would try hash-per-block method for comparing large files by contents, just to be able to detect difference earlier than after reading both files entirely.

When you search for duplicates, you can already use any checksum compare method using corresponding WDX plugin (e.g. http://totalcmd.net/plugring/wdhash.html).

maverick1999 · Post by *maverick1999 » 2015-11-25, 09:14 UTC

@milo1012
On a laptop with 2 identical files located on the same drive the current speed never goes above 7MB/s. (the drive has a 32MB buffer).

SHA-512 however is limited only by disk read speed (~100MB/s) so the bottleneck is not the hash function.

MVV · Post by *MVV » 2015-11-25, 10:39 UTC

Have you tried searching for dups in case of 3+ identical files? TC should use hashes here so you can compare speed. Don't forget about file system cache though if you repeat your test with same files.

maverick1999 · Post by *maverick1999 » 2015-11-25, 13:59 UTC

I think the WDX plugin is no longer compatible with 64bit version of TC and most of the times the large duplicates come in 2 not 3 or more files.

Hopefully the checksum or larger buffer features will make it into TC.

MVV · Post by *MVV » 2015-11-25, 17:22 UTC

I only suggest to compare speed of comparing by contents or by hashes, you can do that already.

There are another plugins that have 64-bit versions, just use the search. E.g. LotsOfHashes.

milo1012 · Post by *milo1012 » 2015-11-25, 23:31 UTC

maverick1999 wrote:On a laptop with 2 identical files located on the same drive the current speed never goes above 7MB/s. (the drive has a 32MB buffer).

SHA-512 however is limited only by disk read speed (~100MB/s) so the bottleneck is not the hash function.

It's understandable for a (mechanical) HDD.
And like I said: I'm all for a configurable buffer, and/or using a checksum/hash compare.

But, in case of a fast SSD, the bottleneck will be the hash function.
Just try to compare files from solid rar archives with TC's "Synchronize dirs" function, where TC uses SHA-1.
I barely get more than ~140 MB/s on my test system, even though the SSD would easily reach ~400 MB/s.
So when looking for duplicates of large files on today's fast (SSD) drives, using a hash is probably not the best solution,
even if TC's SHA-1 implementation would get a slight speed boost.

I'm not sure if you can read drive properties from the WinAPI, i.e. identify SSD drives in a non-ambiguous way.
But if you can, TC could use checksum/hash compare for mechanical HDDs, and for SSDs the buffered binary compare that we have now.
And of course it would be best to still let the user configure which method he wants to use, including the mentioned buffer size.

Update: there actually are some ways to detect non-mechanical drives:
http://stackoverflow.com/questions/9273373/tell-if-a-path-refers-to-a-solid-state-drive-with-winapi
http://stackoverflow.com/questions/23363115/detecting-ssd-in-windows

MVV · Post by *MVV » 2015-11-26, 16:39 UTC

Just try to compare files from solid rar archives with TC's "Synchronize dirs" function, where TC uses SHA-1.
I barely get more than ~140 MB/s on my test system, even though the SSD would easily reach ~400 MB/s.

I think that the bottleneck in such case is RAR decompressor.

milo1012 · Post by *milo1012 » 2015-11-26, 17:31 UTC

MVV wrote:I think that the bottleneck in such case is RAR decompressor

No. The first file is decompressed, and the hash is calculated after that (you can see it by the lower status bar).
Then the second file is decompressed, and that hash will be calculated.
I measured e.g. ~15 seconds for a 2GB file repeatedly, just for the hash calculation, and not counting decompression.

It's the typical calculation performance for SHA-1, as you can also see by my linked article.
I doubt that you would ever get much faster than ~200 MB/s on current machines, even with optimized code.

And yes, this is the same hash function that is optionally used for duplicates search.

maverick1999 · Post by *maverick1999 » 2015-11-28, 06:53 UTC

For SSD the bottleneck could be the hash function but I could hardly call it a bottleneck.
At most a slowdown, and you could just use a faster hash.

For mechanical HDD the issue stands.
I can just hope checksum and larger buffer features can make it into next TC.