Could you add xxHash (xxh3, the fastest hash algorithm)?

Here you can propose new features, make suggestions etc.

Moderators: white, Hacker, petermad, Stefan2

User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *ghisler(Author) »

I have now added BLAKE3 to Total Commander 10.50 beta 1. Unfortunately the C version is not multi-threading, but it's already much faster than all the other hash algorithms. The multi-threaded version is written in Rust, a language I don't have any experience with. I don't even know whether it's possible to create DLLs with it.

Regarding xxHash, I have read some more on it, and the relatively short hash of 32 or 64 bit is meant for a hash table, not for verifying the integrity of a file. Why? In a hash table, collisions (same hash for different inputs) are expected and handled, either by using the next free slot in the hash table, or by chaining (hash table linking to multiple entries). xxHash3 also seems to support a 128 bit hash, but it's relatively new - I don't know the exact internals.

Here is an interesting analysis of xxHash:
https://encode.su/threads/2556-Improving-xxHash

I don't know if any of these improvements have been included in xxHash2 or xxHash3, but if not, then xxHash isn't suited for verifying the integrity of files. At least xxhash64bit seems to use 4 64-bit accumulators now, which would be a big improvement:
https://github.com/Cyan4973/xxHash/blob/dev/doc/xxhash_spec.md

Unfortunately the document still doesn't explain the inner workings of xxhash128, although it was requested 3 months ago. Does anyone here have details on the xxhash128 algorithm?
Author of Total Commander
https://www.ghisler.com
User avatar
nsp
Power Member
Power Member
Posts: 1803
Joined: 2005-12-04, 08:39 UTC
Location: Lyon (FRANCE)
Contact:

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *nsp »

if you want to get a good comparison benchmark, you can give a look at : SMHasher.

For xxh128, algorithm you should give a look to function XXH3_hashLong_128b inside reference implementation in C. This is a reference implementation and does not have performance optimization of the official release 0.8.1.
For a textual definition of xxhash you can read documentation to what i have read, this is so general that it should also apply to xxh128.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *ghisler(Author) »

Unfortunately the documentation doesn't cover xxhash 128 bit yet. This was also mentioned in the github bug reports.

From the source, it seems to use two 64-bit accumulators if I understand it correctly.
Author of Total Commander
https://www.ghisler.com
quantum
Junior Member
Junior Member
Posts: 49
Joined: 2004-02-29, 01:42 UTC

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *quantum »

From the table at https://github.com/rurban/smhasher
I see:
xxh3low, xxh128low, xxh128, xxHash64. These have no quality issues listed. The first is fastest, last is slowest.
Also CityCrc128 is a good pick.
Any cryptographic hash like Blake will be slower than non-crypto hashes. They have a lot more cpu cycles. Multi-threading matters of course. It sort-of seems xxh has multi-threading, but it's not clear and I don't see any command line setting (like Blake3).
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *ghisler(Author) »

I have read a lot about these hashes before deciding to add Blake.

As I understand it, xxHash32 and xxHash64 are not suited for file hashes because their room is too small. They are meant as a traditional hash function, where you have to re-hash in case of a hash collision. There is no such workaround when there is a hash collision with two different files. xxHash128 may be suited for file hashing, but I cannot find any detailed description of the used hash algorithm. If it just uses two xxHash64 hashes, then it is definitely NOT suited for file hashing! The limiting factor isn't the output size, it's the size of the accumulators within the hash algorithm. They need to be at least 128 bit wide too, otherwise you lose entropy.
Author of Total Commander
https://www.ghisler.com
Fla$her
Power Member
Power Member
Posts: 2244
Joined: 2020-01-18, 04:03 UTC

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *Fla$her »

2ghisler(Author)
And what about prvhash64s and Discohash that I mentioned earlier? They are even faster. It would be great to use them for comparison and synchronization.
Overquoting is evil! 👎
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *ghisler(Author) »

PRVHASH is patented, at least in Russia, according to the linked document. The document describes that it's suited as a pseudo random number generator, but does not mention file hashing.

Discohash looks interesting, but it's very new (2020) and unproven, and the 128-bit hash seems to be created from two 64-bit hashes, so it's questionable whether the internal entropy is really 128 bit, or just 64-bit.
Author of Total Commander
https://www.ghisler.com
Fla$her
Power Member
Power Member
Posts: 2244
Joined: 2020-01-18, 04:03 UTC

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *Fla$her »

ghisler(Author) wrote: 2022-05-31, 09:25 UTCand unproven
Tested, only the results are so-so.

According to SMHasher tests, these functions are the fastest of the reliable ones:
Hash functionMiB/seccycl./hashcycl./mapsizeQuality problems
xxh3low16462.3632.77199.79 (2)756
halftime_hash12813478.2397.79252.14 (2)2462
halftime_hash25611620.2898.44252.60 (2)2622
halftime_hash5127681.62125.81274.01 (3)3550
wyhash32low12911.0929.59205.43 (2)4742 bad seeds
komihash10444.5339.55176.50 (1)728
ahash649862.6227.32181.68 (1)412rust
Spooky329747.1362.24196.96 (4)2221UB
Spooky649747.4762.20191.71 (2)2221UB
Spooky1289751.1463.84192.47 (2)2221UB
t1ha2_atonce9237.1238.94194.32 (2)541Zeroes low3
t1ha2_atonce1288350.9955.65203.53 (4)613LongNeighbors
pengyhash8744.4885.31222.45 (4)421
nmhash327850.0156.74207.59 (1)2445
mx36146.0252.48173.09 (3)734UB


Maybe take a look at some of them? HalftimeHash, komihash, Spooky for example?
Overquoting is evil! 👎
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *ghisler(Author) »

Another problem is that 99.999% of people have never heared of any of these hashes, except for maybe xxhash.
The problem is that the only variant interesting for file hashing, xxhash128, is still not documented:
https://github.com/Cyan4973/xxHash/issues/589

This makes its quality very questionable. Even the stone-age hash md5 creates 128 bit hashes and works internally with a 128 bit accumulator (4x32 bit) with good entropy. It is broken cryptographically, but that doesn't matter if you only want to verify integrity (that the file wasn't damaged).
Author of Total Commander
https://www.ghisler.com
User avatar
Hacker
Moderator
Moderator
Posts: 13052
Joined: 2003-02-06, 14:56 UTC
Location: Bratislava, Slovakia

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *Hacker »

I have to agree here that commonly known and independently assessed hash functions should be preferred. Remember, these would be used in TC for a long time, it's not like we find a better hash so we replace the previous one the next month.

Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
User avatar
nsp
Power Member
Power Member
Posts: 1803
Joined: 2005-12-04, 08:39 UTC
Location: Lyon (FRANCE)
Contact:

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *nsp »

Roman, Christian, the hash functions are not only used for hash calculation from the user but also internally.
As hash for content compare or to verify copy is used internally (to my knowledge, it is md5). This should be as efficient as possible and not be exposed outside to calculate hash files if you think that it is not generally relevant.
tc.hash_<Name> wdx like column could be nice.

fclones which i use on linux is much faster than TC on same external drive to find duplicates. To me, this is probably due to efficiency of the hash used (murmurhash vs md5).
User avatar
Hacker
Moderator
Moderator
Posts: 13052
Joined: 2003-02-06, 14:56 UTC
Location: Bratislava, Slovakia

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *Hacker »

nsp,
fclones which i use on linux is much faster than TC on same external drive to find duplicates. To me, this is probably due to efficiency of the hash used (murmurhash vs md5).
I don't believe a hash can be slower than reading from an external drive. If it is, I agree, it should not be used.

Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
Fla$her
Power Member
Power Member
Posts: 2244
Joined: 2020-01-18, 04:03 UTC

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *Fla$her »

ghisler(Author) wrote: 2022-06-01, 07:06 UTCAnother problem is that 99.999% of people have never heared of any of these hashes
This is not a problem for the tasks I have indicated with a significant TC acceleration during comparison and synchronization. If, in addition, computational options for constructing hash sums appear, then this will stick only to the choice of users, and the prevalence of these hashes, as well as once popular ones, will be associated exclusively with their use in services and popular applications like Total Commander. No hash would have become popular if it had not become widespread, thanks to well-known software. I think there's no point in waiting for others to do it sooner. ;)

OFF: Please answer in this topic.
Overquoting is evil! 👎
User avatar
nsp
Power Member
Power Member
Posts: 1803
Joined: 2005-12-04, 08:39 UTC
Location: Lyon (FRANCE)
Contact:

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *nsp »

Hacker wrote: 2022-06-01, 11:34 UTC nsp,
fclones which i use on linux is much faster than TC on same external drive to find duplicates. To me, this is probably due to efficiency of the hash used (murmurhash vs md5).
I don't believe a hash can be slower than reading from an external drive. If it is, I agree, it should not be used.

Roman
We use usb3 or ThunderBolt nvme SSD external drives (faster than the internal HD of my work laptop :()
You can compare yourself time used to calculate md5sum with TC on a big file and time used by FCHash for xxhash3.
User avatar
Hacker
Moderator
Moderator
Posts: 13052
Joined: 2003-02-06, 14:56 UTC
Location: Bratislava, Slovakia

Re: Could you add xxHash (xxh3, the fastest hash algorithm)?

Post by *Hacker »

Fla$her,
This is not a problem for the tasks I have indicated with a significant TC acceleration during comparison and synchronization.
The hashes still have to be reliable, i.e. well documented, tested and at least a bit time-proven.

nsp,
You can compare yourself time used to calculate md5sum with TC on a big file and time used by FCHash for xxhash3.
FcHash xxhash3 - 12 sec
TC Blake3 - 11 sec
TC MD5 - 29 sec

So it seems the best solution at least for now would be to switch to Blake3 for TC's internal functions?

Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
Post Reply