4 years ago by boulos
I feel like youâd want something a bit safer than âwe donât store the keys and just rely on the hash to be really goodâ [1], putting âplease do not use this for serious tasksâ in a comment embedded in the header file isnât a clear enough warning.
Itâs not clear to me that that probability of collision assumptions hold. Itâs basically assuming that the hashing is perfect and distributes any inputs to the full 64-bit space with uniform probability. Thatâs the usual hash map / randomized algorithm hope, but does BigCrush or similar avalanche testing really prove that? (Presumably not, otherwise there wouldnât be image attacks for things like md5).
[1] https://github.com/wangyi-fudan/wyhash/blob/d2a305811972f391...
4 years ago by CodesInChaos
> âwe donât store the keys and just rely on the hash to be really goodâ
I think that's fine if you actually use a really good hash (e.g. a 128-bit cryptographic PRF). But I wouldn't be comfortable with hashes shorter than that or which aren't crypto quality.
People already make such assumptions when assuming uniqueness v4 UUIDs.
4 years ago by graderjs
I agree that not storing keys is riskier, and it's not a risk that's mitigated even if it is a perfect uniform distribution because you can still have collisions. Maybe I don't understand the reasons for that design but personally I don't think it's a good design for a hash table. But even so the hash underlying it is a very good hash.
Big crush and smhasher give a good indication of uniform distribution, but nothing can guarantee no collisions because you're always going to have collisions it's always a possibility. Even if you have a perfect permutation of the 64 bit space, the minute you go beyond 64 bits of keys you're going to collide within 64 bits of hash.
By all the tests they run md5 is a much poorer hash than many others. But it's a or it was a cryptographic hash. It's different.
4 years ago by stingraycharles
To be fair 2^64 keys is a lot, to the point that it cannot possibly fit into memory with todayâs hardware.
4 years ago by hansvm
We're getting close. There's way more RAM than that in the world, and a quick search turned up a single machine from a few years ago that was only off by 160x or so.
Even without relying on the pigeonhole principle though you still have the birthday paradox. As a ballpark estimate, if you have N keys then you'd expect a 50/50 chance of at least one collision with sqrt(N) hashes -- 2^32 for this problem.
4 years ago by Bootvis
According to the Birthday paradox you can expect a collision after approx 2^32 keys. Whether this is small or big probably depends on your use case.
4 years ago by undefined
4 years ago by rurban
I do a similar thing with constant keys. Check the result with a simple memcmp/strcmp of the key for false positives. But only usable for constant perfect hashes.
64 bit is not good enough for that claim. 128 would be good, 256 perfect for production use. Regardless of the hash function quality. wyhash is a very good hash, the best and fastest portable one.
4 years ago by undefined
4 years ago by zucker42
Based on the comment in the source, the hash table doesn't handle hash collisions. And furthermore, I don't see the code which benchmarks this versus other hash tables, just the claim that it is 2-3x faster. Specifically, comparisons for different tables sizes using the same hash function would be good.
The API this presents is not really inspiring either.
4 years ago by undefined
4 years ago by laserbeam
Meow hash claims 3-4x faster hashing over this, still passes smhasher, and is a few years old. https://mollyrocket.com/meowhash
And there are a bunch of other good suggestions here in the comments looking at around the same 50-60gb/sec speed
4 years ago by wangyi_fudan
moew is not good at short keys. portability is also a concern
4 years ago by injinj
The meow 0.4 was faster at short keys, but failed at the smhasher "LongNeighborTest" [1]. However, doubling the AES rounds makes it pass that test. Two rounds is enough for full diffusion in AES [2]. I recently looked at computing 4 Meow keys per hash function [3], and found the speedup to be almost 2x in a microbench. That puts it in rare territory for hash speed.
[1] https://github.com/injinj/smhasher/
[2] Section 5.4 of Introduction to Cryptography by Trappe and Washington -- It can be shown that two rounds are sufficient to obtain full diffusion, namely, each of the 128 output bits depends on each of the 128 input bits.
[3] https://github.com/raitechnology/raikv/blob/3ce2b23e0d9853fe...
4 years ago by laserbeam
Indeed, regarding portability moew was definitely made with gamedev in mind, without concern for portability.
4 years ago by kouteiheika
According to this ahash is faster:
https://github.com/tkaitchuck/aHash/blob/master/compare/read...
Did anything change?
4 years ago by wangyi_fudan
I am also wondering whether ahash is faster. ahash is developed with rust. We need a head to head comparsion. However, it is not available in SMHasher package.
4 years ago by kouteiheika
Apparently there is a patch for the SMHasher here which adds support for ahash:
https://github.com/tkaitchuck/aHash/tree/master/smhasher
There are also ahash's own benchmarks here:
https://github.com/tkaitchuck/aHash/blob/master/compare/test...
They use the wyhash Rust crate, so if wyhash itself was updated doing a head to head comparison would boil down to updating the wyhash crate and rerunning ahash's benchmark suite.
4 years ago by rurban
Interestingly this patch never was submitted to me. Just saw it now. Will add it asap.
Very interesting is his claim to create wyhash collisions at will. Even with bad keys, not bad seeds!
4 years ago by wangyi_fudan
wyhash crate is not wyhash itself. It depends on the skill of the translator and the wyhash version. wyhash has been improved significantly version by version. Also ahash should submit a PR to smhasher and play with ~100 other hash functions there.
4 years ago by dietrichepp
I see my name in the "thanks" section, it's nice, but I don't know why I'm there.
4 years ago by soedirgo
Probably for your PR from way back when: https://news.ycombinator.com/item?id=19359870
4 years ago by icsa
With respect to wyrand, it seems that all prime numbers are not created equal. I implemented wyrand() using the two primes just below 2^64. The upper 53 bits of each 64-bit random deviate was used to generate a uniform [0,1) floating point deviate. The expected value of the sum of the uniform floating point deviates is 0.5 * #deviates. When using the two primes numbers above, the resulting value was 0.6430236 * #deviates - indicating significant bias in the random deviates generated by wyhash for those particular prime numbers.
4 years ago by wangyi_fudan
you sure make popcount(prime)==32
4 years ago by icsa
Thanks! Is that requirement in the documentation?
4 years ago by pansa2
See also âNew fastest portable hash: wyhashâ from 2019:
Daily digest email
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.