Hacker News

4 years ago by boulos

I feel like you’d want something a bit safer than “we don’t store the keys and just rely on the hash to be really good” [1], putting “please do not use this for serious tasks” in a comment embedded in the header file isn’t a clear enough warning.

It’s not clear to me that that probability of collision assumptions hold. It’s basically assuming that the hashing is perfect and distributes any inputs to the full 64-bit space with uniform probability. That’s the usual hash map / randomized algorithm hope, but does BigCrush or similar avalanche testing really prove that? (Presumably not, otherwise there wouldn’t be image attacks for things like md5).

[1] https://github.com/wangyi-fudan/wyhash/blob/d2a305811972f391...

4 years ago by CodesInChaos

> “we don’t store the keys and just rely on the hash to be really good”

I think that's fine if you actually use a really good hash (e.g. a 128-bit cryptographic PRF). But I wouldn't be comfortable with hashes shorter than that or which aren't crypto quality.

People already make such assumptions when assuming uniqueness v4 UUIDs.

4 years ago by graderjs

I agree that not storing keys is riskier, and it's not a risk that's mitigated even if it is a perfect uniform distribution because you can still have collisions. Maybe I don't understand the reasons for that design but personally I don't think it's a good design for a hash table. But even so the hash underlying it is a very good hash.

Big crush and smhasher give a good indication of uniform distribution, but nothing can guarantee no collisions because you're always going to have collisions it's always a possibility. Even if you have a perfect permutation of the 64 bit space, the minute you go beyond 64 bits of keys you're going to collide within 64 bits of hash.

By all the tests they run md5 is a much poorer hash than many others. But it's a or it was a cryptographic hash. It's different.

4 years ago by stingraycharles

To be fair 2^64 keys is a lot, to the point that it cannot possibly fit into memory with today’s hardware.

4 years ago by hansvm

We're getting close. There's way more RAM than that in the world, and a quick search turned up a single machine from a few years ago that was only off by 160x or so.

Even without relying on the pigeonhole principle though you still have the birthday paradox. As a ballpark estimate, if you have N keys then you'd expect a 50/50 chance of at least one collision with sqrt(N) hashes -- 2^32 for this problem.

4 years ago by Bootvis

According to the Birthday paradox you can expect a collision after approx 2^32 keys. Whether this is small or big probably depends on your use case.

4 years ago by undefined

[deleted]

4 years ago by rurban

I do a similar thing with constant keys. Check the result with a simple memcmp/strcmp of the key for false positives. But only usable for constant perfect hashes.

64 bit is not good enough for that claim. 128 would be good, 256 perfect for production use. Regardless of the hash function quality. wyhash is a very good hash, the best and fastest portable one.

4 years ago by undefined

[deleted]

4 years ago by zucker42

Based on the comment in the source, the hash table doesn't handle hash collisions. And furthermore, I don't see the code which benchmarks this versus other hash tables, just the claim that it is 2-3x faster. Specifically, comparisons for different tables sizes using the same hash function would be good.

The API this presents is not really inspiring either.

4 years ago by undefined

[deleted]

4 years ago by laserbeam

Meow hash claims 3-4x faster hashing over this, still passes smhasher, and is a few years old. https://mollyrocket.com/meowhash

And there are a bunch of other good suggestions here in the comments looking at around the same 50-60gb/sec speed

4 years ago by wangyi_fudan

moew is not good at short keys. portability is also a concern

4 years ago by injinj

The meow 0.4 was faster at short keys, but failed at the smhasher "LongNeighborTest" [1]. However, doubling the AES rounds makes it pass that test. Two rounds is enough for full diffusion in AES [2]. I recently looked at computing 4 Meow keys per hash function [3], and found the speedup to be almost 2x in a microbench. That puts it in rare territory for hash speed.

[1] https://github.com/injinj/smhasher/

[2] Section 5.4 of Introduction to Cryptography by Trappe and Washington -- It can be shown that two rounds are sufficient to obtain full diffusion, namely, each of the 128 output bits depends on each of the 128 input bits.

[3] https://github.com/raitechnology/raikv/blob/3ce2b23e0d9853fe...

4 years ago by laserbeam

Indeed, regarding portability moew was definitely made with gamedev in mind, without concern for portability.

4 years ago by wangyi_fudan

I am also wondering whether ahash is faster. ahash is developed with rust. We need a head to head comparsion. However, it is not available in SMHasher package.

4 years ago by kouteiheika

Apparently there is a patch for the SMHasher here which adds support for ahash:

https://github.com/tkaitchuck/aHash/tree/master/smhasher

There are also ahash's own benchmarks here:

https://github.com/tkaitchuck/aHash/blob/master/compare/test...

They use the wyhash Rust crate, so if wyhash itself was updated doing a head to head comparison would boil down to updating the wyhash crate and rerunning ahash's benchmark suite.

4 years ago by rurban

Interestingly this patch never was submitted to me. Just saw it now. Will add it asap.

Very interesting is his claim to create wyhash collisions at will. Even with bad keys, not bad seeds!

4 years ago by wangyi_fudan

wyhash crate is not wyhash itself. It depends on the skill of the translator and the wyhash version. wyhash has been improved significantly version by version. Also ahash should submit a PR to smhasher and play with ~100 other hash functions there.

4 years ago by dietrichepp

I see my name in the "thanks" section, it's nice, but I don't know why I'm there.

4 years ago by soedirgo

Probably for your PR from way back when: https://news.ycombinator.com/item?id=19359870

4 years ago by icsa

With respect to wyrand, it seems that all prime numbers are not created equal. I implemented wyrand() using the two primes just below 2^64. The upper 53 bits of each 64-bit random deviate was used to generate a uniform [0,1) floating point deviate. The expected value of the sum of the uniform floating point deviates is 0.5 * #deviates. When using the two primes numbers above, the resulting value was 0.6430236 * #deviates - indicating significant bias in the random deviates generated by wyhash for those particular prime numbers.

4 years ago by wangyi_fudan

you sure make popcount(prime)==32

4 years ago by icsa

Thanks! Is that requirement in the documentation?

4 years ago by pansa2

See also “New fastest portable hash: wyhash” from 2019:

https://news.ycombinator.com/item?id=19357895

Daily digest email

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.