## to hash or not to hash ?!

30 posts / 0 new
Author
Message

Hi freaks,

As anytime I have a question. This time about Hash functionallity.

Basically I want to calculate a hash value with 2 Bytes out of a 10 Byte value. Basically that would mean I will have a lot of collisions in my hash table if I would have all possible 10 Byte.
But I will only have 1000 of the 10 Byte values to hash. But how can I calculate how big the chance is that I will have 1000 same hash values or how big is the chance that I will have a lot less collisions? This cracks my brain.

Than there is the next question: I'm thinking about using the DJB Hash function: http://www.cse.yorku.ca/~oz/hash...
But this function is optimized to give back a 32 Bit Hash, while no one can explain why the 33 is working that good). As mentioned I would use a 16 Bit Hash. That means a cast to 16 Bit should do the job (or maybe better a modulo with a big prime (< 65535)), but will this break the well working algorithm?

Regards,
Baldrian
*brain is smoking*

it's a complex question to a complex problem.

if the hash function provides an even distribution on it's output for any input (or given subset thereof) your chance of collision (on a given input generating the same output) is simple division. Mathematically proving this can be quite difficult.

Second part (assuming your hash has an even distribution as in the first part) your chances for (an inttial)collision within a table is simply the range of the key (your hash) divided by the number of elements in the table. (assuming simple modulus selection) If your hash does not have an even distribution, then your chances of collision may increase, it will never decrease. If you aren't using simple modulus selection then your chances of collision will be different and will depend on the resultant distribution generated by your function for the given table size.

Also remember ot keep your table size prime, or at least "relatively" prime, or your search could end up in a loop. Personally I like 1021 for 1K tables, as it sits just below the 1024 binary boundary. For 1000, I'd select 997, or 1009 as they are the closet primes. By keeping the table size prime, you can guarantee that after [table-size] searches you have visited every possible entry in the table. IF the size is not prime, and your selector, or step function, are factors of the size you can end up in a loop always visiting the same entries in the table, missing all the others.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Last Edited: Wed. Oct 28, 2009 - 05:16 PM

Baldrian wrote:
But I will only have 1000 of the 10 Byte values to hash. But how can I calculate how big the chance is that I will have 1000 same hash values or how big is the chance that I will have a lot less collisions? This cracks my brain.

Simply calculate the hash of your choice of all 1000 possible inputs, and compare them.

JW

What is sending these 10-byte arrays and what do they contain, and why only 1000 of them are in use?

If you know that, you may find out that for all 1000 different inputs you have no collisions.

Jepael wrote:
What is sending these 10-byte arrays and what do they contain, and why only 1000 of them are in use?

If you know that, you may find out that for all 1000 different inputs you have no collisions.

The 1000 inputs must be unique. The challenge is deciding where in the 1000 entry table the entry should be. ie, How does one map a 10byte (80bit) value to a 10bit address space?

For this to happen, some form of reduction must be done. If you can generate a function that will generate 1000 unique values [in the range of 0-999] for the 1000 10-byte arrays, then you're all set and there is no worry. But chances are that finding such a function will be extremely difficult.

Furthermore, I'm not convinced that there will only be 1000 unique inputs, only that at any given time only 1000 out of the 2^80 possible combinations needs to be computed on.

consider this simple example:
we have 1000 objects [ie players in a game, guests to a wedding etc]. Each object is identified with a 10byte handle (username, nickname, given name).

Now we need to somehow track some information about each of these objects.

Firstly, We could do our searches by brute force in O(n) time (assuming an unordered list), by scanning through each entry in the table until a match is found. Here insertions happen in O(1) time, but deletions could be costly, as the entries need to be re-grouped. Alternatively deletions could happen ion O(1) time, but then insertions require a search of up-to O(n) time to find an empty slot.

We could maintain an ordered list, thus allowing for improved searching O(log(n)) time. But the cost of insertion and deletion is heavy.

Or we can do it in approx O(1) time [for all operations] by using a hash, where the calculation based on the handle gives us the index of the entry in the table. If the identified location is not the correct entry, a secondary index step is used. Typically the correct entry is found very quickly, unless the table is very full, and the hash function generates a lot of collisions. Hash tables do work best when sparsely populated. So you will be giving up memory requirements for speed.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

glitch wrote:
it's a complex question to a complex problem.

Indeed indeed :-)

Thank you for all the answers. To make thinks a little bit more clear:
The 1000 elements of 10 Byte are containing keys. There will be 2^80 possible keys, but the software should only store 1000 allowed keys.
I want to calculate a 16-Bit hash, which means I have a hash table of n=65536, right?

But how big is the chance to get a best case (no collision) when having 1000 random keys out of 2^80 possible values and calculating a 2 Byte hash (n = 65535)?

Last Edited: Wed. Oct 28, 2009 - 09:51 PM

well given that you have a 2^80 space mapped into a 2^10 space your chance of collision is extremely high. To put it another way, you have only 1 in 2^70 chance of not colliding.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

glitch wrote:
well given that you have a 2^80 space mapped into a 2^10 space your chance of collision is extremely high. To put it another way, you have only 1 in 2^70 chance of not colliding.

Hmm. *thinking*

Last Edited: Wed. Oct 28, 2009 - 08:56 PM

we might be able to improve your odds a bit. What is the nature of the key? is it a true 8-bit/byte key, or is it encoded somehow (say ascii, bin-hex, base64 etc) If it's encoded we can decode it down to it's true size, reducing the size of the space you are mapping from.

If it is a true 8/bit per byte data stream, then your hashing might as well be a simple XOR of the bytes together, in 16bit pairs. As that result will be just as distributed as any other algorithm. The only difference being that a single bit in the input, will only change the output by a single bit. Whereas a true hash would alter many bits on the output for any single bit change on the input.

Also while your odds of not colliding are 1 in 2^70, the odds of being assigned 2 colliding keys is also about the same. Assuming a true random distribution.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

glitch wrote:

is it a true 8-bit/byte key,

Yes.

glitch wrote:

Also while your odds of not colliding are 1 in 2^70, the odds of being assigned 2 colliding keys is also about the same. Assuming a true random distribution.

Ok. You mean, taking two keys out of 2^80 will have a small chance to be a collision. But what about 1000 keys? There, it will be a big chance that two will have the same hash value. But how big is the chance that all 1000 will have the same hash value?

I think I will study again the chapter about calculus of probability in my math books. It is 10 years ago that I last time used it :lol:

forget the calculus, this is all about statistics and probability.

and I should have said that the odds of not being assigned 2 colliding keys is abut the same.

Also my math was a bit screwed up. Stepping back
your odds of getting a unique 16-bit hash from the 10byte key, is 1 in 2^64 [2^80 / 2^16 = 2^(80-16)]

The odds of getting a collision in your table using a 16bit unique key, in a 1000 entry table is about 1 in 65.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Where is the 2^10 from (in your older posts)?

JW

 OK I see it was a mistake. Ignore please.[/edit]

That part wasn't a mistake, only the way I broke it down was.

The 2^10 is the 1000 (1024) size of the table. Combining the two probabilities gives the 1 in 2^70 I stated before.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

A simple technique is to EOR the 4 16bit parts of your key. Whatever function you choose, code up a PC simulation and evaluate the performance.

You will probably have to handle collisions regardless of the rate of their occurance.

C: i = "told you so";

yes collisions are unavoidable, it's just how often do they occur. In this case there's a 2^70 chance that a collision will occur, given any two keys, due to the small table space, and the huge key space.

I agree with the EOR solution (I suggested it earlier) I also agree with coding it out on a PC for testing.

Perhaps if we knew more about the actual end application we could offer some alternative solutions to solve the problem.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Quote:
In this case there's a 2^70 chance
Since the hash is 2 bytes, the chance of collision, which is by definition between 0 and 1, would be 1/65536. But, given a hash table of size 1000, the probability is increased to 1/1000, assuming a well behaved (i.e. random) hash function.

Ideally, a hash table would be perhaps 10%-%25 filled, maximum, but given limited memory on a microcontroller, a 10,000 entry table is probably not an option.

Let's say you have the luxury that incorrectly deeming a key as already included is not expensive, then the limited capacity is simply a matter of somewhat reduced performance. Otherwise, a linear or binary search may be more appropriate.

C: i = "told you so";

Well, calculating a little bit gives me the result, that having a collision when having 1000 values out of 65536 is about 99.95% (See http://en.wikipedia.org/wiki/Bir... ). But there will be much lower chance to have i.e. 100 out of 1000 which have the same hash. This is good, because it will be still fast enough when there is a hash 100 times in my list.

Last Edited: Thu. Oct 29, 2009 - 10:57 AM

Why are you so worried about collisions?
It's quite normal that they happen and you
need some re-hashing. But if your table has 65536
entries of which only 1000 are used the hashing will
perform very well.

Baldrian wrote:
Well, calculating a little bit gives me the result, that having a collision when having 1000 values out of 65536 is about 99.95%

I have no idea how did you calculate that number.

I'd say, in agreement with glitch, it's approx. 1/65 which is around 1.5%. Quite a bit of difference.

JW

wek wrote:
Baldrian wrote:
Well, calculating a little bit gives me the result, that having a collision when having 1000 values out of 65536 is about 99.95%

I have no idea how did you calculate that number.

I'd say, in agreement with glitch, it's approx. 1/65 which is around 1.5%. Quite a bit of difference.

JW

1/65 is the chance that 1 of that 1000 has a specific value of the 65536. But having 2 of 1000 which can have the same value (a collision) out of 65536 is really high. I took the formula from the birthday problem:

If you have 23 people, how big is the chance that 2 have their birthday on the same day?
1 - 365!/((365-23)!*365^23)= 0.5 = 50%

Now in my case:
If you have 1000 values, how big is the chance that 2 have the same value in a range of 65536?
1 - 65536!/((65536-1000)!*65536^1000)= 0.9995 = 99.95%

But there is a really low chance that I will have 100 out of 1000 which all have the same hash. And now as this is clear to me and the rest of the concept is ready, I will have some nice days of coding to change the complete data management in the project. :-)

XOR hashing as mentioned before, lacks of good distibution. So I will go for an other hash algorithm.

Baldrian wrote:

1 - 65536!/((65536-1000)!*65536^1000)= 0.9995 = 99.95%

How did you calculate that?

JW

Jan,

(I remember first reading about the Birthday Problem in a book my Martin Gardner - my twin brother and I always screwed the odds for that one in classes we were in!)

wek" wrote:

How did you calculate that?

JW

Using Speed crunch (my TI-92 is at home) :D

clawson wrote:
Jan,

I know, but the raw math is wrong there, IMHO. I would like to know the method of calculating that figure.

JW

Baldrian wrote:
wek" wrote:

How did you calculate that?

JW

Using Speed crunch (my TI-92 is at home) :D

So, you hoped that a general-purpose calculator calculates 65536! accurately enough?

Let's see: http://speedcrunch.org/en_US/ind... says, "up to 50 decimal precisions". Do you think 65536! < 10^50?

JW

PS. Just found this: http://zayantecreek.biz/factoria... - more than 250000 digits... ;-)

PS2. [flame on] thinking about it, this is a good example of why using float on microcontrollers should be discouraged - it's (ab)used mostly by people who don't WANT to THINK, so it cannot be expected they will think about error cumulation etc.[/flame off]

---

Last Edited: Thu. Oct 29, 2009 - 03:31 PM

cpluscon wrote:
Quote:
In this case there's a 2^70 chance
Since the hash is 2 bytes, the chance of collision, which is by definition between 0 and 1, would be 1/65536. But, given a hash table of size 1000, the probability is increased to 1/1000, assuming a well behaved (i.e. random) hash function.

Ideally, a hash table would be perhaps 10%-%25 filled, maximum, but given limited memory on a microcontroller, a 10,000 entry table is probably not an option.

Let's say you have the luxury that incorrectly deeming a key as already included is not expensive, then the limited capacity is simply a matter of somewhat reduced performance. Otherwise, a linear or binary search may be more appropriate.

I think you'll find that my math works out. I used 1024 instead of 1000 to keep the calculations simpler As then everything is based on powers of two.

The input is 2^80
the output is 2^10
In the middle we have a 2^16 stage, but it can be ignored in an end-end calculation. (as the 6 bits lost going to the last stage, need to be added onto the result of first stage, to find the end-end probability)

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

wek wrote:

Let's see: http://speedcrunch.org/en_US/ind... says, "up to 50 decimal precisions". Do you think 65536! < 10^50?

JW

PS. Just found this: http://zayantecreek.biz/factoria... - more than 250000 digits... ;-)

So you mean: 51629485230975091650002279432724017478766918750846 and the rest of the digits which are calculated by maxima is not approximate the same as the result of speedcrunch which says:
65536! = 5,162948523097509165e287193
?
And 5.1629 E 287193 which is calculated by
http://www.luschny.de/math/facto...
is also not approximate the same as speedcrunch result of
5,162948523097509165e287193

hmm. OK.
:wink:

65536! is easily computed by adding all the logarithms from 1 to 65536, base 10. The result is 287193.712898, so taking the exponential results in 5.162949E+287193.

Further:

x = 65536!/((65536-1000)!*65536^1000)

logx = log(65536!) - log((65536-1000)!*65536^1000)

= log(65536!) - log((65536-1000)!) - log(65536^1000)

= log(65536!) - log((65536-1000)!) - 1000*log(65536)

= 287193.712898 - 282380.560015 - 4816.480

= -3.327117

So x = 4.708505e-004 and 1-x = 0.999529

C: i = "told you so";