Collision Handling

What is a Collision?

Since a hash function gets us a small number for a key which is a big integer or string, there is a possibility that two different keys result in the same hashed value. The situation where a newly inserted key maps to an already occupied slot in the hash table is called a collision and must be handled using some collision handling technique.

What are the chances of collisions with a large table?

Collisions are very likely even if we have a big table to store keys. An important observation regarding this is the Birthday Paradox. With only 23 persons in a room, the probability that two people share the exact same birthday is 50%. This demonstrates how quickly collisions can occur even in a relatively large sample space (365 days).

Collision Handling Techniques

Since a hash function gets us a small number for a big key, there is a possibility that two keys result in the same value. Following are the two main ways to handle collisions:

1. Chaining

The idea of Chaining is to make each cell of the hash table point to a linked list of records that have the same hash function value.

Pros: Chaining is simple to implement and the hash table never fills up completely (we can always add more elements to the linked list).
Cons: It requires additional memory outside the table to store the pointers/nodes of the linked list.

2. Open Addressing

In Open Addressing, all elements are stored in the hash table itself. Each table entry contains either a record or NIL. When searching for an element, we examine the table slots one by one until the desired element is found or it is clear that the element is not present in the table.

Pros: It does not use external data structures or pointers, saving space and improving cache performance.
Cons: The hash table can fill up, and as it gets fuller, the performance degrades heavily due to clustering.

(Note: Common probing techniques used in Open Addressing include Linear Probing, Quadratic Probing, and Double Hashing).

Chaining vs Open Addressing

S.No.	Separate Chaining	Open Addressing
1.	Chaining is simpler to implement.	Open Addressing requires more computation.
2.	In chaining, the hash table never fills up, we can always add more elements to the chain.	In open addressing, the table may become full.
3.	Chaining is less sensitive to the hash function or load factors.	Open addressing requires extra care to avoid clustering and load factor.
4.	Chaining is mostly used when it is unknown how many and how frequently keys may be inserted or deleted.	Open addressing is used when the frequency and number of keys is known.
5.	Cache performance of chaining is not good as keys are stored using a linked list.	Open addressing provides better cache performance as everything is stored in the same table.
6.	Wastage of Space (Some parts of the hash table in chaining are never used).	In Open addressing, a slot can be used even if an input doesn't map to it.
7.	Chaining uses extra space for links.	No links in Open addressing.

Note on Cache Performance: Cache performance of chaining is not good because when we traverse a Linked List, we are basically jumping from one node to another, all across the computer's memory. For this reason, the CPU cannot cache the nodes which aren't visited yet. But with Open Addressing, data isn't spread out, so if the CPU detects that a segment of memory is constantly being accessed, it gets cached for quick access.

Hashing Functions Implementation of Chaining