Hashing
\boxed{
\text{In a Nutshell: } \text{Key} \to \text{Hash Code} \to \text{Hash Index}
}
Why?: Hashing lets us turn search keys into integer indices that can be accessed at O(1)
- Hashing is a data-dependent problem, there’s no one “right” way to do it.
Hash Function: Takes a search key and produces the integer index of an element in the hash table.
- Search key is mapped—or hashed—to the index.
- The same key must have the same hash during execution.
- Collision: When two or more keys have the same hash.
Remember: Override the Object class’s default hashcode when necessary!
- The default
hashcode function just returns an object’s memory address. - Example: The
String and Integer class already implement their own hashcode function.
Real-world Example: CRC
- Checking if a message has been modified by its CRC value.
- e.g., JPL uses CRC to determine whether messages sent by Pioneer are garbled.
- Error-correction codes in file systems.
Hash Code and Hash Index
Typical Hashing:
- Convert search key to integer (hash code)
- Compress code into range of indices for hash table.
Hash Code v.s. Hash Index
- Hash Code: The integer result of the hash function.
- Hash Index: The hash code % Table Length
- (This is what we use as the search key.)
- aka: Compressing
Good Hash Functions
- Minimize Collisions
- Are fast to compute.
Note: Data Structure for Hash Tables
- The only data structure compatible with hash tables are arrays, because random accesses are O(1)
Example: Ideal Hashing
Function add(key,value) {
index=h(key);
hashTable[index]=value;
}
- Supposing that we have perfect hashing (no collision).
Computing Hash Codes
Example: Hashing a Unicode string
- Naive Approach: Sum every character’s Unicode integer value.
- Another Approach: Multiply each character’s integer value by its position in the String and sum the values.
u_0 g^{n-1} +
u_1 g^{n-2} +
u^{n-2} g +
u_{n-1}
int hash = 0;
int n = s.length();
for (int i = 0; i < n; i++) {
hash = g * hash + s.charAt(i);
}
- (The g constant was arrived at statistically)
Hash Code for Primitives:
- Use the key itself if data type is
int Byte, short, char: Cast to int- Other primitives: Manipulate internal binary representations.
Compressing a Hash Code (Scaling Method)
After getting a hash code, we need to compress it into a usable hash index.
- One way to do this is to mod the hash code by the length of the hash table.
\text{Scaling Method: Hash Code} \% n
- Best to use an odd number for n
- Prime number often gives good distribution of hash values.
private int getHashIndex(K key) {
int hashIndex = key.hashCode() % hashTable.length;
if (hashIndex < 0) {
hashIndex = hashIndex + hashTable.length;
}
return hashIndex;
}
- Note how we handle negative hash codes!
Important: Resizing a Hash Table
- Because the hash index is dependent on the size of the hash table, resizing the hash table isn’t a trivial copy-by-value to be bigger array—you need to re-hash the values.
Handling Collisions
Reducing Collisions:
- Improve the hash function
- e.g., distributing entries uniformly throughout the hash table.
- (usually requires statistical analysis)
- Increase table size.
Handling Collisions:
- Use another location in the hash table
- Change the structure of the hash tale so that each location can represent more than one value.
- (this introduces searching and a whole slew of problems like deleting entries)
Three Kinds of Locations in a Hash Table:
- Occupied
- Location references an entry in the dictionary
- Empty
- Location contains
null and always has been.
- Available
- Location was previously occupied, but is now available.
A. Linear Probing
Linear Probing: Resolves collision during hashing by examining consecutive locations in hash table (looping back around if we reach the end)
- Basically just probing the next location until we find an empty location.
- If we make it back to the starting location, we know that the table is full.
- (We must differentiate between empty and available locations)
Note: This resolves collision during additions, but complicates removals.
Clustering: Collisions resolved with linear probing cause groups of consecutive locations in hash table to be occupied, called clusters.
- Bigger clusters mean longer search times following collision.
B. Open Addressing with Quadratic Probing
Quadratic Probing: Considers the locations at indices k + j^j, rather than going consecutively from k.
- e.g., probing k, k+1, k+4, etc.
C. Open Addressing with Double Hashing
Double Hashing: Use a second hash function to compute increments.
D. Separate Chaining
Separate Chaining: Track collisions with linked list.
- Bucket: Each location is called a bucket, it can represent more than one value.
More on Implementation
- Successful retrieval/removal has same efficiency as successful search.
- Unsuccessful retrieval/removal has same efficiency as unsuccessful search.
- Successful addition has same efficiency as unsuccessful search.
- Unsuccessful addition has same efficiency as successful search.
Load Factor (\lambda)
Load Factor: Measure of cost of collision resolution
\lambda = \frac{\text{Num. of entries in dictionary}}{\text{Num. of locations in hash table}}
- Never negative.
- For open addressing, 1 \ge \lambda
- For separate chaining, \lambda has no maximum.
- Reducing \lambda improves performance.
Maintaining Performance:
- < 0.5 for open addressing
- < 1.0 for separate chaining
- If load factor exceeds these bounds, increase the size of the hash table.
Rehashing:
- Compute new size
- Double present size
- Increase result to next prime number
- Use add to add current entries in dictionary to new table.
More Advanced Entries
Example: Another private inner-class to encapsulate
key, value pairs along with state.
private static class TableEntry<K,V> {
private K key;
private V value;
private States state;
private enum States { CURRENT, REMOVED }
private TableEntry(K key, V value) {
this.key = key;
this.value = value;
this.state = States.CURRENT;
}
}
- Not necessary if we’re doing linear probing.
Java Class Library: The Class HashMap
- Variety of constructors provided.
- Default maximum load factor of 0.75
- Increases size when limit exceeded.
- Possible to avoid rehashing by setting number of buckets larger.
Java Class Library: The Class HashSet
- Implements
java.util.Set interface - Uses an instance of
HashMap