The Magic of Hash Tables

Ryan Lai
12 min readJul 10, 2023

--

So you’ve spent a good few days indulging in arrays and linked lists through Medium articles & YouTube videos. You feel like you are slowly starting to grasp the foundational concepts of data structures. Well, allow me to take you one level higher into the world of hash tables and dictionaries. But first of all, it’s important to clarify the distinctions between these two terms.

Hash Tables vs Dictionaries

There is a prevailing misconception that the terms “hash table” and “dictionary” are interchangeable. This notion is fundamentally inaccurate, at least in the world of computer science.

A dictionary is a general concept for data structures that maps keys to values. A hash table, however, is a specific implementation of a dictionary.

In essence, dictionaries play the role of an overarching ‘cookie cutter’, shaping the abstract concept. Hash tables are simply a practical realization of that concept. Now that we got that out of the way, let’s delve into the mechanics of a hash table and how it operates.

Basics of Hash Tables

Hash tables are one of the most widely used data structures in the programming world. In fact, you’ve probably used hash tables if you’ve done some basic programming. For example, in Python, the built-in “dictionary” data structure (not to be confused with the previously mentioned “conceptual” definition of a dictionary), actually uses hash tables under the hood. Here’s some code using these Python dictionaries.

# Create a dictionary of student grades
student_grades = {
"John": 85,
"Emily": 92,
"Michael": 78,
"Jessica": 95,
"Ryan": 88
}

# Updating a value in the Python dictionary
student_grades["Michael"] = 82

# Adding a new entry to the Python dictionary
student_grades["Sarah"] = 90

# Removing an entry from the Python dictionary
del student_grades["Ryan"]

At a high level, a hash table is a data structure that implements a key & value look-up system. Since each value is associated with a key, we can perform data operations, such as insert or delete, on a hash table. It’s good to note that the key and value can basically be any data type. Although a string is usually used, pretty much anything will work, as long as you have a hash function.

Wait, what’s a hash function?

Let’s take a little detour to understand what a hash function is. To do this, we’ll explore a similar type of data structure that also stores a collection of keys and values- direct-access tables.

Direct-Access Tables

Similar to a hash table, a direct-access table also uses key-value pairs. Each key uniquely points to the index of a slot within a range of elements, which may or may not contain data.

Uh, wait- that kinda sounds exactly like an array…

And you would be pretty much correct! A direct-access table is simply a specific implementation of an array. As a matter of fact, hash tables are also another specific implementation of arrays- associative arrays to be precise. Hold on, let me guess what you’re thinking…

Too. Many. Terms. I’m experiencing a term-tastrophe of epic proportions! My poor brain is melting at an average rate of 5cm³/second and threatening to go on strike. Save me from this jargon-induced calamity!

Okay, okay, I get it. Let me help you visualize the relationships between everything we’ve just talked about.

Note: If you want to be extremely categorical, there may be minor differences between the textbook definition of dictionaries and associative arrays. However, for the most part, you can consider them analogous.

In a direct-access table, the keys represent the indices associated with each slot. To truly understand this, let’s imagine that all the keys live in a singular “key universe”.

Each key in the “key universe” uniquely points to one slot in the array. Therefore, this means that the size of the array is equal to the total number of keys from the “key universe”.

With direct-access tables, insert, delete, and search are all constant-time operations, or O(1). In other words, the time required to perform these operations will remain constant, regardless of the input size.

So why exactly do we need hash tables if direct-access tables are already so amazing?

Let’s take a look at some reasons why.

#1 Memory Usage

Remember how I previously mentioned that the slots in the array may or may not hold data? This means that the slots don’t necessarily need to contain a value and can instead be empty.

Remember, the keys point to the slots’ indices, not the values of the slots. This means slots don’t necessarily need to contain data.

Now why is this property important?

Consider a scenario where the “key universe” is vast. For demonstration purposes, let’s say it contains 10 billion keys. Since each key is linked to one slot in the array, the size of the array would also be 10 billion. Or in other words, memory would have to be consumed to accommodate 10 billion slots.

However, what if only 10 of those slots are actually being used? In this case, the array would have so many redundant slots that more than 99.9% of it would be empty, resulting in wasted memory and extreme inefficiency.

#2 Insufficient Memory

Let’s now imagine that we are allocated 10 units of memory for the creation of our direct-access table.

In computers, memory is also known as RAM. But let’s stick to “units of memory” for simplicity’s sake.

Due to the intrinsic nature of arrays, each slot would take up a certain amount of memory. For demonstration purposes, let’s say each slot takes up 2 units of memory. So if there were to be 5 or more keys in the “key universe”, we would have already exceeded the memory limit:

2 units × 5 slots = 10 units

This means a direct-access table wouldn’t be a feasible option for data storage as there wouldn’t be enough memory to create the array with all the necessary slots. In other words, if the size of the “key universe” is larger than the maximum size of the array, the direct-access table cannot be used. So, how can we resolve this?

Enter hash tables.

An Introduction to Hash Tables (finally…)

With hash tables, both these problems are solved by the introduction of what’s known as a hash function. Think of a hash function as an intermediary medium between the keys and the slots in the table. Instead of directly using the keys as indices, the hash function transforms each key into a unique identifier called a “hashed output”, which determines the specific slot in which the key-value pair will be stored.

Note: In a hash table, the key can be any data type, as long as they are ‘hashable’.

The hashed output is used as the index to determine where in the array the corresponding value should be stored.

Dynamic Memory Usage

Unlike a direct-access table where the array has a pre-defined size, a hash table can instead adapt its size based on the number of keys being stored. This dynamic allocation optimizes memory usage, allowing the hash table to handle larger ranges of keys without wasting unnecessary memory.

Collisions

Since the “key universe” now has no pre-defined limit to how large it can become, it has the potential to contain an infinite number of different keys. However, this openness introduces the possibility of multiple keys generating the same “hashed output,” causing different keys to end up being assigned to the same slot in the array. This is what’s known as a collision.

The red arrows indicate that two keys have been hashed to the same slot position.

Collisions are generally considered unfavorable for numerous reasons, including degraded performance, increased search time, and creating an uneven distribution of keys, among others. In a few seconds, we’ll talk about how we can deal with these scenarios.

Good Hash Function

Selecting or designing a well-crafted hash function is crucial for the creation of an efficient hash table. We won’t get into all the nitty-gritty details here, but a good hash function tends to possess two key characteristics:

1. Fast computational runtime

Hash functions should be designed to generate the hash value for a given key quickly. By minimizing the time required to compute the hash value, the efficiency of insertion, deletion, and retrieval operations in the hash table can be significantly improved. This is particularly important when dealing with large datasets or time-sensitive applications.

2. Minimizes collisions by maximizing randomness

Another vital characteristic is the ability to generate distinct hash values for different keys as much as possible, minimizing the occurrence of collisions. A good hash function should be sensitive to even slight changes in the input key, resulting in a significant change in the computed hash value.

This property, known as the avalanche effect, ensures that small variations in the key produce vastly different hash outputs. By doing this, the likelihood of multiple keys generating the same hash output is significantly reduced.

Methods To Handle Collisions

Chaining

Chaining is a widely adopted method for handling collisions in hash tables. It involves the use of linked lists to transform each array slot into a chain of values.

Think of every ‘chain link’ as a distinct key-value pair, forming one connected chain that hangs from an array slot. In this analogy, the chain represents the linked list, while the individual ‘chain links’ symbolize the key-value pairs. When a collision occurs, the new key-value pair is simply appended to the chain at that slot, allowing for an efficient resolution for collisions.

Each ‘chain’ represents a linked list and each ‘chain link’ represents a node in the linked list.

Every node in the linked list contains a key-value pair. To retrieve a value from a chained slot, the hash function is used to compute the index, and then the linked list associated with that index is traversed until the desired key-value pair is found. The chaining approach ensures that even if multiple values are stored at the same index, they can be accessed individually by following the linked list structure.

Open Addressing

Open addressing is another alternative method used to handle collisions in hash tables. Unlike chaining, which utilizes linked lists, open addressing aims to store all key-value pairs directly within the array itself.

So how is this achieved?

In a nutshell, when two keys are hashed to the same array index, the algorithm navigates through the array to find the next available or “open” slot to store the key-value pair. This is done by applying a specific probing sequence, which determines the order in which the algorithm examines the array slots.

While all forms of open addressing avoid the use of linked lists, it can lead to clustering, where consecutive slots in the array become occupied, potentially causing longer probe sequences and decreased performance. We will see in just a second how this clustering effect can be alleviated.

In this diagram, the left red arrow represents a key that has been hashed to the same slot as an existing key-value pair “cat”. To resolve this collision by using open addressing, we move through the array until an empty slot is found.

The diagram above is an example of a type of open addressing called linear probing. Other notable probing sequences include quadratic probing and double hashing. Before we delve into the last two, let’s take a look at linear probing first.

1. Linear Probing

As we have already seen with linear probing, if a collision occurs at a particular slot, the algorithm checks every consecutive slot in the array and continues until an empty slot is encountered or the entire array has been traversed.

In the top example, a vacant slot is found after 3 steps to the right. In the bottom example, the entire array is traversed, starting from where the collision occurs at the ‘WJ931’ slot until it reaches back to the same spot, but in the end, no empty slots are found.

In the examples above, I began using integers instead of strings to re-emphasize the fact that the keys and values in a hash table can be any data type, as long as they are hashable.

2. Quadratic Probing

Similar to linear probing, quadratic probing is another method used to resolve collisions in open addressing. When a collision occurs at a specific slot, the algorithm does not simply check the consecutive slots. Instead, it follows a quadratic sequence to determine the next slot to be checked.

Instead of checking the consecutive slots as in linear probing, the algorithm follows a quadratic sequence. In this case, the algorithm successfully finds an empty slot after a few steps, resolving the collision.

By using this quadratic sequence, the algorithm explores slots that are further apart as the probe continues. This helps to distribute the values more evenly throughout the hash table, reducing the clustering effect that was mentioned previously.

3. Double Hashing

In addition to linear and quadratic probing, another advanced open-addressing technique is double hashing. Double hashing introduces a secondary hash function. The secondary hash function takes the original hash value and performs a separate hashing operation on it.

This new rendition of the hash value is then used to determine the step size for the next probe. This step size is added to the current slot index, and the algorithm continues probing until an empty slot is found.

Observe how the collisions in examples 1 and 2 are resolved with different step sizes. This is because the secondary hash function calculates a different step size for each collision, therefore allowing it to find empty slots without extensive clustering.

The use of double hashing helps to distribute the keys more evenly across the hash table, once again reducing the likelihood of clustering. By incorporating a different hash function for determining the step size, double hashing provides a more varied probing sequence.

Rehashing

At this point, we have explored various collision mitigation techniques including high-quality hash functions, chaining, and open addressing. The last one we will be discussing is rehashing.

When the hash table becomes too full or the number of collisions surpasses a certain threshold, rehashing is triggered. This process involves creating a new, larger hash table and recalculating the hash values for each element based on the new size.

When the original 20-slot hash table becomes too full, a ‘resize and rehash’ event is triggered. Specifically, when 12 or more slots are filled up, the elements are rehashed into a new 40-slot table. By rehashing, the elements are distributed more evenly, reducing the number of collisions and improving overall performance.

It’s important to note that rehashing can be a resource-intensive operation, as it requires allocating a new hash table, rehashing all the elements, and copying them to the new table. However, it is a necessary step to keeping a well-balanced hash table and maintaining the efficiency of retrieval and insertion operations.

Applications Of Hashing

1. Message Digest

A message digest is a fixed-size numeric representation of the contents of a message. These ‘digested’ messages are basically hash outputs and are computed by a cryptographic hash function that only goes one way. This algorithm is designed to protect the integrity of transmitted messages.

One example of this can be found when downloading open-source software from the internet. Typically, along with the software, you’ll also find a companion file known as the signature. This signature represents the hash value of the original file, or in other words, the message digest. It plays a crucial role by allowing users to independently calculate the hash of the downloaded file and compare it with the provided signature. This process ensures that the downloaded file hasn’t been tampered with.

2. Encrypting Passwords

Another significant application of hash functions is in password encryption. Have you ever wondered why, when you forget your password on a website and try to recover it, the site usually tells you to set a new password instead of simply reminding you of your original one? The reason behind this approach is that websites don’t store your actual password; instead, they store its hash value.

This security measure prevents unauthorized individuals from obtaining your password if they gain access to the website’s database. Since hash functions are typically one-way functions, it is nearly impossible to retrieve the original password from its hash value. Thus, the use of hashes provides an added layer of protection for user passwords.

To Summarize…

While a lot of data structures that are learned theoretically are never actually put into use much in real life, the same cannot be said for hash tables. With their seemingly magical efficiency, dynamic memory allocation capabilities, and versatility to various data types, hash tables take center stage among the tools every programmer and computer scientist should have under their belts. Understanding hash tables and gaining an appreciation for their advantages and limitations are pivotal milestones on your journey to mastering algorithms & data structures.

“So what’s the next step then?” you may ask.

Here’s a fun take-home challenge for you:

Take on the task of implementing your very own hash table in a programming language of your choice, without relying on any pre-made data structure libraries.

Good luck. Fare well. And until we meet again, may your programming journey continue to be filled with adventure and enlightenment.

--

--