Introduction

Hashing is a fundamental technique in computer science, widely used in data structures, cryptography, and algorithm optimization. Among various hashing methods, polynomial hashing stands out for its efficiency and effectiveness in handling strings and sequences. This technique leverages polynomial functions to generate hash values, making it particularly useful in problems like substring searching, rolling hash implementations, and fingerprinting.

In this article, we will explore the principles behind polynomial hashes, their mathematical foundations, and how they compare to standard hashing approaches. We will also discuss practical applications, potential pitfalls, and best practices to ensure robustness and minimal collisions in real-world scenarios.

Open Table of contents

Definition
Overview
Collision Resolution
- Chaining
- Open Addressing
  - Probe Sequences in Open Addressing
Polynomial Hashes
- Rolling Hash
- Horner’s Method
Final thoughts

Definition

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. Wikipedia

The values returned by a hash function are called hash values. Hash functions are used in data storage and retrieval applications to access data in a small and nearly constant time per request. In the worst case, they require $O(n)$ operations, which is similar to searching for an element in a linked list.

Overview

A hash function can be considered to perform three operations Wikipedia:

Converts variable-length keys into fixed-length values (usually no larger than a machine word) by combining them in parts using operations like addition (ADD) or bitwise XOR.
Scrambles the bits of the key to ensure the resulting hash values are evenly distributed across the key space.
Maps the hash values to a range that fits within the size of the hash table.

A good hash function satisfies two basic properties: it should be fast to compute and should minimize duplication of output values, known as collisions.

When the set of stored keys $K$ in a dictionary is much smaller than the total set of possible keys $U$ , a hash table requires significantly less space compared to a direct-address table. More precisely, the space requirement can be reduced to $\Theta(|K|)$ while maintaining a constant $O(1)$ lookup time.

To achieve this, we use a hash function $h: U \rightarrow \{1, 2, 3, ..., m - 1\}$ . The function $h$ maps the set of keys $U$ to the cells of the hash table $T[0...m-1]$ .

We say that an element $k$ is hashed into the cell $h(k)$ , where $h(k)$ is called the hash value of the key $k$ . The purpose of the hash function is to reduce the working range of array indices, allowing us to use an array of size $m$ insted of full range $|U|$ .

To minimize collisions, the function $h$ should be designed to behave as randomly as possible. However, $h$ must be determenistic, meaning that for the same key $k$ , it should always product the same hash value $h(k)$ . Since $|U| > m$ , there must be at least two keys that share the same hash value.

Collision Resolution

Let’s consider different types of collision resolution in hash tables. The two most common approaches are chaining and open addressing.

Chaining

Each cell of the array $H$ is a pointer to a linked list of key-value pairs that correspond to the same hash value of the key. Collisions simply result in lists containing more than one element.

Searching for or deleting an element requires traversing all elements in the corresponding list to find the one with the given key. To add an element, it is inseted either at the beggining or the end of the corresponsing list. If the load factor becomes too high, the size of the array $H$ is increased, and the table is rebuilt.

Assuming that each element can be placed in any position of the table $H$ with equal probability, independently of where any other element is placed, the average time complexity of a search operations is $\Theta(1 + \alpha)$ , where $\alpha$ is the load factor of the table.

Chaining handles collisions by using linked lists, which can grow large if the hash function leads to many collisions. However, it does not require resizing the hash table as frequently.

Open Addressing

In the array $H$ , the key-value pairs are stored directly. The insertion algorithm checks the cells of $H$ in a certain order until it finds the first available stot, where the new element is then placed. This order is compute on the fly, allowing memory saving by eleminating the need for pointers required hash tables with chaining.

The search algorithm examines the hash table cells in the same order as during insertion until it either finds the element with the desired key or encounters an empty cell, indicating that the element is not present in the hash table.

Deletion in this scheme is somewhat more complicated. A common approach is to introduce a boolean flag for each cell, indicating whether the element has been deleted. Deleting an element then consists of setting this flag for the corresponding hash table cell. However, this requires modifying the search and insertion procedures:

The search procedure must treat deleted cells as occupied to ensure that elements inserted after a deleted cell can still be found.
The insertion procedure must treat deleted cells as available and reset the deletion flag when adding a new element.

Thus, in the open addressing method, the hash table can become completely full, making it impossible to insert new elements. As a consequence, the load factor $\alpha$ cannot exceed $1$ .

The sequence of probed cells depends on the key being inserted into the table. To determine the probed cells, we extend the hash function by including the probe number as a second argument. The hash function then takes the following form:

$h: U \times \{0,1,...,m-1\} \rightarrow \{0,1,...,m-1\}$

Open addressing, on the other hand, avoids the extra pointer overhead but requires careful handling of collisions, especially with high load factors.

Probe Sequences in Open Addressing

Linear Probing. $h(k,i)=(h(k)+i) \text{ mod } m$

Simple and efficient for low address factor. Causes primary clustering, where groups of occupied cells from, slowing down insertion and searches.

Quadratic Probing. $h(k,i)=(h(k)+c_1i^2+c_2i) \text{ mod } m$

Reduces primary clustering. Can still suffer from secondary clustering and may not probe all table slots unless paramters $c_1$ , $c_2$ , $m$ are carefully choosen.

Double Hashing. $h(k,i)=(h_1(k)+ih_2(k)) \text{ mod } m \text{ where }\\ h_1(k)=k \text{ mod } m \\ h_2(k) = 1 + (k \text{ mod } m')$

Minimize clustering and behaves like random probing. Requires two independent hash functions and careful parameter selection to ensure all stots can be probed.

Polynomial Hashes

A polynomial hash of a string $s$ is a number $h$ of the form:

$h=c_1b^{m-1}+c_2b^{m-2}+...+c_mb^0$

where $b$ is a natural number, and $c_1$ , $c_2$ , …, $c_m$ are the characters (of their corresponding numberical values) in the string $s$ . The exponent $m$ is the length of the string, and each character is raised to a decreasing power of $b$ , starting from $b^{m-1}$ .

Rolling Hash

A rolling hash function can efficiently compute the hash of each substring of a string by reusing previously computed values. Instead of recalculating the hash of the entire window (or substring) from scratch, the rolling hash eliminates the need for redundant computations by adjusting the hash when the window slides. This is done by removing the first character of the previous window from the hash calculation and adding the new character that enters the window.

$h=(h_p-C_p*b^{m-1})*b+C_n$

where:

$h_p$ is the previous hash,
$C_p$ is the previous character,
$C_n$ is the new character.

Rolling hashes are especially useful in string matching problems, such as in the Rabin-Karp algorithm, where the goal is to find a substring within a larger string efficiently. By reusing the previously computed hash, rolling hashes allow us to compute the hash of each substring in constant time, making it an essential technique in efficient substring search algorithms.

By applying a rolling hash function, we can compute the hash of each sliding substring in linear time, rather than recalculating it from scratch for each new substring.

The result of all calculations should be taken modulo $Q$ to avoid large hash values $h$ (i.e., hash values) and integer overflows. This is achieved at the cost of increasing hash collisions, often referred to as false positives.

For the value of $Q$ , a prime number is typically chosen — as large as possible without compromising arithmetic performance. The smaller the value of $Q$ , the higher the probability of false positives.

The hash value $h$ of the entire string can be represented as:

$H = (c_1b^{m-1} + c_2b^{m-2} + c_3b^{m-3} + \dots + c_mb^0) \mod Q$

In practice, we apply the modulo $Q$ operation at each step to prevent overflow. Without it, the values can grow excessively large, leading to inefficiency and possible overflow issues.

However, within the loop, the formula requires two multiplications and one addition, which is inefficient. This doesn’t prevent integer overflow because large numbers are being summed, and the modulus operation hasn’t been applied yet. To address this problem, we can use Horner’s method.

Horner’s Method

Horner’s method optimizes polynomial evaluation by reordering terms into a nested form, reducing the number of operations. This technique eliminates the need for multiple multiplications and additions.

For example, instead of evaluating the polynomial in the form:

$h=c_0+c_1b^1+c_2b^2+c_3b^3$

we can rewrite it as:

$h=c_0+b(c_1+b(c_2+bc_3))$

By applying Horner’s method, we reduce the number of multiplications and additions. As a result, at each step of the loop, there is only one multiplication and one addition, which makes the computation more efficient and helps prevent overflow by keeping intermediate values within manageable bounds.

The improved polynomial hash using Horner’s method for each new sliding window can be computed as:

$h=(b*h_p+C_n-C_pb^{m-1}) \mod Q$

This approach ensures that the modulus operation is applied after each step, keeping the intermediate values small and manageable, reducing the chances of overflow and minimizing the overall computation cost.

By applying Horner’s method, we optimize the polynomial evaluation process, reducing the number of operations and ensuring that we can efficiently calculate hash values for each sliding window in string matching tasks.

Final thoughts

In conclusion, polynomial hashes are a powerful tool for handling strings and sequences in various algorithms. By leveraging efficient techniques such as rolling hashes and Horner’s method, we can optimize string matching and data structure operations. Despite the potential for hash collisions, careful design of hash functions and collision resolution strategies can minimize these issues, making polynomial hashes a reliable choice in many applications.

Thanks for reading

If you enjoyed this post, be sure to follow me on Twitter to keep up with the new content.

Nikita Vasilev

A software engineer with over 8 years of experience in the industry. Writes this blog and builds open source frameworks.