Skip to content

Polynomial Hashes

Published: at 06:38 PM

Introduction

Hashing is a fundamental technique in computer science, widely used in data structures, cryptography, and algorithm optimization. Among various hashing methods, polynomial hashing stands out for its efficiency and effectiveness in handling strings and sequences. This technique leverages polynomial functions to generate hash values, making it particularly useful in problems like substring searching, rolling hash implementations, and fingerprinting.

In this article, we will explore the principles behind polynomial hashes, their mathematical foundations, and how they compare to standard hashing approaches. We will also discuss practical applications, potential pitfalls, and best practices to ensure robustness and minimal collisions in real-world scenarios.

Table of contents

Open Table of contents

Definition

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. Wikipedia

The values returned by a hash function are called hash values. Hash functions are used in data storage and retrieval applications to access data in a small and nearly constant time per request. In the worst case, they require O(n)O(n) operations, which is similar to searching for an element in a linked list.

Overview

A hash function can be considered to perform three operations Wikipedia:

A good hash function satisfies two basic properties: it should be fast to compute and should minimize duplication of output values, known as collisions.

When the set of stored keys KK in a dictionary is much smaller than the total set of possible keys UU, a hash table requires significantly less space compared to a direct-address table. More precisely, the space requirement can be reduced to Θ(K)\Theta(|K|) while maintaining a constant O(1)O(1) lookup time.

To achieve this, we use a hash function h:U{1,2,3,...,m1}h: U \rightarrow \{1, 2, 3, ..., m - 1\}. The function hh maps the set of keys UU to the cells of the hash table T[0...m1]T[0...m-1].

We say that an element kk is hashed into the cell h(k)h(k), where h(k)h(k) is called the hash value of the key kk. The purpose of the hash function is to reduce the working range of array indices, allowing us to use an array of size mm insted of full range U|U|.

To minimize collisions, the function hh should be designed to behave as randomly as possible. However, hh must be determenistic, meaning that for the same key kk, it should always product the same hash value h(k)h(k). Since U>m|U| > m, there must be at least two keys that share the same hash value.

Collision Resolution

Let’s consider different types of collision resolution in hash tables. The two most common approaches are chaining and open addressing.

Chaining

Each cell of the array HH is a pointer to a linked list of key-value pairs that correspond to the same hash value of the key. Collisions simply result in lists containing more than one element.

Searching for or deleting an element requires traversing all elements in the corresponding list to find the one with the given key. To add an element, it is inseted either at the beggining or the end of the corresponsing list. If the load factor becomes too high, the size of the array HH is increased, and the table is rebuilt.

Assuming that each element can be placed in any position of the table HH with equal probability, independently of where any other element is placed, the average time complexity of a search operations is Θ(1+α)\Theta(1 + \alpha), where α\alpha is the load factor of the table.

Chaining handles collisions by using linked lists, which can grow large if the hash function leads to many collisions. However, it does not require resizing the hash table as frequently.

Open Addressing

In the array HH, the key-value pairs are stored directly. The insertion algorithm checks the cells of HH in a certain order until it finds the first available stot, where the new element is then placed. This order is compute on the fly, allowing memory saving by eleminating the need for pointers required hash tables with chaining.

The search algorithm examines the hash table cells in the same order as during insertion until it either finds the element with the desired key or encounters an empty cell, indicating that the element is not present in the hash table.

Deletion in this scheme is somewhat more complicated. A common approach is to introduce a boolean flag for each cell, indicating whether the element has been deleted. Deleting an element then consists of setting this flag for the corresponding hash table cell. However, this requires modifying the search and insertion procedures:

Thus, in the open addressing method, the hash table can become completely full, making it impossible to insert new elements. As a consequence, the load factor α\alpha cannot exceed 11.

The sequence of probed cells depends on the key being inserted into the table. To determine the probed cells, we extend the hash function by including the probe number as a second argument. The hash function then takes the following form:

h:U×{0,1,...,m1}{0,1,...,m1}h: U \times \{0,1,...,m-1\} \rightarrow \{0,1,...,m-1\}

Open addressing, on the other hand, avoids the extra pointer overhead but requires careful handling of collisions, especially with high load factors.

Probe Sequences in Open Addressing

  1. Linear Probing. h(k,i)=(h(k)+i) mod mh(k,i)=(h(k)+i) \text{ mod } m

Simple and efficient for low address factor. Causes primary clustering, where groups of occupied cells from, slowing down insertion and searches.

  1. Quadratic Probing. h(k,i)=(h(k)+c1i2+c2i) mod mh(k,i)=(h(k)+c_1i^2+c_2i) \text{ mod } m

Reduces primary clustering. Can still suffer from secondary clustering and may not probe all table slots unless paramters c1c_1, c2c_2, mm are carefully choosen.

  1. Double Hashing. h(k,i)=(h1(k)+ih2(k)) mod m where h1(k)=k mod mh2(k)=1+(k mod m)h(k,i)=(h_1(k)+ih_2(k)) \text{ mod } m \text{ where }\\ h_1(k)=k \text{ mod } m \\ h_2(k) = 1 + (k \text{ mod } m')

Minimize clustering and behaves like random probing. Requires two independent hash functions and careful parameter selection to ensure all stots can be probed.

Polynomial Hashes

A polynomial hash of a string ss is a number hh of the form:

h=c1bm1+c2bm2+...+cmb0h=c_1b^{m-1}+c_2b^{m-2}+...+c_mb^0

where bb is a natural number, and c1c_1, c2c_2, …, cmc_m are the characters (of their corresponding numberical values) in the string ss. The exponent mm is the length of the string, and each character is raised to a decreasing power of bb, starting from bm1b^{m-1}.

Rolling Hash

A rolling hash function can efficiently compute the hash of each substring of a string by reusing previously computed values. Instead of recalculating the hash of the entire window (or substring) from scratch, the rolling hash eliminates the need for redundant computations by adjusting the hash when the window slides. This is done by removing the first character of the previous window from the hash calculation and adding the new character that enters the window.

h=(hpCpbm1)b+Cnh=(h_p-C_p*b^{m-1})*b+C_n

where:

Rolling hashes are especially useful in string matching problems, such as in the Rabin-Karp algorithm, where the goal is to find a substring within a larger string efficiently. By reusing the previously computed hash, rolling hashes allow us to compute the hash of each substring in constant time, making it an essential technique in efficient substring search algorithms.

By applying a rolling hash function, we can compute the hash of each sliding substring in linear time, rather than recalculating it from scratch for each new substring.

The result of all calculations should be taken modulo QQ to avoid large hash values hh (i.e., hash values) and integer overflows. This is achieved at the cost of increasing hash collisions, often referred to as false positives.

For the value of QQ, a prime number is typically chosen — as large as possible without compromising arithmetic performance. The smaller the value of QQ, the higher the probability of false positives.

The hash value hh of the entire string can be represented as:

H=(c1bm1+c2bm2+c3bm3++cmb0)modQH = (c_1b^{m-1} + c_2b^{m-2} + c_3b^{m-3} + \dots + c_mb^0) \mod Q

In practice, we apply the modulo QQ operation at each step to prevent overflow. Without it, the values can grow excessively large, leading to inefficiency and possible overflow issues.

However, within the loop, the formula requires two multiplications and one addition, which is inefficient. This doesn’t prevent integer overflow because large numbers are being summed, and the modulus operation hasn’t been applied yet. To address this problem, we can use Horner’s method.

Horner’s Method

Horner’s method optimizes polynomial evaluation by reordering terms into a nested form, reducing the number of operations. This technique eliminates the need for multiple multiplications and additions.

For example, instead of evaluating the polynomial in the form:

h=c0+c1b1+c2b2+c3b3h=c_0+c_1b^1+c_2b^2+c_3b^3

we can rewrite it as:

h=c0+b(c1+b(c2+bc3))h=c_0+b(c_1+b(c_2+bc_3))

By applying Horner’s method, we reduce the number of multiplications and additions. As a result, at each step of the loop, there is only one multiplication and one addition, which makes the computation more efficient and helps prevent overflow by keeping intermediate values within manageable bounds.

The improved polynomial hash using Horner’s method for each new sliding window can be computed as:

h=(bhp+CnCpbm1)modQh=(b*h_p+C_n-C_pb^{m-1}) \mod Q

This approach ensures that the modulus operation is applied after each step, keeping the intermediate values small and manageable, reducing the chances of overflow and minimizing the overall computation cost.

By applying Horner’s method, we optimize the polynomial evaluation process, reducing the number of operations and ensuring that we can efficiently calculate hash values for each sliding window in string matching tasks.

Final thoughts

In conclusion, polynomial hashes are a powerful tool for handling strings and sequences in various algorithms. By leveraging efficient techniques such as rolling hashes and Horner’s method, we can optimize string matching and data structure operations. Despite the potential for hash collisions, careful design of hash functions and collision resolution strategies can minimize these issues, making polynomial hashes a reliable choice in many applications.

Thanks for reading

If you enjoyed this post, be sure to follow me on Twitter to keep up with the new content.


avatar

Nikita Vasilev

A software engineer with over 8 years of experience in the industry. Writes this blog and builds open source frameworks.


Previous Post
Trunk-Based Development
Next Post
Thread Explosion in iOS: GCD and Swift Concurrency