Hash Tables

Problems

13.1

A palindrome is a string that reads the same forwards and backwards, e.g., “level”, “rotator”, and “foobaroof”.

Write a program to test whether the letters forming a string can be permuted to form a palindrome. For example, “edified” can be permuted to form “deified”.

Hint: Find a simple characterization of strings that can be permuted to form a palindrome.

13.2

Write a program which takes text for an anonymous letter and text for a magazine and determines if it is possible to write the anonymous letter using the magazine. The anonymous letter can be written using the magazine if for each character in the anonymous letter, the number of times it appears in the anonymous letter is no more than the number of times it appears in the magazine.

Hint: Count the number of distinct characters appearing in the letter.

13.3

The International Standard Book Number (ISBN) is a unique commercial book identifier. It is a string of length 10. The first 9 characters are digits; the last character is a check character. The check character is the sum of the first 9 digits, modulo 11, with 10 represented by X. (Modern ISBNs use 13 digits, and the check digit is taken modulo 10; this problem is concerned with 10-digit ISBNs.)

Create a cache for looking up prices of books identified by their ISBN. You implement lookup, insert, and remove methods. Use the Least Recently Used (LRU) policy for cache eviction. If an ISBN is already present, insert should not change the price, but it should update that entry to be the most recently used entry. lookup should also update that entry to be the most recently used entry.

Hint: Amortize the cost of deletion. Alternatively, use an auxiliary data structure.

13.4

Problem 10.4 is concerned with computing the LCA in a binary tree with parent pointers in time proportional to the height of the tree. The algorithm presented there entails traversing all the way to the root even if the nodes whose LCA is being computed are very close to their LCA.

Design an algorithm for computing the LCA of two nodes in a binary tree. The algorithm’s time complexity should depend only on the distance from the nodes to the LCA.

Hint: Focus on the extreme case described in the problem introduction.

13.5

You are given a log file containing search queries. Each query is a string, and queries are separated by newlines. Diverse applications, such as autocompletion and trend analysis, require computing the most frequent queries. In the abstract, you are to solve the following problem.

You are given an array of strings. Compute the k strings that appear most frequently in the array.

Hint: Consider extreme values for k, as well as scenarios where there are a relatively small number of distinct strings.

13.6

People do not like reading text in which a word is used multiple times in a short paragraph. You are to write a program which helps identify such a problem.

Write a program which takes as input an array and finds the distance between a closest pair of equal entries. For example, if s = ⟨"All", "work", "and", "no", "play", "makes", "for", "no", "work", "no", "fun", "and", "no", "results"⟩, then the second and third occurrences of "no" are the closest pair.

Hint: Each entry in the array is a candidate.

13.7

When you type keywords in a search engine, the search engine will return results, and each result contains a digest of the web page, i.e., a highlighting within that page of the keywords that you searched for.

Write a program which takes an array of strings and a set of strings, and returns the indices of the starting and ending index of a shortest subarray of the given array that “covers” the set, i.e., contains all strings in the set.

Hint: What is the maximum number of minimal subarrays that can cover the query?

Variant: Given an array A, find a shortest subarray A[i:j] such that each distinct value present in A is also present in the subarray.

Variant: Given an array A, rearrange the elements so that the shortest subarray containing all the distinct values in A has maximum possible length.

Variant: Given an array A and a positive integer k, rearrange the elements so that no two equal elements are k or less apart.

13.8

In Problem 13.7 we did not differentiate between the order in which keywords appeared. If the digest has to include the keywords in the order in which they appear in the search textbox, we may get a different digest. For example, for the search keywords “Union” and “save”, in that order, the digest would be “Union, and is not either to save”.

Write a program which takes two arrays of strings, and returns the indices of the starting and ending index of a shortest subarray of the first array (the “paragraph” array) that “sequentially covers”, i.e., contains all the strings in the second array (the “keywords” array), in the order in which they appear in the keywords array. You can assume all keywords are distinct. For example, let the paragraph array be ⟨apple, banana, cat, apple⟩, and the keywords array be ⟨banana, apple⟩. The paragraph subarray starting at index 0 and ending at index 1 does not fulfill the specification, even though it contains all the keywords, since they do not appear in the specified order. On the other hand, the subarray starting at index 1 and ending at index 3 does fulfill the specification.

Hint: For each index in the paragraph array, compute the shortest subarray ending at that index which fulfills the specification.

13.9

Write a program which takes an array and returns the length of a longest subarray with the property that all its elements are distinct. For example, if the array is ⟨f,s,f,e,t,w,e,n,w,e⟩ then a longest subarray all of whose elements are distinct is ⟨s,f,e,t,w⟩.

Hint: What should you do if the subarray from indices i to j satisfies the property, but the subarray from i to j + 1 does not?

13.10

Write a program which takes as input a set of integers represented by an array, and returns the size of a largest subset of integers in the array having the property that if two integers are in the subset, then so are all integers between them. For example, if the input is ⟨3,-2,7,9,8,1,2,0,-1,5,8⟩, the largest such subset is {-2,-1,0,1,2,3}, so you should return 6.

Hint: Do you really need a total ordering on the input?

13.11

Student test scores are recorded in a file. Each line consists of a student ID, which is an alphanumeric string, and an integer between 0 and 100, inclusive.

Write a program which takes as input a file containing test scores and returns the student who has the maximum score averaged across his or her top three tests. If the student has fewer than three test scores, ignore that student.

Hint: Generalize to computing the top k scores.

13.12

This problem is concerned with taking a string (the “sentence” string) and a set of strings (the “words”), and finding the substrings of the sentence which are the concatenation of all the words (in any order). For example, if the sentence string is "amanaplanacanal" and the set of words is {"can", "apl", "ana"}, "aplanacan" is a substring of the sentence that is the concatenation of all words.

Write a program which takes as input a string (the “sentence”) and an array of strings (the “words”), and returns the starting indices of substrings of the sentence string which are the concatenation of all the strings in the words array. Each string must appear exactly once, and their ordering is immaterial. Assume all strings in the words array have equal length. It is possible for the words array to contain duplicates.

Hint: Exploit the fact that the words have the same length.

13.13

The Collatz conjecture is the following: take any natural number. If it is odd, triple it and add one; if it is even, halve it. Repeat the process indefinitely. No matter what number you begin with, you will eventually arrive at 1.

As an example, if we start with 11 we get the sequence 11,34,17,52,26,13,40,20,10,5,16,8,4,2,1. Despite intense efforts, the Collatz conjecture has not been proved or disproved.

Suppose you were given the task of checking the Collatz conjecture for the first billion integers. A direct approach would be to compute the convergence sequence for each number in this set.

Test the Collatz conjecture for the first n positive integers.

Hint: How would you efficiently check the conjecture for n assuming it holds for all m < n?

13.14

The state of a game of chess is determined by what piece is present on each square. Each square may be empty, or have one of six classes of pieces; each piece may be black or white. Thus ⌈lg(1 + 6 × 2)⌉ = 4 bits suffice per square, which means that a total of 64 × 4 = 256 bits can represent the state of the chessboard. (The actual state of the game is slightly more complex, as it needs to capture which side is to move, castling rights, en passant, etc., but we will use the simpler model for this question.)

Chess-playing computers need to store sets of states, e.g., to determine if a particular state has been evaluated before, or is known to be a winning state. To reduce storage, it is natural to apply a hash function to the 256 bits of state, and ignore collisions. The hash code can be computed by a conventional hash function for strings. However, since the computer repeatedly explores nearby states, it is advantageous to consider hash functions that can be efficiently computed based on incremental changes to the board.

Design a hash function for chess game states. Your function should take a state and the hash code for that state, and a move, and efficiently compute the hash code for the updated state.

Hint: XOR is associative, commutative, and fast to compute. Additionally, a ⊕ a = 0.

Variant: How can you include castling rights and en passant information in the state?

Answers

13.1

Count character frequencies and check how many have odd counts. A permutation can form a palindrome iff at most one character has odd frequency. This gives $O (n)$ time and $O (c)$ space, where $c$ is the number of distinct characters.

13.2

Count the characters needed by the letter in a hash table, then scan the magazine and decrement counts when matching characters appear. Delete entries whose counts drop to zero; if the table becomes empty, the letter is constructible. This runs in $O (m + n)$ time.

13.3

Use a hash table from ISBN to price plus a handle into a doubly linked list that stores usage order. lookup and insert move touched entries to the front of the list, and eviction removes the tail node and its hash entry. All operations are $O (1)$ .

13.4

Walk upward from both nodes in tandem, storing visited ancestors in a hash set. The first repeated node is the LCA. This bounds the work by the distances from the nodes to the LCA instead of the full tree height.

13.5

First build a frequency table. If you only need the top k, either keep a min-heap of size k keyed by frequency for $O (n + m lo g k)$ time, or run quickselect on the distinct strings by frequency for expected linear time after counting. Here m is the number of distinct strings.

13.6

Track the most recent index of each value in a hash table while scanning left to right. When a value reappears, compare the new gap with the best gap seen so far, then update its last-seen index. This is $O (n)$ time and $O (d)$ space.

13.7

Use a sliding window with a hash table of remaining keyword counts. Expand the right end until all keywords are covered, then shrink the left end while coverage is preserved, recording the best interval seen. Both pointers move at most n times, so the algorithm is $O (n)$ .

13.8

Map each keyword to its position in the keyword order. As you scan the paragraph, maintain the latest occurrence of each keyword and the length of the shortest subarray ending here that sequentially covers the first j keywords. When you see the last keyword, update the best answer. This is $O (n)$ time and $O (m)$ space.

13.9

Maintain a hash table from value to its most recent index and track the left boundary of the current duplicate-free window. When a repeated value appears inside the window, move the left boundary just past its previous occurrence. Update the maximum window length throughout. This gives $O (n)$ time.

13.10

Insert all numbers into a hash set. Repeatedly pick an unprocessed value, expand downward and upward while consecutive neighbors exist, and remove every visited element from the set so no interval is processed twice. The largest expansion length is the answer, and the overall time is $O (n)$ .

13.11

For each student, keep only the top three scores seen so far, e.g., in a size-3 min-heap. After processing the file, compute each eligible student’s top-three sum and return the student with the maximum average. Since each update is constant-time for fixed heap size 3, the total time is $O (n)$ .

13.12

Because all words have equal length, test each possible start index by chopping the candidate substring into fixed-size pieces and comparing their multiplicities against the target word multiset. Use a hash table for the required counts and another for counts seen in the candidate. This yields $O (N nm)$ time in the book’s notation.

13.13

Check numbers from 3 to n, skipping even starts. For each start, follow the Collatz sequence, using a hash set to detect loops and another set of previously verified odd numbers to stop early once the trajectory reaches a known-good state. Guard against overflow on 3x + 1. This is mainly a heuristic optimization on top of the obvious iterative test.

13.14

Use Zobrist hashing: precompute a random code for each (square, piece-state) pair, and define the board hash as the XOR of the codes for all occupied or empty square states. After a move, XOR out the old square states and XOR in the new ones, so an update takes constant time. Extra game state like castling rights or en passant can be handled by assigning them their own random codes and XORing them in as well.

Takashi's Notes

Explorer

Hash Tables

Hash Tables

Problems

13.1

13.2

13.3

13.4

13.5

13.6

13.7

13.8

13.9

13.10

13.11

13.12

13.13

13.14

Answers

13.1

13.2

13.3

13.4

13.5

13.6

13.7

13.8

13.9

13.10

13.11

13.12

13.13

13.14

Graph View

Table of Contents

Backlinks