Input Enhancement11. Often entails trading storage space for quicker execution times later on.: Pre-process the input to store some info to be used later in solving the problem.
Restructuring: Pre-process the input to make accessing its elements easier.
Idea: Count values in table.
ALGORITHM ComparisonCountingSort(A[0..n-1])// Sorts an array by comparison counting// Input: Array A[0..n-1] of orderable values// Output: Array S[0..n-1] of A's elements sortedfor i \leftarrow 0 to n-1 do Count[i] \leftarrow 0for i \leftarrow 0 to n-2 dofor j \leftarrow i+1 to n-1 doif A[i] \lt A[j]Count[j] \leftarrow Count[j] \text{$+$} 1elseCount[i] \leftarrow Count[i] \text{$+$} 1for i \leftarrow 0 to n-1 do S[Count[i]] \leftarrow A[i]return S
For each item: Count how many items are less than that item.
We don’t have to look backwards and re-check old work.
Analysis — C(n) = \frac{n(n-1)}{2}
- Same number of key comparisons as selection sort \Theta(n^2).
- Linear amount of extra space O(n).
- Minimum number of key moves possible.
Idea:
There are only three unique values. If we compute frequency/distribution value we only end up doing work for each unique value.
e.g.,
| Values | Frequency | Distribution Value (summation of smaller freq) |
|---|---|---|
| 11 | 1 | 1 |
| 12 | 3 | 4 |
| 13 | 2 | 6 |
Thus, we can construct the sorted array:
11, 12, 12, 12, 13, 13
ALGORITHM DistributionCountingSort(A[0..n-1], l, u)// Sorts an array of integers from l to u// Input: Array A[0..n-1], bounds l and u// Output: Sorted array S[0..n-1]for j \leftarrow 0 to u \text{$-$} l do D[j] \leftarrow 0for i \leftarrow 0 to n \text{$-$} 1 do D[A[i] \text{$-$} l] \leftarrow D[A[i] \text{$-$} l] \text{$+$} 1for j \leftarrow 1 to u \text{$-$} l do D[j] \leftarrow D[j \text{$-$} 1] \text{$+$} D[j]for i \leftarrow n \text{$-$} 1 down to 0 doj \leftarrow A[i] \text{$-$} lS[D[j] \text{$-$} 1] \leftarrow A[i]D[j] \leftarrow D[j] \text{$-$} 1return S
Analysis — Time complexity is \Theta(n + (u-l)). Space complexity is O(n + (u-l)).
Recall — String searching by brute force requires comparing the pattern at every shift.
Knuth-Morris-Pratt: Preprocess pattern left-to-right to get info useful for later searching.
Boyer-Moore: Preprocess pattern right-to-left and store some info into twow tables.
Horspool’s: Simplified version of Boyer-Moore that uses just one table.
Note — For this class, we’ll only focus on Boyer-Moore and Horspool’s.
Preprocess pattern to generate a shift table that determines how much to shift the pattern when a mismatch occurs.
Shift Table: Stores numbers of characters to shift depending on the text character aligned with the rightmost pattern character.
How: If the character is not in the pattern, shift by m (length of pattern). If it is in the pattern, shift by the distance from its rightmost occurrence (excluding the very last character of the pattern itself) to the end of the pattern.
ALGORITHM ShiftTable(P[0..m-1])// Fills the shift table used by Horspool's algorithm// Input: Pattern P[0..m-1] and an alphabet// Output: Table[0..size-1] initialized with shift sizesfor i \leftarrow 0 to size \text{$-$} 1 do Table[i] \leftarrow mfor j \leftarrow 0 to m \text{$-$} 2 do Table[P[j]] \leftarrow m \text{$-$} 1 \text{$-$} jreturn TableALGORITHM HorspoolMatching(P[0..m-1], T[0..n-1])// Implements Horspool's algorithm// Input: Pattern P[0..m-1] and Text T[0..n-1]// Output: Index of first matching character, or -1Table \leftarrow ShiftTable(P[0..m-1])i \leftarrow m \text{$-$} 1while i \le n \text{$-$} 1 dok \leftarrow 0while k \le m \text{$-$} 1 and P[m \text{$-$} 1 \text{$-$} k] = T[i \text{$-$} k] dok \leftarrow k \text{$+$} 1if k = mreturn i \text{$-$} m \text{$+$} 1elsei \leftarrow i \text{$+$} Table[T[i]]return -1
Note — REMEMBER WE CHECK RIGHT TO LEFT, USE THE RIGHTMOST TEXT CHARACTER.
Analysis — Time efficiency is O(n) average-case, O(nm) worst-case. Space complexity is O(|Alphabet|).
How do we know we’ve found it?
Preprocesses shift sizes in two tables:
If the rightmost character doesn’t match, BM acts like Horspool’s, using the bad symbol table, except it uses the text character that actually caused the mismatch.
If the rightmost character does match, BM compares preceding characters. We are trying to find the character that causes the mismatch and use that rather than just the rightmost character.
Successful matches are also useful for shifts.
Good suffix shift value depends on finding the rightmost occurrence of the matched suffix (of length k) in the pattern that is not preceded by the same character.
How much BM shifts is d=\max(d_1,d_2), where d_1 is the bad-symbol shift and d_2 is the good-symbol shift.
Analysis — Time efficiency is O(n/m) best-case, O(n+m) worst-case. Space complexity depends on alphabet size and pattern length.
Bad Symbol Shift Table and Good-Suffix Table allow us to quickly skip characters and align the pattern.
Very efficient way of implementing a dictionary.
Note — Dictionary: Find, insert, and delete.
- Basically, we spend extra space to make these operations faster.
Hash Function: Maps keys of a given file of size n int to a table of size m.
h: K \to \text{location (cell) in the hash table}
Note — Generally, as hash function should be:
- Easy to compute, and
- Distribute keys evenly.
- Handle collisions
h(K) = K \% m (where m is typically prime).
Recall — Collisions are when two different keys have the same address.
Good hash functions have fewer collisions, but some collisions should be expected.
Two principal schemes to handle collisions:
Keys are stored in a linked-list outside a hash table.
Note — Because we use linked-list, open hashing means we can store more information (at the cost of performance).
If a hash function distributes keys uniformly, average length of linked list will be \alpha = n /m
Av.g number of pointers inspected in successful and unsuccessful searches is then:
\begin{aligned} S &\approx 1 + \alpha/2 \\ U &= \alpha \end{aligned}
Kept small.
All keys are stored in the hash table itself.
One way to resolve collisions is to just traverse linearly and look for the next open cell if there is a collision.
Issues:
Load factor22. The ratio of n / m, indicating how full the hash table is. is same, now called hash table density.
As the table gets filled (\alpha approaches 1, linear probing increases drastically).
Problem: Access a huge data set, organized by key values.
Note — This is used a lot in system’s programming.
Idea: Use a multi-way search tree to minimize disk accesses by making the tree very wide and shallow.
A B-tree of order m is a m-way tree (each node can have up to m children), in which:
Note — Important: m must be odd!
- If it was even, the interval wouldn’t be odd.
In other words, B-tree is an extension of the 2-3 tree where:
Pick m so that an m-block fits exactly into a disk page.
A B-tree of m \ge 2 must satisfy: