Input Enhancement: Pre-process the input to store some info to be used later in solving the problem.
Restructuring: Pre-process the input to make accessing its elements easier.
Idea: Count values in table.
TODO Pseudocode
For each item: Count how many items are less than that item.
We don’t have to look backwards and re-check old work.
Analysis:
C(n) = TODO = \frac{n(n-1)}{2}
Idea:
Example:
TODO
There are only three unique values. If we compute frequency/distribution value (???) we only end up doing work for each unique value?
e.g.,
| Values | Frequency | Distribution Value (summation of smaller freq?) |
|---|---|---|
| 11 | 1 | 1 |
| 12 | 3 | 4 |
| 13 | 2 | 6 |
Thus, we can construct the sorted array:
11,12,12,12,13,13
TODO Pseudocode
Relevant Notes — String searching by brute force TODO
Knuth-Morris-Pratt: Preprocess pattern left-to-right to get info useful for later searching.
Boyer-Moore: Preprocess pattern right-to-left and store some info into twow tables.
Horspool’s: Simplified version of Boyer-Moore that uses just one table.
For this class, we’ll only focus on Boyer-Moore and Horspool’s.
Preprocess pattern to generate a shift table that determines how much to shift the pattern when a mismatch occurs.
Example: Reasoning about Shifting
Suppose you are searching for the pattern “BAOBAB”.
If you get a character and…
.....C...
BAOBAB
Then we want to shift until we get past this character entirely.
.....O...
BAOBAB
.....A...
BAOBAB
Then ???
.....B...
BAOBAB
Then ???
Shift Table: Stores numbers of characters to shift depending on first character compard.
How: TODO I don’t understand the reasoning, so I can’t explain how it actually works. I can only understand shifting by m (where m is pattern length) to get past characters not in the pattern, but I don’t understand the m-1 computation???
TODO Pseudocode
TODO NOTE TO SELF REMEMBER WE CHECK RIGHT TO LEFT, USE THE RIGHTMOST CHARACTER.
TODO Example: BARBER and BAOBAO
TODO How do we know we’ve found it?
Preprocesses shift sizes in two tables:
If the rightmost character doesn’t match, BM acts like Horspool’s, using the bad symbol table.
If the rightmsot character does match, BM compares preceding ??? We are trying to find the character that causes the mismatch and use that rather than just the rightmost character.
TODO
Horspool: Using last character to shift.
SER
BARBER
BARBER
<.pre>
Boyer-Moore: Using character that caused mismatch.
SER
BARBER
BARBER
<.pre>
Successful matches are also useful for shifts.
Good suffix shift value depends on k’s value (value of distance between match suffix of size k and its rightmost occurrence in the pattern that is not preceded by the same character as… TODO ??????
TODO I don’t understand really how this works? We apply this after k \in [0,m] last characters were matched???
How much BM shifts is d=max(d_1,d_2), where d_1 is the bad-symbol shift and d_2 is the good-symbol shift.
TODO Obviously, something about using good characters/matches can sometimes let us know how much to shift in a way that can be larger than only bad-symbol shifts. When this exactly is, I don’t get yet. But basically, both have their downsides, so we need to use both methods. (See AER to BARBER example? (below SER))
TODO When there is no suffix matching, d_2 is just m.
In more descriptive steps: TODO
Example: Complete Boyer-Moore Application
Bad Symbol Shift Table: TODO
BESS_KNEW_ABOUT_BAOBAB
TODO Scratchwork
Good-Suffix Table:
| k | pattern | d_2 |
|---|---|---|
| 1 | BAOBAB | 2 |
| 2 | BAOBAB | 5 |
| 3 | BAOBAB | 5 |
| 4 | BAOBAB | 5 |
| 5 | BAOBAB | 5 |
Example: Write the shift tables
Q: For the pattern TATGTG in the alphabet ATGC, find the bad shift table.
A:
Bad: “When we mismatch, what do we have shift by to make it align with the next occurrence?”
| Letter | Shift |
|---|---|
| A | 4 |
| T | 1 |
| G | 2 |
| C | 6 |
GCAATGCCTATGTGACC
TATGTG
TATGTG
TATGTG
Q: Now, write the good suffix table and do Boyer Moore
A:
Good-Suffix
| k | Pattern | Shift |
|---|---|---|
| 1 | TATGT\underline{G} | 2 |
| 2 | TATG\underline{TG} | 2 |
| 3 | TAT\underline{GTG} | 6 |
| 4 | TA\underline{TGTG} | 6 |
| 5 | T\underline{ATGTG} | 6 |
Application:Once we hit a 6 (max), we can skip the check and set the max for all remaining strings.
GCAATGCCTATGTGACC
TATGTG
TATGTG
TATGTG
Very efficient way of implementing a dictionary.
Dictionary — Find, insert, and delete.
- Basically, we spend extra space to make these operations faster.
Relevant Notes — CS2400: Hashing
Hash Function: Maps keys of a given file of size n int to a table of size m.
h: K \to \text{location (cell) in the hash table}
Note — Generally, as hash function should be:
- Easy to compute, and
- Distribute keys evenly.
- Handle collisions
Example: Simple Hash Function TODO (hash with mod and primes)
Recall — Collisions are when two different keys have the same address.
Good hash functions have fewer collisions, but some collisions should be expected.
Two principal schemes to handle collisions:
Note — Efficient implementations reduce memory usage by only storing addresses. Double hashing somehow plays into this. TODO
Keys are stored in a linked-list outside a hash table.
Note — Because we use linked-list, open hashing means we can store more information (at the cost of performance).
Aside — If we’re doing simple mod hashing, the maximum is whatever we’re modding by (e.g., MOD 13 only needs 13 memory cells TODO)
| Key | A | FOOL | AND | HIS | MONEY | ARE | SOON | PARTED |
|---|---|---|---|---|---|---|---|---|
| h(K) | 0x1 | 0x9 | 0x6 | 0x10 | 0x7 | 0x11 | 0x11 | 0x12 |
If a hash function distributes keys uniformly, average length of linked list will be \alpha = n /m
Av.g number of pointers inspected in successful and unsuccessful searches is then:
S \approx 1 + \alpha/2 \\ U = \alpha
Kept small.
All keys are stored in the hash table itself.
One way to resolve collisions is to just traverse linearly and look for the next open cell if there is a collision.
TODO resolve the previous collision in “A simple hash table” with linear probing.
Issues:
Load factor is same, now called hash table density.
Rates are different: TODO copy S and U (professor says we don’t need to memorize)
As the table gets filled (approaches 1, linear probing increases)
Problem: Access a huge data set, organized by key values.
Note — This is used a lot in system’s programming.
Idea: TODO
A B-tree of order m is a m-way tree (each node can have up to m children), in which: TODO Copy five points of definition.
Important — m must be odd!
- If it was even, the interval wouldn’t be odd.
In other words, B-tree is an extension of the 2-3 tree where:
TODO Example
Pick m so that an m-block fits exactly into a disk page.
A B-tree of m \ge 2 must satisfy: