Space and Time Trade-offs

Space for Time

Input Enhancement: Pre-process the input to store some info to be used later in solving the problem.

e.g., improved string searching

Restructuring: Pre-process the input to make accessing its elements easier.

e.g., hashing

Input Enhancement

Example: Sorting by Counting

Idea: Count values in table.

Use counts to determine sorted positions.

Comparison Counting Sort

TODO Pseudocode

For each item: Count how many items are less than that item.

This yields the item’s final position (how? I can see the work-through, but I don’t know how it actually worked)

We don’t have to look backwards and re-check old work.

Analysis:

C(n) = TODO = \frac{n(n-1)}{2}

Same number of key comparisons as selection sort
Linear amount of extra space.
Minimum number of key moves possible.

Distribution Counting Sort

Idea:

Count number of items of each value.
Compute number of items <= each value
Move all items to their exact final position.

Example:

TODO

There are only three unique values. If we compute frequency/distribution value (???) we only end up doing work for each unique value?

I guess this is: Get merely the count and unique numbers and sort only unique numbers?

e.g.,

Values	Frequency	Distribution Value (summation of smaller freq?)
11	1	1
12	3	4
13	2	6

Thus, we can construct the sorted array:

11,12,12,12,13,13

TODO Pseudocode

String Searching by Preprocessing

Relevant Notes — String searching by brute force TODO

Knuth-Morris-Pratt: Preprocess pattern left-to-right to get info useful for later searching.

Boyer-Moore: Preprocess pattern right-to-left and store some info into twow tables.

Horspool’s: Simplified version of Boyer-Moore that uses just one table.

For this class, we’ll only focus on Boyer-Moore and Horspool’s.

Horspool’s Algorithm

Preprocess pattern to generate a shift table that determines how much to shift the pattern when a mismatch occurs.

When a mismatch happens, we use the shift value of the previous character, not the one which caused the mismatch (???)

Example: Reasoning about Shifting

Suppose you are searching for the pattern “BAOBAB”.

If you get a character and…

It isn’t in the pattern

    .....C...
    BAOBAB

Then we want to shift until we get past this character entirely.

It’s in the pattern (but no the rightmost)

    .....O...
    BAOBAB

    .....A...
    BAOBAB

Then ???

It matches the rightmost character

    .....B...
    BAOBAB

Then ???

This B cannot be in the right position right now because on the left everything is wrong? Thus we must shift until we hit the next B?

Shift

Shift Table: Stores numbers of characters to shift depending on first character compard.

How: TODO I don’t understand the reasoning, so I can’t explain how it actually works. I can only understand shifting by m (where m is pattern length) to get past characters not in the pattern, but I don’t understand the m-1 computation???

TODO Pseudocode

TODO NOTE TO SELF REMEMBER WE CHECK RIGHT TO LEFT, USE THE RIGHTMOST CHARACTER.

TODO Example: BARBER and BAOBAO

TODO How do we know we’ve found it?

I can follow the shifts, but then how do you know you’ve hit it?
I think, you begin checking each time (right to left), but then shift upon mismatch. We use the right-most character because the shifting is based on that??? (Almost clicking)

Boyer-Moore

Preprocesses shift sizes in two tables:

Bad Symbol Table: How much to shift based on character cuasing a mismatch.
Good Suffix Table: Indicates how much to shift based on match part (suffix) of the pattern.

Bad-Symbol Shift

If the rightmost character doesn’t match, BM acts like Horspool’s, using the bad symbol table.

If the rightmsot character does match, BM compares preceding ??? We are trying to find the character that causes the mismatch and use that rather than just the rightmost character.

In other words, rather than basing off just the right-most character, we base out shift off the bad symbol which caused the mismatch!

Example: Horspool v.s. Boyer-Moore

TODO

Horspool: Using last character to shift.

    SER
    BARBER
    BARBER
    <.pre>
    
Boyer-Moore: Using character that caused mismatch.
    
    SER
    BARBER
    BARBER
    <.pre>

Good-Suffix Shift

Successful matches are also useful for shifts.

Good suffix shift value depends on k’s value (value of distance between match suffix of size k and its rightmost occurrence in the pattern that is not preceded by the same character as… TODO ??????

TODO I don’t understand really how this works? We apply this after k \in [0,m] last characters were matched???

How much BM shifts is d=max(d_1,d_2), where d_1 is the bad-symbol shift and d_2 is the good-symbol shift.

In other words, we want to shift by whatever the maximum amount is to cull as much of the problem and avoid doing useless work.

TODO Obviously, something about using good characters/matches can sometimes let us know how much to shift in a way that can be larger than only bad-symbol shifts. When this exactly is, I don’t get yet. But basically, both have their downsides, so we need to use both methods. (See AER to BARBER example? (below SER))

TODO When there is no suffix matching, d_2 is just m.

Algorithm

In more descriptive steps: TODO

Example: Complete Boyer-Moore Application

Bad Symbol Shift Table: TODO

    BESS_KNEW_ABOUT_BAOBAB
    TODO Scratchwork

Good-Suffix Table:

k	pattern	d_2
1	BAOBAB	2
2	BAOBAB	5
3	BAOBAB	5
4	BAOBAB	5
5	BAOBAB	5

Example: Write the shift tables

Q: For the pattern TATGTG in the alphabet ATGC, find the bad shift table.

Bad: “When we mismatch, what do we have shift by to make it align with the next occurrence?”

Letter	Shift
A	4
T	1
G	2
C	6

Application:

    GCAATGCCTATGTGACC
    TATGTG
      TATGTG
            TATGTG

Q: Now, write the good suffix table and do Boyer Moore

Good-Suffix

k	Pattern	Shift
1	TATGT\underline{G}	2
2	TATG\underline{TG}	2
3	TAT\underline{GTG}	6
4	TA\underline{TGTG}	6
5	T\underline{ATGTG}	6

Once we hit a 6 (max), we can skip the check and set the max for all remaining strings.

Application:

    GCAATGCCTATGTGACC
    TATGTG
      TATGTG
            TATGTG

Hashing

Very efficient way of implementing a dictionary.

Dictionary — Find, insert, and delete.
Basically, we spend extra space to make these operations faster.

Hash Tables and Hash Functions

Relevant Notes — CS2400: Hashing

Hash Function: Maps keys of a given file of size n int to a table of size m.

h: K \to \text{location (cell) in the hash table}

Note — Generally, as hash function should be:
Easy to compute, and
Distribute keys evenly.
Handle collisions

Example: Simple Hash Function TODO (hash with mod and primes)

Collisions

Recall — Collisions are when two different keys have the same address.

Good hash functions have fewer collisions, but some collisions should be expected.

Two principal schemes to handle collisions:

Open Hashing: Allows multiples keys to share a cell.
- e.g., the cell itself is a header of the linked list of all keys attached to it.
Closed Hashing: Enforces one key-per-cell.
- e.g., doing linear probing, double-hashing, etc. to find open cells.

Note — Efficient implementations reduce memory usage by only storing addresses. Double hashing somehow plays into this. TODO

Open Hashing (Separate Chaining)

Keys are stored in a linked-list outside a hash table.

Hash table cells just stores pointers to linked lists.

Note — Because we use linked-list, open hashing means we can store more information (at the cost of performance).

Aside — If we’re doing simple mod hashing, the maximum is whatever we’re modding by (e.g., MOD 13 only needs 13 memory cells TODO)

Example: A simple hash table

Key	A	FOOL	AND	HIS	MONEY	ARE	SOON	PARTED
h(K)	0x1	0x9	0x6	0x10	0x7	0x11	0x11	0x12

Performance (Load Factor)

If a hash function distributes keys uniformly, average length of linked list will be \alpha = n /m

Av.g number of pointers inspected in successful and unsuccessful searches is then:

S \approx 1 + \alpha/2 \\ U = \alpha

Kept small.

Closed Hashing

All keys are stored in the hash table itself.

Thus, table size m must be at least as big as n.

Linear Probing

One way to resolve collisions is to just traverse linearly and look for the next open cell if there is a collision.

We wrap-around to do the search, only stopping if we make it back to the start.

TODO resolve the previous collision in “A simple hash table” with linear probing.

Performance

Issues:

Deletions aren’t straightforward.
Does not work if n > m

Load factor is same, now called hash table density.

Rates are different: TODO copy S and U (professor says we don’t need to memorize)

As the table gets filled (approaches 1, linear probing increases)

B-Tree Motivation

Problem: Access a huge data set, organized by key values.

Too big to store all data in memory, but disk is slow.

Note — This is used a lot in system’s programming.

Idea: TODO

B-Tree Definition

A B-tree of order m is a m-way tree (each node can have up to m children), in which: TODO Copy five points of definition.

Important — m must be odd!
If it was even, the interval wouldn’t be odd.

In other words, B-tree is an extension of the 2-3 tree where:

All data (pointers) are stored in the leaves, and
each internal node has n-1 ordered keys and n child pointers.

Constructing a B-tree

TODO Example

Concept

Pick m so that an m-block fits exactly into a disk page.

We typically have to read an entire TODO
TODO
TODO

More Properties

A B-tree of m \ge 2 must satisfy:

Root is either a leaf, or has between 2 and m children.
Every other internal node has between m/2 and m childeren
Perectly balanced