Space and Time Trade-offs

Space for Time

Input Enhancement: Pre-process the input to store some info to be used later in solving the problem.

Restructuring: Pre-process the input to make accessing its elements easier.

Input Enhancement

Example: Sorting by Counting

Idea: Count values in table.

Comparison Counting Sort

TODO Pseudocode

For each item: Count how many items are less than that item.

We don’t have to look backwards and re-check old work.

Analysis:

C(n) = TODO = \frac{n(n-1)}{2}

Distribution Counting Sort

Idea:

  1. Count number of items of each value.
  2. Compute number of items <= each value
  3. Move all items to their exact final position.

Example:

TODO

There are only three unique values. If we compute frequency/distribution value (???) we only end up doing work for each unique value?

e.g.,

ValuesFrequencyDistribution Value (summation of smaller freq?)
1111
1234
1326

Thus, we can construct the sorted array:

11,12,12,12,13,13

TODO Pseudocode

String Searching by Preprocessing

Relevant Notes — String searching by brute force TODO

Knuth-Morris-Pratt: Preprocess pattern left-to-right to get info useful for later searching.

Boyer-Moore: Preprocess pattern right-to-left and store some info into twow tables.

Horspool’s: Simplified version of Boyer-Moore that uses just one table.

For this class, we’ll only focus on Boyer-Moore and Horspool’s.

Horspool’s Algorithm

Preprocess pattern to generate a shift table that determines how much to shift the pattern when a mismatch occurs.

Example: Reasoning about Shifting

Suppose you are searching for the pattern “BAOBAB”.

If you get a character and…

  1. It isn’t in the pattern
    .....C...
    BAOBAB
    

Then we want to shift until we get past this character entirely.

  1. It’s in the pattern (but no the rightmost)
    .....O...
    BAOBAB
    
    .....A...
    BAOBAB
    

Then ???

  1. It matches the rightmost character
    .....B...
    BAOBAB
    

Then ???

Shift

Shift Table: Stores numbers of characters to shift depending on first character compard.

How: TODO I don’t understand the reasoning, so I can’t explain how it actually works. I can only understand shifting by m (where m is pattern length) to get past characters not in the pattern, but I don’t understand the m-1 computation???

TODO Pseudocode

TODO NOTE TO SELF REMEMBER WE CHECK RIGHT TO LEFT, USE THE RIGHTMOST CHARACTER.

TODO Example: BARBER and BAOBAO

TODO How do we know we’ve found it?

Boyer-Moore

Preprocesses shift sizes in two tables:

  1. Bad Symbol Table: How much to shift based on character cuasing a mismatch.
  2. Good Suffix Table: Indicates how much to shift based on match part (suffix) of the pattern.

Bad-Symbol Shift

If the rightmost character doesn’t match, BM acts like Horspool’s, using the bad symbol table.

If the rightmsot character does match, BM compares preceding ??? We are trying to find the character that causes the mismatch and use that rather than just the rightmost character.

Example: Horspool v.s. Boyer-Moore

TODO

Horspool: Using last character to shift.

    

SER BARBER BARBER <.pre>

Boyer-Moore: Using character that caused mismatch.

    

SER BARBER BARBER <.pre>

Good-Suffix Shift

Successful matches are also useful for shifts.

Good suffix shift value depends on k’s value (value of distance between match suffix of size k and its rightmost occurrence in the pattern that is not preceded by the same character as… TODO ??????

TODO I don’t understand really how this works? We apply this after k \in [0,m] last characters were matched???

How much BM shifts is d=max(d_1,d_2), where d_1 is the bad-symbol shift and d_2 is the good-symbol shift.

TODO Obviously, something about using good characters/matches can sometimes let us know how much to shift in a way that can be larger than only bad-symbol shifts. When this exactly is, I don’t get yet. But basically, both have their downsides, so we need to use both methods. (See AER to BARBER example? (below SER))

TODO When there is no suffix matching, d_2 is just m.

Algorithm

In more descriptive steps: TODO

Example: Complete Boyer-Moore Application

Bad Symbol Shift Table: TODO

    BESS_KNEW_ABOUT_BAOBAB
    TODO Scratchwork
    

Good-Suffix Table:

kpatternd_2
1BAOBAB2
2BAOBAB5
3BAOBAB5
4BAOBAB5
5BAOBAB5

Example: Write the shift tables

Q: For the pattern TATGTG in the alphabet ATGC, find the bad shift table.

A:

Bad: “When we mismatch, what do we have shift by to make it align with the next occurrence?”

LetterShift
A4
T1
G2
C6
Application:
    GCAATGCCTATGTGACC
    TATGTG
      TATGTG
            TATGTG
    

Q: Now, write the good suffix table and do Boyer Moore

A:

Good-Suffix

kPatternShift
1TATGT\underline{G}2
2TATG\underline{TG}2
3TAT\underline{GTG}6
4TA\underline{TGTG}6
5T\underline{ATGTG}6

Once we hit a 6 (max), we can skip the check and set the max for all remaining strings.

Application:
    GCAATGCCTATGTGACC
    TATGTG
      TATGTG
            TATGTG
    

Hashing

Very efficient way of implementing a dictionary.

Dictionary — Find, insert, and delete.

Hash Tables and Hash Functions

Relevant NotesCS2400: Hashing

Hash Function: Maps keys of a given file of size n int to a table of size m.

h: K \to \text{location (cell) in the hash table}

Note — Generally, as hash function should be:

Example: Simple Hash Function TODO (hash with mod and primes)

Collisions

Recall — Collisions are when two different keys have the same address.

Good hash functions have fewer collisions, but some collisions should be expected.

Two principal schemes to handle collisions:

  1. Open Hashing: Allows multiples keys to share a cell.
  2. Closed Hashing: Enforces one key-per-cell.

Note — Efficient implementations reduce memory usage by only storing addresses. Double hashing somehow plays into this. TODO

Open Hashing (Separate Chaining)

Keys are stored in a linked-list outside a hash table.

Note — Because we use linked-list, open hashing means we can store more information (at the cost of performance).

Aside — If we’re doing simple mod hashing, the maximum is whatever we’re modding by (e.g., MOD 13 only needs 13 memory cells TODO)

Example: A simple hash table
KeyAFOOLANDHISMONEYARESOONPARTED
h(K)0x10x90x60x100x70x110x110x12

Performance (Load Factor)

If a hash function distributes keys uniformly, average length of linked list will be \alpha = n /m

Av.g number of pointers inspected in successful and unsuccessful searches is then:

S \approx 1 + \alpha/2 \\ U = \alpha

Kept small.

Closed Hashing

All keys are stored in the hash table itself.

Linear Probing

One way to resolve collisions is to just traverse linearly and look for the next open cell if there is a collision.

TODO resolve the previous collision in “A simple hash table” with linear probing.

Performance

Issues:

Load factor is same, now called hash table density.

Rates are different: TODO copy S and U (professor says we don’t need to memorize)

As the table gets filled (approaches 1, linear probing increases)

B-Tree Motivation

Problem: Access a huge data set, organized by key values.

Note — This is used a lot in system’s programming.

Idea: TODO

B-Tree Definition

A B-tree of order m is a m-way tree (each node can have up to m children), in which: TODO Copy five points of definition.

Importantm must be odd!

In other words, B-tree is an extension of the 2-3 tree where:

Constructing a B-tree

TODO Example

Concept

Pick m so that an m-block fits exactly into a disk page.

More Properties

A B-tree of m \ge 2 must satisfy:

  1. Root is either a leaf, or has between 2 and m children.
  2. Every other internal node has between m/2 and m childeren
  3. Perectly balanced