Space and Time Trade-offs

Space for Time

Input Enhancement11. Often entails trading storage space for quicker execution times later on.: Pre-process the input to store some info to be used later in solving the problem.

Restructuring: Pre-process the input to make accessing its elements easier.

Input Enhancement

Example: Sorting by Counting

Idea: Count values in table.

Comparison Counting Sort

ALGORITHM ComparisonCountingSort(A[0..n-1])
// Sorts an array by comparison counting
// Input: Array A[0..n-1] of orderable values
// Output: Array S[0..n-1] of A's elements sorted
    for i \leftarrow 0 to n-1 do Count[i] \leftarrow 0
    for i \leftarrow 0 to n-2 do
        for j \leftarrow i+1 to n-1 do
            if A[i] \lt A[j]
                Count[j] \leftarrow Count[j] \text{$+$} 1
            else
                Count[i] \leftarrow Count[i] \text{$+$} 1
    for i \leftarrow 0 to n-1 do S[Count[i]] \leftarrow A[i]
    return S

For each item: Count how many items are less than that item.

We don’t have to look backwards and re-check old work.

Analysis C(n) = \frac{n(n-1)}{2}

Distribution Counting Sort

Idea:

  1. Count number of items of each value.
  2. Compute number of items <= each value
  3. Move all items to their exact final position.
Example: Distribution Counting Process

There are only three unique values. If we compute frequency/distribution value we only end up doing work for each unique value.

e.g.,

Values Frequency Distribution Value (summation of smaller freq)
11 1 1
12 3 4
13 2 6

Thus, we can construct the sorted array:

11, 12, 12, 12, 13, 13

ALGORITHM DistributionCountingSort(A[0..n-1], l, u)
// Sorts an array of integers from l to u
// Input: Array A[0..n-1], bounds l and u
// Output: Sorted array S[0..n-1]
    for j \leftarrow 0 to u \text{$-$} l do D[j] \leftarrow 0
    for i \leftarrow 0 to n \text{$-$} 1 do D[A[i] \text{$-$} l] \leftarrow D[A[i] \text{$-$} l] \text{$+$} 1
    for j \leftarrow 1 to u \text{$-$} l do D[j] \leftarrow D[j \text{$-$} 1] \text{$+$} D[j]
    for i \leftarrow n \text{$-$} 1 down to 0 do
        j \leftarrow A[i] \text{$-$} l
        S[D[j] \text{$-$} 1] \leftarrow A[i]
        D[j] \leftarrow D[j] \text{$-$} 1
    return S

Analysis — Time complexity is \Theta(n + (u-l)). Space complexity is O(n + (u-l)).

String Searching by Preprocessing

Recall — String searching by brute force requires comparing the pattern at every shift.

Knuth-Morris-Pratt: Preprocess pattern left-to-right to get info useful for later searching.

Boyer-Moore: Preprocess pattern right-to-left and store some info into twow tables.

Horspool’s: Simplified version of Boyer-Moore that uses just one table.

Note — For this class, we’ll only focus on Boyer-Moore and Horspool’s.

Horspool’s Algorithm

Preprocess pattern to generate a shift table that determines how much to shift the pattern when a mismatch occurs.

Shift

Shift Table: Stores numbers of characters to shift depending on the text character aligned with the rightmost pattern character.

How: If the character is not in the pattern, shift by m (length of pattern). If it is in the pattern, shift by the distance from its rightmost occurrence (excluding the very last character of the pattern itself) to the end of the pattern.

ALGORITHM ShiftTable(P[0..m-1])
// Fills the shift table used by Horspool's algorithm
// Input: Pattern P[0..m-1] and an alphabet
// Output: Table[0..size-1] initialized with shift sizes
    for i \leftarrow 0 to size \text{$-$} 1 do Table[i] \leftarrow m
    for j \leftarrow 0 to m \text{$-$} 2 do Table[P[j]] \leftarrow m \text{$-$} 1 \text{$-$} j
    return Table

ALGORITHM HorspoolMatching(P[0..m-1], T[0..n-1])
// Implements Horspool's algorithm
// Input: Pattern P[0..m-1] and Text T[0..n-1]
// Output: Index of first matching character, or -1
    Table \leftarrow ShiftTable(P[0..m-1])
    i \leftarrow m \text{$-$} 1
    while i \le n \text{$-$} 1 do
        k \leftarrow 0
        while k \le m \text{$-$} 1 and P[m \text{$-$} 1 \text{$-$} k] = T[i \text{$-$} k] do
            k \leftarrow k \text{$+$} 1
        if k = m
            return i \text{$-$} m \text{$+$} 1
        else
            i \leftarrow i \text{$+$} Table[T[i]]
    return -1

Note — REMEMBER WE CHECK RIGHT TO LEFT, USE THE RIGHTMOST TEXT CHARACTER.

Analysis — Time efficiency is O(n) average-case, O(nm) worst-case. Space complexity is O(|Alphabet|).

How do we know we’ve found it?

Boyer-Moore

Preprocesses shift sizes in two tables:

  1. Bad Symbol Table: How much to shift based on character causing a mismatch.
  2. Good Suffix Table: Indicates how much to shift based on match part (suffix) of the pattern.

Bad-Symbol Shift

If the rightmost character doesn’t match, BM acts like Horspool’s, using the bad symbol table, except it uses the text character that actually caused the mismatch.

If the rightmost character does match, BM compares preceding characters. We are trying to find the character that causes the mismatch and use that rather than just the rightmost character.

Good-Suffix Shift

Successful matches are also useful for shifts.

Good suffix shift value depends on finding the rightmost occurrence of the matched suffix (of length k) in the pattern that is not preceded by the same character.

How much BM shifts is d=\max(d_1,d_2), where d_1 is the bad-symbol shift and d_2 is the good-symbol shift.

Analysis — Time efficiency is O(n/m) best-case, O(n+m) worst-case. Space complexity depends on alphabet size and pattern length.

Example: Complete Boyer-Moore Application

Bad Symbol Shift Table and Good-Suffix Table allow us to quickly skip characters and align the pattern.

Hashing

Very efficient way of implementing a dictionary.

Note — Dictionary: Find, insert, and delete.

Hash Tables and Hash Functions

Hash Function: Maps keys of a given file of size n int to a table of size m.

h: K \to \text{location (cell) in the hash table}

Note — Generally, as hash function should be:

Example: Simple Hash Function

h(K) = K \% m (where m is typically prime).

Collisions

Recall — Collisions are when two different keys have the same address.

Good hash functions have fewer collisions, but some collisions should be expected.

Two principal schemes to handle collisions:

  1. Open Hashing: Allows multiples keys to share a cell.
  2. Closed Hashing: Enforces one key-per-cell.

Open Hashing (Separate Chaining)

Keys are stored in a linked-list outside a hash table.

Note — Because we use linked-list, open hashing means we can store more information (at the cost of performance).

Performance (Load Factor)

If a hash function distributes keys uniformly, average length of linked list will be \alpha = n /m

Av.g number of pointers inspected in successful and unsuccessful searches is then:

\begin{aligned} S &\approx 1 + \alpha/2 \\ U &= \alpha \end{aligned}

Kept small.

Closed Hashing

All keys are stored in the hash table itself.

Linear Probing

One way to resolve collisions is to just traverse linearly and look for the next open cell if there is a collision.

Performance

Issues:

Load factor22. The ratio of n / m, indicating how full the hash table is. is same, now called hash table density.

As the table gets filled (\alpha approaches 1, linear probing increases drastically).

B-Tree Motivation

Problem: Access a huge data set, organized by key values.

Note — This is used a lot in system’s programming.

Idea: Use a multi-way search tree to minimize disk accesses by making the tree very wide and shallow.

B-Tree Definition

A B-tree of order m is a m-way tree (each node can have up to m children), in which:

  1. All leaves are on the same level.
  2. Every node except the root has at least \lceil m/2 \rceil children.
  3. The root has at least 2 children if it is not a leaf.
  4. A node with k children contains exactly k-1 keys.
  5. Keys are stored in nondecreasing order within the node.

Note — Important: m must be odd!

In other words, B-tree is an extension of the 2-3 tree where:

Concept

Pick m so that an m-block fits exactly into a disk page.

More Properties

A B-tree of m \ge 2 must satisfy:

  1. Root is either a leaf, or has between 2 and m children.
  2. Every other internal node has between \lceil m/2 \rceil and m childeren
  3. Perectly balanced

  1. Often entails trading storage space for quicker execution times later on.↩︎

  2. The ratio of n / m, indicating how full the hash table is.↩︎