Hash Tables

Fast Data Retrieval with O(1) Average Time Complexity

An introduction to hash tables and their importance in computer science.

Why Hash Tables?

Hash tables allow very fast retrieval of data regardless of dataset size.

Applications:

  • Database indexing
  • Caching systems
  • Program compilation
  • Error checking

Problem solved: Linear search takes O(n) time, which is slow for large arrays.

Array Basics - Direct Access

If you know the index of an element in an array, lookup is O(1).

[0]
[1]
[2]
[3]
[4]
[5]
X
[6]
[7]
[8]
[9]
[10]

Time to access array[5] is the same whether the array has 100 or 1,000,000 items.

Access time is independent of array size and element position.

Problem: How do you know which index contains your desired value?

The Key Insight

Solution: Calculate the index using the value itself.

The index is derived from the data you're looking for.

If you apply the same calculation to a key, you get the same index.

This is the fundamental principle of hashing.

Hash Function Introduction

A hash function transforms a key into an array index.

For numeric keys:

address = key mod n

(where n is the array size)

For alphanumeric keys:

  1. Sum the ASCII codes of each character
  2. Divide by array size
  3. Take the remainder (modulo operation)

Other methods: Folding (group digits, add, then mod), folding for phone numbers, etc.

Example: ASCII Code Hashing with Cartoon Characters

Let's use famous cartoon characters from Looney Tunes and Scooby Doo!

Array size = 11

FRED (from Scooby Doo):

F=70, r=114, e=101, d=100 → Sum = 385 → 385 mod 11 = 0 → Index 0

SCRAPPY: → Index 1

TWEETY: T=84, w=119, e=101, e=101, t=116, y=121 → Sum = 642 → 642 mod 11 = 4 → Index 4

BUGS: B=66, u=117, g=103, s=115 → Sum = 401 → 401 mod 11 = 5 → Index 5

DAFFY: D=68, a=97, f=102, f=102, y=121 → Sum = 490 → 490 mod 11 = 6 → Index 6

SCOOBY: S=83, c=99, o=111, o=111, b=98, y=121 → Sum = 623 → 623 mod 11 = 7 → Index 7

DAPHNE: D=68, a=97, p=112, h=104, n=110, e=101 → Sum = 592 → 592 mod 11 = 9 → Index 9

YOSEMITE: Y=121, o=111, s=115, e=101, m=109, i=105, t=116, e=101 → Sum = 879 → 879 mod 11 = 10 → Index 10

[0]
Fred
[1]
Scrappy
[2]
[3]
[4]
Tweety
[5]
Bugs
[6]
Daffy
[7]
Scooby
[8]
[9]
Daphne
[10]
Yosemite

Final array: [Fred, Scrappy, _, _, Tweety, Bugs, Daffy, Scooby, _, Daphne, Yosemite]

No collisions with this particular set of characters! Indices 2, 3, and 8 remain empty.

Key-Value Pairs & Hash Maps

Hash tables typically store key-value pairs.

Key: Used to calculate the index (e.g., character name)

Value: The data associated with the key (e.g., voice actor, cartoon series, debut year)

Example: "Scooby" → {name: "Scooby", series: "Scooby Doo", voiceActor: "Don Messick", ...}

In object-oriented programming:

Each position stores an object instance with many properties.
Only the key property is used to calculate the index.

This allows efficient storage of related data about each character.

The Collision Problem

Collision: When two different keys hash to the same index.

Example: If we add "Porky" (another Looney Tunes character):

P=80, o=111, r=114, k=107, y=121 → Sum = 533 → 533 mod 11 = 5

This hashes to index 5, but "Bugs" is already there!

⚠ Both items cannot occupy the same space!

[0]
Fred
[1]
Scrappy
[2]
[3]
[4]
Tweety
[5]
Bugs
Porky?
[6]
Daffy
[7]
Scooby
[8]
[9]
Daphne
[10]
Yosemite

Solution needed: Collision resolution strategy.

Open Addressing - Linear Probing with Cartoons

Strategy: If the calculated position is occupied, find the next available slot.

Starting fresh with our cartoon characters (array size 11):

  • Fred → 0 ✓
  • Scrappy → 1 ✓
  • Tweety → 4 ✓
  • Bugs → 5 ✓
  • Daffy → 6 ✓
  • Scooby → 7 ✓
  • Daphne → 9 ✓

Now add PORKY:

Calculated position: 5 (collision with Bugs!)

→ Check index 6: occupied (Daffy)

→ Check index 7: occupied (Scooby)

→ Check index 8: empty! Place Porky here ✓

Now add VELMA:

V=86, e=101, l=108, m=109, a=97 → Sum = 501 → 501 mod 11 = 6

Calculated position: 6 (collision with Daffy!)

→ Check index 7: occupied (Scooby)

→ Check index 8: occupied (Porky)

→ Check index 9: occupied (Daphne)

→ Check index 10: empty! Place Velma here ✓

[0]
Fred
[1]
Scrappy
[2]
[3]
[4]
Tweety
[5]
Bugs
[6]
Daffy
[7]
Scooby
[8]
Porky
[9]
Daphne
[10]
Velma

Final array: [Fred, Scrappy, _, _, Tweety, Bugs, Daffy, Scooby, Porky, Daphne, Velma]

Both collisions resolved through linear probing! Even though Porky and Velma had hash collisions, linear probing found them available slots.

Linear Probing - Primary Clustering

Problem with linear probing: Primary clustering

Items bunch together in clusters, while large portions of the array remain empty.

[0]
Fred
[1]
Scrappy
[2]
[3]
[4]
Tweety
[5]
Bugs
[6]
Daffy
[7]
Scooby
[8]
Porky
[9]
Daphne
[10]
Velma

← Cluster of items (some displaced by collisions)

Result:

  • Reduces lookup efficiency (no longer guaranteed O(1))
  • Many displaced items require searching
  • As table fills, clustering worsens

Solution: Use alternative probing methods or different collision strategy.

Alternatives to Linear Probing

Quadratic Probing:

Use quadratic distances (i², 2i², 3i²...) from collision point.
Spacing grows rapidly, reducing clustering.

First collision: check position + 1²

Second collision: check position + 2²

Third collision: check position + 3²

Plus-3 Rehash:

Check every 3rd position instead of every position.

Double Hashing:

Apply a second hash function when collision occurs.
Second hash result determines step size.

stepSize = hash2(key)

nextPosition = (currentPosition + stepSize) mod n

All methods aim to spread displaced items more evenly.

Chaining (Closed Addressing) with Cartoons

Strategy: Each array position has a linked list for colliding items.

Using the same cartoon character dataset with collisions:

Index 0: Fred
Index 1: Scrappy
Index 4: Tweety
Index 5: Bugs → Porky (linked list - Porky chains after Bugs)
Index 6: Daffy → Velma (linked list - Velma chains after Daffy)
Index 7: Scooby
Index 9: Daphne

With chaining:

  • More items stored in their "correct" calculated positions
  • Lookup requires traversing linked list only for that index
  • No need to search through many array slots like in linear probing
  • Better lookup performance than linear probing in many cases

Cost: Must traverse linked list, but usually shorter than linear probe distance.

Load Factor

Load Factor: ratio of items stored to array size.

Example: 9 cartoon characters in array of size 11 = 0.82 (82% load factor)

Load Factor = Number of Items / Array Size

Best practice: Keep load factor ≤ 0.7 (70% occupancy)

Dynamic resizing: Many implementations auto-expand array when load factor exceeds threshold, reducing collision probability.

If load factor > 0.7:

1. Create new array (typically 2× size)

2. Rehash all items into new array

3. Replace old array with new array

Trade-off: More space used, but fewer collisions and better performance.

Time Complexity

Best Case: O(1)

No collisions, direct array access.

Average Case: O(1)

With good hash function and load factor < 0.7, collisions are rare.

Worst Case: O(n)

Many collisions (poor hash function or high load factor).

Must search through many colliding items.

Performance depends on:

  • Quality of hash function
  • Load factor maintained
  • Collision resolution method used

Hash Function Objectives

When designing a hash function, consider:

1. Minimize collisions

Less collision resolution needed → faster retrieval

2. Uniform distribution

Data spread evenly across array positions

3. Easy to calculate

Hash computation must be O(1)

4. Suitable collision resolution

Choose method appropriate for your use case

Good hash function characteristics:

• Deterministic (same input → same output)

• Fast to compute

• Distributes keys uniformly

• Uses all available address space

Perfect Hash Functions

Perfect hash function: Produces unique index for each key.

Requirements: All keys must be known in advance.

Advantages:

  • Zero collisions
  • Can use all array space efficiently
  • Optimal performance

Limitations:

  • Not practical for dynamic datasets (keys not known in advance)
  • Expensive to compute in some cases
  • Most real-world applications use approximate hash functions

Use cases for perfect hashing:

• Compiler keyword tables

• Fixed configuration mappings

• Static lookup tables

Real-World Applications

Hash tables are fundamental to modern computing:

  • Database indexing: Fast record lookup
  • Compiler symbol tables: Variable/function name resolution
  • Caching systems: Fast memory access
  • Password authentication: Secure hashing
  • Dictionary/Map implementations: Language features
  • Set implementations: Unique element storage
  • File systems: Inode lookups
  • Networking: ARP tables, DNS caching

Examples in programming languages:

• Java: HashMap, HashSet

• Python: dict, set

• JavaScript: Object, Map, Set

• C++: unordered_map, unordered_set

Summary

Key takeaways:

  • Hash tables enable O(1) average-case retrieval
  • Hash functions transform keys into array indices
  • Collisions are inevitable; must be resolved
  • Two main strategies:
    • Open addressing: linear probing, quadratic probing, double hashing
    • Chaining (closed addressing): linked lists at each position
  • Load factor and hash function quality are critical for performance
  • Widely used in databases, compilers, caching, authentication, and more
  • Perfect hash functions exist but require knowing all keys in advance

Remember: Hash tables trade space for time to achieve fast lookups!