Fast Data Retrieval with O(1) Average Time Complexity
An introduction to hash tables and their importance in computer science.
Hash tables allow very fast retrieval of data regardless of dataset size.
Applications:
Problem solved: Linear search takes O(n) time, which is slow for large arrays.
If you know the index of an element in an array, lookup is O(1).
Time to access array[5] is the same whether the array has 100 or 1,000,000 items.
Access time is independent of array size and element position.
Problem: How do you know which index contains your desired value?
Solution: Calculate the index using the value itself.
The index is derived from the data you're looking for.
If you apply the same calculation to a key, you get the same index.
This is the fundamental principle of hashing.
A hash function transforms a key into an array index.
For numeric keys:
address = key mod n
(where n is the array size)
For alphanumeric keys:
Other methods: Folding (group digits, add, then mod), folding for phone numbers, etc.
Let's use famous cartoon characters from Looney Tunes and Scooby Doo!
Array size = 11
FRED (from Scooby Doo):
F=70, r=114, e=101, d=100 → Sum = 385 → 385 mod 11 = 0 → Index 0
SCRAPPY: → Index 1
TWEETY: T=84, w=119, e=101, e=101, t=116, y=121 → Sum = 642 → 642 mod 11 = 4 → Index 4
BUGS: B=66, u=117, g=103, s=115 → Sum = 401 → 401 mod 11 = 5 → Index 5
DAFFY: D=68, a=97, f=102, f=102, y=121 → Sum = 490 → 490 mod 11 = 6 → Index 6
SCOOBY: S=83, c=99, o=111, o=111, b=98, y=121 → Sum = 623 → 623 mod 11 = 7 → Index 7
DAPHNE: D=68, a=97, p=112, h=104, n=110, e=101 → Sum = 592 → 592 mod 11 = 9 → Index 9
YOSEMITE: Y=121, o=111, s=115, e=101, m=109, i=105, t=116, e=101 → Sum = 879 → 879 mod 11 = 10 → Index 10
Final array: [Fred, Scrappy, _, _, Tweety, Bugs, Daffy, Scooby, _, Daphne, Yosemite]
No collisions with this particular set of characters! Indices 2, 3, and 8 remain empty.
Hash tables typically store key-value pairs.
Key: Used to calculate the index (e.g., character name)
Value: The data associated with the key (e.g., voice actor, cartoon series, debut year)
Example: "Scooby" → {name: "Scooby", series: "Scooby Doo", voiceActor: "Don Messick", ...}
In object-oriented programming:
Each position stores an object instance with many properties.
Only the key property is used to calculate the index.
This allows efficient storage of related data about each character.
Collision: When two different keys hash to the same index.
Example: If we add "Porky" (another Looney Tunes character):
P=80, o=111, r=114, k=107, y=121 → Sum = 533 → 533 mod 11 = 5
This hashes to index 5, but "Bugs" is already there!
⚠ Both items cannot occupy the same space!
Solution needed: Collision resolution strategy.
Strategy: If the calculated position is occupied, find the next available slot.
Starting fresh with our cartoon characters (array size 11):
Now add PORKY:
Calculated position: 5 (collision with Bugs!)
→ Check index 6: occupied (Daffy)
→ Check index 7: occupied (Scooby)
→ Check index 8: empty! Place Porky here ✓
Now add VELMA:
V=86, e=101, l=108, m=109, a=97 → Sum = 501 → 501 mod 11 = 6
Calculated position: 6 (collision with Daffy!)
→ Check index 7: occupied (Scooby)
→ Check index 8: occupied (Porky)
→ Check index 9: occupied (Daphne)
→ Check index 10: empty! Place Velma here ✓
Final array: [Fred, Scrappy, _, _, Tweety, Bugs, Daffy, Scooby, Porky, Daphne, Velma]
Both collisions resolved through linear probing! Even though Porky and Velma had hash collisions, linear probing found them available slots.
Problem with linear probing: Primary clustering
Items bunch together in clusters, while large portions of the array remain empty.
← Cluster of items (some displaced by collisions)
Result:
Solution: Use alternative probing methods or different collision strategy.
Quadratic Probing:
Use quadratic distances (i², 2i², 3i²...) from collision point.
Spacing grows rapidly, reducing clustering.
First collision: check position + 1²
Second collision: check position + 2²
Third collision: check position + 3²
Plus-3 Rehash:
Check every 3rd position instead of every position.
Double Hashing:
Apply a second hash function when collision occurs.
Second hash result determines step size.
stepSize = hash2(key)
nextPosition = (currentPosition + stepSize) mod n
All methods aim to spread displaced items more evenly.
Strategy: Each array position has a linked list for colliding items.
Using the same cartoon character dataset with collisions:
With chaining:
Cost: Must traverse linked list, but usually shorter than linear probe distance.
Load Factor: ratio of items stored to array size.
Example: 9 cartoon characters in array of size 11 = 0.82 (82% load factor)
Load Factor = Number of Items / Array Size
Best practice: Keep load factor ≤ 0.7 (70% occupancy)
Dynamic resizing: Many implementations auto-expand array when load factor exceeds threshold, reducing collision probability.
If load factor > 0.7:
1. Create new array (typically 2× size)
2. Rehash all items into new array
3. Replace old array with new array
Trade-off: More space used, but fewer collisions and better performance.
Best Case: O(1)
No collisions, direct array access.
Average Case: O(1)
With good hash function and load factor < 0.7, collisions are rare.
Worst Case: O(n)
Many collisions (poor hash function or high load factor).
Must search through many colliding items.
Performance depends on:
When designing a hash function, consider:
1. Minimize collisions
Less collision resolution needed → faster retrieval
2. Uniform distribution
Data spread evenly across array positions
3. Easy to calculate
Hash computation must be O(1)
4. Suitable collision resolution
Choose method appropriate for your use case
Good hash function characteristics:
• Deterministic (same input → same output)
• Fast to compute
• Distributes keys uniformly
• Uses all available address space
Perfect hash function: Produces unique index for each key.
Requirements: All keys must be known in advance.
Advantages:
Limitations:
Use cases for perfect hashing:
• Compiler keyword tables
• Fixed configuration mappings
• Static lookup tables
Hash tables are fundamental to modern computing:
Examples in programming languages:
• Java: HashMap, HashSet
• Python: dict, set
• JavaScript: Object, Map, Set
• C++: unordered_map, unordered_set
Key takeaways:
Remember: Hash tables trade space for time to achieve fast lookups!