Step-by-Step Tutorial: Huffman Coding¶
Note that this tutorial has been created by ChatGPT after prompting it to generate a step-by-step implementation guide for the given implementation of huffman.py which is also presented in the lecture. This tutorial might help to better understand the code and is just an additional learning material.
Hereโs a step-by-step tutorial that builds up to your final Huffman coding script. This will walk through concepts, building code incrementally, and explaining how each part contributes to the whole.
๐ง Step 1: Understand Huffman Coding¶
Huffman Coding is a lossless data compression algorithm:
- It assigns shorter binary codes to more frequent characters.
- Constructs a binary tree based on character frequencies.
๐ฆ Step 2: Count Character Frequencies¶
We need to count how often each character appears in a string. Python's collections.Counter
is perfect for this.
from collections import Counter
def get_character_counts(text):
return Counter(text)
โ Test it:
print(get_character_counts("AAAABCCD"))
๐ฒ Step 3: Create the Huffman Tree Node¶
We need a tree structure. Each node has:
- A character (or None for internal nodes)
- Frequency
- Left and right child nodes
class HuffmanNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
โ
__lt__
is used for sorting nodes in a priority queue. Note that lt
stands for less than.
๐๏ธ Step 4: Build the Huffman Tree¶
Using a priority queue (heapq
), combine nodes until only one remainsโthe root of the Huffman Tree.
import heapq
def build_huffman_tree(frequencies):
heap = [HuffmanNode(char, freq) for char, freq in frequencies.items()]
heapq.heapify(heap)
while len(heap) > 1:
smallest = heapq.heappop(heap)
next_smallest = heapq.heappop(heap)
merged = HuffmanNode(None, smallest.freq + next_smallest.freq)
merged.left = smallest
merged.right = next_smallest
heapq.heappush(heap, merged)
return heap[0]
๐งพ Step 5: Generate Huffman Codes¶
Walk the tree:
- Left = โ0โ
- Right = โ1โ
def generate_huffman_codes(node, prefix="", codebook={}):
if node:
if node.char is not None:
codebook[node.char] = prefix
generate_huffman_codes(node.left, prefix + "0", codebook)
generate_huffman_codes(node.right, prefix + "1", codebook)
return codebook
๐ฌ Step 6: Encode and Decode Messages¶
Encoding just replaces characters with Huffman codes.
def encode(message, huffman_dict):
return ''.join(huffman_dict[char] for char in message)
Decoding walks bit-by-bit and matches codes.
def decode(encoded_message, huffman_dict):
reverse_dict = {v: k for k, v in huffman_dict.items()}
current_code = ''
decoded_message = ''
for bit in encoded_message:
current_code += bit
if current_code in reverse_dict:
decoded_message += reverse_dict[current_code]
current_code = ''
return decoded_message
๐ Step 7: Display the Tree (Optional Visual)¶
Using a clever method from StackOverflow (credited in your code), the display()
method lets us print the tree structure.
Paste this into your HuffmanNode
class:
# Full display() and _display_aux() methods (as in your code)
๐ Step 8: Load and Clean Data¶
import os
import requests
def load_text_file(file_path):
with open(file_path, 'r') as f:
return f.read()
# Download data if not present
input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')
if not os.path.exists(input_file_path):
url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
with open(input_file_path, 'w', encoding='utf-8') as f:
f.write(requests.get(url).text)
text = load_text_file(input_file_path)
text = ''.join(text.split()) # Remove whitespace
text = ''.join(c for c in text if c.isalnum()) # Remove special characters
text = text.upper()
๐งฎ Step 9: Calculate Entropy & Mean Code Length¶
import math
frequencies = get_character_counts(text)
total_chars = sum(frequencies.values())
entropy = -sum((freq / total_chars) * math.log2(freq / total_chars) for freq in frequencies.values())
print("Entropy:", entropy)
huffman_tree = build_huffman_tree(frequencies)
huffman_tree.display()
codes = generate_huffman_codes(huffman_tree)
print("Huffman Codes:", codes)
mean_code_length = sum(len(codes[char]) * freq for char, freq in frequencies.items()) / total_chars
print("Mean Code Length:", mean_code_length)
๐งช Step 10: Test Your Compressor¶
message = "ABC"
encoded = encode(message, codes)
print("Encoded:", encoded)
print("Decoded:", decode(encoded, codes))
๐ Bonus: Visualize the Huffman Tree (Optional)¶
This part is only in the complete code that you can find here: huffman.py