Step-by-Step Tutorial: Huffman Coding¶

Note that this tutorial has been created by ChatGPT after prompting it to generate a step-by-step implementation guide for the given implementation of huffman.py which is also presented in the lecture. This tutorial might help to better understand the code and is just an additional learning material.

Here’s a step-by-step tutorial that builds up to your final Huffman coding script. This will walk through concepts, building code incrementally, and explaining how each part contributes to the whole.

🧠 Step 1: Understand Huffman Coding¶

Huffman Coding is a lossless data compression algorithm:

It assigns shorter binary codes to more frequent characters.
Constructs a binary tree based on character frequencies.

📦 Step 2: Count Character Frequencies¶

We need to count how often each character appears in a string. Python's collections.Counter is perfect for this.

from collections import Counter

def get_character_counts(text):
    return Counter(text)

✅ Test it:

print(get_character_counts("AAAABCCD"))

🌲 Step 3: Create the Huffman Tree Node¶

We need a tree structure. Each node has:

A character (or None for internal nodes)
Frequency
Left and right child nodes

class HuffmanNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    def __lt__(self, other):
        return self.freq < other.freq

✅ __lt__ is used for sorting nodes in a priority queue. Note that lt stands for less than.

🏗️ Step 4: Build the Huffman Tree¶

Using a priority queue (heapq), combine nodes until only one remains—the root of the Huffman Tree.

import heapq

def build_huffman_tree(frequencies):
    heap = [HuffmanNode(char, freq) for char, freq in frequencies.items()]
    heapq.heapify(heap)

    while len(heap) > 1:
        smallest = heapq.heappop(heap)
        next_smallest = heapq.heappop(heap)

        merged = HuffmanNode(None, smallest.freq + next_smallest.freq)
        merged.left = smallest
        merged.right = next_smallest

        heapq.heappush(heap, merged)

    return heap[0]

🧾 Step 5: Generate Huffman Codes¶

Walk the tree:

Left = “0”
Right = “1”

def generate_huffman_codes(node, prefix="", codebook={}):
    if node:
        if node.char is not None:
            codebook[node.char] = prefix
        generate_huffman_codes(node.left, prefix + "0", codebook)
        generate_huffman_codes(node.right, prefix + "1", codebook)
    return codebook

💬 Step 6: Encode and Decode Messages¶

Encoding just replaces characters with Huffman codes.

def encode(message, huffman_dict):
    return ''.join(huffman_dict[char] for char in message)

Decoding walks bit-by-bit and matches codes.

def decode(encoded_message, huffman_dict):
    reverse_dict = {v: k for k, v in huffman_dict.items()}
    current_code = ''
    decoded_message = ''
    for bit in encoded_message:
        current_code += bit
        if current_code in reverse_dict:
            decoded_message += reverse_dict[current_code]
            current_code = ''
    return decoded_message

🔍 Step 7: Display the Tree (Optional Visual)¶

Using a clever method from StackOverflow (credited in your code), the display() method lets us print the tree structure.

Paste this into your HuffmanNode class:

# Full display() and _display_aux() methods (as in your code)

📚 Step 8: Load and Clean Data¶

import os
import requests

def load_text_file(file_path):
    with open(file_path, 'r') as f:
        return f.read()

# Download data if not present
input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')
if not os.path.exists(input_file_path):
    url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w', encoding='utf-8') as f:
        f.write(requests.get(url).text)

text = load_text_file(input_file_path)
text = ''.join(text.split())  # Remove whitespace
text = ''.join(c for c in text if c.isalnum())  # Remove special characters
text = text.upper()

🧮 Step 9: Calculate Entropy & Mean Code Length¶

import math

frequencies = get_character_counts(text)
total_chars = sum(frequencies.values())

entropy = -sum((freq / total_chars) * math.log2(freq / total_chars) for freq in frequencies.values())
print("Entropy:", entropy)

huffman_tree = build_huffman_tree(frequencies)
huffman_tree.display()

codes = generate_huffman_codes(huffman_tree)
print("Huffman Codes:", codes)

mean_code_length = sum(len(codes[char]) * freq for char, freq in frequencies.items()) / total_chars
print("Mean Code Length:", mean_code_length)

🧪 Step 10: Test Your Compressor¶

message = "ABC"
encoded = encode(message, codes)
print("Encoded:", encoded)
print("Decoded:", decode(encoded, codes))

📌 Bonus: Visualize the Huffman Tree (Optional)¶

This part is only in the complete code that you can find here: huffman.py