Skip to content

Step-by-Step Tutorial: Huffman Coding

Note that this tutorial has been created by ChatGPT after prompting it to generate a step-by-step implementation guide for the given implementation of huffman.py which is also presented in the lecture. This tutorial might help to better understand the code and is just an additional learning material.

Hereโ€™s a step-by-step tutorial that builds up to your final Huffman coding script. This will walk through concepts, building code incrementally, and explaining how each part contributes to the whole.


๐Ÿง  Step 1: Understand Huffman Coding

Huffman Coding is a lossless data compression algorithm:

  • It assigns shorter binary codes to more frequent characters.
  • Constructs a binary tree based on character frequencies.

๐Ÿ“ฆ Step 2: Count Character Frequencies

We need to count how often each character appears in a string. Python's collections.Counter is perfect for this.

from collections import Counter

def get_character_counts(text):
    return Counter(text)

โœ… Test it:

print(get_character_counts("AAAABCCD"))

๐ŸŒฒ Step 3: Create the Huffman Tree Node

We need a tree structure. Each node has:

  • A character (or None for internal nodes)
  • Frequency
  • Left and right child nodes
class HuffmanNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    def __lt__(self, other):
        return self.freq < other.freq

โœ… __lt__ is used for sorting nodes in a priority queue. Note that lt stands for less than.


๐Ÿ—๏ธ Step 4: Build the Huffman Tree

Using a priority queue (heapq), combine nodes until only one remainsโ€”the root of the Huffman Tree.

import heapq

def build_huffman_tree(frequencies):
    heap = [HuffmanNode(char, freq) for char, freq in frequencies.items()]
    heapq.heapify(heap)

    while len(heap) > 1:
        smallest = heapq.heappop(heap)
        next_smallest = heapq.heappop(heap)

        merged = HuffmanNode(None, smallest.freq + next_smallest.freq)
        merged.left = smallest
        merged.right = next_smallest

        heapq.heappush(heap, merged)

    return heap[0]

๐Ÿงพ Step 5: Generate Huffman Codes

Walk the tree:

  • Left = โ€œ0โ€
  • Right = โ€œ1โ€
def generate_huffman_codes(node, prefix="", codebook={}):
    if node:
        if node.char is not None:
            codebook[node.char] = prefix
        generate_huffman_codes(node.left, prefix + "0", codebook)
        generate_huffman_codes(node.right, prefix + "1", codebook)
    return codebook

๐Ÿ’ฌ Step 6: Encode and Decode Messages

Encoding just replaces characters with Huffman codes.

def encode(message, huffman_dict):
    return ''.join(huffman_dict[char] for char in message)

Decoding walks bit-by-bit and matches codes.

def decode(encoded_message, huffman_dict):
    reverse_dict = {v: k for k, v in huffman_dict.items()}
    current_code = ''
    decoded_message = ''
    for bit in encoded_message:
        current_code += bit
        if current_code in reverse_dict:
            decoded_message += reverse_dict[current_code]
            current_code = ''
    return decoded_message

๐Ÿ” Step 7: Display the Tree (Optional Visual)

Using a clever method from StackOverflow (credited in your code), the display() method lets us print the tree structure.

Paste this into your HuffmanNode class:

# Full display() and _display_aux() methods (as in your code)

๐Ÿ“š Step 8: Load and Clean Data

import os
import requests

def load_text_file(file_path):
    with open(file_path, 'r') as f:
        return f.read()

# Download data if not present
input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')
if not os.path.exists(input_file_path):
    url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w', encoding='utf-8') as f:
        f.write(requests.get(url).text)

text = load_text_file(input_file_path)
text = ''.join(text.split())  # Remove whitespace
text = ''.join(c for c in text if c.isalnum())  # Remove special characters
text = text.upper()

๐Ÿงฎ Step 9: Calculate Entropy & Mean Code Length

import math

frequencies = get_character_counts(text)
total_chars = sum(frequencies.values())

entropy = -sum((freq / total_chars) * math.log2(freq / total_chars) for freq in frequencies.values())
print("Entropy:", entropy)

huffman_tree = build_huffman_tree(frequencies)
huffman_tree.display()

codes = generate_huffman_codes(huffman_tree)
print("Huffman Codes:", codes)

mean_code_length = sum(len(codes[char]) * freq for char, freq in frequencies.items()) / total_chars
print("Mean Code Length:", mean_code_length)

๐Ÿงช Step 10: Test Your Compressor

message = "ABC"
encoded = encode(message, codes)
print("Encoded:", encoded)
print("Decoded:", decode(encoded, codes))

๐Ÿ“Œ Bonus: Visualize the Huffman Tree (Optional)

This part is only in the complete code that you can find here: huffman.py