Step-by-Step Tutorial: Shannon-Fano Coding¶
Note that this tutorial has been created by ChatGPT after prompting it to generate a step-by-step implementation guide for the given implementation of fano.py which is also presented in the lecture. This tutorial might help to better understand the code and is just an additional learning material.
General information¶
Shannon-Fano coding is a lossless data compression algorithm that assigns binary codes to characters based on their frequencies in a given dataset. It works by dividing characters into groups with similar cumulative probabilities and recursively assigning binary codes to those groups. Below is a step-by-step guide to understanding and implementing Shannon-Fano coding, culminating in the provided Python code.
Step 1: Introduction to Shannon-Fano Coding¶
Shannon-Fano coding is based on the principle of assigning shorter codes to more frequent characters and longer codes to less frequent ones. It is an entropy-based coding technique, meaning that the binary codes generated aim to minimize the average number of bits needed to encode a symbol.
Step 2: How Shannon-Fano Works¶
- Calculate Frequencies: Count how many times each character occurs in the dataset.
- Sort Characters by Frequency: Arrange characters in descending order of their occurrence frequencies.
- Split into Groups: Divide the sorted list into two groups such that the total frequency of each group is as close as possible.
- Assign Binary Digits: Assign a
0
to characters in one group and a1
to characters in the other. - Repeat Recursively: For each group, repeat the process until each character is uniquely assigned a binary code.
Step 3: Breaking Down the Code¶
Imports and Dependencies¶
from collections import Counter
from itertools import islice
import math
import os
import requests
- Counter: Used to count occurrences of each character efficiently.
- math: For calculating entropy and logarithmic operations.
- os & requests: To handle file operations and download data from the internet.
Step 4: Node Class for the Shannon-Fano Tree¶
The ShannonFanoNode
class represents a node in the binary tree used for encoding. Each node stores:
- A character (
char
) and its frequency (freq
). - References to its left and right children (
left
,right
).
The class also includes a display
method to visualize the tree structure.
Step 5: Splitting Frequencies¶
The function split_frequencies
divides the sorted character frequencies into two groups with similar cumulative frequencies. This ensures balanced binary codes.
def split_frequencies(frequencies):
total_frequency = sum(frequencies.values())
cumulative_frequency = 0
split_index = 0
index = 0
for char, freq in (frequencies.items()):
cumulative_frequency += freq
index += 1
if cumulative_frequency >= total_frequency / 2:
split_index = index
break
left_symbols = dict(list(frequencies.items())[:split_index])
right_symbols = dict(list(frequencies.items())[split_index:])
return total_frequency, left_symbols, right_symbols
Step 6: Building the Shannon-Fano Tree¶
The recursive function build_tree
constructs the tree by splitting the frequencies and assigning the left and right children to a parent node.
def build_tree(frequencies):
if len(frequencies) == 1:
temp_list = list(frequencies.items())
return ShannonFanoNode(temp_list[0][0], temp_list[0][1])
total_frequency, left_symbols, right_symbols = split_frequencies(frequencies)
node = ShannonFanoNode(None, total_frequency)
node.left = build_tree(left_symbols)
node.right = build_tree(right_symbols)
return node
Step 7: Generating Binary Codes¶
The generate_codes
function traverses the binary tree and assigns a binary prefix (0
or 1
) to generate codes for each character.
def generate_codes(node, prefix="", codebook={}):
if node is not None:
if node.char is not None:
codebook[node.char] = prefix
generate_codes(node.left, prefix + "0", codebook)
generate_codes(node.right, prefix + "1", codebook)
return codebook
Step 8: Calculating Character Frequencies¶
The function get_character_counts
uses Python's Counter
to count the occurrences of each character in the text.
def get_character_counts(text):
return Counter(text)
Step 9: Putting It All Together¶
Here’s how the script works:
- Download the Dataset: The script downloads a sample text file (the Tiny Shakespeare dataset) if it doesn't already exist.
- Preprocess the Text: It removes special characters, whitespace, and converts text to uppercase.
- Calculate Frequencies: The character frequencies are computed.
- Sort Frequencies: Frequencies are sorted in descending order.
- Build the Tree: The Shannon-Fano tree is constructed from the sorted frequencies.
- Generate Codes: Binary codes are generated for each character.
- Calculate Entropy and Mean Code Length: The script calculates the entropy (theoretical minimum average code length) and compares it to the mean code length of the generated codes.
Step 10: Complete Python Code¶
Here is the full implementation of Shannon-Fano coding:
Step 11: Conclusion¶
Shannon-Fano coding is a foundational concept in data compression. While it has been largely replaced by Huffman coding in practical applications, studying Shannon-Fano provides valuable insights into entropy-based compression techniques. This tutorial walked you through implementing Shannon-Fano coding from scratch and provided the tools to understand its inner workings.