Skip to content

Instantly share code, notes, and snippets.

@psiborg
Last active January 7, 2025 02:59
Show Gist options
  • Save psiborg/a881ee799eb04e182378004fbe38892c to your computer and use it in GitHub Desktop.
Save psiborg/a881ee799eb04e182378004fbe38892c to your computer and use it in GitHub Desktop.
Huffman compression

Huffman Coding (Variable-Length Encoding)

Huffman coding is a lossless data compression algorithm that assigns shorter binary codes to more frequent characters and longer codes to less frequent characters. This minimizes the overall size of encoded data while ensuring perfect decompression.

How Huffman Coding Works

  1. Analyze Character Frequencies
    • Count how often each character appears in the text.
  2. Build a Huffman Tree
    • Create a binary tree where characters with lower frequency are placed deeper.
  3. Generate Unique Binary Codes
    • Assign shorter binary codes to more common characters.
  4. Encode the Text
    • Replace each character with its binary code.
  5. Decode the Text
    • Use the Huffman tree to reconstruct the original text.

Example

Given the text: ABRACADABRA

Step 1: Character Frequencies A: 5, B: 2, R: 2, C: 1, D: 1

Step 2: Build Huffman Tree The algorithm constructs a binary tree where frequently occurring characters get shorter codes.

Step 3: Assign Huffman Codes A → 0 B → 10 R → 110 C → 1110 D → 1111

Step 4: Encode the Text Original: ABRACADABRA Compressed: 010110011101111010110

Since no code is a prefix of another, decoding is straightforward.

Why is Huffman Coding Good?

  • It significantly reduces file sizes (especially with repetitive text).
  • It's lossless, meaning the original data can be fully reconstructed.

Usage

Generate a sample text file.

utf8_sample.py
Generated file: ./utf8_sample.txt, Size: 2.30 MB

Compress the text file.

huffman_compress.py utf8_sample.txt utf8_sample.compressed
Original File Size: 2358.25 KB
Compressed File Size: 1287.76 KB
Compression Ratio: 54.61%
Space Saved: 1070.49 KB
Compression successful: 'utf8_sample.txt' → 'utf8_sample.compressed'

Decompress the compressed file.

huffman_decompress.py utf8_sample.compressed utf8_sample.decompressed
Decompression successful: 'utf8_sample.compressed' → 'utf8_sample.decompressed'

Compare the original text file with the the decompressed file.

diff utf8_sample.txt utf8_sample.decompressed

If there's no difference, the output will be blank.

#!/usr/bin/env python3
import heapq
class HuffmanNode:
"""
Represents a node in the Huffman Tree.
Each node contains a character, its frequency, and pointers to left and right child nodes.
"""
def __init__(self, char, freq):
self.char = char # Character stored in the node (None for internal nodes)
self.freq = freq # Frequency of the character
self.left = None # Left child node
self.right = None # Right child node
def __lt__(self, other):
"""
Defines the comparison operator for priority queue (heap).
Nodes with lower frequency are given higher priority.
"""
return self.freq < other.freq
def build_frequency_table(text):
"""
Creates a frequency table from the input text.
:param text: Input string
:return: Dictionary mapping characters to their frequency counts
"""
frequency = {}
for char in text:
frequency[char] = frequency.get(char, 0) + 1
return frequency
def build_huffman_tree(frequency):
"""
Builds the Huffman Tree from the frequency table.
:param frequency: Dictionary mapping characters to their frequency counts
:return: Root node of the Huffman Tree
"""
heap = [HuffmanNode(char, freq) for char, freq in frequency.items()]
heapq.heapify(heap) # Convert list into a min-heap
while len(heap) > 1:
left = heapq.heappop(heap) # Node with smallest frequency
right = heapq.heappop(heap) # Node with second smallest frequency
merged = HuffmanNode(None, left.freq + right.freq) # Create internal node
merged.left = left
merged.right = right
heapq.heappush(heap, merged) # Push merged node back into the heap
return heap[0] # Root of the Huffman Tree
def generate_huffman_codes(node, prefix="", huffman_codes={}):
"""
Recursively generates Huffman codes for each character.
:param node: Current node in the Huffman Tree
:param prefix: Accumulated binary string (default: empty)
:param huffman_codes: Dictionary to store Huffman codes
:return: Dictionary mapping characters to their Huffman codes
"""
if node:
if node.char is not None: # Leaf node (actual character)
huffman_codes[node.char] = prefix
generate_huffman_codes(node.left, prefix + "0", huffman_codes)
generate_huffman_codes(node.right, prefix + "1", huffman_codes)
return huffman_codes
def compress_text(text, huffman_codes):
"""
Encodes the input text using Huffman codes and converts it into a byte array.
:param text: Input string
:param huffman_codes: Dictionary mapping characters to Huffman codes
:return: Compressed byte array
"""
binary_string = ''.join(huffman_codes[char] for char in text) # Encode text into binary
padding = 8 - (len(binary_string) % 8) # Calculate padding needed to make it byte-aligned
binary_string += "0" * padding # Append padding bits
padded_info = f"{padding:08b}" # Store padding information as an 8-bit binary string
binary_data = padded_info + binary_string
byte_array = bytearray(int(binary_data[i:i+8], 2) for i in range(0, len(binary_data), 8)) # Convert to bytes
return byte_array
def decode_binary_data(binary_data, huffman_tree):
"""
Decodes the binary data using the Huffman Tree.
:param binary_data: Encoded binary string (with padding info included)
:param huffman_tree: Root node of the Huffman Tree
:return: Decoded text
"""
padding = int(binary_data[:8], 2) # Extract padding info from the first 8 bits
binary_data = binary_data[8:-padding] # Remove padding bits
decoded_text = []
node = huffman_tree # Start from the root of the tree
for bit in binary_data:
node = node.left if bit == "0" else node.right # Traverse the tree
if node.char is not None: # Leaf node reached (character found)
decoded_text.append(node.char)
node = huffman_tree # Reset to root for the next character
return ''.join(decoded_text)
#!/usr/bin/env python3
import os
import pickle
import argparse
import huffman
def huffman_compress(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as file:
text = file.read()
frequency = huffman.build_frequency_table(text)
huffman_tree = huffman.build_huffman_tree(frequency)
huffman_codes = huffman.generate_huffman_codes(huffman_tree)
compressed_data = huffman.compress_text(text, huffman_codes)
# Save compressed data to the output file
with open(output_file, 'wb') as output:
pickle.dump((compressed_data, huffman_tree), output)
# File size information
original_size = os.path.getsize(input_file)
compressed_size = len(compressed_data)
# Compression ratio calculation
compression_ratio = (compressed_size / original_size) * 100
space_saved = original_size - compressed_size
# Display summary of savings
print(f"Original File Size: {original_size / 1024:.2f} KB")
print(f"Compressed File Size: {compressed_size / 1024:.2f} KB")
print(f"Compression Ratio: {compression_ratio:.2f}%")
print(f"Space Saved: {space_saved / 1024:.2f} KB")
print(f"Compression successful: '{input_file}' → '{output_file}'")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Compress a text file using Huffman coding.")
parser.add_argument("input_file", help="The text file to compress.")
parser.add_argument("output_file", nargs="?", default="compressed.huff", help="Output compressed file (default: compressed.huff).")
args = parser.parse_args()
huffman_compress(args.input_file, args.output_file)
#!/usr/bin/env python3
import pickle
import argparse
import huffman
def decompress_huffman(input_file, output_file):
with open(input_file, 'rb') as file:
compressed_data, huffman_tree = pickle.load(file)
binary_data = ''.join(f"{byte:08b}" for byte in compressed_data)
original_text = huffman.decode_binary_data(binary_data, huffman_tree)
with open(output_file, 'w', encoding='utf-8') as file:
file.write(original_text)
print(f"Decompression successful: '{input_file}' → '{output_file}'")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Decompress a Huffman compressed file.")
parser.add_argument("input_file", help="The compressed file to decompress.")
parser.add_argument("output_file", nargs="?", default="decompressed.txt", help="Output decompressed text file (default: decompressed.txt).")
args = parser.parse_args()
decompress_huffman(args.input_file, args.output_file)
#!/usr/bin/env python3
import os
import sys
# Define a range of Unicode characters to include in the sample file
unicode_blocks = [
(0x0020, 0x007F), # Basic Latin
(0x00A0, 0x00FF), # Latin-1 Supplement
(0x0100, 0x017F), # Latin Extended-A
(0x0370, 0x03FF), # Greek and Coptic
(0x0400, 0x04FF), # Cyrillic
(0x2200, 0x22FF), # Mathematical Operators
(0x1F600, 0x1F64F), # Emoticons (Emoji)
]
# Collect characters from the defined blocks
characters = [chr(code) for block in unicode_blocks for code in range(block[0], block[1] + 1)]
# Default size is 1MB if no argument is provided
default_size_mb = 1
def parse_size_arg(arg):
try:
size_mb = int(arg.rstrip('MBmb'))
return max(size_mb, 1) # Ensure at least 1MB
except ValueError:
return default_size_mb
# Get file size from command line argument, default to 1MB
size_mb = parse_size_arg(sys.argv[1]) if len(sys.argv) > 1 else default_size_mb
target_size = size_mb * 1024 * 1024 # Convert MB to bytes
# Create a sample text by repeating the characters until reaching the target size
sample_text = ''.join(characters)
sample_text = (sample_text * (target_size // len(sample_text) + 1))[:target_size]
# Save to a file
file_path = "./utf8_sample.txt"
with open(file_path, "w", encoding="utf-8") as f:
f.write(sample_text)
# Output file size
file_size = os.path.getsize(file_path)
print(f"Generated file: {file_path}, Size: {file_size / (1024 * 1024):.2f} MB")
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſͰͱͲͳʹ͵Ͷͷ͸͹ͺͻͼͽ;Ϳ΀΁΂΃΄΅Ά·ΈΉΊ΋Ό΍ΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϏϐϑϒϓϔϕϖϗϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰϱϲϳϴϵ϶ϷϸϹϺϻϼϽϾϿЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ∀∁∂∃∄∅∆∇∈∉∊∋∌∍∎∏∐∑−∓∔∕∖∗∘∙√∛∜∝∞∟∠∡∢∣∤∥∦∧∨∩∪∫∬∭∮∯∰∱∲∳∴∵∶∷∸∹∺∻∼∽∾∿≀≁≂≃≄≅≆≇≈≉≊≋≌≍≎≏≐≑≒≓≔≕≖≗≘≙≚≛≜≝≞≟≠≡≢≣≤≥≦≧≨≩≪≫≬≭≮≯≰≱≲≳≴≵≶≷≸≹≺≻≼≽≾≿⊀⊁⊂⊃⊄⊅⊆⊇⊈⊉⊊⊋⊌⊍⊎⊏⊐⊑⊒⊓⊔⊕⊖⊗⊘⊙⊚⊛⊜⊝⊞⊟⊠⊡⊢⊣⊤⊥⊦⊧⊨⊩⊪⊫⊬⊭⊮⊯⊰⊱⊲⊳⊴⊵⊶⊷⊸⊹⊺⊻⊼⊽⊾⊿⋀⋁⋂⋃⋄⋅⋆⋇⋈⋉⋊⋋⋌⋍⋎⋏⋐⋑⋒⋓⋔⋕⋖⋗⋘⋙⋚⋛⋜⋝⋞⋟⋠⋡⋢⋣⋤⋥⋦⋧⋨⋩⋪⋫⋬⋭⋮⋯⋰⋱⋲⋳⋴⋵⋶⋷⋸⋹⋺⋻⋼⋽⋾⋿😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſͰͱͲͳʹ͵Ͷͷ͸͹ͺͻͼͽ;Ϳ΀΁΂΃΄΅Ά·ΈΉΊ΋Ό΍ΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϏϐϑϒϓϔϕϖϗϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰϱϲϳϴϵ϶ϷϸϹϺϻϼϽϾϿЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ∀∁∂∃∄∅∆∇∈∉∊∋∌∍∎∏∐∑−∓∔∕∖∗∘∙√∛∜∝∞∟∠∡∢∣∤∥∦∧∨∩∪∫∬∭∮∯∰∱∲∳∴∵∶∷∸∹∺∻∼∽∾∿≀≁≂≃≄≅≆≇≈≉≊≋≌≍≎≏≐≑≒≓≔≕≖≗≘≙≚≛≜≝≞≟≠≡≢≣≤≥≦≧≨≩≪≫≬≭≮≯≰≱≲≳≴≵≶≷≸≹≺≻≼≽≾≿⊀⊁⊂⊃⊄⊅⊆⊇⊈⊉⊊⊋⊌⊍⊎⊏⊐⊑⊒⊓⊔⊕⊖⊗⊘⊙⊚⊛⊜⊝⊞⊟⊠⊡⊢⊣⊤⊥⊦⊧⊨⊩⊪⊫⊬⊭⊮⊯⊰⊱⊲⊳⊴⊵⊶⊷⊸⊹⊺⊻⊼⊽⊾⊿⋀⋁⋂⋃⋄⋅⋆⋇⋈⋉⋊⋋⋌⋍⋎⋏⋐⋑⋒⋓⋔⋕⋖⋗⋘⋙⋚⋛⋜⋝⋞⋟⋠⋡⋢⋣⋤⋥⋦⋧⋨⋩⋪⋫⋬⋭⋮⋯⋰⋱⋲⋳⋴⋵⋶⋷⋸⋹⋺⋻⋼⋽⋾⋿😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſͰͱͲͳʹ͵Ͷͷ͸͹ͺͻͼͽ;Ϳ΀΁΂΃΄΅Ά·ΈΉΊ΋Ό΍ΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϏϐϑϒϓϔϕϖϗϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰϱϲϳϴϵ϶ϷϸϹϺϻϼϽϾϿЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ∀∁∂∃∄∅∆∇∈∉∊∋∌∍∎∏∐∑−∓∔∕∖∗∘∙√∛∜∝∞∟∠∡∢∣∤∥∦∧∨∩∪∫∬∭∮∯∰∱∲∳∴∵∶∷∸∹∺∻∼∽∾∿≀≁≂≃≄≅≆≇≈≉≊≋≌≍≎≏≐≑≒≓≔≕≖≗≘≙≚≛≜≝≞≟≠≡≢≣≤≥≦≧≨≩≪≫≬≭≮≯≰≱≲≳≴≵≶≷≸≹≺≻≼≽≾≿⊀⊁⊂⊃⊄⊅⊆⊇⊈⊉⊊⊋⊌⊍⊎⊏⊐⊑⊒⊓⊔⊕⊖⊗⊘⊙⊚⊛⊜⊝⊞⊟⊠⊡⊢⊣⊤⊥⊦⊧⊨⊩⊪⊫⊬⊭⊮⊯⊰⊱⊲⊳⊴⊵⊶⊷⊸⊹⊺⊻⊼⊽⊾⊿⋀⋁⋂⋃⋄⋅⋆⋇⋈⋉⋊⋋⋌⋍⋎⋏⋐⋑⋒⋓⋔⋕⋖⋗⋘⋙⋚⋛⋜⋝⋞⋟⋠⋡⋢⋣⋤⋥⋦⋧⋨⋩⋪⋫⋬⋭⋮⋯⋰⋱⋲⋳⋴⋵⋶⋷⋸⋹⋺⋻⋼⋽⋾⋿😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment