Implementing Huffman coding algorithm

Huffman coding is a compression technique used to reduce the number of bits required to send or store a message or string. It is based on the idea that frequently appearing characters will have shorter bit representation, and less frequent characters will have longer bit representation. If we consider the Huffman coding as a tree structure, the less frequent characters or items will be at the top part of the tree and more frequent items will be at the bottom of the tree or in the leaf. Huffman encoding relies a lot on the priority queue. Huffman encoding can be computed by first creating a tree of nodes.

Process to create a tree of nodes:

  1. We have to create a leaf node for each symbol and add it to the priority queue.
  2. While there is more than one node in the queue, do the following:

1. Remove the node of highest priority (lowest probability/frequency) twice to get two nodes.

2. Create a new internal node with these two nodes as children and with probability/frequency equal to the sum of the two nodes' probabilities/frequencies.

3. Add the new node to the queue.

  1. The remaining node is the root node, and the tree is complete.

Then, we have to traverse the constructed binary tree from root to leaves assigning and accumulating a "0" for one branch and a "1" for the other at each node. The accumulated zeros and ones at each leaf constitute a Huffman encoding for those symbols and weights. Here is an implementation of Huffman encoding algorithm using the SPL priority queue:

function huffmanEncode(array $symbols): array { 
$heap = new SplPriorityQueue;
$heap->setExtractFlags(SplPriorityQueue::EXTR_BOTH);
foreach ($symbols as $symbol => $weight) {
$heap->insert(array($symbol => ''), -$weight);
}

while ($heap->count() > 1) {
$low = $heap->extract();
$high = $heap->extract();
foreach ($low['data'] as &$x)
$x = '0' . $x;
foreach ($high['data'] as &$x)
$x = '1' . $x;
$heap->insert($low['data'] + $high['data'],
$low['priority'] + $high['priority']);
}
$result = $heap->extract();
return $result['data'];
}

Here, we are building a min heap for each of the symbols and using their weight to set the priority. Once the heap is constructed, we extract two nodes one after another and combining their data and priority to add them back to the heap. This continues unless only one node exists, which is the root node. Now, let's run the following code to generate the Huffman code:

$txt = 'PHP 7 Data structures and Algorithms'; 
$symbols = array_count_values(str_split($txt));
$codes = huffmanEncode($symbols);

echo "Symbol Weight Huffman Code ";
foreach ($codes as $sym => $code) {
echo "$sym $symbols[$sym] $code ";
}

Here, we are using str_split to break the string into an array and then using array count values to convert it into an associative array where the character will be the key and its number of appearance in the string will be the value. The preceding code will produce the following output:

Symbol          Weight          Huffman Code
i 1 00000
D 1 00001
d 1 00010
A 1 00011
t 4 001
H 1 01000
m 1 01001
P 2 0101
g 1 01100
o 1 01101
e 1 01110
n 1 01111
7 1 10000
l 1 10001
u 2 1001
5 101
h 1 11000
c 1 11001
a 3 1101
r 3 1110
s 3 1111

There are many other practical usages of greedy algorithms. We will solve a job-scheduling problem with greedy algorithms. Let's consider an example of a team of agile software developers who are working in a two-week iteration or sprint. They have some user stories to complete with some deadlines (by date) for the tasks and velocity (size of the story) attached to the story. The target for the team is to gain maximum velocity for the sprint within the given deadline. Let's consider the following tasks with deadline and velocity:

Index

1

2

3

4

5

6

Story

S1

S2

S3

S4

S5

S6

Deadline

2

1

2

1

3

4

Velocity

95

32

47

42

28

64

As we can see from the preceding table, we have six user stories, and they have four different deadlines from 1 to 4. We have to finish the user story S2 or S4 for slot 1 since the deadline for the task is 1. The same goes for story S1 and S3, and they have to be finished on or before slot 2. However, since we have S3 and the velocity of S3 is bigger than S2 and S4, S3 will be chosen for slot 1 by the greedy approach. Let's write the greedy code for our velocity calculation:

function velocityMagnifier(array $jobs) { 

$n = count($jobs);

usort($jobs, function($opt1, $opt2) {
return $opt1['velocity'] < $opt2['velocity'];
});

$dMax = max(array_column($jobs, "deadline"));

$slot = array_fill(1, $dMax, -1);
$filledTimeSlot = 0;

for ($i = 0; $i < $n; $i++) {
$k = min($dMax, $jobs[$i]['deadline']);
while ($k >= 1) {
if ($slot[$k] == -1) {
$slot[$k] = $i;
$filledTimeSlot++;
break;
}
$k--;
}

if ($filledTimeSlot == $dMax) {
break;
}
}

echo("Stories to Complete: ");
for ($i = 1; $i <= $dMax; $i++) {
echo $jobs[$slot[$i]]['id'];

if ($i < $dMax) {
echo " ";
}
}

$maxVelocity = 0;
for ($i = 1; $i <= $dMax; $i++) {
$maxVelocity += $jobs[$slot[$i]]['velocity'];
}
echo " Max Velocity: " . $maxVelocity;
}

Here, we are getting the list of jobs (user story ID, deadline, and velocity) that we will use to find the maximum velocity and their respective user story ID. First, we sort the jobs array with custom user sort function usort and sort the array in the descending order based on their velocity. After that, we calculate the maximum number of slots available from the deadline column. We are then initializing the slot array to -1 to keep a flag of used slots. The next code block is to traverse through each of the user stories and find a proper slot for the user story. If the available timeslots are filled, we don't continue further. Now, let's run this code using the following code block:

$jobs = [ 
["id" => "S1", "deadline" => 2, "velocity" => 95],
["id" => "S2", "deadline" => 1, "velocity" => 32],
["id" => "S3", "deadline" => 2, "velocity" => 47],
["id" => "S4", "deadline" => 1, "velocity" => 42],
["id" => "S5", "deadline" => 3, "velocity" => 28],
["id" => "S6", "deadline" => 4, "velocity" => 64]
];

velocityMagnifier($jobs);

This will produce the following output in command line:

Stories to Complete: S3    S1    S5    S6
Max Velocity: 234

Greedy algorithms can be helpful in solving locally optimized problems such as job scheduling, network traffic control, graph algorithm, among other things. However, to get a globally optimized solution, we need to focus on another aspect of algorithms, which is known as dynamic programming.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset