Analysis and Design of Algorithms Programming Assignment 2
Huffman's Algorithm
In this part, it is required to implement Huffman's algorithm that we discussed. Your implementation should allow compressing and decompressing arbitrary files. As discussed in class, the implementation should collect statistics from the input file first, then apply the compression algorithm. Note that you will need to store a representation of the codewords in the compressed file, so that you can decompress the file back. Your program should have the capability of considering more than one byte. For example, instead of just collecting the frequencies and finding codewords for single bytes. The same can be done assuming the basic unit is n bytes, where n is an integer.
Specifications
-
You will submit a single runnable jar that will be used for both compression and decompression. Your jar should be named as huffman.jar. The jar must include the source code files.
-
To use it for compressing an input file, the following will be called: java -jar huffman.jar c absolute_path_to_input_file n
-
c means compressing the file.
-
n is the number of bytes that will be considered together.
-
To use it for decompressing an input file, the following be called: java -jar huffman.jar d absolute_path_to_input_file
-
If the user chooses to compress a file with the name abc.exe, the compressed file should have the name compressed..abc.exe.hc , and should be replaced by n (the number of bytes per group). The compressed file should appear in the same directory of the input file. The program should print the compression ratio and the compression time.
-
If the user chooses to decompress a file with name abc.exe.hc, the output file should be named extracted.abc.exe. This should appear in the same directory of the input file. The program should print the decompression time in this case.
Analysis Requirements
- Submit a report that shows the compression ratio when running your implementation on the following files for different values of n = 1, 2, 3, 4, 5.
- File 1: This file is from the NIH genetic sequence database. Note that the file is already compressed. Extract the file first, then run the experiments on the uncompressed file (gbbct10.seq). Note that the experiments here could take time, as the file is large. You should test your program and make sure of correctness on small inputs first. Also, try to implement your code efficiently to save time.
- Compare the compression ratio you got in both cases with the compression ratio of 7-zip: https://www.7-zip.org/. Try to explain the observations. Note: The compression ratio can be defined using multiple ways. You can calculate it in the same way as 7-zip does, which is the size of the compressed file divided by the size of the original file.