The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

COhfPUllNG PRACTICES Edgar H. Sibley Panel Editor Both static and dynamic Huffman coding techniques are applied to test data consisting of 530 source ...

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2016-06-20 05:27:03

Data compression using static Huffman code-decode tables

COhfPUllNG PRACTICES Edgar H. Sibley Panel Editor Both static and dynamic Huffman coding techniques are applied to test data consisting of 530 source ...

COhfPUllNG PRACTICES

Edgar H. Sibley Both static and dynamic Huffman coding techniques are applied to test data
Panel Editor consisting of 530 source programs in four different languages. The results
indicate that, for small files, a savings of 22-91 percent in compression can
be achieved by using the static instead of dynamic techniques.

DATA COMPRESSION USING STATIC
HUFFlUAU CODE-DECODE TABLES

DAVID R. MCINTYRE and MICHAEL A. PECHURA

In the past few years, the proliferation of software and 3. Rereadthe file, substituting the assignedcodesfor
the original characters. The Huffman decodetable
the increasinguseof word processorsfor text prepara- plus the compressedfile are written to secondary
tion have resulted in an increasingdemand for larger storage.
and faster secondary-storagedevices.In addition, the
4. At any future time, the file may be reconstructed
growing useof microcomputers hasforced usersto usingthe stored prefixed Huffman decodetable and
make do with l.imited amounts of direct-accessstorage. the encodedfile.

Thus, the age-oldneed for more storagecapacity, both In the Huffman scheme,the encoding is tailored spe-
for the archival storageof files on removable volumes cifically to each file, and therefore a decodetable (tree)
and for the storageof executable programsor data files must be prefixed to the encodingin storage.Depending
upon the number of distinct charactersin the file and
on line, is still with us. the size of the compressedfile, this additional overhead
More storagecapacity, in terms of additional hard- can be quite significant (Figure 1).

ware, may not be costjustifiable in an interactive sys- To overcome this storagedrawback, it is possibleto
tem that storesdata on on-line direct-accessdevices. remove the individual decodetables, correspondingto
An alternative is text-compression techniques, which a classof files, from in front of each compressedfile
and replace them with only one fixed static decode
can often savesignificant storagespace[4] and in addi- table for the entire set of compressedfiles. In this pa-
tion greatly increasethe effective capacity of the com- per, we investigate the useof such fixed static decode
tablesfor classesof COBOL, FORTRAN, Pascal,and
munications channel on which the data are transmit- PL/l sourceprogramsby comparingthe results with
ted. thoseproduced by the dynamic Huffman method.

Pioneeringwork in the construction of minimum- THE DATA
redundancy codeswasdeveloped by Huffman in 1952 The test data consistedof 530 student sourcepro-
grams-265 COBOL, 169FORTRAN, 78 Pascal,and 18
[2]. Becauseof its simplicity, Huffman coding hasbeen PL/l. Programsize, including trailing blanks (fixed-
implemented on small systemswith encouragingre- length, 80-byte records),varied from 960to 56,560
characterswith an averagesize of 14,827characters.
sults [3]. To colmpressa file using the standardHuffman Programsizesfor the samesetof programs,excluding
encoding-decodingtechnique: trailing blanks, varied from 218to 48,240characters
with an averagesize of approximately 9,635characters.
1. Readthe file (text) to determine the frequency of
eachcharacter.

2. Construct the Huffman encode-decodetablesby as-
signingvar:iable-length codesto eachdistinct char-
acter soas to minimize the size of the original file.

Generally, this resultsin the assignmentof short
codesto charactersthat occur most frequently.

01965 ACM OOOl-Oi62/65/0600-0612 750

612 Communications cj the ACM lune 1985 Volume 28 Number 6

Computing Practices

The number of different characters in any program erated corresponding to each of the five general pro-
ranged from 24 to 58 with an average of 45.5.
gram categories-with and without trailing blanks.

Files with and without trailing blanks were investi-

THE EXPERIMENTS gated to study the effectiveness of Huffman encodings
The experiments were run on the IBM 370/158 under
OS/W1 release 7.0 with the PL/l Optimizing Compiler both without and with simple preprocessing. As might
Version 1 release 3.0. Each of the 530 source programs
was scanned, and the following statistics gathered for be expected, static Huffman codes for general program
each source program:
categories of the same language-with and without

trailing blanks-were identical. This is because blanks

were by far the most frequent character for general

program type (language), programs without trailing blanks. Hereafter, therefore,
table of character frequencies,
total number of characters, we refer simply to the 5 static Huffman encoding tables.
number of distinct characters,
number of trailing blanks, and Our principal interest in these experiments is to in-
encoding time.
vestigate and compare the use of static Huffman encod-

ing (and decoding) tables with standardly used dynamic

Huffman tables generated from the table of character

frequencies of the actual source program.

The same statistics were generated for five general pro- Using the statistics gathered for each of the 530
gram categories consisting of
source programs, we generated for each source program

l the concatenation of all COBOL programs, (both with and without trailing blanks)
* the concatenation of all FORTRAN programs,
* the concatenation of all Pascal programs, * the size of the compressed code plus decode table
* the concatenation of all PL/l programs, and using the dynamic Huffman code (DYN), and
* the concatenation of all programs.
* the size of each of the compressed codes using each
Using the tables of character frequencies and number of the five static Huffman codes (COB, FOR, PAS,
of trailing blanks for each of the five general program PL/l, ALL).
categories, 10 static Huffman encoding tables were gen-
The decode table requires 2N - 1 entries, where N is
the number of distinct characters and each entry con-

Percentage of overhead

Table size x 100
Compressed size

01 I I I I I I I
8K
1K 2K 3K 4K 5K 6K 7K

Compressed file size (bytes) for 20. 30, 40, and 50 distinct source file characters

FIGURE 1. Percent Overhead due to Storage of Decode Table Expressed as a Function of Compressed File Size

/me 1985 Volume 28 Numb 6 Communications of the ACM 613

Computing Practices

sists of two pointer fields [3]. Two bytes of storage per suitsachieved usingthe dynamic and static tablesare
field were used yielding four bytes per entry. given in Table III. A sampleof the encodingtimes (time
to scana card imagesourcefile to determine the char-
RAW RESULTS acter frequenciesand the Huffman character codes)is
In TablesI-III, we adopt the following abbreviations: given in Table IV.
COB for COBOL, FOR for FORTRAN, and PAS for
PASCAL. ALL. asa table heading refers to the static OBSERVATIONS AND ANALYSIS
Huffman table formed from the concatenation of all A review of the compressionstatisticsgiven in Table I
programs. showsthat, for sourcefiles with trailing blanks, the
compressedfiles usingstatic and dynamic tableswere
Table I lists the compressionstatisticsachieved using all in the “30-percent-of-the-original-size” range; for
static and dynamic tablesfor a randomly selectedsam- files without trailing blanks, the results were in the
ple of 14 of the 530programs.The variation in the “50-percent-of-the-original-size” range.
percentagefile size (from DYN) for the set of programs
shown in Table I is given in Table II. The percentage From a sampleof programsgiven in Table II, it is
size is defined to be (encodedsize)/(original size) x clear that all static tablesclosely approximate the dy-
namic compressionwith a worst-casevariation of only
100. 2.7 percent and 6.6 percent, respectively, for files with
and without trailing blanks. As expected, the beststatic
Overall statjsticson the difference between the re- table was almostalways the sametype languageasthe
sourceprogram, although in a few casesALL wasbet-
TABLE la. File Size after Compression Expressed as a Percentage of the ter. Also, there was always a static table that com-
Original File Size for Files with Trailing Blanks (Random Sample of 14 Files) pressedbetter than the dynamic table. The ALL table
wasalmostalways better than DYN, with a worst case
Program Size Difilerent Compre~s#f@s : :;‘:* of only 0.2 percent worse.
charac- 1
w WW lers DYN COB’ FOR’ ~~~~~~~~,~~~~~~~~: Someinteresting statisticswere obtained for the set
of 530sourcefiles usingstatic and dynamic Huffman
COB 18,400 .45 30.2 28.7 30.1 29.9 30.1 28.8 encodings.
COB 17,680 50 29.7 28.9 30.0 30.2 29.9 29.1
FOR 27,520 45 24.0 23.7 23.1 23.5 23.4 23.2 62.5 percent (60.9 percent) of the time, all five static
COB 23,120 .46 27.1 25.9 27.2 27.4 27.4 26.0 Huffman encodingsyielded smaller compressedfiles
COB 20,960 ,46 26.6 25.3 26.2 26.0 26.2 25.2 than the dynamic Huffman encodingsfor files with
COB 32,400 ,45 30.9 30.5 31.5 31.7 31.4 30.4 and without trailing blanks, respectively;
COB 56,560 846 26.1 26.0 26.7 26.6 26.7 25.9 98.7 percent (95.7 percent) of the time, static Huff-
PL/l 32,640 :52 28.1 30.6 28.8 27.8 27.1 27.9 man encoding of the samelanguagetype asthe file to
FOR 17,600 52 34.0 34.8 32.5 34.3 33.1 33.4 be compressedyielded smaller compressedfiles than
FOR 26,320 44 30.1 30.7 29.0 30.7 30.1 29.8 the dynamic Huffman encodingfor files with and
FOR 36.080 841 29.1 30.1 28.5 30.1 29.5 29.3 without trailing blanks;
PL/l 32,560 !50 27.9 30.6 28.7 27.7 27.0 27.9 94.7 percent (89.2 percent) of the time, static Huff-
PAS 40,560 !53 27.1 28.1 27.8 26.2 26.8 26.7 man encoding usingall programsyielded smaller
PAS 37,840 !51 29.9 31.5 30.7 29.0 29.8 29.7 compressedfiles than the dynamic Huffman encod-
- ing for files with and without trailing blanks.
l Static codes.
Table III showsthat, on average,all five static Huff-
TABLE lb. File Size after Compression Expressed as a Percentage mancodesproduce more compressionthan the dy-
of the Original File Size for Files in Table la without Trailing Blanks namic Huffman code: on average,O-3 percent and O-6
percent better, respectively, for files with and without
Program Size D&rent comprysed%e~. * I .% trailing blanks. On average,the compressionis 3 per-
typr chiwac- QYN COB’ FOW $A$~*‘. ?$i; Au’ cent and 6 percent better usingthe static Huffman en-
(byte) ten? coding of the samelanguagetype asthe sourcefile
(with and without trailing blanks). On average,the
COB 9,657 45 45.9 43.1 45.7 45.3 45.6 43.3 compressionis 3 percent and 5 percent better using the
COB !jO 51.1 49.2 51.6 52.2 51.5 49.6 static Huffman ALL encodingof the sourcefile, with
7,830 and without trailing blanks, respectively. In the worst
FOR 6,618 45 57.7 58.5 55.8 57.5 57.2 56.4 case,the compressionproduced by static codesis ac-
COB 10,701 46 43.7 41 .l 44.0 44.3 44.3 41.4 ceptably close(within 4 percent) to that produced by
COB 8,712 48 46.0 42.8 45.0 44.7 45.1 42.8 the dynamic codes.
COB 18,501 45 44.5 43.7 45.5 45.9 45.4 43.5
46 43.3 43.0 44.7 45.0 44.7 42.9 One important advantage of static over dynamic
COB 24,651 !i2 51.1 57.6 52.9 50.5 48.7 50.8 Huffman codesis the significant savingsin compressed
PL/l 13,016 52 59.1 62.1 57.5 61.6 58.9 59.5 file size for small files. The figures under MX (maxi-
FOR mum) in Table III show a savingsdifference of 22 and
FOR 7,771 44 60.1 64.7 59.8 64.8 63.1 82.1 91. respectively, in percent compressionfor small files
FOR 9,091 41 56.0 60.6 56.2 60.6 59.0 58.2 with and without trailing blanks. Such savingsare not
PL/l 13,112 60 51.0 57.6 52.9 50.5 48.6 50.7
PAS 12,943 53 47.8 50.4 49.5 45.6 47.1 47.0
PAS 16,557 51 49.9 53.3 51.6 47.9 49.6 49.3
17,494

’ Static codes.

614 Comtnur~ications of the ACM Iune 1985 Volume 28 Number 6

Computing Practices

TABLE IV. Encoding Times for Initial Scan and amount of compressioncan be gainedby the useof
Huffman Code Generatiorl on the IBM 370/W static in place of dynamic Huffman techniques. Finally,

12 28 0.08 6.67 it is important to rememberthat the file compression
gainedusingthe static tablesis quite significant: The
50 43 0.20 4.09 compressedfiles for files with and without trailing

101 46 0.29 2.87 blanks were uniformly within 30 and 50 percent of the
150 48 0.40 2.67 original file size, respectively.
200 47 0.52 2.60
250 46 0.60 2.40 The static Huffman encode-decodeschemecan be
299 48 0.70 2.34 implemented in two ways: by storing a static encode
351 46 0.78 2.22
400 52 0.93 2.33 and decodetable globally for eachsourcelanguage;or
451 41 0.99 2.20 by storing one static encodeand decodetable (our ALL
502 52 1.14 2.27
538 50 1.23 2.29 table), which is determined by samplinga reasonably
598 52 1.33 2.22 large number of programsfrom each type of source
622 46 1.37 2.20 language.Either technique appearsto yield better com-
707 46 1.53 2.16
pressionthan the bestdynamic Huffman encoding with
age45.5 (and, in fact, bounded by a constant), the theo- decodetable.
retical encoding time is strictly linear.
Another advantageof static asopposedto dynamic
There are several reasonswhy Huffman codesper- Huffman codesliesin time savings.Since the source
form well in compressingcomputer sourcecodes.First, codeneed not be scannedinitially to determine charac-
frequent useof commonkeywords in languageslike
COBOL, FORTRAN, Pascal,and PL/l causesthe fre- ter frequencies, and the Huffman procedure is not
quenciesof charactersto vary widely, and Huffman
codesare mosteffective when there is a small number called to determine the encodeand decodetables,the
of characterswith widely varying frequencies[l]. Sec- percentagetime to encodeis more than halved. Al-
ond, most computer systemsusea coding schemethat though this encodingtime for static Huffman codesis
allows for either 128 different characters (e.g.,7-bit
ASCII) or 256characters (e.g.,8-bit EBCDIC).In fact, small(approximately 1 sec.on the IBM 370/158 for a
mosttext or computer codesusea relatively smallper- file of 500cards),it must be expended eachtime a file
centageof all possiblecharacters(the number of dis-
tinct characters for each sourcelanguagein our test is stored. On a microcomputer, of course,this encoding
averagedonly 45.5).
time would be several times longer.
CONCLUSIONS In terms of data integrity, backupsfor static Huffman
Using the static Huffman techniques we have described
results in a compressionof sourceprogramsthat is of- encodeand decodetables are easily maintained. In this
ten better than the tailored dynamic Huffman coding
and decodingschemeby O-3 percent and O-6 percent, way, data perturbation is confined to the encodedfile,
respectively, for files with and without trailing blanks.
The compressionproduced by a static table of the same where it may affect several bytes in the file but will not
languagetype asthe sourcefile wasalmostalways at destroy the entire file. When using dynamic Huffman
leastasgoodasthat produced by the dynamic Huffman
table (99 and 96percent for files with and without tables,the decodetable must be stored with the file
trailing blanks, respectively). On average,compression and is susceptibleto data perturbation with possibly
produced by the static Huffman encodingof the same drastic lossesof information.
languagetype asthe sourcefile was 3-6 percent
smallerthan that produced by the dynamic Huffman REFERENCES
encoding.The compressionproduced by the static ALL
table, although not quite asgoodasthe static table of Gotlieb, C.C., and Gotlieb, L.R. Data Types and Structures. Prentice-
the sametype asthe sourcefile, was often better than
that produced by the dynamic Huffman table (95and Hall, En&wood Cliffs, N.J., 1978, pp. 86-90. Good, brief introduc-
89 percent of the time for files with and without trail-
ing blanks, respectively). On average,the compression tion to data compression.
produced by the static ALL table was 3-5 percent
smaller than that produced by the dynamic Huffman Hoffman, D.A. A method for the construction of minimum redun-
code. For smallfiles, a savingsof 22 to 91 percent in the
dancy codes. Proc. IRE 40, 9 (Sept. 1952), 1098-1101. The original

paper describing the Hoffman coding scheme along with an expla-

nation of how it leads to minimum variable-length codes.

Pechura, M. File archival techniques using data compression. Com-
mun. ACh4 25, 9 (Sept. 1982), 605-609. Discusses Huffman data-

compression techniques for small computer systems.

Snyderman, M., and Hunt, B. The myriad virtues of text compac-

tion. Datamation 16, 16 (Dec. 1970), 36-40. Describes a method of

data compression for EBCDIC codes files in which certain pairs of

characters are encoded into a single byte.

CR Categories and Subject Descriptors: E.4 [Coding and Information
Theory] data compaction,

General Terms: Algorithms, Performance
Additional Key Words and Phrases: coding theory,
file compression, Hoffman coding, text compaction

Received 9/84: accepted l/85

Authors’ Present Address: David R. McIntyre and Michael A. Pechura,
Dept. of Computer and Information Science, Cleveland State University,

University Center 482. Cleveland, OH 44115.

Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct commer-
cial advantage, the ACM copyright notice and the title of the publication
and its date appear, and notice is given that copying is by permission of
the Association for Computing Machinery. To copy otherwise, or to
republish, requires a fee and/or specific permission.

616 Communications of the ACM lune 1985 Volume 28 Number 6

Computing Practices

TABLE ha. Variation in Percent Compression for Files in Table la with Trailing Blanks

COB la.400 45 1.5 COB 0.1 FOR, PL/l 1.4
COB
FOR 17,680 50 0.8 COB -0.5 PAS 0.6
COB 27,520 45
COB 23,120 46 0.9 FOR 0.3 COB 0.8
COB 20,960 46
COB 32,400 45 1.2 COB -0.3 PL/l , PAS 1.1
PL/l 56,560 46
FOR 32,640 52 1.4 ALL 0.4 PL/l 1.4
17,600 52
FOR 44 0.5 ALL -0.8 PAS 0.5
FOR 26,320 41
PL/l 36,080 0.2 ALL -0.7 PAS 0.2
PAS 32,560 50
PAS 40,560 53 1.o PL/l -2.5 COB 0.2
37,840 51
1.5 FOR -0.6 COB 0.6

1.1 FOR -0.6 COB, PAS 0.3

0.6 FOR -1 .o COB -0.2

0.9 PL/l -2.7 COB 0.0

0.9 PAS -1 .o COB 0.4

0.9 PAS -1.6 COB 0.2

TABLE Ilb. Variation in Percent Compression for Files in Table lb without Trailing Blanks

Different Dynamic- St&c

Program Size charac Best case worst case .Alb
type Rw) --V--M--lTStiM
ten -Variation TYW Variation Type
COB 9,657 2.6
COB 45 2.8 COB 0.2 FOR 1.5
FOR 7,830 50 1.9 COB -1.1 PAS 1.3
COB 6,618 2.3
COB 10,701 45 1.9 FOR -0.8 COB 3.2
COB 8,712
COB 18,501 46 2.6 COB -0.6 PAS, PL/l 1.0
PL/l 24,651
FOR 13,016 46 3.2 COB, ALL 0.9 PL/l 0.4
FOR 7,771 0.3
FOR 9,091 45 1.0 ALL -1.4 PAS -0.4
PL/l 13,112 -2.0
PAS 12,943 46 0.4 ALL -1.7 PAS -2.2
PAS 16,557 0.3
17,494 52 0.6 PL/l -2.4 COB 0.8
0.6
52 1.6 FOR -3.0 COB
-4.7 PAS
44 0.3 FOR

41 -0.2 FOR -4.6 COB, PAS

50 2.4 PL/l -6.6 COB

53 2.2 PAS -2.6 COB

51 2.0 PAS -3.4 COB

TABLE Ill. Difference in Percent Compression between Dynamic and Static Tables
(Dynamic - Static) for All 530 Source Programs

(AV = Average, MX = Maximum Difference, and MN = Minimum Difference)

Matching COB FOR PAS PL/l f&i _*
type I,

AV MX MN iV MX MN AV MX MN AV MX MN AV MX MN AiV MX krN.

3 22 -2 i 13 -2 i 22 --I i 18 0 0 a -1 3 22 -2
File with trailing blanks

6 91 -3 2 22 -3 2 91 -2 2 53 -1 0 14 -1 5 91 -4
File without trailing blanks

unexpected, due to the significance of the decode table, ing time should theoretically grow linearly with the
as presented in Figure 1. size of the source file plus the order of NDCHAR log
NDCHAR (NDCHAR being the number of distinct char-
Table IV indicates that encoding time appears to acters). However, since the number of distinct charac-

grow linearly with the size of the source file. Owing to ters for each source programming language is on aver-
the time complexity of the Huffman algorithm, encod-

lune 1985 Volur~re 28 Number 6 Communications of the ACM 615


Click to View FlipBook Version