When Speed Is King Redux
Today is a rainy Saturday. I thought I would revisit my previous snippet, When Speed Is King: C vs Lisp. I've got the winning source code of the Apress Fractal contest. I decided to try it out on vega.
My first steps involved looking at the README file and building the program:
mgen: Mandelbrot Fractal Generator Program
Apress Summer 2005 Programming Contest
By David Coakley (coakley@nwlink.com)
The program is written in C and requires the Independent JPEG Group's free
JPEG compression library (libjpeg). The supplied Makefile should build the
program in any environment where gcc and libjpeg are available.
For Red Hat Linux systems, you may need the "libjpeg-devel" package to get
the required libjpeg header file.
The program uses a straightforward method for computing the Mandelbrot set,
iterating over each point to compute its level set. The level set is used
to select a color from a predefined table. One advantage of this method is
that it only requires a buffer large enough for one line in the image.
References:
The Science of Fractal Images
edited by Heinz-Otto Peitgen and Dietmar Saupe; Michael Barnsley,
Robert L. Devaney, Benoit B. Mandelbrot, Richard F. Voss, Yuval Fisher,
and Michael McGuire
New York: Springer-Verlag, Inc. ISBN 0-387-96608-0.
Chaos and Fractals: New Frontiers of Science.
Heinz-Otto Pietgen, Hartmut Jürgens, and Dietmar Saupe.
New York: Springer-Verlag, Inc. ISBN 0-387-97903-4.
david@vega:~/apress-fractal/mgen $ sudo apt-get install libjpeg62 libjpeg62-dev Reading Package Lists... Done Building Dependency Tree... Done libjpeg62 is already the newest version. The following NEW packages will be installed: libjpeg62-dev 0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded. Need to get 183kB of archives. After unpacking 426kB of additional disk space will be used. Get:1 http://ftp.us.debian.org sarge/main libjpeg62-dev 6b-10 [183kB] Fetched 183kB in 1s (113kB/s) Selecting previously deselected package libjpeg62-dev. (Reading database ... 47874 files and directories currently installed.) Unpacking libjpeg62-dev (from .../libjpeg62-dev_6b-10_i386.deb) ... Setting up libjpeg62-dev (6b-10) ... david@vega:~/apress-fractal/mgen $ make gcc -ansi -O3 -ljpeg -o mgen src/mgen.c ; strip mgen
The source code for the program is deliciously simple. It really demonstrates the application of the KISS principle, not to mention a literal interpretation of the rules. Well, literal provided the minimum number of colors is equated with the minimum number of iterations. The code is so simple I will post it inline:
/* mgen: Mandelbrot Fractal Generator Program
* Apress Summer 2005 Programming Contest
*
* By David Coakley
*/
#include <stdio.h>
#include <jpeglib.h>
#include <sys/stat.h>
#define THRESHOLD 32.0
#define IMAGE_WIDTH 800
#define IMAGE_HEIGHT 600
#define MAX_ITERATIONS 33
/*
* Define some interesting colors in parallel arrays.
*/
const unsigned char RED[MAX_ITERATIONS + 1] = {
156, 156, 150, 144, 136, 128, 128, 128, 128,
144, 160, 176, 192, 208, 224, 240, 255,
255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 0 };
const unsigned char GREEN[MAX_ITERATIONS + 1] = {
156, 156, 150, 144, 136, 96, 64, 32, 16,
0, 0, 0, 0, 0, 0, 0, 0,
16, 32, 48, 64, 80, 96, 112, 128,
144, 160, 176, 192, 208, 224, 240, 255, 0 };
const unsigned char BLUE[MAX_ITERATIONS + 1] = {
255, 252, 244, 240, 200, 144, 96, 64, 32,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0 };
/* Return the level set of a point. */
int mlevel(double cx, double cy)
{
double x, y, x2, y2;
int j;
x = cx;
y = cy;
for (j = 0; j < MAX_ITERATIONS; j++)
{
x2 = x * x; y2 = y * y;
if (x2 + y2 > THRESHOLD)
{
return j;
}
y = 2 * x * y + cy;
x = x2 - y2 + cx;
}
return j;
}
/* Write a JPEG compressed image of the Mandelbrot set to the specified file. */
void mgen(char *filename, double xmin, double xmax, double ymin, double ymax)
{
struct jpeg_compress_struct cinfo;
struct jpeg_error_mgr jerr;
JSAMPROW row_pointer[1]; /* pointer to one row of samples */
JSAMPLE mrow[IMAGE_WIDTH * 3]; /* one row (each pixel is three samples) */
FILE *outfile = fopen(filename, "wb");
cinfo.err = jpeg_std_error(&jerr);
jpeg_create_compress(&cinfo);
jpeg_stdio_dest(&cinfo, outfile);
cinfo.image_width = IMAGE_WIDTH; /* image width in pixels */
cinfo.image_height = IMAGE_HEIGHT; /* image height in pixels */
cinfo.input_components = 3; /* # of color components per pixel */
cinfo.in_color_space = JCS_RGB; /* colorspace of input image */
jpeg_set_defaults(&cinfo);
jpeg_start_compress(&cinfo, TRUE);
while (cinfo.next_scanline < cinfo.image_height)
{
double cy = ymin + cinfo.next_scanline * (ymax - ymin) / cinfo.image_height;
int x;
for (x = 0; x < cinfo.image_width; x++)
{
double cx = xmin + x * (xmax - xmin) / cinfo.image_width;
int level = mlevel(cx, cy);
mrow[x * 3] = RED[level];
mrow[x * 3 + 1] = GREEN[level];
mrow[x * 3 + 2] = BLUE[level];
}
row_pointer[0] = mrow;
jpeg_write_scanlines(&cinfo, row_pointer, 1);
}
jpeg_finish_compress(&cinfo);
fclose(outfile);
jpeg_destroy_compress(&cinfo);
}
int main()
{
const char *outdir = "outputpic";
FILE *infile = fopen("input.txt", "r");
double xmin, xmax, ymin, ymax;
int i = 1;
/* Create output directory. */
close(mkdir(outdir, S_IRWXU | S_IRWXG | S_IRWXO));
/* Read image boundary coordinates from input file. */
while (fscanf(infile, "%lf %lf %lf %lf", &xmin, &xmax, &ymin, &ymax) != EOF)
{
char filename[32];
/* Write each image to a unique filename. */
sprintf(filename, "%s/m%d.jpg", outdir, i++);
mgen(filename, xmin, xmax, ymin, ymax);
}
fclose(infile);
return 0;
}
I think I can see the secret of the speed. The max iterations is set to 33. I was using 128 (intially 255). So how fast does this code run on vega? I did five consecutive timings:
david@vega:~/apress-fractal/mgen $ linuxinfo Linux vega 2.6.8-2-k7-smp #1 SMP Thu May 19 18:14:00 JST 2005 Two AMD Athlon MP 1533MHz processors, 6094.84 total bogomips, 511M RAM System library 2.3.2 david@vega:~/apress-fractal/mgen $ cpuid eax in eax ebx ecx edx 00000000 00000001 68747541 444d4163 69746e65 00000001 00000662 00000000 00000000 0383fbff 80000000 80000008 68747541 444d4163 69746e65 80000001 00000762 00000000 00000000 c1cbfbff 80000002 20444d41 6c687441 54286e6f 4d20294d 80000003 38312050 002b3030 00000000 00000000 80000004 00000000 00000000 00000000 00000000 80000005 0408ff08 ff20ff10 40020140 40020140 80000006 00000000 41004100 01008140 00000000 80000007 00000000 00000000 00000000 00000001 80000008 00002022 00000000 00000000 00000000 Vendor ID: "AuthenticAMD"; CPUID level 1 AMD-specific functions Version 00000662: Family: 6 Model: 6 [Athlon MP/Mobile Athlon model 6] Standard feature flags 0383fbff: Floating Point Unit Virtual Mode Extensions Debugging Extensions Page Size Extensions Time Stamp Counter (with RDTSC and CR4 disable bit) Model Specific Registers with RDMSR & WRMSR PAE - Page Address Extensions Machine Check Exception COMPXCHG8B Instruction APIC SYSCALL/SYSRET or SYSENTER/SYSEXIT instructions MTRR - Memory Type Range Registers Global paging extension Machine Check Architecture Conditional Move Instruction PAT - Page Attribute Table PSE-36 - Page Size Extensions MMX instructions FXSAVE/FXRSTOR 25 - reserved Generation: 7 Model: 6 Extended feature flags c1cbfbff: Floating Point Unit Virtual Mode Extensions Debugging Extensions Page Size Extensions Time Stamp Counter (with RDTSC and CR4 disable bit) Model Specific Registers with RDMSR & WRMSR PAE - Page Address Extensions Machine Check Exception COMPXCHG8B Instruction APIC SYSCALL/SYSRET or SYSENTER/SYSEXIT instructions MTRR - Memory Type Range Registers Global paging extension Machine Check Architecture Conditional Move Instruction PAT - Page Attribute Table PSE-36 - Page Size Extensions 19 - reserved AMD MMX Instruction Extensions MMX instructions FXSAVE/FXRSTOR 3DNow! Instruction Extensions 3DNow instructions Processor name string: AMD Athlon(TM) MP 1800+ L1 Cache Information: 2/4-MB Pages: Data TLB: associativity 4-way #entries 8 Instruction TLB: associativity 255-way #entries 8 4-KB Pages: Data TLB: associativity 255-way #entries 32 Instruction TLB: associativity 255-way #entries 16 L1 Data cache: size 64 KB associativity 2-way lines per tag 1 line size 64 L1 Instruction cache: size 64 KB associativity 2-way lines per tag 1 line size 64 L2 Cache Information: 2/4-MB Pages: Data TLB: associativity L2 off #entries 0 Instruction TLB: associativity L2 off #entries 0 4-KB Pages: Data TLB: associativity Direct mapped #entries 0 Instruction TLB: associativity Direct mapped #entries 0 size 1 KB associativity L2 off lines per tag 129 line size 64 Advanced Power Management Feature Flags Has temperature sensing diode Maximum linear address: 32; maximum phys address 34 david@vega:~/apress-fractal/mgen $ time ./mgen real 0m1.753s user 0m1.751s sys 0m0.003s david@vega:~/apress-fractal/mgen $ time ./mgen real 0m1.751s user 0m1.745s sys 0m0.005s david@vega:~/apress-fractal/mgen $ time ./mgen real 0m1.761s user 0m1.758s sys 0m0.004s david@vega:~/apress-fractal/mgen $ time ./mgen real 0m1.761s user 0m1.760s sys 0m0.001s david@vega:~/apress-fractal/mgen $ time ./mgen real 0m1.751s user 0m1.743s sys 0m0.008s
I'm calling 1.751s for mgen to run. Now for the $64 USD question. Can I modify my Lisp code to run as fast? I've got periodicity detection which is a liability rather than an improvement at only 33 iterations. I'm going up to 128 iterations. My palette mapping is logarithmic, although that is precomputed. I don't want to make any truly drastic changes though. For example, my JPEG encoder is Lisp and I want to keep it that way. I'm using a tiling algorithm that I would like to keep, even though it is more useful at higher iterations than lower ones. In short, I want to make just a few minor changes to my code and see how it fairs against the winning entry.
david@vega:~/apress-fractal/lisp $ sbcl This is SBCL 0.9.5, an implementation of ANSI Common Lisp. More information about SBCL is available at <http://www.sbcl.org/>. SBCL is free software, provided as is, with absolutely no warranty. It is mostly in the public domain; some portions are provided under BSD-style licenses. See the CREDITS and COPYING files in the distribution for more information. * (load (compile-file "apress-fractal-2.lisp")) * (time (generate-fractals-from-input-file "../input.txt")) Evaluation took: 3.182 seconds of real time 3.170518 seconds of user run time 0.011999 seconds of system run time 0 page faults and 5,340,536 bytes consed. NIL * (time (generate-fractals-from-input-file "../input.txt")) Evaluation took: 3.156 seconds of real time 3.140523 seconds of user run time 0.015997 seconds of system run time 0 page faults and 4,037,824 bytes consed. NIL * (time (generate-fractals-from-input-file "../input.txt")) Evaluation took: 3.09 seconds of real time 3.076532 seconds of user run time 0.012998 seconds of system run time 0 page faults and 4,042,664 bytes consed. NIL * (time (generate-fractals-from-input-file "../input.txt")) Evaluation took: 3.097 seconds of real time 3.083532 seconds of user run time 0.012998 seconds of system run time 0 page faults and 4,046,768 bytes consed. NIL * (time (generate-fractals-from-input-file "../input.txt")) Evaluation took: 3.112 seconds of real time 3.09653 seconds of user run time 0.015997 seconds of system run time 0 page faults and 4,040,992 bytes consed. NIL * (quit)
Well, it looks like I just can't keep up. Calling my time at 3.09s, the C program comes in at 1.76 times faster than my Lisp program. This is less than twice as fast, but still quite a bit faster.
Final Words
This is still an apples to oranges comparison. My Lisp code is using a different algorithm than the C code. Other factors such as the fact that one of the CPUs on vega is at 100% load recomputing Xeno's Xoom at 1280x720 resolution may or may not have been a factor. In spite of all that, it may be possible to extrapolate my perfomance vs other entries. Based on the winning time published by Apress (1.08s), Apress' test box runs the entries 1.62 times faster than my box. The winning C# program took 1.80s to run. Multiply that by 1.62 and you get 2.92s for the C# program on my machine if it could run it. I'm disappointed that I don't beat the C#, but I am within .2s. Likewise, the fastest Java program seems to edge me out by .01s. I can run Java, but I haven't tried a direct comparison on my machine. If I do, I will update this section.
One thing that certainly does not help my performance is the use of CLOS. That was a hold over from older code I have written. It is likely that I could gain a tiny bit of time by getting rid of the CLOS code. That is something else to consider doing in the future.
Another thing that I did that probably didn't help was raise my escape threshold from 4.0 to 32.0 to match the C code. It's a small difference in time, but I scored a 3.06s run when I set it back to 4.0.
Here is the code I used with the resulting JPEG images. 355,341 Bytes.