Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!cs.utexas.edu!csc.ti.com!ti-csl!tilde.csc.ti.com!osage.csc.ti.com!bmk
From: bmk@csc.ti.com (Brian M Kennedy)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <1990Sep12.223253.9574@csc.ti.com>
Date: 12 Sep 90 22:32:53 GMT
References: <3989@bingvaxu.cc.binghamton.edu>
Sender: usenet@csc.ti.com (USENET News System)
Distribution: usa
Organization: TI Computer Science Center, Dallas
Lines: 129
Xref: dummy dummy:1
X-OldUsenet-Modified: added Xref
=>It has been claimed that a lack of 32x32->64 multiplication
=>makes a factor of 10 difference in the running time of
=>typical extended precision arithmetic routines. Although it
=>obviously makes _a_ difference in run time I do not measure
=>an order of magnitude difference.
I cannot comment directly about the performance gains possible
for general multi-precision arithmetic, but I can comment on
a special case, 64-bit arithmetic implemented via 32-bit
arithmetic.
I have a highly optimized simulator that must do 64-bit
arithmetic. For portability, the simulator is written
in C++/ANSI C. ANSI C provides no support for multi-precision
arithmetic. It only guarantees that long is *at least* 32-bit.
Therefore, I have written a C++ type, doubleLong, which
guarantees at least 64-bit. The operation doubleLong*doubleLong
is implemented, as you might guess, with multiple long*long
operations.
IF ANSI C supported a library function
void mult(long Op1, long Op2, long Product[2])
which multiplied Op1 times Op2 and placed the result in the
two longs making up the array Product,
AND my hardware provided support for 32*32->64, such that the
library function was implemented using that support,
THEN, I could implement doubleLong multiplication much more
efficiently.
The question in this thread has been "How much more efficient?"
Speed-up from 2.5 to 10 has been claimed. Here's my measurements
using doubleLong:
Ideally, I would measure a program full of multiplications of
doubleLong's, first with the current 32*32->32 implementation,
and then with 32*32->64 implementation. Unfortunately, I have
no hardware support for the latter. Instead I will measure
an upper-bound on the performance increase by comparing:
64*64->64 via 32*32->32 vs. 32*32->32
I assert that the speed of (64*64->64 via 32*32->64) calculations
is faster than (64*64->64 via 32*32->32) but slower than just
plain 32*32->32.
The Environment:
---------------
Mips R3000 running RISC/os 4.50, medium load
C++ code compiled with ATT C++ 2.0
C code compiled with Mips 4.5 C compiler with -O1
The Results:
-----------
The program was run a number of times. The min and max times are
shown.
32*32->32 64*64->64 via 32*32->32
--------- -----------------------
MIN: %multiSpeed %multiSpeed
Final product = 246447234 Final product = 99_999_970_000_002
CPU Time = 7650 ms CPU Time = 83730 ms
Real Time = 17187 ms Real Time = 86343 ms
MAX: %multiSpeed %multiSpeed
Final product = 246447234 Final product = 99_999_970_000_002
CPU Time = 7820 ms CPU Time = 85960 ms
Real Time = 19191 ms Real Time = 137034 ms
From these timings, I get an approximate ratio of -- 11 --.
This is about what I expected, knowing the code.
Therefore, the speedup I would see from having a 32*32->64
operation available to me through ANSI C would be less than 11.
Knowing the code, I would guess the speed-up would be about 4.
That speedup, of course, is on a program that does little more
than 64-bit multiplies. On my simulator that uses doubleLong,
the gain would be much less than five percent.
Finally, general multi-precision arithmetic may get more gains
than indicated here. Also the gains possible when coding in
assembly are probably higher than in C++.
The program (in C++):
--------------------
// Switch the definition of NUMBER between long and doubleLong
// to obtain the alternate versions of this program.
#include "doubleLong.h" // contains definition of doubleLong
#include "timer.h" // for software timings
#define NUMBER doubleLong // for 64*64->64 calculations,
// implemented with 32*32->32 calculations
//#define NUMBER long // for 32*32->32 calculations
#define COUNT 10000000
main ()
{NUMBER Product;
timer Timer;
Timer.start();
for(int I = 0; I < COUNT; I++)
{Product = NUMBER(I) * NUMBER(I-1);}
Timer.split();
cout << "Final product = " << Product << endl << endl;
cout << " CPU Time = " << Timer.cpu() << " ms" << endl;
cout << " Real Time = " << Timer.real() << " ms" << endl;
}
---------------------------------
Brian M. Kennedy
Computer Systems Laboratory
Computer Science Center
Texas Instruments Incorporated