Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!mcgill-vision!snorkelwacker!apple!julius.cs.uiuc.edu!zaphod.mps.ohio-state.edu!rpi!leah!bingvaxu!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <3994@bingvaxu.cc.binghamton.edu>
Date: 12 Sep 90 20:10:24 GMT
References: <3989@bingvaxu.cc.binghamton.edu> <119858@linus.mitre.org>
Reply-To: kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell)
Organization: SUNY Binghamton, NY
Lines: 78
Xref: dummy dummy:1
X-OldUsenet-Modified: added Xref
In article <119858@linus.mitre.org> bs@linus.UUCP (Robert D. Silverman) writes:
\\\
>You are totally confused. Nowhere in your benchmark did I see any assembler
>code to support 32 x 32 -- > 64. You are STILL working in radix 2^15
>[or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64
>would allow you to work in radix 2^30 or 2^31 [depending on how careful
>you want to be about sign bit and 1's complement/2's complement portability
>problems]. Change of radix ALONE would account for a factor of 4 FEWER
>multiplies. Secondly, you are working with tiny examples. Working in a
>larger radix in a REAL WORLD PROGRAM means less storage required for each
>multi-precise variable, less loop overhead, fewer subroutine calls, fewer
>cache misses, etc.
\\\
Funny, I don't _fell_ confused. Some of the points you make _are_
valid in some sense, but we're diverging from the topic.
My whole point is that 32x32->64 is not _required_ in order to support
extended precision arithmetic. There are various definitions of
_required_ here: ``logically required'', ``convenience'', etc.
Obviously the operation isn't _logically_ required, and as I've
pointed out, its introduction poses some logical _problems_.
With this benchmark, I'm also trying to show that _convenience_
is also not an issue (esp. in the light of the ``bottom line'',
which is what (almost?) everything boils down to, whether we
like it or not).
The fact that I'm doing multiply naively is in _favour_ of
your argument that
``[without 32x32->64 high prec arithmetic will] run 10 times slower''.
If I had written this code using _better algorithms_ the difference
between using 16x16->16 and 32x32->32 would be _even less_,
thereby weakening your claim further.
Consider, we all know of fairly simply divide-and-conquer
high-prec multiply with complexity of O(n^lg(3)). (Actually we
_can_ do them in almost linear time, but the cost isn't worth
it yet).
Instead of using 32x32->64 arithmetic we _could_ use
32x32->32 arithmetic and only use the low-order bits
for multiply.
If we do this we must perform 4 times more ``small''
multiplications to perform the ``big'' one; but each of these
``small'' multiplies is performed on, effectively, 16-bit
operands and should therefore be several times faster.
(Checks for effectively multiplying by zero are _almost_ universal).
In the limit, see above, this would work out at
each multiply (2)^lg(3) faster / 4 times more multiplies
= 3/4 of ``nxn->2n'' speed
That is, *asymptotically* you only lose 25% in performance by
using ``proper'' 32-bit arithmetic (to *simulate* 16x16->32)
over having the ``convenience'' operation of 32x32->64 multiply.
A 25% reduction in speed is not significant enough, in my
accounting, to justify any _extra_ costs required to
maintain the 32x32->64 crock in a HLL (_despite_ the fact that,
historically, ``that's the way the harware does it anyway'').
However, we can say 32 bits isn't close enough to the
asymptote to justify any claims -- hence a little benchmark.
Considering the amount of extended precision arithmetic
that is performed anyway vis a vie fixed-point or
the normal floating-point hackery, the the _theoretical_
(if you like) slowdown of 25% shown above, and in the light
of _actual measurements_ of factors of 2-3 using naive
algorithms, I remain unconvinced that 32x32->64 is required;
the folks who have dropped it off their list of things to
implement are justified I think.
-Kym Horsell