Path: utzoo!attcan!uunet!snorkelwacker!apple!agate!linus!linus!bs
From: bs@linus.mitre.org (Robert D. Silverman)
Newsgroups: comp.arch
Subject: Re: int x int -> long for * (or is it 32x32->64)
Keywords: arithmetic,arbitrary precision,benchmark,modular arithmetic
Message-ID: <119977@linus.mitre.org>
Date: 13 Sep 90 13:51:59 GMT
References: <3984@bingvaxu.cc.binghamton.edu> <41425@mips.mips.COM> <353@kaos.MATH.UCLA.EDU>
Organization: The MITRE Corporation, Bedford MA
Lines: 76
Xref: dummy dummy:1
X-OldUsenet-Modified: added Xref
In article <353@kaos.MATH.UCLA.EDU> pmontgom@euphemia.math.ucla.edu (Peter Montgomery) writes:
>In article <41425@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>I'm confused about the following exchange:
>>
>>In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>>>In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes:
stuff deleted.
>>How about posting the benchmark, so we can understand what's happening?
>>It might especially be good to post a sample of the disassembled code that's
>>supposed to be doing 32x32->64, if that is truly supposed to be happening.
>
> I do many of the same types of computations as Silverman
>(and, like he, some of my processes use a month of CPU time or more).
>But I dispute his 10X figure. Here is a benchmark which tries
>to compute the remainders i*j mod p where 0 <= i, j < p, using three
>algorithms expressible in C and also an assembly code
>algorithm (which, on the SUN 3, uses the 64-bit multiply and divide
>instructions). Those instructions perform five times as fast as
Even Peter reports a 5 fold speed increase. One difference between his
code and mine that would tend to exaggerate the difference is that he
codes the control loops directly in assembler [for his multiply routines]
whereas I write my in C with low level calls to assembler routines for
the 64 bit arithmetic. When I go from 32 to 64 bits, I cut my calls to
these routines by a factor of 4. For me to reduce to 32 bit arithmetic
is therefore more expensive. This applies especially to the VAX because
the 'calls' instruction generated by the C compiler is a real pig.
Secondly, he does his modular multiplication by a method that avoids
the division instruction entirely (at the cost of some additional
multiplies). I do mine by multiplication followed by long division
[slower, but slightly more general]. Since division is slower than
multiplication, the effect of going from 64/32 --> 32 to 32/16 --> 16
enhances the speed difference.
My claim of 10x varies from machine to machine. On some machines
[a microVAX] It has been as high as a factor of 13. On others as low
as about 6. On one machine, [An Alliant] I observed a 20% speed difference
just from cache effects. Storing in radix 2^30 versus 2^15 halves the
storage requirements.
I agree that my measurements DO NOT measure JUST the raw speed difference
in arithmetic that one obtains from going from 32 to 64 bit multiplies
and divides. I also agree that if I were to inline my calls to the
low level assembler routines, the effect would be lessened.
However, there can't be any doubt that there is a SIGNIFICANT difference
between 32 and 64 bit arithmetic when doing multiple precision. The
Sparc shows little difference because it has NO multiply or divide
instruction for even 32 bits.
It comes down to what we are measuring. I am measuring the overall
performance benefit for large programs with lots of data including
caching, subroutine calls etc. Peter's benchmarks tend to measure
more closely just the speed difference in arithmetic. And in that
respect, I agree with his assessments.
If one had higher level language support for a 'long long' temporary
data type so one could compute AB/C, AB mod C, (A << 30 + B)/C,
(AB + C) mod 2^30, etc. directly in C, my claim for a 10x speed
improvement would come down quite a bit.
One thing is clear throughout ALL of this. Machines like the SPARC
which provide NO integer multiply/divide are not worth much for
this type of work. This is why I question ALUs like the i860
which also has no integer multiply/divide, yet does have a fancy
graphics unit.
--
Bob Silverman
#include
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"