Path: utzoo!attcan!utgpu!news-server.csri.toronto.edu!rutgers!sun-barr!cs.utexas.edu!samsung!usc!snorkelwacker!bloom-beacon!eru!hagbard!sunic!mcsun!ukc!acorn!abccam!pete
From: pete@abccam.abcl.co.uk (Peter Cockerell)
Newsgroups: comp.arch
Subject: Re: benchmark for evaluating extended precision
Keywords: extended precision,multiply,benchmark,arithmetic
Message-ID: <513@abccam.abcl.co.uk>
Date: 14 Sep 90 15:03:26 GMT
References: <3989@bingvaxu.cc.binghamton.edu>
Organization: Active Book Company Limited, Cambridge, UK
Lines: 58
Xref: dummy dummy:1
X-OldUsenet-Modified: added Xref
In article <3989@bingvaxu.cc.binghamton.edu>, vu0310@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>
> After a number of private communications, I've managed to render
> one of the little benchmarks I have presentable enough to post,
> along with some performance figures from different machines
> (basically, its put up or shut up time).
[load of stuff deleted]
> P.S. The benchmark was _deliberately_ kept rather simple; I
> wanted to measure the performance of _multiply_ in
> the context of basic extended precision arithmetic, not
> the memory or i/o subsystems.
The results of the 'benchmark' when run on my ARM (Acorn Risc Machine)
system running BSD 4.3 are:
DOUBLE defined 3.6
DOUBLE not defined 8.9
Ratio 2.5
All this seems to be telling me is that the conversions required to used the
ARM's 32*32->32 instruction to perform 8*8->16 arithmetic are more onerous
(and so slower) than those required to do 16*16->32.
(Short<->int on the ARM requires masking and/or shifting; there are no
explicit conversion instructions, so a C int=short*short compiles to
MOV R0, R0, LSL #16 ;Sign extend RHS
MOV R0, R0, ASR #16
MOV R1, R1, LSL #16 ;Sign extend RHS
MOV R1, R1, ASR #16
MUL R2, R0, R1 ;Do the multiply
Similarly, a short=char*char is
MOV R0, R0, LSL #24 ;Sign extend RHS
MOV R0, R0, ASR #24
MOV R1, R1, LSL #24 ;Sign extend RHS
MOV R1, R1, ASR #24
MUL R2, R0, R1 ;Do the mul
MOV R2, R2, LSL #16 ;Convert to short
MOV R2, R2, ASR #16
In comparison, int=int*int compiles to
MUL R2, R0, R1
The benchmark time for the case when LONG and SHORT are both defined
to be int (ie the natural length for the processor) is 0.4s!
Or am I missing something...?