Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!snorkelwacker!apple!olivea!tymix!cirrusl!sunstorm!douglas
From: douglas%cirrusl@oliveb.ATC.olivetti.com (Douglas Lee)
Newsgroups: comp.arch
Subject: Re: Workstation Data Integrity
Message-ID: <2361@cirrusl.UUCP>
Date: 5 Sep 90 18:21:08 GMT
References: <6797.26d6edce@vax1.tcd.ie> <56qmo1w162w@zl2tnm.gp.govt.nz> <19875@crg5.UUCP> <19208@dime.cs.umass.edu> <2201@lectroid.sw.stratus.com> <68362@sgi.sgi.com> <1990Sep4.163619.24726@zoo.toronto.edu> <68505@sgi.sgi.com> <2483@crdos1.crd.ge.COM>
Sender: news@cirrusl.UUCP
Organization: Cirrus Logic Inc.
Lines: 89
Xref: dummy dummy:1
X-OldUsenet-Modified: added Xref
DRAM manufactures express reliablity in terms of FITs (Failures In
Time). On FIT represents one error in one billion (10 ^ 9) hours of
operation. Toshiba claims a FIT rate of 252 for 1 Mb DRAMs.
Clearpoint, who makes add-in memory boards claims the actual rate is
1000 FITs. The FIT rate has steadily decreased for each sucessive
generation of DRAMs until the 4 Mb. The FIT rate for 4 Mb DRAMs is
higher than 1 Mb.
From the FIT rate you can calculate the MTBF of any memory system. The
MTBF in hours for one DRAM is calculated as 10 ^ 9 / FIT. The MTBF for
a system is just the MTBF of each DRAM divided by the total number of
DRAMs in the system. We are only looking at single bit errors here.
Assuming a FIT rate of 252:
# of DRAMs Memory Size MTBF
32 4 MB 14.1 years
96 12 MB 4.7 years
160 20 MB 2.8 years
Assuming a FIT rate of 1000:
# of DRAMs Memory Size MTBF
32 4 MB 3.6 years
96 12 MB 1.2 years
160 20 MB 260 days
For most PC (memories < 12 MB) a single bit error should occur rarely
due to soft errors. The FIT rate really only measures errors due to
alpha particle radiation. There can be more soft errors caused by
power supply spikes, drop outs, etc. that have not been accounted for
here. This will cause the FIT rate to go up, reducing the MTBF. The
thing to realize here, is that parity will actually make the MTBF go
down. This is because more parts are added, more things can fail.
Parity does allow you to detect these errors, however.
Error detection and correction (EDAC) have been mentioned as an
alternative and these are used in many workstations (i.e. Sun). One of
the most popular parts is the Am29C660 and its predecessor Am2960.
This part uses a modified Hamming code to detect and correct single
bit errors and to detect double bit errors. It will in fact detect
many multi-bit errors and catastrophic failures such as all 0's or all
1's. The part appends 7 bits to a 32 bit word and 8 bits to a 64 bit
word (two parts are cascaded). For 32 bits the overhead is greater
than parity, 7 vs. 4, but at 64 bits you break even. Similar parts are
made by IDT and many workstation manufactures implement the same
function in gate arrays. The advantage of this scheme is that all
single bit errors are corrected. Also during refresh cycles, the EDAC
can scrub memory. This is done by reading one memory location and
correcting any single bit errors during each refresh cycle. By
appropriately partioning memory the entire memory can be scrub in a
short time and prevent the accumulation of double-bit errors.
To calculate the probability of two bit errors occurring, the birthday
paradox is used. This will give the probability of two single bit
errors occuring in the same memory word. Assuming 32 bit words and 252
FITs:
# of DRAMs Memory Size MTBF
39 4 MB 14,907 years
117 12 MB 8,607 years
195 20 MB 6,667 years
For 1000 FITs
# of DRAMs Memory Size MTBF
39 4 MB 3,757 years
117 12 MB 2,168 years
195 20 MB 1,680 years
This increase is overstated since you have added extra circuitry and
devices that can cause other failures to occur. The expected total
system MTBF increase is 50 to 60 times the non-EDAC system. If
scrubbing is used, than this will be even higher.
What this also neglects is that may single bit errors can occur in
memory locations that are not used, or are not read before they are
written again. Therefore, the system may not detect all the parity
errors that occur.
I would expect that most 64 bit memories will have EDC circuits,
especially memories using DRAMs > 1Mb. Some PC companies have looked
at EDC, but found it too expensive to justify putting in the box.
I now must say that I worked for Advanced Micro Devices supporting the
Am29C660. I no longer am affiliated with them.
I hope this answers some of the questions about memory reliability.
Douglas Lee