Lecture 34       7th August 1946

Reliability and Checking in Digital Computing Systems

S.B. Williams    (Formerly of Bell Telephone Laboratories)

The lecture describes the checking mechanisms in a computing system, developed by Bell Laboratories, which uses electro-mechanical telephone relays for number storage, and teleprinter tape for input-output. The discussion is not very relevant to electronic computers. For example, the main checking mechanisms are based on the ability to check electronically that a setting of a relay has been successfully achieved within some appropriate time, and stopping the machine if it has not.


Lecture 35       7th August 1946

Reliability and Checking

J.P. Eckert    (Electronic Control Company)

This lecture discusses reliability and checking from the standpoint of experience with the ENIAC. This is obviously more applicable to electronic stored-program computers than experience with electro-mechanical machines like the Bell Laboratory machine discussed in Lecture 34 and the Harvard Mark I.

The complexity and large scale of computation possible with electronic computers requires far higher standards of reliability than most of the uses of components (e.g. vacuum tubes and capacitors) in other industrial applications. In some cases reliability is not a particular issue with manufacturers, and statistics on reliablity may not be available. A further problem is that the use the components are put to in computers are very different from the uses they are designed for.

It is important to draw a clear distinction between reliability and checking. Checking is no substitute for reliability. Good reliability is essential. But even with good reliability, good checking is essential.

Machine errors can be divided into two classes: "Permanent Failure" of a component and "Transient Failure". A permanent failure tends to have obvious consequences and is relatively easy to find. A transient failure is generally more difficult to find, particularly as standard operator tests may give inconsistent results. Transient errors are in fact a bigger problem in electro-mechanical machines, where the commonest failure is temporary malfunction of a relay due to a bit of dirt which soon gets dislodged.

In the early life of a computer failures are most likely to be permanent failures, due to imperfections in machine construction and especially vacuum tube failure. Later on transient failures become more common as vacuum tube performance degenerates over time. A partial solution to this latter problem is to periodically run tests on vacuum tubes, e.g. with reduced voltages, to detect tubes close to the end of their useful life.

Checking mechanisms can be divided into high-level and low-level checks. Low-level checks are those made at the level of the circuitry, independent of the problems being solved, for example in electro-mechanical machines checking the performance of relays (see Lecture 34), and in Mercury Acoustic Delay Line machines checking that values are circulating in tanks correctly. High-level checks are ones which involve running programs and checking their results. Most obviously a program can be run which has known results, or for which the results can be easily checked independently, e.g. a program which finds all the prime factors of a number and then multiplies them all together again.

On ENIAC operators normally apply four techniques for discovering errors:

  1. Running various high-level checking programs. A good source of more complex self-checking programs (e.g. compared with a prime factors program) is to run a complex problem for which the solutions for special cases can be worked out in theory or by a simpler independent computation. The operator runs such programs at intervals varying with the current perceived overall reliability of the machine, up to a maximum of say 3 or 4 hours if the machine is running well. If the test fails, the work done since the last set of tests is rerun.

    When a test fails on a complex run, the program already has a large number of break points built in, and the operator can quickly narrow down the part of the program where an error has occurred using binary chop techniques. Hopefully this will highlight the error. Obviously this method works much better with permanent errors than transient. The former can typically take around 15 minutes to isolate; the latter has been known to take 10 to 15 hours.

    This method was probably peculiarly suitable to the ENIAC, with its complicated combination of function tables, accumulators, and program-specific wiring; individual components were much more likely to be used only in certain parts of the program than in a general-purpose stored-program computer. On the other hand, it is not clear how easy running of test programs squares with the difficulty of changing programs on ENIAC. Part of the explanation may be that 'operator' refers to a person associated with the program currently being run rather than any ENIAC 'computing service', and that the tests referred to would be closely related to the program being run.
     

  2. If transient errors are suspected or feared, a run can be repeated, possibly more than once, and the outputs compared (easily done with punched card output and IBM checking machines).
     
  3. Checks can be provided (especially by the program itself) by running checks on the consistency of partial results being generated by the program. A sophisticated example is where a program is generating values from a complex real-life problem known to be a continuous process; difference techniques, maybe up to 5 or 6 orders, could be used to check that there are no apparent discontinuities.
     
  4. A suite of tests is provided which tests specific parts of the machine in detail. This suite is typically run at the beginning and end of each day.
Unfortunately these checking mechanisms rely on the operators for correct application. Most occurrences of disastrously inaccurate runs have been traced to failures of an operator!

The question of reliability and checking should be born in mind in the design stage of a computer. Mechanisms for checking circuits can be built in, though care must be taken that tests for ageing components (e.g. by varying voltages) are not too harsh. Reliability can be addressed by giving consideration to e.g. appropriate power supplies and signal levels, and shielding and isolation of circuits.

In general reliability should be much better in the much simpler design expected in the new computers (i.e. compared with ENIAC). Also 'serial' computers (see Lecture 45) should be more reliable than the more complex multi-channel 'parallel' computers.

One possible solution to checking reliability is to produce two identical machines with no internal checking mechanisms. Instead there are mechanisms to compare partial results to check that the machines are running consistently. Care must however be taken to avoid the occurrence of duplicated errors, e.g. due to using a common power supply with sufficient variation to cause simultaneous identical intermittent errors. Again duplicated input would need to be independently checked to be identical. An extra advantage of this solution is that for programs where checking is not a particular problem, e.g. self checking programs, the two machines could be operated independently and hence do twice the work.


Lecture 44      26th August 1946

Discussion of Ideas for the Naval Ordnance Laboratory Computing Machine

C.N. Mooers    (Naval Ordnance Laboratory)

In this lecture Calvin Mooers discusses three problems in relation to preliminary ideas for the design of a computer for the Naval Ordnance Laboratory's 'EDVAC-like computing machine' (i.e. stored-program electronic computer). Mooers warns that very little experimental electronic work has been done, so the ideas presented are speculative. In fact the project was abandoned within a year.

The machine envisaged has a store capacity of about 8,000 50-bit words. Each word in store is immediately accessible. It uses binary, and provides floating-point numbers and operations. However fixed-point operation is also provided, as it is necessary for altering instruction fields (e.g. for general store access) and providing for multi-length arithmetic. The possible provision of 'stop-order tags' (see Lecture 39) is highly speculative.

The Electrostatic Storage Tube

Electrostatic storage (i.e. rather than Mercury Acoustic Delay Line storage) is of course crucial to the design of the machine. An early idea was to base it on the iconoscope tube used for television, which distinguishes 500 * 500 different picture elements. If they could be adapted to store just a single bit, then this would provide a memory of 5,000 50-bit words. However it was feared that with this much information it would only be possible to access a particular word by sweeping the whole screen -- too slow. So a tube was considered that had 50 rows each with 50 bits, with the ability to select and scan an individual row directly. However it could be clearly seen from papers on the iconoscope tube that there would still be problems of image deterioration due to redistribution of electrons when the beam hits the screen --"secondary emissions".

A hopeful solution for controlling the secondary emissions was seen in the special tube being developed by RCA for radar use, which had a mesh in contact with the dielectric screen. It was decided to place a copper mesh just inside the dielectric surface. This, plus a conducting surface on the outside of the dielectric, could be used to control and set and reset the potentials at the bit positions. A bit position would be held at one of two potentials, and would be read by observing the current in the outer conducting surface at the moment the scanning beam was firing at it. The Naval Ordnance Laboratory have already arranged with DuMont to manufacture three experimental tubes, slightly varying the position of the copper mesh. So far only one has been produced, but there were faults in its construction.

A further problem with the proposed electrostatic tube is the need to refresh the storage on a frequent basis. A simple solution would be to copy the values on the screen to a companion tube, and then copy them back again. This of course would double the number of tubes required for a given number of words. They had considered reducing the number of redundant tubes by forming the storage tubes into a logical circle which contained only one redundant tube, and refreshing each in turn, by copying from the preceding tube into the redundant tube, and then declaring the preceding tube as the redundant tube. This of course would mean that the block of words associated with a particular tube-full of addresses would migrate continuously round the circle.

At the time of the lecture the Laboratory is not trying to solve the refresh problem until the basic tubes are working.

Automatic Control of Significant Digits

The problem being addressed here is one that has received relatively little attention. When using floating point notation it has been well discussed that it is necessary to keep values in a standardised form (i.e. with the most significant bit at the most significant end of the mantissa) to avoid the problem of losing significant digits (most obviously in multiplication). Also the question of rounding has been well discussed, where an approximation has to be made where significant bits have to be lost (now obviously in addition and subtraction as well). Rounding is a problem both in machine design (which rounding strategy to employ) and in program design (knowing the strategy being used, what affect will continual rounding have on the program's computation). But what of the storage of insignificant bits? In practice any particular number could have less than the fixed amount of significant bits, for example deriving from the significance of the numbers input to the program, and/or the accumulation of errors due to rounding in a long program.

One possible idea is to hold numbers in a destandardised form, so that only the significant bits of a number are held, in the least significant bits of the mantissa, with zeroes filling the space between the binary point and the first significant bit. At the start of an arithmetic operation the number of significant bits in each number could be worked out (e.g. by counting the number of zeroes after the binary point, possibly as part of the process of standardising the number, e.g. before multiplication). Then the result could be formed and destandardised according to the appropriate resultant significance (e.g. that of the shorter factor). A better idea (added on revising the lecture) might be to have a third field in a floating-point number, to hold the significance of the number directly.

This method does not solve all the problems, for example taking account of the accumulation of rounding errors in a long calculation, which can build up the number of approximated bits at the end of a number, and hence reduce its significance. In the end no solution is suggested that satisfactorily deals with the problem, but that does not mean that the problem is one that should be ignored.

High Speed Transfer between Magnetic Storage and Electronic Storage.

If a program requires access to a large number of words, much larger than the electronic memory (say 1,000,000), then the program must use the magnetic storage attached to the machine. If this access is not simple, then there is clearly a major problem in the waiting time required to reposition the magnetic equipment before transferring information. The speed of transfer of successive words may only be a couple of orders of magnitude slower than internal transfers, but the positioning time can add many further orders of magnitude, which is intolerable.

Two solutions are suggested, each requiring the availability of a large number of magnetic channels, say 100. (Mooers must surely be thinking specifically of magnetic wire as the medium being discussed, where the individual bits of a word are held in order down the wire, followed by the bits of the next word. Also let us assume that words are being transferred into the machine store under machine control in suitably sized blocks, e.g. 50 words or 200.)

One suggestion is to store the bits for a particular pair of words across all 100 channels, in the same position on each channel. Then, assuming the channels are in synchrony, at each bit position a bit from each channel will be transferred to two special holding registers, from which it will then be transferred to a pair of words in store. In this way both the waiting time required for a transfer and the transfer time itself are reduced by a factor of 100. To solve problems with synchronisation, it is suggested that the channels are laid across a single wide strip of magnetic material.

The other suggestion is that the 100 channels operate independently, with the bits of a word in successive positions down the channel. The data is now spread serially down the set of 100 channels, with e.g. up to 10,000 words on each channel. The program can request repositioning of any channel to any position at any time. The waiting time for randomly positioned data is now reduced by a factor of 100, but not the transfer time. However a possible refinement is to permit the simultaneous reading of the next bit from all 100 channels, into a holding register, and to continuously transfer the bits in this register into electronic store. However there is a problem here in that the pairs of words being transferred in succession to electronic store each contain 100 individual bits from widely scattered words in the data on magnetic store. The suggested solution is to have a special electrostatic tube to hold a full transfer (of 200 words, say, i.e. 100 * 100 bits), which can be scanned both by row and by column. Then bits can be transferred from the holding registers scanning by row and copied to standard store scanning by column.

Of course it would be hard for the programmer to arrange to utilise all this tube-full of information scattered as it is about the magnetic storage. But there are certainly cases where at least some of the power could be used, e.g. in multidimensional problems where cycles may need to iterate across store in large strides, at its simplest in matrix multiplication.

These ideas on magnetic storage appear somewhat 'half-baked'. Of course with full cooking we could end up with magnetic tape and even the magnetic drum!


Lecture 45      26th August 1946

A Parallel Channel Computing Machine

J.P. Eckert    (Electronic Control Company)

Parallel operations are possible in an electronic computer; they were provided in ENIAC. However they make life more difficult for the programmer. It is better to leave the programmer with a simple serial model of computer operation and consider using parallelism within the detail of instruction execution, most obviously using parallel access in the transmission of words and inside arithmetic. Provision of parallel mechanisms is more expensive, but provides much faster execution.

We can define a "serial" machine as one where a word (of say 40 bits) is transmitted by 40 pulses in serial down one line, and a "parallel" machine where a 40-bit word is transmitted by single pulses down 40 lines. In a serial machine adding and subtracting can be carried out serially, at the same time as transmission (assuming the pulses are sent in order with least significant pulse first). Multiplication is naturally carried out serially (in a set of partial sums), but it is possible to build in a varying degree of parallelism. For a parallel machine addition and subtraction can be partially carried out in parallel at the same time as transmission, but an element of serialism is likely, to deal with carries.

However it possible to have a compromise between serial and parallel operation even in the transmission of words. For example a 40-bit word could be transmitted using 10 channels, with 4 bits per channel. Transmission of the 10 channels could be in parallel, but the transmission of the four bits in each channel could be serial. This arrangement would be particularly appropriate in a machine which works with decimal digits, where each digit is represented by 4 bits.

The lecture goes on to a long and detailed description of ways of carrying out (fixed point) addition and multiplication in a parallel environment, discussing both binary and decimal representations. He uses the number of vacuum tubes required to implement a method as an approximate measure of its cost. The cost versus speed considerations must be carried out carefully and in relation to the expected frequency of the different operations in programs in practice. It had been observed that an average of 2 additions and 11 store accesses (including the fetching of instructions) are typically asssociated with each multiplication.

Some time is spent on describing how control can be organised in a machine which can store two instructions per word of store.

The lecture finishes with a suggested order code for the machine, giving 13 orders but excluding orders for input and output. The order code is one address, using an accumulator plus an auxiliary register for use in multiplication and division. It contains the expected basic set of arithmetic operations (but including division), loading and storing, and a conditional and unconditional jump. Also included are a simple shift and a general masking/logical instruction (clear all bits of the accumulator in positions where there is a 1 in the operand).


Lecture 46      27th August 1946

A Four-Channel Coded-Decimal Electrostatic Machine

C.B. Sheppard   

This is a two-page summary of the lecture, by T.K. Sharpless (see Lecture 47), with a set of block and circuit diagrams. The lecture discussed a possible machine with decimal words being transmitted with the digits in series, but the four bits of a digit in parallel. (This is the other way round to the possibility mentioned in the previous lecture).

The machine is assumed to have a high speed memory of 1000 words, each word comprising 11 decimal digits. A number is stored with one digit for the sign and 10 digits for the number itself. The order code is 3-address. The structure of the numbers and instructions is therefore the same as referred to in Lecture 37, and Lecture 39 without the stop-order tags. The memory uses four CRT memory tubes, with the 4 bits of each digit held in the same position of each of the 4 tubes. The machine provides fixed-point arithmetic only, with multiplication, but no division.


Lecture 47       29th August 1946

Description of Serial Acoustic Binary EDVAC

T.K. Sharpless    (University of Pennsylvania)

This lecture describes the operation of a basic (and incomplete) EDVAC machine based on a Mercury Acoustic Delay Line store.

The overall control is described, using a detailed diagram of the machine's control circuits and a short piece of code. The lecture finishes with a shorter discussion of an arithmetic unit providing floating point addition, subtraction and 'modified whiffle tree' multiplication, again using a detailed diagram.

Unfortunately the description of the machine operation, and of the short piece of program provided to illustrate it, is closely tied to the detail of the overall control diagram, so it is difficult to follow without getting fully immersed in the diagram. Also the arithmetic unit described at the end uses a different word length, and although the introduction and detailed diagram imply floating-point operation, the description in the lecture reads as if it uses fixed-point.

The multiplier is said to take 70 basic unit instruction times (i.e. 70 'minor cycles') for a 36-bit word length, i.e. 2 units for each partial sum.

The following information hopefully gives some feel of the instruction format and the fragment of code given.

The store of the machine comprises a Main Store of 64 "Long Tanks" each holding 32 words each of 32 bits, and a set of 32 registers in "Short Tanks" each holding one word of 32 bits. An instruction can refer to two addresses: one of the registers, and either a register or a word in Main Store.

Of course the big problem with a Mercury Acoustic Delay Line store is that it is only possible to access a word in a Long Tank when it is its turn to be visible. There are two crucial timing units in the machine, a "Minor Cycle", which is the time it takes for the 32 bits of a word to cycle round a Short Tank, and a "Major Cycle", which is the time it takes for the 1024 bits of the 32 words to cycle round a Long Tank, i.e. 32 minor cycles. Therefore access to a random word in the Main Store causes an average delay of around 16 minor cycles, compared with the time to execute a basic instruction of one minor cycle (e.g. addition between two numbers in Short Tanks). In particular therefore the access of two successive words in Main Store in two successive instructions will tend to cause a delay of one major cycle minus the instruction time of the first instruction.

This potential major cycle delay between instructions is particularly embarrassing when you remember that an instruction itself demands a Main Store access. And although you can execute a simple instruction in one minor cycle, you have still got to decode the instruction before you can do this. There is an instruction buffer system that means I think that while the current instruction is being executed the instruction following it can be read and partly decoded. This means that where there is a succession of instructions referring to Short Tanks only, and only requiring a minor cycle to execute, they can mostly be executed in successive minor cycles. A second device is to allow the same instruction to be executed a number of times in succession before moving on to the next instruction, using a field in the instruction, field (6) below, which specifies the number of repetitions, usually set at 0.

The instruction format as specified in the lecture uses 6 fields (with some bits not used). They are detailed below, with the size in bits of the field given in square brackets:

  1. [5] The instruction code
  2. [6] A Short Tank address (6 bits not 5 in case it was decided to use more Short Tanks)
  3. [1] A bit distinguishing whether field (4) contains a Long or Short Tank address
  4. [6] A Long or Short Tank address
  5. [5] The "Place", the position of a required word in the Long Tank of field (4) if appropriate
  6. [4] The Repeat factor

The machine is illustrated with a six-instruction sequence which is part of a sequence to find a reciprocal by iteration using the approximation xn+1 = xn (2 - Axn).

The only instructions defined are those which are used (plus addition).

The sequence assumes that the instructions are held in Long Tank 1 ("LT1"), starting from first place, and the variables/values used, i.e. A, x, 2 and A (again), are held in the first 4 places of LT3.

The first two instructions are magic! The first appears to set up the second explicitly in an instruction buffer, and the second copies the first 4 words in LT3 to ST1, ST2, ST3 and ST4, using the Repeat field (field 6 above).

We now have:

ST1 = A
ST2 = x
ST3 = 2
ST4 = A

The next 4 instructions are fairly straightforward 2-address operations between the 4 Short Tank registers:
ST1 = ST1 * ST2    So ST1 now holds Ax
ST1 = ST3 - ST1 So ST1 holds 2 - Ax (an odd way round to define subtraction!)
ST2 = ST2 * ST1 So ST2 holds x(2 - Ax), the next approximation
ST1 = ST4 So ST1 holds A again, ready for the next iteration

The sequence peters out at this point, because there is no capability in the machine specified by the diagram to transfer control!

Note that the holding of two copies of A in Main Store looks unnecessary. It looks neater to only store A, x and 2, copy them into ST1, ST2 and ST3, and use ST4 instead of ST1 as the working register; so the body of the loop is now ST4 = ST1; ST4 = ST4 * ST2; ST4 = ST3 - ST4; ST2 = ST2 * ST4.

Note that the subtract and store instructions will only take one minor cycle and so each should save waiting (most of) a major cycle between instructions. Also the four repetitions of the second 'magic' instruction should take place in successive minor cycles. Multiplication of course takes longer than a minor cycle.


HOME PAGE
Copyright The University of Manchester 1999