Print
Category: Computer Architecture
Hits: 8378

Index of Multiple Processor


1. Multiple processor systems

Computer systems must become faster and smaller, to meet the requirements of the industry. But unfortunately we are beginning to hit some fundamental physical limits on clock speed. Since no electrical signal can travel faster than the speed of light (30cm/nsec) in vacuum and about 20cm/nsec in copper wire of optical fiber the size of a processor is limited:

We face an other huge problem. The heat dissipation. To get rid of those problems and still have a great improvement in speed is to build parallel computers or biological computers.

1.1 Parallel computers

Parallel computers are built with conventional hardware (processors) connected together using bus system. All communication between the cores (processors or computers) are realized using messages.

There are 3 general types of multi processor systems:

Shared-memory multiprocessor.

This system is invisible to the programmer. The message passing is done under cover. The implementation is not very easy, and there are some limits as well.

Accessing a memory word takes about 0.002-0.01 µsec.

Message-passing multicomputer.

The processor and memory pairs are connected to a fast high speed interconnect. The system is also called message passing multicomputer.

Implementation is much simpler than the memory shared multiprocessor.

Accessing a memory word takes about 10-50 µsec.

Wide area distributed system.

This system connects complete computer systems over a wide area network, such as Internet. It forms a distributed system. Since the accessing is rather slow the systems are loosely coupled.

Accessing a memory word takes about 10'000-100'000 µsec.

1.2 Multiprocessors

A computer system witch contains more than one processor accessing to a common RAM is called Multiprocessor.

Regular operation systems can be used with some extended features such as

There are two major groups of multiprocessors:

1.2.1 Bus based architecture for UMA Multiprocessors

This is the simplest way. All processors and one or more memory module are using the same bus for communication.

1.The CPU waits until the bus is ready
2.Put the address of the word on the bus and waits until the memory puts the requested memory word on the bus
3.If the bus is busy the CPU has to wait.

The major problem is step 3. The system will be totally limited by the bandwidth of the bus. To get a better performance each processor gets its own cache. Some of the reads can be satisfied over the cache. This reduces the usage of the bus.

The cache handling must be extended.

If a CPU wants to write a word witch is in more than one cache (at least one remote cache), the bus hardware detects the write command and puts a signal to the bus informing all caches of the write. If one of the caches has a "dirty" entry (already modified) the cache must write the modified copy back to the memory before the current cache can read and modify the word.

Even with the best caching the numbers of CPU are limited to max. 32 CPU's.

1.2.2 Using Crossbar switches

Using crossbar switches up to 100 CPU's can be connected. The crossbar switches connects n CPU's to k memories. For more CPU's the network becomes to complicated and very expensive, since n*k switches are needed.

A crosspoint can be opened or closed. If the crosspoint (i,j) is closed the i-th CPU is connected to the j-th memory block.

Using this architecture many CPU's can access the memory at the same time (parallel). The crossbar switch is a non-blocking network. If a CPU A tries to access a memory block already accessed by an other CPU B, the CPU A has to wait until CPU B finishes the access.

One of the biggest disadvantages is that the network grows as O(n^2).

1.2.3 Using Multistage Switching Networks

A different multiprocessor architecture can be achieved using the humble 2x2 switch.

The message arrives either on input A or B and is routed according to some header information to the output X or Y. The message header is composed by 4 fields:

The switch looks at the Module-field and decides if the message should be sent to output X or Y.

Using the 2x2 humble switch larger networks can be built. One possibility is to build an omega network (also called perfect shuffle). The omega network using n CPU's and n memories log2n stages are needed, with n/2 switches per stage. Instead of n2 swiches (for crossbar switched network) only (n/2)log2n switches are needed to build the omega network.

The image illustrates how 2 CPU's A (001) and B (011) accesses two different memory blocks M (001) and N (110).

The CPU A puts 001 for memory module 001 onto the Module-Field of the message. The first stage looks at the first bit witch is 0 and activates the output line 1B.X. The second stage 2C analyses the 2nd bit witch is also 0 and activates the output line 2C.X. The last bit is analyzed by the last stage. The 3A switch activates the 3A.Y output since the last bit is 1.

If an other CPU wants to access the memory 001 it has to wait. Therefore the omega network is a blocking network.

1.2.4 NUMA Multiprocessors

To connect more than 100 CPU's together an other approach is needed. Like UMA architectures the NUMA uses also a single address space across all CPU's but the memory is departed into local and remote memory. Accessing to local memory is therefore much faster.

There are two types of NUMA architectures: The NC-NUMA (no-cache NUMA) and the CC-NUMA (cache-coherent NUMA). Most popular is the CC-NUMA using a directory based architecture. The idea is to maintain a database telling where each cache line is and what status it has.

1.3 Operation System Types

There are different possibilities to handle multiprocessors by a operation system.

1.4 Multiprocessor synchronization

On a single CPU machine after calling a system call, the system has to access a table containing the critical section locks. To do that it was sufficient to disable the interrupt handling before accessing the table.

On a multiprocessor system this simple mechanism will not work, since other processors can still access the table. Therefore some other synchronization is needed. A proper mutex protocol must be used and respected by all CPU's to guarantee that the mutual exclusion works.

Note: TSL Instruction

TSL (Test and Set Lock) instruction reads the content of a memory word into a register and then stores a nonzero value at the memory address. The operation is indivisible. It locks the bus until it finishes so no other CPU can access the bus at its execution time.


TSL R0, lock    ;
...             ; Do some work here
MOVE lock, #0   ; Release lock