Index of Multiple Processor

1. Multiple processor systems

Computer systems must become faster and smaller, to meet the requirements of the industry. But unfortunately we are beginning to hit some fundamental physical limits on clock speed. Since no electrical signal can travel faster than the speed of light (30cm/nsec) in vacuum and about 20cm/nsec in copper wire of optical fiber the size of a processor is limited:

  • for 10 GHz processor the signal can not travel more than 2cm
  • for 100 GHz processor the signal can not travel more than 2mm
  • for 1 THz processor the processor must be smaller than 100 microns

We face an other huge problem. The heat dissipation. To get rid of those problems and still have a great improvement in speed is to build parallel computers or biological computers.

1.1 Parallel computers

Parallel computers are built with conventional hardware (processors) connected together using bus system. All communication between the cores (processors or computers) are realized using messages.

There are 3 general types of multi processor systems:

Shared-memory multiprocessor.

This system is invisible to the programmer. The message passing is done under cover. The implementation is not very easy, and there are some limits as well.

Accessing a memory word takes about 0.002-0.01 µsec.

Message-passing multicomputer.

The processor and memory pairs are connected to a fast high speed interconnect. The system is also called message passing multicomputer.

Implementation is much simpler than the memory shared multiprocessor.

Accessing a memory word takes about 10-50 µsec.

Wide area distributed system.

This system connects complete computer systems over a wide area network, such as Internet. It forms a distributed system. Since the accessing is rather slow the systems are loosely coupled.

Accessing a memory word takes about 10'000-100'000 µsec.

1.2 Multiprocessors

A computer system witch contains more than one processor accessing to a common RAM is called Multiprocessor.

Regular operation systems can be used with some extended features such as

  • process synchronization
  • resource management
  • scheduling

There are two major groups of multiprocessors:

  • UMA (Uniform memory access). Here every memory word can read as fast as any other memory word.
  • NUMA (Nonuniform memory access) does not have this property

1.2.1 Bus based architecture for UMA Multiprocessors

This is the simplest way. All processors and one or more memory module are using the same bus for communication.

1.The CPU waits until the bus is ready
2.Put the address of the word on the bus and waits until the memory puts the requested memory word on the bus
3.If the bus is busy the CPU has to wait.

The major problem is step 3. The system will be totally limited by the bandwidth of the bus. To get a better performance each processor gets its own cache. Some of the reads can be satisfied over the cache. This reduces the usage of the bus.

The cache handling must be extended.

  • The cache block is marked as read only if it is present in multiple caches
  • The cache block is marked as read-write if it is not present in any other caches

If a CPU wants to write a word witch is in more than one cache (at least one remote cache), the bus hardware detects the write command and puts a signal to the bus informing all caches of the write. If one of the caches has a "dirty" entry (already modified) the cache must write the modified copy back to the memory before the current cache can read and modify the word.

Even with the best caching the numbers of CPU are limited to max. 32 CPU's.

1.2.2 Using Crossbar switches

Using crossbar switches up to 100 CPU's can be connected. The crossbar switches connects n CPU's to k memories. For more CPU's the network becomes to complicated and very expensive, since n*k switches are needed.

A crosspoint can be opened or closed. If the crosspoint (i,j) is closed the i-th CPU is connected to the j-th memory block.

Using this architecture many CPU's can access the memory at the same time (parallel). The crossbar switch is a non-blocking network. If a CPU A tries to access a memory block already accessed by an other CPU B, the CPU A has to wait until CPU B finishes the access.

One of the biggest disadvantages is that the network grows as O(n^2).

1.2.3 Using Multistage Switching Networks

A different multiprocessor architecture can be achieved using the humble 2x2 switch.

The message arrives either on input A or B and is routed according to some header information to the output X or Y. The message header is composed by 4 fields:

  • Module: Tells witch memory to use
  • Address: Specifies an address within the module
  • OpCode: Operation (read or write)

The switch looks at the Module-field and decides if the message should be sent to output X or Y.

Using the 2x2 humble switch larger networks can be built. One possibility is to build an omega network (also called perfect shuffle). The omega network using n CPU's and n memories log2n stages are needed, with n/2 switches per stage. Instead of n2 swiches (for crossbar switched network) only (n/2)log2n switches are needed to build the omega network.

The image illustrates how 2 CPU's A (001) and B (011) accesses two different memory blocks M (001) and N (110).

The CPU A puts 001 for memory module 001 onto the Module-Field of the message. The first stage looks at the first bit witch is 0 and activates the output line 1B.X. The second stage 2C analyses the 2nd bit witch is also 0 and activates the output line 2C.X. The last bit is analyzed by the last stage. The 3A switch activates the 3A.Y output since the last bit is 1.

If an other CPU wants to access the memory 001 it has to wait. Therefore the omega network is a blocking network.

1.2.4 NUMA Multiprocessors

To connect more than 100 CPU's together an other approach is needed. Like UMA architectures the NUMA uses also a single address space across all CPU's but the memory is departed into local and remote memory. Accessing to local memory is therefore much faster.

There are two types of NUMA architectures: The NC-NUMA (no-cache NUMA) and the CC-NUMA (cache-coherent NUMA). Most popular is the CC-NUMA using a directory based architecture. The idea is to maintain a database telling where each cache line is and what status it has.

1.3 Operation System Types

There are different possibilities to handle multiprocessors by a operation system.

  • Each CPU has its own OS. This is the simplest way.
    • A system call of an application is automatically handled by the correct OS of the same CPU
    • The scheduling is done at each processor itself.
    • Each CPU has its own memory (static assigned)
    • IF the OS maintains buffer caches (of recently used disk blocks) the buffer might be inconsistent, since other CPU's might change content on the disk. Buffer caches should not be used.
  • Master-Slave Multiprocessors. The OS runs on the master and all client processes are running on slave processors. If a processor is idle it can ask the master for a new process to run.
    • Can not happen that one slave is overloaded and one is idle.
    • Buffer cache is maintained only by the master processor (OS)
    • All system calls are handled by the master processor. The master processor becomes a bottle neck for more than 10 CPU's.
  • Symmetric Multiprocessors. The OS is in memory and every processor can run it.
    • OS must be redesigned and independent sections must be locked by a critical section

1.4 Multiprocessor synchronization

On a single CPU machine after calling a system call, the system has to access a table containing the critical section locks. To do that it was sufficient to disable the interrupt handling before accessing the table.

On a multiprocessor system this simple mechanism will not work, since other processors can still access the table. Therefore some other synchronization is needed. A proper mutex protocol must be used and respected by all CPU's to guarantee that the mutual exclusion works.

  • Synchronization using TSL (Test and Set Lock). Needs 2 bus cycles and is therefore not a atomic operation. In worst case more than one CPU can read and set the lock. The mutual exclusion fails.
  • Synchronization using TSL and bus lock. First the TSL instruction locks the bus, reads the word, compares and might write back a non zero value. At the end the bus is unlocked. This prevents other CPUs using the bus at the same time. Since each CPU is looping over this lock (spin lock) a massive load is put on the CPU and the bus.
    • To optimize the bus traffic a preread can be done, where as the CPU checks first if the lock is free before using the TSL.
    • Also possible is a waiting algorithm (like Ethernet does). First it waits one instruction, the two, four etc. up to a maximum.
    • FIFO of waiting CPUs looping on a private lock. The first CPU in list is the lockholder. If it finishes it releases the lock of the next CPU. This concept is efficient and starvation free.

Note: TSL Instruction

TSL (Test and Set Lock) instruction reads the content of a memory word into a register and then stores a nonzero value at the memory address. The operation is indivisible. It locks the bus until it finishes so no other CPU can access the bus at its execution time.

TSL R0, lock    ;
...             ; Do some work here
MOVE lock, #0   ; Release lock