Accessing the main memory or remote memory in a multiprocessor system causes latencies. These latencies can be strongly reduced on average by a hierarchy of cache memories. Further improvements are obtained by preloading data by soft- and hardware. Often the computation of the branch target address of a jump can done at first in a late pipeline stage. This latencies usually will be avoided by prophetic branch techniques, where in case of a wrong prediction the jump must be undone and some processor cycles are wasted.
For decoupled architectures the sequential instruction stream will be divided in two special instruction streams. One stream only executes computation instructions, the other load/store instructions. Two special units connected by FIFO buffers execute this instruction streams. Developing a "good" compiler is still going on.
Multithreaded processors bridge the latencies by a fast switch between several instruction streams. Using block interleaving techniques the instructions of a thread of control will be consecutely executed until an event is detected, which causes latencies, and then a context switch will be produced. For the sparcle processor this event is a cache miss or a failed synchronisation. The disadvantage of the switch on cache miss technique is cache misses can only be detected late in the pipeline.
The Rhamma processor bridges latencies by a fast context switch. A combination of the block interleaving technique and the decoupled architecture is used. In analogy to the decoupled architecture the instructions will be put into two classes, which will be executed by decoupled units. In contrast to the decoupled architecture several register sets are physically on chip and the units execute instructions of different threads of control. A thread will be switched if a instruction of the wrong instruction class should be executed. The separation of the instruction streams by a compiler is not necessary any more.
In contrast to similar multithreaded processors with block interleaving already the instruction fetch stage can detect if a context switch will be done (switch on load strategy). This can be done by reserving a bit in the instruction opcode, which indicates the context switch. This output bit of the instruction fetch stage can be used directly as a control input for the instruction fetch, which reduces the (time) overhead of the context switch to at most one processor cycle. However, the context switch are done more often.
To avoid the loss of a cycle in case the same instruction sequence will be repeated, a context switch buffer was developed to predict the instruction class. This is done by a hardware buffer, which collects the addresses of previous executed load/store instructions. Before an instruction is loaded from the instruction cache, the context switch buffer checks if the address of the instruction is already in the context switch buffer. In this case the context switches and in the next cycle an instruction of an other thread will be loaded. In case the address of the instruction can not be found in the context switch buffer, the next instruction of the actual thread of control will be loaded and handed over to the decode stage. Is this instruction a load/store instruction, the address of the instruction will be written to the context switch buffer. Software simulations indicate that 32 to 128 entries suffice.

Rhammas micro architecture
Software simulations of a strongly coupled multiprocessor system with the Rhamma processor as nodes demonstrate that the presented multithreaded processor is able to bridge long memory access times if enough work load is present. Thus a secundary level cache is unnecessary.
The correct behaviour of the processor is tested by a hardware simulation. The hardware syntheses for an ASIC using a 1.0 µm library obtains a processor cycle time of 20 MHz.
The Rhamma processor proves that a switch on load technique leads to an efficient processor if a special instruction coding in connection with a predecoding mechanism and a context switch buffer is used.
| Reference | Postscript | ||
![]() |
K. Bittnar, W. Grünewald, T. Ungerer Entwurf einer vielfädigen Prozessorarchitektur zum Einsatz in Distributed-Shared-Memory-Systemen PARS-Workshop Potsdam, Sept. 1994, pages 85-94 |
Pars94.ps.gz 621 kByte |
Pars94.pdf.gz 36 kByte |
![]() |
D. Riekert Compileroptimierungen zum Vorabladen von Registern Master thesis, Dept. of Computer Science, University of Karlsruhe 1994 |
||
![]() |
W. Grünewald, T. Ungerer Simulation einer vielfädigen Prozessorarchitektur GI/ITG Workshop PARS 1995, Stuttgart, Oct. 1995, pages 219-226 |
Pars95.ps.gz 536 kByte |
Pars95.pdf.gz 47 kByte |
![]() |
S. Sandmann Codeerzeugung für eine vielfädige Prozessorarchitektur Seminar thesis, Dept. of Computer Science, University of Karlsruhe 1995 |
||
![]() |
B. Grünewald Ein Modula-2 Compiler für eine erweiterte DLX-Architektur Master thesis, Mathematisch-Naturwissenschaftliche Fakultät, Universität Augsburg, 1995 |
||
![]() |
U. Garbe Modellierung einer vielfädigen Prozessorarchitektur in VHDL Master thesis, Dept. of Computer Science, University of Karlsruhe 1995 |
||
![]() |
J. Eisenbiegler Konsistenzmodelle aus Sicht des Programmierers Master thesis, Dept. of Computer Science, University of Karlsruhe 1995 |
||
![]() |
W. Grünewald, T. Ungerer Towards Extremely Fast Context Switching in a Blockmultithreaded Processor Proceedings of the 22nd Euromicro Conference, Prague, Sept. 1996, pages 592-599 |
EuroMicro96.ps.gz 894 kByte |
EuroMicro96.pdf.gz 426 kByte |
![]() |
B. Köcke Ein Cache für eine vielfädige Prozessorarchitektur Seminar thesis, Dept. of Computer Science, University of Karlsruhe 1995 |
||
![]() |
J. Kreuzinger Entwurf und Synthese einer Prozessor-Pipeline mit schnellem Kontextwechsel Master thesis, Dept. of Computer Science, University of Karlsruhe 1996 |
||
![]() |
W. Grünewald, T. Ungerer A Multithreaded Processor Designed for Distributed Shared Memory Systems International Conference on Advances in Parallel and Distributed Computing, Shanghai, March 1997, pages 206-213 |
APDC97.ps.gz 526 kByte |
APDC97.pdf.gz 154 kByte |
![]() |
W. Grünewald, T. Ungerer Die mehrfädige Prozessorarchitektur Rhamma ITG/GI-Fachtagung Architektur von Rechensystemen Rostock, Sept. 1997, pages 171-180 |
Arcs97.ps.gz 472 kByte |
Arcs97.pdf.gz 36 kByte |
![]() |
W. Grünewald Rhamma - Eine entkoppelte mehrfädige Prozessorarchitektur PhD thesis, Dept. of Computer Science, University of Karlsruhe 1998 To be published by: Shaker-Verlag 1998 |
||
![]() |
W. Grünewald, K. Schneider Modeling and Verifying Abstract Multithreaded Systems Gemeinsamer Workshop der GI/ITG/GME Fachgruppen Methoden des Entwurfs und der Verifikation digitaler Schaltungen und Systeme Beschreibungssprachen und Modellierung von Schaltungen und Systemen Paderborn, March 1998, pages 85-94 |
GI98.ps.gz 59 kByte |
GI98.pdf.gz 113 kByte |
| unpublished | Postscript | ||
![]() |
W. Grünewald Slides for the talk to get the PhD degree |
Vortrag.ps.gz 171 kByte |
Vortrag.pdf.gz 42 kByte |
![]() ![]() |
J. Kreuzinger VHDL sources of a Rhamma model |
jkr.ps.gz 32 kByte |
jkr.pdf.gz 76 kByte |
Winfried
Grünewald, 20 March 1998