The article describes the 'Siberia' project, which is being developed at the Computing Centre of the Siberian Division of the Russian Academy of Sciences. This project has a pragmatic and a research objective. The pragmatic objective is the creation of a high-performance computer system (HCS) intended for the operation of the supercomputing centre for global simulation and large-scale computation. The research objective is the investigation of ways to develop open general-purpose parallel systems of special-purpose parallel components. The latter should be a basis for operational design, development or adaptation of scalable problem-oriented systems. Tile concepts of large-block system design, sustaining multiple hardware paradigms, multiple software models, as well as existing languages and environments, are considered. The architecture and software of the HCS are presented. Topics for possible research co-operation are emphasised.
At present, high-performance computer systems, large-scale computations, global simulations and parallel programming are very important for research, engineering and education. Complex mathematical models replace or supplement traditional experimental methods, where those come up against principal or technological limits. The simulation may not only provide more accurate information more quickly, but also it will be cheaper than the experiment. For example, here it is possible to mention aerodynamics applications, energy research problems or fire spread simulation for complex constructions. Moreover, high- performance computer systems of today will become the ordinary facilities of tomorrow and will begin to be used by manufacturing enterprises. However, many Soviet research laboratories, universities and enterprises have no adequate computer technologies and no opportunity to take an active part in the development and use of the new information technologies and operations, such as integrated manufacturing. Serious problems for us are the high cost and various restrictions on IT as well as the isolation of local universities, and research institutions, lack of networks etc.
In an attempt to cope with this situation, the Computing Centre of Novosibirsk proposed the 'Siberia' project. The objectives of the project are the following:
To meet these objectives we used the concept of large-block system design and developed the architecture of and software for the high-performance 'Siberia' computer system to support large-scale computations in geophysics, continuum mechanic etc. We began the respective digital experiments and a number of research projects to sustain the premise: 'any module should be changeable, any unit may be ‘scalable'. These projects include the investigation of the idea of multiple hardware paradigms (vector, vertical. VLIW architectures etc.) and multiple software models (communicating sequential processes, MVS operating system, massive parallelism etc.) within one computer system. They support existing languages and software environments. In addition, in order to work out the requirements on the generic set of building blocks, we are researching problems of the parallel interpretation of very-highlevel languages, the creation of a processor that does not introduce rounding errors, the parallelisation of input/output operations, the design of processors with massive parallelism etc.
Computers have always been I designed as multiblock systems. However, the number, type and function of the 'building’ blocks as well as the intermodular interface have been subject to continuous changes. Also, functional capabilities of single blocks have been increasing. The problem of selecting an appropriate set of modules for computing systems has not yet been solved. Evidently, on the one hand, the objectives, in particular the actual properties of intended applications must be considered, and, on the other hand, it is necessary to be realistic and estimate the existing situation. Analysing our aims and possibilities we realise the necessity for application-driven machines in the form of parallel systems of large-scale hardware-software modules whose composition and features are in continuous development in keeping with the latest achievements of electronics and mathematics. The concept of large-block system design is based on the approach in which not chips but large modules, represented by special-purpose processors including those with mass parallelism, work stations, multiport memory units and even mainframes, act as building blocks. This approach makes it possible:
There are many examples of using this concept. One of them is the systems LCAP-N developed by IBM and FPS ( see 1, 2 ). Unlike the systems described in References 1 and 2, in order to use the advantages of various kinds of parallelism we consider more heterogeneous complexes and types of intermodule interfaces. Our 'generalisation’ (3, 4) enables one to focus attention on the following questions: what large-scale modulq should constitute the complex, what the prospective forms of controlling processes and resources in multiblock parallel systems are, what program modules are to support this control and where they must be located etc.
Surely, there can be no single answer to these questions. So, we suggest a concept of a general- purpose kernel surrounded by a set of special-purpose subsystems. Each of these subsystems may have a hierarchical structure of its own. i.e. it may consist of a kernel and satellites, or it may have one-level homogeneous structure. Within this hypothetical architecture the characteristics of the general-purpose kernel can be varied as well as the set of the problem-oriented subsystems, number of levels in the hierarchy and the degree of homogeneity (Fig. I). It must be emphasised that selecting special-purpose subsystems we can achieve practically any degree of universality of our computing system as a whole.
The system software should include, in addition to standard software of general- and special-purpose processors, new language means and monitoring systems intended for organisation of the computation processes based on the resources of the whole system. To ensure an efficient utilisation of the system resources under parallel processing it is important to have different controlling strategies and the opportunity of choosing the system environment preferred by the user.
'Siberia' system consists at present of large modules assembled into a single installation by the principle of extensibility. The modules of ‘Siberia' are grouped into several subsystems (Fig. 2). The central part is a multimachine subsystem including three Soviet ES-1066 mainframes (ES is an IBM-type series of mainframes combined with a set of special- purpose processors: ES-1066 is an IBM-370 compatible computer). In addition to its main function of general-purpose data processing, this subsystem acts as a host computer for vector-pipelined, vector-parallel. and associative subsystems. At present, the peak performance of this central subsystem is 36 MIPS.
The vector-pipelined subsystem is a set of eight Bulgarian processors ES-27065 (ES-2706 is an AP-l 90L compatible processor). This system enables pipelined, macro-pipelined and parallel data processing. The peak performance of this subsystem is 96 MFLOPS.
The vector-parallel sulysystem consists of Soviet PS-computers (PS is a series of Soviet array computers (see 6)). Each PS-computer includes a parallel processor with between eight and 64 processing elements (PE) together with a monitor machine and an external menrofy system. At present. the peak performance of the vector- parallel subsystem is 200 MIPS.
The asociative subsystem is a STARAN-like computer which enables the use of various operations on vertical bit slices of data. At present, the system has four associative memory unite (AMU) of 4096*256 bits each and 1024 one-bit processors. Its peak performance is 1010 one-bit operations per second.
As an extensible system, 'Siberia' may include other types of special- punpose processors. We would like to include Bulgarian array processors of the type ES-2709 with a peak perfonmance of 72 MFLOPS each (ES-2709 is a S-5000 compatible processor of FPS). In Fig. 2 they are inside the bottom-line rectangle.
The parallel programming system 'lnya' has been developed to describe large-scale and complicated applications as communicating processes over all 'Siberia' subsystems.7 This system supports the distribution of processes over several hierarchical levels. The upper level is occupied by the master process generating slave processes at the lower levels. Each slave process in tum may act as a master process. Any slave process may communicate with its master process as well as with the processes at its own level in the hieramchy. Hierarchical multi-level ordering of processes generally predetermines a hierarchical style of parallel programming. First, elementary processes of the lowest level are programmed and the corresponding load modules are being prepared. Then begins the creation of the master processes. A master process can generate various slave processes on the basis of one and the same set of elementary processes. To describe these slave and elementary processes, one can use any available programming systems of special-purpose processors. Hierarchical style of programming provides the best use of various kinds of parallelism for solving one problem.
The number of processes generated during the execution of a parallel program may exceed the number of available processors. But 'lnya' can also provide an organisation of computation where the amount of processes being generated automatically equals the number of available processors, and the data are dynamically distributed among the generated processes. In other words, the parallel programs may be adapted to the number of resources available before run time and provide highly efficient load balancing. To describe the slave processes of the lower level we developed several special parallel programming facilities.
In order to program these processes for the associative subsytem a set of routines intended for mufti pleprecision paraltel arithmetic has been created.8
The set is intended for effective programming of applications with a high proportion of vector operations. From the user’s point of view, this complex is a programmable vector processor which will be referred to as the SPARTH-processor (Super- precision Parallel ARiTHmetic processor). Vectors serve as operands of the SPARTH-processor The programmer can control the accuracy of the results, setting the operand length (e.g. 32, 64,256. 512 bits) which helps to achieve an acceptable balance between the rate of processing, volume of data being processed and desired accuracy. Moreover, during the run time of a program, the precision can be changed.
For programming the slave processes on the vector-parallel subsystem, the FORTRAN-PS compiler has been developed. This language is a FORTRAN extension providing the possibility, on the one hand, to take into account the peculiarities of the subsystem's architecture and, on the other hand. to combine the concepts of APL and languages of communicating processes. The extension includes a special dimension statement, an activation vector variable and a number of subroutines. Under the FORTRAN-PS programming system a sequential FORTRAN program can process simultaneously a large set of different source data. In other wands, these facilities support the execution of a system of identical communicating processes in the SIMD mode.
In order to program the slave processes on the vector-pipelined subsystem we have developed our own AP-FORTRAN compiler for the EC-2706. It extracts parallelism from loops and uses compaction-based parallelisation and program pipelining.9
To improve the programmer's environment during preparation and running of the parallel programs on a single multiprocessor or array processor we developed an interactive service system 'Bija'. It functions in the ISPF/PDF environment under TSO of the MVS operating system. This system essentially improves the user’s efficiency when preparing, debugging and running the programs on separate special-purpose processors and subsystems, in testing the processors, tracing availability of resources etc.
To control the resources of special-purpose subsystems we developed a monitor 'Ob' .9 It allows a single task to get from 1 up to N special-purpose processors and organise parallel computations on them. It is not necessary to create N subtasks to get N processors. A given number of processors, say, three, can be requested or any number from, say, two to five of processors can be specified. There is a possibility to access specific or arbitrary processors. Using the monitor improves the flexibility of the application program, permits its automatic adaptation to the available computing resources of the complex, and reduces the overhead of the computation. Note that the monitor controls special-purpose processors of different types with different interfaces. The user can request both the resources and the preferable monitor component to control them. The monitor tests processors, accumulates statistics etc.
There are many commercial programming systems with automatic extraction of parallelism. Nevertheless, the efficiency analysis of high-performance computers shows that there is a wide gap between their peak performance and their performance when solving real applications. So, taking into account the features of many digital experiments, we suggest making use of integration before parallelisation.10
The gist of the integration is not the parallelisation of a sequential program but its integration with other programs. In other words, we suggest parallel programming using multiple function multiple data (MFMD) procedures, i.e. procedures each of which by one code executes a number of functions over several data sets in an integrated. interrelated and even shuffled mode. Either the functions being computed or the data sets may coincide. After integration the new integrated program may be submitted for parallclisation or for direct parallel execution. The MFMD procedure is intended for realisation on any type of parallel computer, not only for MIMD machines or vector processors.
Some reasons why the MFMD procedure increases the efficiency of parallel computer systems are:
To support the idea of MFMD procedures we suggest that special-purpose software tools should be developed. We call them 'sequential program integrator'. Integrators execute transformations inverse to those of parallelisers. We distinguish several types of integrators,10 each of which in turn can be subdivided into a few variants according to SIMD, MIMD, VLIW or mix architectures of target computers as well as according to their compiler and parallelisation systems.
The first example of an integrator is the FORTRAN-PS system mentioned above. This system generates automatically a vector-parallel program on the basis of the integration of identical sequential programs. Another type of integrator is the 'lnya' system, which is also mentioned above. Unlike the first type of integrator, 'lnya' requires an additional mastering program which treats the sequential programs to be integrated as elementary processes. A subset of source elementary processes is integrated into one slave process to be executed on one processor. The reduction of the actions for the process creation as well as the reduction of number of data accesses in the main memory of the 'Siberia' kernel and the local memory of attached processors considerably increase the efficiency. Now we are developing another type of integrator and integration control language. New tools will allow programmers to perform the integration before or after application of the AP-FORTRAN compiler optimiser. Matching the results will give the best variant. So, our new motto is integration before and after parallelisation.
The approach described here may help to solve the problem of program portability and the effective use of universal parallel architectures. It will be especially successful for applications based on complexes of programs, multi-variant computations etc.
There applications for which input/output operations are essential. In addition, any computer system with a strongly scalable architecture should have a scalable file system, database etc. At the hardware level, such features may be supported by parallel interactions of a number of processor elements (PE) either with their individual disc storage devices or with different tracks of one disc. In any case, there are questions associated with the file distribution over different discs or tracks, with the movement of portions of data during one input/output operation and, finally, with chaotic or co-ordinated input/output actions of PE. One of the ways to investigate these questions is to solve the 'file rendezvous' problem.11
Assume that we have two large files, A and B, in the secondary computer memory. The sizes of A and B are assumed to exceed the volume of the accessible main memory. The problem is in arranging the 'meeting' of each element of B with each element of A in the main memory. If we abstract from the nature of files and operations executed during the rendezvous, the sequential algorithm of the problem realisation is as follows. The first block of file B is read and all blocks of file A are run one after the other 'to meet’ it; then the second block of file B is input and all blocks of file A are also run to meet it, and so on, up to the last block of file B. In other words, this algorithm has two loops: one is for the input of blocks of B, the other is for the input of blocks of A. The latter is nested in the former.
File rendezvous is a basic scheme of data processing for many applications. It could be applied to the analysis of certain file sorting procedures, retrieval commands, set operations etc.
Assume that we have a multiprocessor of l PE and each PE has its own local memory and disc device. Now suppose that the files A and B are distributed, i.e. their blocks are uniformly distributed over these disc devices, the l processes working in parallel. Then the rendezvous speedup is equal to k x /, where k > l. If the duration of a word exchange among PE is much shorter that the duration of a word read from disc then k approaches l and the speedup approaches l2. This nonlinear factor is very important. In addition, there are a number of optimisation aspects. In particular, we must allocate a greater part of buffer memory for a file input in the external loop of the algorithm than for a file input in the internal loop, the larger file is more effective for input in the internal loop etc.
We may run many applications if we generalise the concept of file rendezvous, such that it is possible to uniformly analyse the organisation of data swapping between main memory and bulk memory.
Very-high-level languages have rich expressive features. Therefore they can be used as convenient facilities for program specification. It is not evident, however, that these languages can be regarded as facilities for efficient programming. This is related on the one hand, to their expressive features and, on the other hand, often to their sequential machine realisation. But it is well known that parallel machines like associative computers allow the efficient manipulation of composite objects such as vectors, matrices, sets and others. So, we attempted to design a parallel computing system for which very-high-level (VHL) languages may be used as the programming means.12 In order to realise this idea we are undertaking a three-step plan:
At present we are working at the second step of the plan.
PARIS consists of a kernel and different problem-oriented shells around it. The PARIS kernel has been defined to allow an efficient interpretation of well-known VHL languages. In order to achieve this, we have considered the languages LISP, APL, PROLOG. SEOUEL-2. SETL and selected a number of constructs from them. We have also added some new features. Therefore PARIS can be considered as an intermediate language for the interpretation of other VHL languages. To underline the basic role of data types and constructs of SETL we named our language PARIS following the abbreviation of PARallel Interpretation of SETL. It is essential to note that PARIS constructs are designed to express opposite and dialectical language properties: procedureness and declarativeness, typed and untyped data, explicit and implicit parallelism, dynamic and static objects etc.
It is well known that an interpreter is often rather slow. In spite of that we expect an efficient realisation for the following reasons:
An analysis of scalable and evolving architectures tells us that it is high time to supplement computing systems with a special subsystem allowing high- acuracy computations by the elimination of rounding errors. The necessity of rounding is attributed to the fixed and relatively low wort length of operands in most computers. We propose to design a scientific computing subsystem in which an effective decrease of the impact of rounding errors is achieved by the joint use of the following means:
We call such subsystems SCORE- subsystems (Scientific Computer for Overcoming Rounding Errors). Crucial for a SCORE-subsystem design is the development of a programming environment and supporting hardware in conjunction. First, we developed the SPARTH-processor on the basis of the STARAN-like computer (see earlier). The decrease of rounding errors is achieved by means of processing operands of a very large word length (up to 512 bits). In general, the purpose of these investigations was to develop a prototype of the SCORE- subsystem and to arrive at recommendations for the selection of the components for new generation computers.
A disadvantage of existing parallel MIMD systems is the duplication of both operating system and application programs in the memory of each PE. Compared with SIMD systems this duplication can lead to an n-fold size of main memory (what if n=1000?). To decrease memory redundancy for parallel program storing, architectures and operational algorithms for the MIMD system have been considered. A program is divided into segments. At any time, one and the same segment can tie executed by several (possibly all) PEs of the system. We propose to allocate different segments of the program into different PE and not to store the whole program in the memory of each PE. When needed, segments (or their copies) are transmitted for execution into the corresponding PEs. We propose three varieties of architecture depending on the segment allocation method: (i) a static architecture, if the segment’s store places do not change during run time; (ii) a dynamic architecture, if the segment's store places change according to the segment’s use frequency; (iii) a regular architecture, in which segments circulate following given routes. We propose to carry out analytical and modelling investigations and to consider the realisation of the proposed approach on a multitransputer system.
Another disadvantage of MIMD systems, as compared with SIMD systems, is the problem of deadlock- freedom of programs and uniqueness of parallel programs results. To overcome it we developed a compromise approach which allows polynomial and even linear algorithms for a static analysis of parallel program correctness.13,14
To support the idea of future general-purpose parallel systems of special-purpose parallel components we are investigating the problem of designing processors with mass parallelism. This includes systolic, cellular, vertical as well as mixed processing.15-18
In particular, a formal method for the synthesis and analysis of systole algorithms and structures has been developed by S. G. Sedukhin. Based on an initial specification of the algorithm which is given as a system of linear recurrent equations, this method allows the systematic synthesis of all equivalent systolic structures admissible for a VLSI implementation with certain constraints. This method is used as the basis for an interactive automated design system. Based on the method and the system, a number of systolic structures for a VLSI implementation were obtained. These structures are optimal for solving problems in linear algebra (multiplication of matrices of different structures, triangle decomposition and inversion of matrices, solution of systems of linear equations by direct and iterative methods etc.), digital signal processing (convolution, deconvolution. 1D and 2D discrete Fourier transform), graph theory (transitive closure, shortest distances in a weighted graph) etc.
Our experience with the Liberia' system shows that the concept of large-block system design is very important. It enables us to create cost-effective high-performance systems based on 'off-the-shelf equipment’. 'Siberia' is a pragmatic answer to many of our current problems and seems at the same time to be very promising. It has proved to be a valuable tool in solving computation-intensive research problems, as well as a super-testbed for various forms of parallel processing. But it has to be mentioned that all system components are 'supersized' units of old technology. To maintain the system is very time and effort consuming. We hope to replace these units step by step by new technology. Right now, our opportunities to realise this dream are few. So, all our hopes are in international projects and cooperation. The topics of the second half of the article (and others) may be considered for possible joit work.
The author wishes to thank Profs. R. Perrott, D. Wilson and Ch. Lengauer for their help in preparing this article.
©1997-98 Ëàáîðàòîðèÿ Êîíñòðóèðîâàíèÿ è Îïòèìèçàöèè Ïðîãðàìì
Èíñòèòóò Ñèñòåì Èíôîðìàòèêè CO PAH
Web master webmst@pcosrv.iis.nsk.su