\documentclass[twoside]{article}
\pagestyle{myheadings}
\markboth{\hfil Implementing parallel elliptic solver \hfil}%
{\hfil Marcin Paprzycki, Svetozara Petrova, \& Julian Sanchez \hfil}
\begin{document} \setcounter{page}{75}
\title{\vspace{-1in}\parbox{\linewidth}{\footnotesize\noindent
{\sc 15th Annual Conference of Applied Mathematics, Univ. of Central Oklahoma},
\newline
Electronic Journal of Differential Equations, Conference~02, 1999, pp. 75--85. 
\newline
ISSN: 1072-6691. URL: http://ejde.math.swt.edu or http://ejde.math.unt.edu
\newline ftp  ejde.math.swt.edu (login: ftp)}
\vspace{\bigskipamount} \\
Implementing parallel elliptic solver \\ on a Beowulf cluster 
\thanks{ {\em 1991 Mathematics Subject Classifications:} 65F10, 65N20, 65N30.
\hfil\break\indent
{\em Key words and phrases:} elliptic solvers, separation of variables,
 parallel computing, \hfil\break\indent problems with sparse right-hand side.
\hfil\break\indent
\copyright 1999 Southwest Texas State University  and University of
North Texas. \hfil\break\indent
Published November 29, 1999. \hfil\break\indent
The second author was partially supported by the Alexander 
von Humboldt Foundation and \break\indent
the Bulgarian Ministry for Education,
Science and Technology under Grant MM--98\#801.} }

\date{}
\author{Marcin Paprzycki, Svetozara Petrova, \& Julian Sanchez}
\maketitle

\begin{abstract} 
In a recent paper \cite{zara} a parallel direct solver for the linear
systems arising from elliptic partial differential equations has been
proposed. The aim of this note is to present the initial evaluation of
the performance characteristics of this algorithm on Beowulf-type
cluster. In this context the performance of PVM and MPI based
implementations is compared.
\end{abstract}

\newcommand{\x}{{\underline {x}}}
\newcommand{\y}{{\underline {y}}}
\newcommand{\f}{{\underline {f}}}
\newcommand{\q}{{\underline {q}}}
\newcommand{\et}{{\underline {\eta }}}
\newcommand{\bet}{{\underline {\beta }}}
\newcommand{\0}{{\underline {0}}}
\renewcommand{\theequation}{\thesection.\arabic{equation}}

\section{Introduction}\label {sec1}

Recently a new parallel algorithm for the solution of separable second
order elliptic PDE's on rectangular domains has been presented by
Petrova \cite{zara}. The method is based on the sequential algorithm
proposed by Vassilevski \cite{V1,V2}. The algorithm consists of odd-even
block elimination combined with discrete separation of variables.  It
was established that the proposed solver has good numerical properties
for both 2D and 3D problems \cite{zara,V1,V2,VP}. The parallel algorithm
by Petrova was implemented using PVM to facilitate parallelism and the
initial performance evaluation has been reported in \cite{HMPzara} (for
a brief descriptions of PVM and MPI programming environments, see
below). The performance study has run into technical problems. For
obvious reasons, one should use parallel computers to solve large
problems. On the computers we had access to (Silicon Graphics machines
at NCSA in Urbana), this meant that using the batch-queues was required
(there is an imposed limit on the size and time allowed for interactive
jobs). We have found that for some reason the PVM environment, when
invoked from the NQS (NCSA batch job submission system), was relatively
unstable (approximately two of every three jobs hang up when the PVM
daemons died). In the mean time, while running MPI-based jobs (in the
same environment) we have not encountered such problems (see also
\cite{LiMaPa}). We have therefore re-implemented the algorithm using the
MPI environment to facilitate parallelism and experimented with it on a
Beowulf cluster. Due to the fact that we had two versions of the code we
were able to run both of them for problems of the same size and compare
their performance. The aim of this note is to report on the results of
our experiments.

Parallel Virtual Machine (PVM) is a programming environment (developed
at Oak Ridge National Laboratory) which allows a heterogeneous
collection of workstations and/or supercomputers to function as a single
high-performance parallel machine. PVM was designed to link computing
resources and provide users with a parallel platform for running their
computing applications, irrespective of the number of different
computers they use and where the computers are located. It was created
primarily as an interactive-type tool, where a console is started on one
of the machines and the interactions with other computers are handled
from this console. Over the last two years its role has been slowly
descreasing. In the meantime, the popularity of the Message Passing
Interface (MPI) is increasing. MPI is a message passing library
facilitating parallel programming (primarily for codes written in C and
Fortran). It was designed in Argonne National Laboratory and released in
1994. In contrast to PVM, MPI is just a specification of a library
without an attempt to build an interactive parallel environment. Both
PVM and MPI became standards and are supported by most vendors of
parallel computers.

We proceed as follows. In Section~2 the method of discrete separation of
variables is presented. Section~3 contains a brief summary of the Fast
Algorithm for Separation of Variables (FASV) (for more details of both
phases of the algorithm, see \cite {zara,V1}). In Section~4 we discuss a
number of issues related to the parallel implementation and execution.  
Finally, Section~5 presents the results of experiments performed on a
Beowulf cluster. We conclude with the description of future research.

\section{Discrete separtion of variables} \label {sec2}
\setcounter{equation}{0}

Consider a separable second order elliptic equation of the form
$$ -\sum _{s=1}^d \frac {\partial} {\partial x_s} a_s(x_s) \frac
{\partial u} {\partial x_s} = f(x), \ x \in \Omega ,~~ d = 2~ \mbox {
or} ~3 $$
$$ u_{| \partial \Omega} = 0. $$
For the purpose of this paper we assume that $\Omega $ is a rectangle
$(d=2)$, but the method described below can be generalized to 3D problems.
Using, for example, a finite difference method to discretize the equation,
we obtain the linear system of equations
\begin{equation} \label {11}
A \x = \f,
\end{equation}
where
$$A  =  I_m \otimes T + B \otimes I_n.$$ 
\noindent
Here $I_n$ is the identity $n \times n$ matrix and $\otimes$ is the {\it
Kronecker product}. Matrices $T=(t_{ij})_{i,j=1}^n $ and
$B=(b_{kl})_{k,l=1}^m $ are tridiagonal s.p.d.\ arising from the finite
difference approximation of the one-dimensional operators $-\frac
{\partial} {\partial x_s} a_s(x_s) \frac {\partial } {\partial x_s}(.),$
for $s=1,2,$ respectively. Vector $\x$ and the right-hand side $\f$ of
(\ref {11}) are composed using the lexicographic ordering on horizontal
lines.

To describe the method of separation of variables we will use vectors
$\x'$ and $\f'$ reordered using vertical lexicographic ordering. Consider
the following eigenvalue problem
\begin{equation} \label {12}
B \q_k = \lambda _k \q_k,
\end{equation}
where $\left \{\lambda _k,\,\q_k \right \}_{k=1}^m$ are the eigenpairs of
$B_ {m \times m}$. The matrix $B$ is assumed s.p.d.\ and hence the
eigenvalues $\lambda _k>0$ and the eigenvectors satisfy $\q_k^T\,\q_r =
\delta _{kr}$ ($\delta _{kr}$ is the Kronecker symbol) for $k,r=1, 2,
\ldots ,m$. Using the basis $\{\q_k \}_{k=1}^m$ the vectors $ \x_i'$ and $
\f_i'$ can be expanded as follows:
\begin{equation} \label {13}
\x_i' = \sum _{k=1}^m \eta _{ki}\q_k, \qquad 
\f_i' = \sum _{k=1}^m \beta _{ki}\q_k, ~~~ i=1,2,\ldots ,n,
\end{equation}
where the Fourier coefficients of $\f_i'$ are computed by
\begin{equation} \label{15}
\beta_{ki} = \q_k^T \f_i'.
\end{equation}
Consider the column vectors $\et_k = \left (\eta_{ki} \right )_{i=1}^n$
and $\bet_k = \left (\beta_{ki}\right )_{i=1}^n, \ k=1,2, \ldots, m.$
Substituting expressions (\ref {13}) in (\ref {11}) results in the
following system of equations for the discrete Fourier coefficients
$\et_k$ of $\x_i', \ i=1,2, \ldots ,n$
\begin{equation} \label{16}
(\lambda_kI+T) \et_k =  \bet_k, ~~  k=1,2, \ldots ,m.
\end{equation}
Equations (\ref {16}) represent $m$ systems of linear equations with
$n \times n$ matrices. They can be solved independently of each other. The
algorithm for the separation of variables (SV) has thus the following
form:

\bigskip 

\noindent
\underline {\large {Algorithm SV}}
\medskip

\noindent
(1) {\bf determine} the eigenpairs $\left \{ \lambda_k, \,\q_k \right
\}_{k=1}^m$ of the tridiagonal matrix $B$ from (\ref {12});

\noindent
(2) {\bf compute} the Fourier coefficients $\beta_{ki}$ of $\f_i'$
using (\ref {15});

\noindent
(3) {\bf solve} $m$ $n \times n$ tridiagonal systems of equations of
the form (\ref {16}) to determine $\{\eta_{ki}\}$ -- the Fourier
coefficients of $\x_i', \ i=1,2, \ldots, n$;

\noindent
(4) {\bf recover} the solution of our original system (\ref {11}) on
the basis of the Fourier coefficients $\{\eta_{ki}\}$ of $\x_i'$.
There are two possibilities:
\begin{eqnarray} \label {17}
{\mbox {vertical recovering: }} ~~&\x_i' = &\sum _{k=1}^m
\eta_{ki}\q_k, ~~ i=1,2\ldots ,n \nonumber \\ \vspace{6pt}
{\mbox {horizontal recovering: }}~~&\x_j = &\sum _{k=1}^m
q_{jk}\et_k, ~~ j=1,2\ldots ,m.
\end{eqnarray}

\medskip

Let us now assume that the system of the form (\ref {11}) has a sparse
right-hand side (SHRS) (for more details of origins of such problems see
for instance Banegas \cite{B}, Proskurowski \cite{P} and Kuznetsov
\cite{K}). More precisely, assume that the right-hand side $\f$ has only
$d \ll m$ nonzero block components
$$ \f^T = \left [ \0, \ldots, \f_{j_1}, \0, \ldots, \f_{j_d}, \0, \ldots,
\0\right]^T, ~~ {\mbox {where }} \f_{j_s} \in {\cal R} ^n, \
s=1,2 \ldots , d.$$
Then, for the reordered right-hand side, each vector $\f_i' \   
(i=1,2,\ldots
,n)$ has only $d$ nonzero scalar components $f_{ij_s}, \ s = 1,2 \ldots
,d$, i.e.
$$ \f_i'^T = \left [ 0, \ldots, f_{ij_1}, 0, \ldots, f_{ij_d}, 0, \ldots, 
0
\right]^T, ~~ i=1,2,\ldots ,n.$$
Assume that only $r \ll m$ block components of the solution are needed.  
Denote by $\x_{j_1'},\x_{j_2'}, \ldots ,\x_{j_r'}$ the sought vectors. To
find the solution of such a problem with a sparse right-hand side we apply
the Algorithm~SV as follows:

%\newpage
\bigskip

\noindent
\underline {\large {Algorithm SRHS}}
\medskip

\noindent
(1) {\bf compute} the Fourier coefficients $\beta_{ki}$ of $\f_i'$
from (\ref {15}),

$$\beta_{ki}= \q_k^T \f_i' = \sum _{s=1}^d q_{kj_s}f_{ij_s} , ~~ k=1,2,
\ldots,m, ~~ i=1,2,\ldots,n;$$

\noindent
(2) {\bf solve} systems of the form (\ref{16});

\noindent
(3) {\bf recover} the solution per lines using (\ref {17}).  We need only
$ \x_j = \sum _{k=1}^m q_{jk} \et_k$,  for $j=j_1',
j_2', \ldots, j_r'$.

\section {Fast algorithm for separation of variables} \label {sec3}
\setcounter{equation}{0}

Consider now the algorithm FASV proposed by Vassilevski \cite {V1,V2} for
solving problems of type (\ref {11}).  For simplicity we assume that
$m=2^l-1$. The algorithm consists of two steps -- forward and backward
recurrence. For the forward step we need matrix $A$ in the following block
form:
\begin{equation} \label{32}
A = \left [ \begin {array} {ccccc}
           A^{(k,1)} & A_{12} &  &  & 0 \\
             A_{21}  & T+b_{2^k2^k}I_n & A_{23} &  & \\
                     & A_{32}  & A^{(k,2)} & A_{34}  & \\
                     &  & \ddots & \ddots & \ddots  \\
               0       &         &  & A_{2^{l-k+1}-1,2^{l-k+1}-2} & 
A^{(k,2^{l-k})}
          \end{array} \right ] ,
\end{equation}
where
$$A^{(k,s)} = I_{2^k-1} \otimes T+B_k^{(s)} \otimes I_n,
~~s=1,2,\ldots ,2^{l-k},$$
i.e.  each matrix $ A^{(k,s)},\ 1\le k\le l,\ 1\le s \le 2^{l-k}$ has
$2^k-1$ blocks of order $n.$

Above $B_k^{(s)}$ of order $2^k-1$ is the principal submatrix of $B$ and
hence, it is also a tridiagonal s.p.d.\ matrix of the form
$B_k^{(s)}=(b_{ij}),\ i,j=1,\ldots s_k+2^k-1,$ where $s_k=(s-1)2^k.$

\bigskip 
\noindent
(1) {\bf Forward step} of FASV
\medskip

For $k=1,2, \ldots , l-1$ solve the problem:
\begin{equation} \label{34}
A^{(k,s)} \x^{(k,s)} = \f^{(k,s)} , ~~ s=1,2, \ldots , 2^{l-k},
\end{equation}
where
$$\x^{(k,s)} = \left [ \begin {array} {c}
 \x_{s_k+1}^{(k)} \\ \vspace {4pt}
 \x_{s_k+2}^{(k)} \\
 \vdots \\
 \x_{s_k+2^k-1}^{(k)} \end {array} \right ] $$
and $\f^{(k,s)}$ will be defined below.

After solving these problems and setting $\x_{s2^k}^{(k)}=0$ for
$s=1,2, \ldots , 2^{l-k}-1$ we denote by $\x^{(k)}$ and $\f^{(k)}$
the following vectors
\begin{equation} \label{*}
\x^{(k)} = \left [ \begin {array} {c}
 \x^{(k,1)} \\
 \0 \\ \vspace {7pt}
\x^{(k,2)} \\
 \0 \\
 \vdots \\
\x^{(k,2^{l-k})} \end {array} \right ] \begin {array} {l}
                         \}  2^k-1 {\mbox { blocks }} \\  \vspace {4pt}
                         \}  1 {\mbox { block }} \\  \vspace {4pt}
                         \}  2^k-1 {\mbox { blocks }} \\  \vspace {4pt}
                         \}  1 {\mbox { block }} \\  \vspace {4pt}
                                                 \\
                         \}  2^k-1 {\mbox { blocks }} \end {array}
~~~{\mbox { and }} \f^{(k)} = \left [ \begin {array} {c}
 \f^{(k,1)} \\ \vspace {5pt}
 \f_{2^k} \\  \vspace {5pt}
 \f^{(k,2)} \\  \vspace {5pt}
 \vdots \\
\f^{(k,2^{l-k})} \end {array} \right ] .
\end{equation}

Let the residual vector be the right-hand side for the next $k+1$st step
\begin{equation} \label {35}
\f^{(k+1)} = \f^{(k)}-A\x^{(k)}=
\left [ \begin {array} {c}
\f^{(k+1,1)} \\
\f_{2^{k+1}}\\ \vspace {4pt}
\f^{(k+1,2)} \\
\vdots \\
\f^{(k+1,2^{l-k-1})} \end {array}  \right ] ,
\end{equation}
where
$$\f^{(k+1,s')} = \left [ \begin {array} {c}
\0 \vspace {3pt} \\ \vspace {6pt}
\f_{s'.2^{k+1}}^{(k+1)}\\ \vspace {6pt}
\0 \end {array}  \right ] \begin {array} {l}
                         \}  2^k-1 {\mbox { blocks }} \\  \vspace {4pt}
                         \}  1 {\mbox { block }} \\  \vspace {4pt}
                         \}  2^k-1 {\mbox { blocks }}  \vspace {4pt}
                         \end {array},
~~  s'=1,2, \ldots , 2^{l-k-1}.$$
 From (\ref {35}) and $s=(2s'-1)$ we have
\begin{eqnarray} \label {36}
\f_{s'.2^{k+1}}^{(k+1)} & =&
\f_{s.2^k}^{(k)}-A_{2s,2s-1}\,\x^{(k,s)}-A_{2s,2s+1}\,\x^{(k,s+1)}
\nonumber \\
& =&\f_{s.2^k}^{(k)}-b_{s.2^k,s.2^k-1}\,\x_{2^k-1}^{(k,s)}-
b_{s.2^k,s.2^k+1}\,\x_1^{(k,s+1)}.
\end{eqnarray}

The new right-hand side $\f^{(k+1,s')}$ has only one nonzero block
component and hence, by induction, the right-hand sides $\f^{(k,s)}$ have
the following sparsity pattern
$$\f^{(k,s)} = \left [ \begin {array} {c}
\0 \vspace {3pt}\\ \vspace {3pt}
\ast \vspace {3pt}\\ \vspace {3pt}
\0 \end {array} \right ] \begin {array} {l}
                \}  2^{k-1}-1 {\mbox { blocks }} \vspace {3pt} \\ \vspace {3pt}
                \}  1 {\mbox { block }} \vspace {3pt} \\ \vspace {3pt}
                \}  2^{k-1}-1 {\mbox { blocks }}  \vspace {3pt}
                \end {array},  ~~ s=1,2, \ldots , 2^{l-k} .$$
Therefore, the problems (\ref {34}) have a sparse right-hand side. The
matrices $A^{(k,s)}$ allow a separation of variables as submatrices of
$A$ and hence, Algorithm SV from Section \ref {sec2} can be applied with
$d=1$ (the number of nonzero block components of the right-hand side), and
$r=3$ (the number of the sought block components of the solution:
$\x_1^{(k,s)}, \ \x_{2^{k-1}}^{(k,s)}$ and $\x_{2^k-1}^{(k,s)}).$

\bigskip  
\noindent
(2) {\bf {Backward step}} of FASV
\medskip

For $k=l,l-1,\ldots ,1$, our purpose is to determine $\x_{(2s-1)2^{k-1}}$ for
$s=1,2, \ldots ,2^{l-k}$. First, when $k=l,$ we solve
\begin{equation} \label {38}
A\x^{(l)} = \f^{(l)},
\end{equation}
where $A^{(l,s)}\equiv A, \ \x^{(l)} = \x^{(l,1)} $ and $\f^{(l)} =
\f^{(l,1)}$. The right-hand side $\f^{(l)}$ is found at the last step of the
forward recurrence and from (\ref {36}) it has a sparse right-hand side.  The
problem (\ref {38}) is solved incompletely finding only one block component
$ \x_{2^{l-1}} ^{(l)}.$

We have also $\x_{2^{l-1}}=\x_{2^{l-1}} ^{(l)}$ which corresponds to the
midblock component of the solution $\x$ of (\ref {11}). The remaining
block components of $\x$ are recovered by induction as follows:

Starting with $k=l$ we have $\x_{2^{l-1}}=\x_{2^{l-1}} ^{(l)}$. Assume that for
some given $k,\ 1 \le k \le l-1,$ we have found the vectors $\x_{s.2^k}$ for
$s=1,2, \ldots , 2^{l-k}-1$ in the previous steps $k+1, \ldots , l.$

Then at the $k$th step we can find $\x_{s.2^{k-1}}$ for $s=1,2, \ldots ,
2^{l-k+1}-1$.  More precisely, by construction we have $\f^{(k'+1)} = \f^{(k')}
- A\x^{(k')}$ and after summation of both sides of these equalities for $k'=k,
k+1, \ldots, l,$ we get 
\begin{equation} \label {39}
\f^{(k)} = A \left ( \sum _{k'=k}^l \x^{(k')} \right ).
\end{equation}
Let
\begin{equation} \label {310}
\y = \sum _{k'=k}^l \x^{(k')}
\end{equation}
and using $A$ from (\ref {32}) we have the following equivalent form for
(\ref {39})
$$ \displaylines{
\left [ \begin {array} {ccccc}
           A^{(k,1)} & A_{12} &  &  & 0 \\
             A_{21}  & T+b_{2^k2^k}I_n & A_{23} &  &   \\
                     & A_{32}  & A^{(k,2)} & A_{34}   &   \\
                     &  & \ddots & \ddots & \ddots  \\
              0 & & & A_{2^{l-k+1}-1,2^{l-k+1}-2} & A^{(k,2^{l-k})}
          \end{array} \right ]
\left [ \begin {array} {c}
\y^{(k,1)}\\
\y_{2^k} \\ \vspace {3pt}
\y^{(k,2)}\\
\vdots \\ \vspace {3pt}
\y^{(k,2^{l-k})}
\end{array} \right ] \cr
=\left [ \begin{array} {c}
\f^{(k,1)}\\
\f_{2^k} \\ \vspace {3pt}
\f^{(k,2)}\\
\vdots \\ \vspace {3pt}
\f^{(k,2^{l-k})}
\end{array} \right ]
}$$
where each block $\y^{(k,s)}, \ s=1,2, \ldots , 2^{l-k}$ has $2^k-1$
blocks of order $n$. From this system we obtain the following
equation for $\y^{(k,s)}:$
\begin{equation} \label {311}
A_{2s-1,2s-2} \y_{(s-1).2^k} + A^{(k,s)} \y^{(k,s)} + A_{2s-1,2s}
\y_{s.2^k} = \f^{(k,s)}.
\end{equation}
Hence, when $s_k=(s-1)2^k$:
\begin{equation} \label {312}
A^{(k,s)} \y^{(k,s)} = \f^{(k,s)}-A_{2s-1,2s-2} \y_{s_k} -
A_{2s-1,2s} \y_{s.2^k}.
\end{equation}

Using (\ref {310}) we have $\y_{s_k} = \x_{s_k}$ and $\y_{s.2^k} = \x_{s.2^k}.$
Recall that from the induction described above, the vectors
$\x_{s_k} = \x_{(s-1).2^k}$ and $\x_{s.2^k}$ are already found. Thus, 
from (\ref {312}) we get the following system
\begin{equation} \label {314}
A^{(k,s)} \y^{(k,s)}  = \left [ \begin{array} {c}
-b_{s_k+1,s_k} \x_{s_k} \\
\0 \\
\f_{2^{k-1}}^{(k,s)} \\
\0  \\
-b_{s.2^k-1,s.2^k} \x_{s.2^k}  \end {array} \right ]
 \begin{array} {l}
\} 1 {\mbox { st block}} \\
 \\
\} 2^{k-1} {\mbox { th block} }\\
 \\
\}  2^{k}-1 {\mbox { st block}} \end {array} .
\end{equation}
The nonzero block components of the right-hand side of (\ref {314}) can be
found using the solution computed at the $k+1$st step. It is a problem
with a sparse right-hand side and only one $(r=1)$ block component of the
solution, namely $\y_{2^{k-1}}^{(k,s)}$ is needed. By (\ref {310}) one
finds
$$\y_{2^{k-1}}^{(k,s)} = \y_{(2s-1).2^{k-1}} = \sum _{k'=k}^l
\x_{(2s-1).2^{k-1}}^{(k')} = \x_{(2s-1).2^{k-1}}.$$

Therefore, at the $k$th step from the backward recurrence $(k=l,\, l-1,
\ldots, 1)$ we can determine $\x_{(2s-1)2^{k-1}}, \ s=1,2, \ldots, 2^{l-k}.$

\section{Parallel implementation}

When a 2D problem is solved the rectangular domain is decomposed into
horizontal strips. We then assign a processor to each strip. In each
forward step of FASV a number of solves with matrices $A^{(k,s)}$ are
involved. These are problems with sparse right hand sides and to find
the components of the solution algorithm SRHS is applied. In general,
the forward sweep of FASV requires $O(log (m))$ steps. The backward
sweep of FASV is a reverse of the forward step and results in the
solution of the original system (2.1).

In \cite{zara} it was shown that the total arithmetical complexity of
the algorithm is $(28n^2 - 9n^2/(l-1))(l-3-log P +2P)/P$ and the
speed-up $S= (l-1)P/(l-3-logP +2P)$. Therefore, in the two limiting
cases, for large $l$ the optimal speed-up $P$ is obtained, while for a
fixed $l$ and a large $P$ the speed-up is limited by $l/2$.

As mentioned above, we have used two versions of the code. The only
difference between them was that in one the PVM environment was used to
facilitate interprocessor communication, while the other was based on
the MPI library. While re-implementing the code we have fixed the way
that the time was measured. In the original code an average of processor
times was reported. We have decided that this may be slightly misleading
as it hides possible workload imbalances. In the new version of the code
we measured the time spent by each individual processor. Since the
workload differs from processor to processor, in each run we recorded
the longest time (all remaining processors have to wait for the slowest
one to complete its job before the problem is solved). After performing
several runs, we kept and reported the shortest time (of the longest
times). We used the function $mclock()$ to measure time in the PVM
version while in the MPI version we used the $MPI\_Wtime()$ function.

Our experiments have been performed on a Beowulf cluster.  The Beowulf
cluster architecture was designed in 1994 by Thomas Sterling and Donald
Beck at the center of Excellence in Space Data and Information Sciences
(CESDIS). The purpose of the design was to combine together COTS
(commodity off the shelf) base systems in a network and emulate a costly
high performance supercomputer.

The Beowulf cluster at the University of Southern Mississippi consists
of 16 PC's connected with a Fast Ethernet switch. Each PC is a 233 MHz
Pentium II with 256 Mbytes of RAM. The proposed algorithm has been
implemented in Fortran (77). We used the Portland Group compiler and
invoked maximum compiler optimization ($-O2$ $-tp$ $p6$ $-Mvect$
$-Munroll=n:16$). Most runs have been done on an empty machine.

\section{Experimental results} \label {sec4}

To study the performance of the codes we have used a standard model
problem. We have solved the Poisson equation on the unit square with
Dirichlet boundary conditions. The mesh size was fixed at $h = 1/(m+1)$,
where $m = 2^l$. In Table $1$ we present the times obtained for both PVM
and MPI codes for $l = 9, 10, 11$ and for $P = 1, 2, \dots, 11$
processors (unfortunately, due to the technical difficulties 5 nodes
were down for hardware maintenance). Problem of size $l = 11$ was the
largest that we were able to run on a single processor system (it
required more than 90Mbytes of memory). 
\begin{table} \begin{center}
\begin {tabular}{| c || c | c || c | c || c | c || c | c ||} \hline P &
\multicolumn{2}{c||}{$L=9$} & \multicolumn{2}{c||}{$L=10$} &
    \multicolumn{2}{c||}{$L=11$} \\ 
\hline
  & MPI & PVM    & MPI & PVM    & MPI & PVM \\
\hline
1 &  1.16 & 1.30 & 6.32 & 6.71 &32.19 &34.42 \\
2 &  0.69 & 0.75 & 3.50 & 3.91 &17.50 &19.20 \\
3 &  0.52 & 0.58 & 2.54 & 2.78 &13.39 &13.50 \\
4 &  0.42 & 0.46 & 2.01 & 2.30 & 9.68 &10.92 \\
5 &  0.40 & 0.40 & 1.81 & 1.95 & 8.63 & 9.23 \\
6 &  0.38 & 0.41 & 1.68 & 1.91 & 7.77 & 8.84 \\
7 &  0.36 & 0.33 & 1.53 & 1.72 & 6.94 & 7.88 \\
8 &  0.32 & 0.32 & 1.36 & 1.54 & 6.15 & 7.00 \\
9 &  0.34 & 0.30 & 1.35 & 1.43 & 6.02 & 6.05 \\
10&  0.35 & 0.32 & 1.32 & 1.48 & 5.78 & 5.53 \\
11&  0.34 & 0.30 & 1.28 & 1.38 & 5.48 & 5.77 \\
\hline
\end {tabular}
\end{center}
\caption{Times comparison}
\end{table}

It can be observed that for the larger problems the MPI based code
slightly outperformed the PVM based implementation. This result is
consistent with our expectations. One needs to remember that MPI is a
newer development than PVM and performance was more of a goal in its
creation (in case of PVM the main goal was rather ability of
interactively creating a heterogeneous parallel computer system).
Additionally, the Beowulf implementation of MPI, which uses the LAM
communication fabrics, supports sending and receiving messages directly
from one process to another, overriding the communication daemons. LAM
is a parallel processing environment and development system for a
network of independent computers created and maintained by the Ohio
Supercomputer Center. It features the MPI programming standard,
supported by extensive monitoring and debugging tools.

In Table $2$ we compare the speed-up values of both implementations for
the largest problem size. In addition we present the theoretical
speed-up value calculated using the speed-up formula above.
\begin{table}
\begin{center}
\begin {tabular}{| c || c | c | c ||}
\hline
P & \multicolumn{3}{c||}{$L=11$} \\
\hline
  & MPI & PVM & THEORY \\
\hline
1 & 1.00 & 1.00 & 1.00 \\
2 & 1.84 & 1.79 & 1.67 \\
3 & 2.40 & 2.55 & 2.14 \\
4 & 3.33 & 3.15 & 2.50 \\
5 & 3.73 & 3.73 & 2.78 \\
6 & 4.14 & 3.89 & 3.00 \\
7 & 4.64 & 4.37 & 3.18 \\
8 & 5.23 & 4.92 & 3.33 \\
9 & 5.35 & 5.69 & 3.46 \\
10& 5.57 & 6.22 & 3.57 \\
11& 5.87 & 5.96 & 3.67 \\
\hline
\end {tabular}
\end {center}
\caption{Speedup comparison}
\end{table}

Interestingly, both MPI and PVM speedup values are far superior to the
theoretical ones. This shows a slight weakness of the theoretical
arithmetical complexity analysis which typically takes into
consideration only the size of the problem (sometimes supplemented by a
rudimentary data communication analysis). In the experimental
environment, one deals with machines with hierarchical memory structure
and its management as well as networks and network protocols which
influence the performance of the program. For instance, as we increase
the number of processors for a given fixed problem size the amount of
memory used per-processor decreases (since the problem is divided into
smaller pieces) and it is much easier to manage the data movement in the
hierarchical environment making the multiprocessor runs faster than the
predicted ones. This can be viewed also from the reverse side. For very
large problems the memory management becomes rather complicated. It is
well known that most of the existing cache management algorithms are far
from optimal. This increases the single processor execution time and
thus increases the obtainable speed-up.

In this context we can also observe that in many cases the PVM based
code shows slightly faster speed-up values than MPI. The reason for that
behavior is that speed-up formula involves the time spent by a single
processor run. Inspecting the values in Table $1$ we can see that the
single processor PVM version is slower than the MPI version. Thus any
speedup value will have a greater numerator and therefore a greater
value.

\section{Concluding remarks}\label {sec5}

In this note we have presented the experimental results illustrating the
parallel performance characteristics of the fast direct elliptic solver
on a Beowulf cluster. The algorithm and both its implementations turned
out to be relatively efficient. We have also found that, for large
problems, memory management capacity is an important factor influencing
performance. In the near future we plan to, first, investigate the code
to see if there is a possibility of any additional optimization of the
way that the code is currently implemented. Second, we plan to complete
the performance study across a number of modern parallel programming
architectures. We will also look into extending the proposed approach to
parallel three dimensional solvers.


\begin{thebibliography}{10}

\bibitem {B} A. Banegas, Fast Poisson solvers for problems with
sparsity, {\em Math.Comp.}, {\bf 32}(1978), 441-446.

\bibitem{HMPzara} H. Hope, M. Paprzycki and S. Petrova, Parallel
Performance of a Direct Elliptic Solver, in: M. Griebel et. al. (eds.),
{\em Large Scale Scientific Computations of Engineering and Environmental
Problems}, Vieweg, Wisbaden, (1998), 310-318 

\bibitem {K} Y. Kuznetsov, Block relaxation methods in subspaces,
their optimization and application, {\em Soviet.J.Numer.Anal.  and Math.
Modelling}, {\bf 4}(1989), 433-452.

\bibitem{LiMaPa} I. Lirkov, S. Margenov and M. Paprzycki, Benchmarking
performance of parallel computers using a 2D elliptic solver, {\em
Proceedings of the $4^{th}$ International Conference on Numerical
Methods and Applications}, Sofia, Bulgaria, August, 1998, to appear.

\bibitem {zara} S. Petrova, Parallel implementation of fast
elliptic solver, {\em Parallel Computing}, {\bf23}(1997), 1113-1128.

\bibitem {P} W. Proskurowski, Numerical solution of Helmholtz
equation by implicit capacitance matrix method, {\em ACM Trans.  Math.
Software}, {\bf 5}(1979), 36-49.

\bibitem {V1} P. Vassilevski, Fast algorithm for solving a
linear algebraic problem with separation of variables, {\em Comptes
Rendus de l'Academie Bulgare des Sciences}, {\bf 37}(1984), No.3, 305-308.

\bibitem {V2} P. Vassilevski, Fast algorithm for solving
discrete Poisson equation in a rectangle, {\em Comptes Rendus de
l'Academie Bulgare des Sciences}, {\bf 38}(1985), No.10, 1311-1314.

\bibitem {VP} P. Vassilevski and S. Petrova, A note on
construction of preconditioners in solving 3D elliptic problems by
substructuring, {\em Comptes Rendus de l'Academie Bulgare des Sciences},
{\bf 41}(1988), No.7, 33-36.


\end{thebibliography} \medskip

\noindent{\sc  Marcin Paprzycki} (e-mail: marcin.paprzycki@usm.edu)\\
{\sc Julian Sanchez} (e-mail: julian.sanchez@ieee.org) \\
Department of Computer Science and Statistics \\
University of Southern Mississippi \\
Hattiesburg, MS 39406, USA
 \medskip

\noindent{\sc Svetozara Petrova} \\
Central Laboratory of Parallel Processing \\
Bulgarian Academy of Sciences, Acad. G.Bonchev Str.\\
Block 25A, 1113 Sofia, Bulgaria \\ 
e-mail: zara@cantor.bas.bg


\end{document}