MPI-PHYLIP: Parallelizing Computationally Intensive Phylogenetic Analysis Routines for the Analysis of Large Protein Families

Authors:
Alexander J. Ropelewski, Hugh B. Nicholas, Jr.
Pittsburgh Supercomputing Center, Carnegie Mellon University, 300 S. Craig Street, Pittsburgh, PA, USA 15213

Ricardo R. Gonzalez Mendez
Department of Radiological Sciences, University of Puerto Rico School of Medicine, PO Box 365067, San Juan, PR, USA 00936-5067.

Click to view the paper

ABSTRACT

Background: Phylogenetic study of protein sequences provides unique and valuable insights into the molecular and genetic basis of important medical and epidemiological problems as well as insights about the origins and development of physiological features in present day organisms. Consensus phylogenies based on the bootstrap and other resampling methods play a crucial part in analyzing the robustness of the trees produced for these analyses.

Methodology: Our focus was to increase the number of bootstrap replications that can be performed on large protein datasets using the maximum parsimony, distance matrix, and maximum likelihood methods. We have modified the PHYLIP package using MPI to enable large-scale phylogenetic study of protein sequences, using a statistically robust number of bootstrapped datasets, to be performed in a moderate amount of time. This paper discusses the methodology used to parallelize the PHYLIP programs and reports the performance of the parallel PHYLIP programs that are relevant to the study of protein evolution on several protein datasets.

Conclusions: Calculations that currently take a few days on a state of the art desktop workstation are reduced to calculations that can be performed over lunchtime on a modern parallel computer. Of the three protein methods tested, the maximum likelihood method scales the best, followed by the distance method, and then the maximum parsimony method. However, the maximum likelihood method requires significant memory resources, which limits its application to more moderately sized protein datasets.

SUPPLEMENTAL FILES

Tar file containing the source code for the parallelized PHYLIP routines and the test runs described in this paper (MPIsrc.tar.gz)

Tar file containing ONLY the source code for the parallelized PHYLIP routines described in this paper (MPIsrcOnly.tar.gz)

Tar files containing the test runs described in the paper:

The serial version of the PHYLIP suite is available from Joseph Felsenstein's website: http://evolution.genetics.washington.edu/phylip/

INSTALLATION INSTRUCTIONS

To install the distribution, first download the compressed tar file (MPIsrc.tar.gz). Next uncompress the file using the gunzip command (gunzip MPIsrc.tar.gz) and unpack it with the tar command (tar xvf MPIsrc.tar).

To compile and link the software on platforms where the standard MPI compiler wrappers (e.g., mpicc) have been installed, change your working directory to the MPIsrc/src subdirectory (via cd MPIsrc/src). Next, copy the Makefile.mpicc file to a file called Makefile (cp Makefile.mpicc Makefile). Then use the make command to build compile and link the code with the appropriate MPI library (make all). In general we have found that only minor modifications to the compile and link lines in the Makefile are required to run the code on parallel platforms that do not use the standard MPI wrappers (including the Altix platform used to run the test cases.) To compile on the Altix platform simply copy the Makefile.Altix file to a file called Makefile (cp Makefile.Altix Makefile). Then use the make command to build, compile and link the code with the appropriate MPI library (make all).

In the test directory of the distribution are several input files that can be used to test the installation. The files ending in .boot are bootstrapped input files, the files ending in .stdin are the program input files and the files ending in .job are UNIX (PBS) script files that can be used to run the parallel versions of the programs.