Cray XT5 (Einstein)
User Guide

Table of Contents

1. Introductionto top

1.1. Document Scope and Assumptions

This document provides an overview and introduction to the use of the Cray XT5 (Einstein) located at the Navy DSRC, along with a description of the specific computing environment on Einstein. The intent of this guide is to provide information that will enable the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:

  • Use of the UNIX operating system
  • Use of an editor (e.g., vi or emacs)
  • Remote usage of computer systems via network or modem access
  • A selected programming language and its related tools and libraries

1.2. Policies to Review

Users are expected to be aware of the following policies for working on Einstein:

1.2.1. Login Node Abuse Policy

The login nodes provide login access for Einstein and support such activities as compiling, editing and general interactive use by all users. Consequently, memory or CPU intensive programs running on the login nodes can significantly affect all users of the system. Therefore, only small serial applications requiring less than 15 minutes of compute time and less than 16 GBytes of memory are allowed on the login nodes. Any jobs running on the login nodes that exceed these limits will be terminated.

1.2.2. Workspace Purge Policy

Close management of space in the /scr file system is a high priority. Files in the /scr file system that have not been accessed in 30 days are subject to the purge cycle. If available space becomes critically low, a manual purge may be run, and all files in /scr are eligible for deletion. Using the touch command (or similar commands) to prevent files from being purged is prohibited. Users are expected to keep up with file archival and removal within the normal purge cycles.

1.3. Obtaining an Account

The process of getting an account on the HPC systems at any of the DSRCs begins with getting an account on the HPCMP Portal to the Information Environment, commonly called a "pIE User Account". If you do not yet have a pIE User Account, please visit the Consolidated Customer Assistance Center (CCAC) Accounts page and follow the instructions there. Once you have an active pIE User Account, visit the Navy DSRC Accounts page for instructions on how to request accounts on the Navy DSRC HPC systems. If you need assistance with any part of this process, please contact CCAC at accounts@ccac.htp.mil.

1.4. Requesting Assistance

The Consolidated Customer Assistance Center (CCAC) is available to help users with unclassified problems, issues, or questions. Analysts are on duty 8:00 a.m. - 11:00 p.m. Eastern, Monday - Friday (excluding Federal holidays).

You can contact the Navy DSRC Help Desk directly in any of the following ways for issues related to classified and non-HPCMP resources.

  • E-mail: dsrchelp@navo.hpc.mil
  • Phone: 1-800-993-7677 or 228-688-7677
  • Fax: 228-688-4356
  • U.S. Mail:
    Navy DoD Supercomputing Resource Center
    1002 Balch Blvd
    Stennis Space Center, MS 39522-5001

For more detailed contact information, please see our Contact Page.

2. System Configurationto top

2.1. System Summary

Einstein is a Cray XT5. The login nodes are populated with 2.4-GHz AMD Opteron quad-core processors. The compute nodes are populated with 2.4-GHz AMD Opteron quad-core processors. Einstein uses a dedicated SeaStar2+ communications network for MPI messages and IO traffic. Einstein uses Lustre to manage its parallel file system that targets its LSI arrays. Einstein has 1592 compute nodes that share memory only on the node; memory is not shared across the nodes. Each compute node has 2 quad-core processors (8 cores) with its own Compute Node Linux (CNL) operating system, sharing 16 GBytes of DDR2 memory, with no user-accessible swap space. Einstein is rated at 123 peak TFLOPS and has 518 TBytes (formatted) of disk storage.

Einstein is intended to be used as a batch scheduled HPC system. Its login nodes are not to be used for large computational (memory, IO, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by batch job submission.

Node Configuration
Login Nodes Compute Nodes
Total Nodes 4 1592
Operating System SUSE Linux Compute Node Linux (CNL)
Cores/Node 16 8
Core Type AMD Opteron 64-bit AMD Opteron 64-bit
Core Speed 2.4 GHz 2.4 GHz
Memory/Node 128 GBytes 1584 nodes - 16 GBytes
8 nodes - 31 GBytes
Accessible Memory/Node 16 GBytes 1584 nodes - 14 GBytes
8 nodes - 30 GBytes
Memory Model Shared on node Shared on node.
Distributed across cluster.
Interconnect Type Ethernet Seastar2+
File Systems on Einstein
Path Capacity Type
/scr516 TBytesLustre
/u/home22 TBytesLustre

2.2. Processors

Einstein uses AMD Opteron 64-bit processors on its login and compute nodes. The login nodes run at 2.4 GHz and have four processors per node, each with four cores, for a total of 16 cores per login node. These processors have 64 KBytes of L1 instruction cache, 64 KBytes L1 data cache, 512 KBytes of L2 cache and 6 MBytes of L3 cache.

The compute nodes run at 2.4 GHz and have 2 processors per node, each with four cores, for a total of 8 cores per compute node. These processors have 64 KBytes of L1 instruction cache, 64 KBytes L1 data cache, 512 KBytes of L2 cache and 6 MBytes of L3 cache.

2.3. Memory

Einstein uses both shared and distributed memory models. Memory is shared among all the cores on a node, but is not shared among the nodes across the cluster.

Each login node contains 128 GBytes of main memory. All memory and cores on the node are shared among all users who are logged in. Therefore, users should not use more than 16 GBytes of memory at any one time. Memory is not distributed across the login nodes.

Each compute node contains 14 GBytes of user-accessible shared memory. Eight of these nodes are configured with 30 GBytes of user-accessible shared memory to support larger memory applications. These high-memory nodes are accessible by jobs submitted to the bigmem queue.

2.4. Operating System

The operating system on Einstein's login nodes is SUSE Linux. The compute nodes use a Cray developed (CNL) kernel. The combination of these two operating systems is known as the Cray Linux Environment (CLE). The kernel on the compute node has been stripped of most non-essential computational Linux modules and provides a very limited number of system commands. Plus by default, execution on the compute nodes cannot access dynamic shared libraries. But by setting the $CRAY_ROOTFS environment variable to "DSL" in your PBS batch script or interactive batch session, the compute nodes can then provide access to dynamically shared libraries and most of the common Linux commands. Also, when this environment variable has been set, applications can then run on the compute nodes with stack arrays that are larger than 2 GBytes in size.

2.5. File Systems

Einstein has the following file systems available for user storage:

2.5.1. /u/home/

This is a locally mounted Lustre file system. It has a formatted capacity of 22 TBytes. All users have a home directory located on this file system which can be referenced by the environment variable $HOME. This file system is not backed up. Users are responsible for making backups of their files to the archive server, Newton, or to some other local system.

2.5.2. /scr/

This is a locally mounted Lustre file system that is tuned for high-speed I/O performance. It has a formatted capacity of 516 TBytes. All users have a work directory located on this file system which can be referenced by the environment variable $WORKDIR. This file system is not backed up. Users are responsible for making backups of their files to the archive server, Newton, or to some other local system.

2.5.3. Raid/Striping Concerns for Large Files

It is important to note that /scr is a parallel, striped file system. This means that as files are written, they are automatically divided into chunks and written across multiple RAID 5 disk sets, or "OSTs," simultaneously. This process, called "striping," plays a vital role in running very large jobs because it significantly improves file I/O speed, thereby reducing the time required to read or write a file. Without parallel striping, large jobs, many of which require hundreds of GBytes of disk space, would spend much of their time just reading from and writing to disk.

The default stripe size for /scr is 1 MByte, and the default stripe count is two stripes. Increasing the stripe count is advisable when creating files on /scr that are larger than 40 GBytes. For an explanation of how to do this, see the document "Increasing Performance on Lustre File Systems."

Go to $SAMPLES_HOME/OST_Stripes for an example of how to increase stripe counts in a batch job.

2.6. Peak Performance

Einstein is rated at 123 peak TFLOPS or 9.6 GFLOPS per core.

3. Accessing the Systemto top

3.1. Kerberos

A Kerberos client kit must be installed on your desktop to enable you to get a Kerberos ticket. Kerberos is a network authentication tool that provides secure communication by using secret cryptographic keys. Only users with a valid HPCMP Kerberos authentication can gain access to Einstein. More information about installing Kerberos clients on your desktop can be found at the CCAC Support page.

3.2. Logging In

  • Kerberized SSH
    The recommended method is to use dynamic assignment, as follows:
    local > ssh einstein.navo.hpc.mil
    Alternatively, you can manually specify a particular login node, as follows:
    local > ssh einstein#.navo.hpc.mil (# = 1 - 4)
  • Kerberized rlogin and telnet are also allowed.

3.3. File Transfers

File transfers to DSRC systems must be performed using the following Kerberized tools: kftp, krcp, sftp, scp, and mpscp.

4. User Environmentto top

4.1. User Directories

The following user directories are provided for all users on Einstein.

4.1.1. Home Directory

When you log on to Einstein, you will be placed in your home directory, /u/home/username. The environment variable $HOME is automatically set for you and refers to this directory. $HOME is visible to both the login and compute nodes, and may be used to store small user files. It has an initial quota of 1 GByte and is not backed up and therefore should not be used for long term storage.

To access $HOME from Einstein's compute nodes, the environment variable $CRAY_ROOTFS must be set to "DSL". It is strongly suggested that you not use $CRAY_ROOTFS to run MPI executables from your $HOME directory.

4.1.2. Work Directory

Einstein has one large file system (/scr) for the temporary storage of data files needed for executing programs. You may access your personal working directory under /scr by using the $WORKDIR environment variable, which is set for you upon login. Your $WORKDIR directory has no disk quotas, and files stored there do not affect your permanent file quota usage. Because of high usage, the /scr file system tends to fill up frequently. Please review the Purge Policy and be mindful of your disk usage.

REMEMBER: /scr is a "scratch" file system and is not backed up. You are responsible for managing files in your $WORKDIR by backing up files to the archive system and deleting unneeded files when your jobs end. See the section below on Archive Usage for details.

All of your jobs should execute from your $WORKDIR directory, not $HOME. While not technically forbidden, jobs that are run from $HOME are subject to disk space quotas and have a much greater chance of failing if problems occur with that resource.

To avoid unusual errors that can arise from two jobs using the same scratch directory, a common technique is to create a unique subdirectory for each batch job by including the following lines in your batch script:

TMPD=${WORKDIR}/${PBS_JOBID}
mkdir -p ${TMPD}

4.2. Shells

The following shells are available on Einstein: csh, bash, ksh, tcsh, and sh. To request a change of your default shell, contact the Consolidated Customer Assistance Center (CCAC).

4.3. Environment Variables

4.3.1. Login Environment Variables

A number of environment variables are provided by default on all HPCMP HPC systems. We encourage you to use these variables in your scripts where possible. Doing so will help to simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems. The following environment variables are automatically set in your login environment:

Common Environment Variables
Option Purpose
$ARCHIVE_HOME Your directory on the archival system
$ARCHIVE_HOST The host name of the archival system
$CSI_HOME The path to the directory for the following list of heavily used application packages: ABAQUS, Accelrys, ANSYS, CFD++, Cobalt, EnSight, Fluent, GASP, Gaussian, LS-DYNA, MATLAB, and TotalView, formerly known as the Consolidated Software Initiative (CSI) list. Other application software may also be installed here by our staff.
$DAAC_HOME The path to the directory containing the ezVIZ visualization software
$HOME Your home directory on the system
$JAVA_HOME The path to the directory containing the default installation of JAVA
$PET_HOME The path to the directory containing the tools installed by the PET CE staff. The supported software includes a variety of open-source math libraries (see BC policy FY06-01) and open-source performance and profiling tools (see BC policy FY07-02).
$SAMPLES_HOME The path to the Sample Code Repository. This is a collection of sample scripts and codes is provided and maintained by our staff to help users learn to write their own scripts. There are a number of ready-to-use scripts for a variety of applications.
$WORKDIR Your work directory on the local temporary file system (i.e., local high-speed disk).
4.3.2. Batch-Only Environment Variables

In addition to the variables listed above, the following variables are automatically set only in your batch environment. That is, your batch scripts will be able to see them when they run. These variables are supplied for your convenience and are intended for use inside your batch scripts.

Batch-Only Environment Variables
Option Purpose
$BC_CORES_PER_NODE The number of cores per node for the compute node on which a job is running.
$BC_MEM_PER_NODE The approximate maximum user-accessible memory per node (in integer MBytes) for the compute node on which a job is running.
$BC_MPI_TASKS_ALLOC The number of MPI tasks allocated for a job.
$BC_NODE_ALLOC The number of nodes allocated for a job.

4.4. Modules

Software modules are a very convenient way to set needed environment variables and include necessary directories in your path so that commands for particular applications can be found. All accounts have a set of modules loaded on login. These modules add compilers to the PATH, etc. For additional information on using modules, please refer to the Modules User Guide.

4.5. Archive Usage

All of our HPC systems have access to an online archival mass storage system that provides long term storage for users' files on a Petascale archival storage system that resides on a robotic tape library system. A 60-TByte disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.

Tape file systems have very slow access times. The tapes must be robotically pulled from the tape library, mounted in one of the limited number of tape drives, and wound into position for file archival or retrieval. For this reason, users should always tar up their small files in a large tarball when archiving a significant number of files. A good maximum target size for tarballs is about 200 GBytes or less. At that size, the time required for file transfer and tape I/O is reasonable. Files larger than 1 TByte will span more than one tape, which will greatly increase the time required for both archival and retrieval.

The environment variables $ARCHIVE_HOST and $ARCHIVE_HOME are automatically set for you. $ARCHIVE_HOST can be used to reference the archive server, and $ARCHIVE_HOME can be used to reference your archive directory on the server. These can be used when transferring files to/from archive.

4.5.1. Archival Command Synopsis

A synopsis of the main archival utilities is listed below. For information on additional capabilities, see the Archive User Guide or read the online man pages that are available on each system. These commands are non-Kerberized and can be used in batch submission scripts if desired.

  • Copy one or more files from the archive system
    rcp ${ARCHIVE_HOST}:${ARCHIVE_HOME}/file_name ${WORKDIR}/proj1

  • List files and directory contents on the archive system
    rsh ${ARCHIVE_HOST} ls [lsopts] [file/dir ...]

  • Create directories on the archive system
    rsh ${ARCHIVE_HOST} mkdir][-p] [-s] dir1 [dir2 ...]

  • Copy one or more files to the archive system
    rcp ${WORKDIR}/proj1/file_name ${ARCHIVE_HOST}:${ARCHIVE_HOME}/proj1

5. Program Developmentto top

5.1. Programming Models

Einstein supports three base programming models: Message Passing Interface (MPI), Shared-MEMory (SHMEM), Open Multi-Processing (OpenMP). A Hybrid MPI/OpenMP programming model is also supported. MPI and SHMEM are examples of the message- or data-passing models, while OpenMP uses only shared memory on a node by spawning threads.

5.1.1. Message Passing Interface (MPI)

The MPI package on Einstein is derived from MPICH-2 and implements the MPI-2 standard except for spawn support. It also implements the MPI 1.2 standard, as documented by the MPI Forum in the spring 1997 release of "MPI: A Message Passing Interface Standard."

For more information on included MPI-2 features, see the Cray XT Series Programming Environment User's Guide, available online from Cray.

When creating an MPI program on Einstein, ensure the following:

  • That the Message Passing Toolkit (module xt-mpt) is loaded. MPT should be loaded in the default programming environment. To check this, run the "module list" command. If xt-mpt is not listed, use the following command:

    module load xt-mpt

    Additional information on modules can be found in the Modules User Guide.

  • That the source code includes one of the following lines:
    #include <mpi.h>        ## for C, or
    INCLUDE "mpif.h"        ## for Fortran

To compile an MPI program, use one of the following examples:

cc -o mpiprog.exe mpi_prog.c          ## for C, or
ftn -o mpiprog.exe mpi_prog.f         ## for Fortran

To run an MPI program within a batch script, use the following command, which is the same for all three programming environments:

aprun -n N $WORKDIR/mympidirectory/mpiprog.exe

The aprun utility executes across a specified number of compute nodes. The "-n N" option specifies the number of cores to start. Please note the aprun utility only works in a Lustre-mounted file system. Users should ensure that all files needed by the MPI job are located in $WORKDIR. Additional information on the aprun utility can be found in the online man pages.

5.1.2. SHared MEMory (SHMEM)

These logically shared, distributed-memory access routines provide high-performance, high-bandwidth communication for use in highly parallelized scalable programs. The SHMEM data-passing library routines are similar to the MPI library routines: they pass data between cooperating parallel processes. The SHMEM data-passing routines can be used in programs that perform computations in separate address spaces and that explicitly pass data to and from different processes in the program.

The SHMEM routines minimize the overhead associated with data-passing requests, maximize bandwidth, and minimize data latency. Data latency is the length of time between a process initiating a transfer of data and that data becoming available for use at its destination.

SHMEM routines support remote data transfer through put operations that transfer data to a different process and get operations that transfer data from a different process. Other supported operations are work-shared broadcast and reduction, barrier synchronization, and atomic memory updates. An atomic memory operation is an atomic read and update operation, such as a fetch and increment, on a remote or local data object. The value read is guaranteed to be the value of the data object just prior to the update. See "man intro_shmem" for details on the SHMEM library.

For more information on the performance and use of SHMEM calls, see the Cray XT Series Programming Environment User's Guide, available online from Cray.

When creating a SHMEM program on Einstein, ensure the following:

  • That the Message Passing Toolkit (MPT)(module xt-mpt) is loaded. MPT should be loaded in the default programming environment. To check this, run the "module list" command. If xt-mpt is not listed, run the command, "module load xt_mpt" to load it. Additional information on modules can be found in the Modules User Guide.

  • That the source code includes one of the following lines:

    #include <mpp/shmem.h>  ## for C, or
    INCLUDE 'mpp/shmem.fh'  ## for Fortran
  • That the compile command includes an option to reference the SHMEM library, "-lsma". The SHMEM library is included in the standard Cray compilers, but can be specified on the compile line.

To compile a SHMEM program, use the following examples:

cc -lsma -o shmemprog.exe shmem_prog.c     ## for C, or
ftn -lsma -o shmemprog.exe shmem_prog.f    ## for Fortran

To run a SHMEM program within a batch script, use the following command, which is the same for all three programming environments:

aprun -n N $WORKDIR/myshmem/shmemprog.exe

The aprun utility executes across a specified number of compute nodes. The "-n N" option specifies the number of cores to start. Please note the aprun utility only works in a Lustre-mounted file system. Users should ensure that all files needed by the SHMEM job are located in $WORKDIR. Additional information on the aprun utility can be found in the online man pages.

5.1.3. Open Multi-Processing (OpenMP)

OpenMP is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++ and Fortran. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications.

When creating an OpenMP program on Einstein, ensure the following:

  • That the Message Passing Toolkit (MPT)(module xt-mpt) is loaded. MPT should be loaded in the default programming environment. To check this, run the "module list" command. If xt-mpt is not listed, run the command, "module load xt_mpt" to load it. Additional information on modules can be found in the Modules User Guide.

  • That the source code includes one of the following:

    #include <omp.h>    ## for C, or
    INCLUDE 'omp.h'     ## for Fortran
  • That the compile command includes an option to reference the OpenMP library. The PGI, PathScale, and GNU compilers support OpenMP, and each one uses a different option.

To compile an OpenMP program, use the following examples:

PGI Programming Environment:

cc -o openmpprog.exe -mp=nonuma openmp_prog.c   ## for C, or 
ftn -o openmpprog.exe -mp=nonuma openmp_prog.f  ## for Fortran

PathScale Programming Environment:

cc -o openmpprog.exe -mp openmp_prog.c     ## for C, or
ftn -o openmpprog.exe -mp openmp_prog.f    ## for Fortran

GNU Programming Environment:

cc -o openmpprog.exe -fopenmp openmp_prog.c    ## for C, or
ftn -o openmpprog.exe -fopenmp openmp_prog.f   ## for Fortran

To run an OpenMP program within a batch script, use the following command, which is the same for all three programming environments:

aprun -n 1 -d 4 openmpprog.exe

The aprun utility executes across one compute node with four threads. Also, the environment variable $OMP_NUM_THREADS needs be set to the number of threads. And, as in the previous programming models, aprun only works in a Lustre-mounted file system. Users should ensure that all files needed by the OpenMP job are located in $WORKDIR. Additional information on the aprun utility can be found in the online man pages.

5.1.4. Hybrid Processing (MPI/OpenMP)

An application built with the hybrid model of parallel programming can run on Einstein using both OpenMP and Message Passing Interface (MPI). In hybrid applications, OpenMP threads can be spawned by MPI processes, but MPI calls should not be issued from OpenMP parallel regions or by an OpenMP thread.

When creating a hybrid (MPI/OpenMP) program on Einstein, follow the instructions in the MPI and OpenMP sections above for creating your program. Then use the compilation instructions for OpenMP.

Before running a hybrid MPI/OpenMP program, you need to set the $OMP_NUM_THREADS environment variable to the number of threads.

export OMP_NUM_THREADS=8

To run your program within a batch script, use the launch command that corresponds to the MPI library that you compiled with.

aprun -n mpi_procs -d threads_per_mpi_proc mpiprogram.exe

In the following example, we want to run 8 MPI processes, and each process needs at least half the memory available on a node. We therefore request 4 nodes or 64 cpus = 8 nodes X 8 cores. We also want each MPI process to launch 4 OpenMP threads, so we set the environment variable accordingly and assign 4 cpus per process in the aprun command.

####  MPI/OpenMP on 8 nodes, 8 MPI processes with 4 threads each
##  request 8 nodes, each has 8 cores
#PBS -l mppwidth=64
##  request 4 threads per MPI process
#PBS -l mppdepth=4
##  request 8 cores per node
#PBS -l mppnppn=8
export OMP_NUM_THREADS=4 ##  create 4 threads per MPI process
aprun -n 4 -d 4 ./xthi.x ##  assigns 4 MPI_processes/node (2 MPI_processes/CPU on each node)

5.2. Available Compilers

There are three compiler suites available on Einstein. Each of the compiler suites may be accessed by a corresponding Programming Environment.

  • Portland Group (PGI)
  • PathScale
  • GNU
Compiling for the Login Nodes

Codes compiled to run on the login or esLogin nodes must be serial, while those for the compute nodes may be either serial or paralled. To compile for the login nodes, use the following compile commands:

Serial-Only Compiler Commands
LanguagePGIPathScaleGNUSerial/Parallel
C pgcc pathcc gcc Serial
C++ pgCC pathCC g++ Serial
Fortran 77 pgf77 pathf90 gfortran Serial
Fortran 90 pgf90 pathf90 gfortran Serial

To compile codes for execution on the compute nodes, the same compile commands are available in all programming environment suites as shown in the following table:

Compute Node Compiler Commands
LanguagePGIPathScaleGNUSerial/Parallel
C cc cc cc Serial/Parallel
C++ CC CC CC Serial/Parallel
Fortran 77 f77 f77 f77 Serial/Parallel
Fortran 90 ftn ftn ftn Serial/Parallel

The PGI programming environment is loaded for you by default. To use a different suite, you will need to swap modules. See Relevant Modules (below) to learn how.

5.2.1. Portland Group (PGI) Programming Environment

The PGI Programming Environment provides a large number of options that are the same for all compilers in the suite. The following table lists some of the more common options that you may use:

Commonly Used PGI Compiler Options (Fortran and C/C++)
OptionPurpose
-c Generate intermediate object file but do not attempt to link.
-l directory Search in directory for include or module files.
-L directory Search in directory for libraries.
-o outfile Name executable "outfile" rather than the default "a.out".
-Olevel Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
-mcmodel=medium Enables medium=model core generation for 64-bit targets; useful when the data space of the program exceeds 4 GBytes.
-Mfree Process Fortran codes using free form
-i8, -r8 Treat integer and real variables as 64-bit
-Mbyteswapio Big-endian files; the default is for little-endian
-g Generate symbolic debug information
-Mbounds Add array bound checking
-Minfo=all Reports detailed information about code optimizations to stdout as compile proceeds.
-Mlist Generate a file containing the compiler flags used and a line numbered listing of the source code.
-mp=nonuma Recognize OpenMP directives
-Bdynamic Compiling using shared objects requires CCM mode for execution on compute nodes.

Detailed information about these and other compiler options is available in the PGI compiler (pgf90, pgcc, and pgCC) man pages on Einstein.

5.2.2. PathScale Programming Environment

The PathScale Programming Environment provides a large number of options that are the same for all compilers in the suite. The following table lists some of the more common options that you may use:

Commonly Used PathScale Compiler Options
OptionPurpose
-c Generate intermediate object file but do not attempt to link.
-I directory Search in directory for include or module files.
-L directory Search in directory for libraries.
-o outfile Name executable "outfile" rather than the default "a.out".
-Olevel Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
-show-defaults List default compiler options for the compiler and exits.
-g Adds information for debugging to the object file and/or executable.
-i8, -r8 (fortran only) Treat integer and real variables as 64-bit
-default64 (fortran only) Passes the -i8 and -r8 options to the compiler.
-cpp (fortran only) Preprocess files with the C preprocessor. Enabled by default for files ending in .F, F90, or .F95.
-ftpp (fortran only) Preprocess files with the Fortran preprocessor. Useful when portions of the Fortran code could be misinterpreted as C preprocessor directives (e.g. "//")
-intrinsic=PGI (fortran only) Enables intrinsic functions that are available in the PGI compiler which are not ANSI standard (e.g. rand)
-mp Enables parallelization via OpenMP directives.

Detailed information about these and other compiler options is available in the PathScale compiler (pathf90, pathcc, pathCC) man pages on Einstein.

5.2.3. GNU Programming Environment

The GNU Programming Environment provides a large number of options that are the same for all compilers in the suite. The following table lists some of the more common options that you may use:

Useful GNU Compiler Options
OptionPurpose
-c Generate intermediate object file but do not attempt to link.
-I directory Search in directory for include or module files.
-L directory Search in directory for libraries.
-o outfile Name executable "outfile" rather than the default "a.out".
-Olevel Set the optimization level. For more information on optimization, see the section on Profiling and Optimization.
-g Generate symbolic debug information
-fconvert=big-endian Big-endian files; the default is for little-endian
-Wall
-Wextra
Turns on increased error reporting.

Detailed information about these and other compiler options is available in the GNU compiler (gfortran, gcc, and g++) man pages on Einstein.

5.3. Relevant Modules

By default, Einstein loads the PGI programming environment for you. The PathScale and GNU environments are also available. To use either of these, the PGI module must be unloaded and replaced with the one you wish to use. To do this, use the "module swap" command as follows:

module swap PrgEnv-pgi PrgEnv-pathscale     ## To switch to PathScale
module swap PrgEnv-pgi PrgEnv-gnu           ## To switch to GNU

In addition to the compiler suites, all of these modules also load the MPT and LibSci modules. The MPT module initializes MPI. The LibSci module includes solvers and single-processor and parallel routines that have been tuned for optimal performance on Cray XT systems (BLAS, LAPACK, ScaLAPACK, etc.). For additional information on the MPT and LibSci modules, see the intro_mpi and intro_libsci man pages on Einstein.

The table below shows the naming convention for various programming environment modules.

Programming Environment Modules
Module Module Name
PGIPrgEnv-pgi
PathScalePrgEnv-pathscale
GNUPrgEnv-gnu

For more information on using modules, see the Modules User Guide.

5.4. Libraries

Cray's LibSci and AMD's Core Math Library (ACML) are both available on Einstein. In addition, an extensive suite of math and science libraries are available in the $PET_HOME directory.

5.4.1. Cray LibSci

Einstein provides Cray's LibSci library as part of the modules that are loaded by default. This library is a collection of single processor and parallel numerical routines that have been tuned for optimal performance on Cray XT systems. The LibSci library contains optimized versions of many of the BLAS math routines as well as Cray versions of most of the ACML routines. Users should call the LibSci versions instead of the public domain or user written versions, to optimize application performance on Einstein.

The routines in LibSci are automatically included when using the ftn, cc, or CC commands. You do not need to use the "-l sci" flag in your compile command line.

Cray LibSci includes the following:

  • Basic Linear Algebra Subroutines (BLAS) - Levels 1, 2, and 3
  • Linear Algebra Package (LAPACK)
  • Scalable LAPACK (ScaLAPACK) (distributed-memory parallel set of LAPACK routines)
  • Basic Linear Algebra Communication Subprograms (BLACS)
  • Iterative Refinement Toolkit (IRT)
  • SuperLU (for large, sparse nonsymmetrical systems of linear equations)
5.4.2. AMD Core Math Library (ACML)

In addition to LibSci, Einstein also provides the AMD Core Math Library (ACML). ACML is a set of numerical routines tuned specifically for AMD64 platform processors. The routines, which are available via both FORTRAN and C interfaces, include the following:

  • Basic Linear Algebra Subroutines (BLAS) - Levels 1, 2, and 3
  • Linear Algebra Package (LAPACK)
  • Fast Fourier Transform (FFT) routines for single-precision, double-precision, single-precision complex, and double-precision complex data types
  • Random Number Generator
  • Fast Math and Fast Vector Library

The routines in the ACML can be accessed by including the library reference on your compile command line. For example:

ftn -l acml fort.f90

The library is not loaded by default and can be loaded by running the following module command:

module load acml

Once ACML is loaded into your default programming environment, it will take precedence over the LibSci routines.

5.4.3. Additional PETTT Libraries

There is also an extensive set of Math libraries available in the $PET_HOME/MATH directory on Einstein. A list of PETTT and other commonly supplied tools, such as HDF5, CMake, and NetCDF can be found at http://www.pettt-ace.com.

5.5. Debuggers

Einstein provides the TotalView debugger and the GNU Project Debugger (gdb) to assist users in debugging their code.

5.5.1. TotalView

TotalView is a debugger that supports threads, MPI, OpenMP, C/C++, and Fortran, mixed-language codes, advanced features like on-demand memory leak detection, other heap allocation debugging features, and the Standard Template Library Viewer (STLView). Unique features like dive, a wide variety of breakpoints, the Message Queue Graph/Visualizer, powerful data analysis, and control at the thread level are also available.

Follow these steps to use TotalView on Einstein via a UNIX X-Windows interface:

  1. Ensure that an X server is running on your local system. Linux users will likely have this by default, but MS Windows users will need to install a third party X Windows solution. There are various options available. Currently, we recommend Cygwin.
  2. For Linux users, connect to Einstein using "ssh -Y". Windows users will need to use PuTTY with X11 forwarding enabled (Connection->SSH->X11->Enable X11 forwarding).
  3. Compile your program on Einstein with the "-g" option.
  4. Submit an interactive job:

    qsub -l mppwidth=4 -A Project_ID -l walltime=00:30:00 -q debug -v DISPLAY -I

    After a short while the following message will appear:

    qsub: waiting for job NNNNNNNN to start
    qsub: job NNNNNNN ready

    You are now logged into an interactive batch session.

  5. Load the TotalView module:

    module load totalview

  6. Start program execution:

    totalview aprun -a -n 4 ./my_mpi_prog.exe arg1 arg2 ...

  7. After a short delay, the TotalView windows will pop up. Click "GO" to start program execution.

For more information on using TotalView, see the TotalView Documentation page.

5.5.2. gdb

The GNU Project Debugger (gdb) is a source level debugger that can be invoked either with a program for execution or a running process id. To launch your program under gdb for debugging, use the following:

gdb a.out corefile

To attach gdb to a program that is already executing on this node, use the following command:

gdb a.out pid

For more information, the GDB manual can be found at http://sourceware.org/gdb/current/onlinedocs/gdb.

5.6. Code Profiling and Optimization

Profiling is the process of analyzing the execution flow and characteristics of your program to identify sections of code that are likely candidates for optimization, which increases the performance of a program by modifying certain aspects for increased efficiency.

We provide CrayPat to assist you in the profiling process. In addition, a basic overview of optimization methods with information about how they may improve the performance of your code can be found in Performance Optimization Methods (below).

5.6.1. CrayPat

CrayPat is an optional performance analysis tool used to evaluate program behavior on Cray supercomputer systems. CrayPat consists of the following major components: pat_build, pat_report, and pat_help. The data produced by CrayPat also can be used with Cray Apprentice2, an analysis tool that is used to visualize and explore the performance data captured during program execution.

Man pages are available for pat_build, pat_report, pat_help, and Apprentice2. Additional information can be found in the document "Using Cray Performance Analysis Tools."

The following steps should get you started using CrayPat:

  1. Load the "xt-craypat" module

    module load xt-craypat

  2. Recompile the code as you normally would to generate an executable.

    ftn mycode.f90 -o mycode

  3. Use the pat_build command to generate an instrumented executable.

    pat_build -g mpi -u mycode

    This generates an instrumented executable called mycode+pat. Here the "-g" option enables the "mpi" tracegroup. See "man pat_build" for available tracegroups.

  4. Run the instrumented executable with aprun via PBS.

    aprun -n 4 ./mycode+pat

    This generates an instrumented output file (e.g., mycode+pat+2007-12tdt.xf).

  5. Use pat_report to display the statistics from the output file

    pat_report mycode+pat+2007-12tdt.xf > mycode.pat_report

Additional profiling options are available. See "man pat_build" for additional instrumentation options.

5.6.2. Additional Profiling Tools

There is also a set of profiling tools available in the $PET_HOME/pkgs directory on Einstein. Information about these tools may be found on the Baseline Configuration Web site at BC policy FY07-02. A list of commonly provided profiling tools can also be found at http://www.pettt-ace.com.

5.6.3. Program Development Reminders

If an application is not programmed for distributed memory, then only the cores on a single node can be used. This is limited to 8 cores on Einstein.

Keep the system architecture in mind during code development. For instance, if your program requires more memory than is available on a single node, then you will need to parallelize your code so that it can function across multiple nodes.

5.6.4. Compiler Optimization

The "-Olevel" option enables code optimization when compiling. The level that you choose (0-4) will determine how aggressive the optimization will be. Increasing levels of optimization may increase performance significantly, but you should note that a loss of precision may also occur. There are also additional options that may enable further optimizations. The following table contains the most commonly used options.

Compiler Optimization Flags
Option Description Compiler Suite
-O0 No Optimization. (default in GNU) All
-O1 Scheduling within extended basic blocks is performed. Some register allocation is performed. No global optimization. All
-O2 Level 1 plus traditional scalar optimizations such as induction recognition and loop invariant motion are performed by the global optimizer. Generally safe and beneficial. (default in PGI & PathScale) All
-O3 Levels 1 and 2 plus more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable. Generally beneficial. All
-O4 Levels 1, 2, and 3 plus hoisting of guarded invariant floating point expressions is enabled. PGI
-Ofast Levels 1, 2, and 3 plus IPA, -OPT:Ofast, and -fno-math-errno. More aggressive. PathScale
-fast
-fastsse
Chooses generally optimal flags for the target platform. Includes: -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz. PGI
-OPT:Ofast -OPT:ro=2:Olimit=o:div=split=ON:alias=typed
Generally safe, but may impact floating point precision
PathScale
-Mipa=fast,inline Performs interprocedural analysis (IPA) with generally optimal IPA flags for the target platform, and inlining. IPA can be very time consuming. Flag must be used in both compilation and linking steps. PGI
Minline=levels:n Number of levels of inlining (default: n = 1) PGI
-fipa-* The GNU compilers automatically enable IPA at various -O levels. To set these manually, see the options beginning with -fipa in the gcc man page. GNU
-ipa Tells the compiler to perform interprocedural analysis. Can be very time consuming to perform. This flag should also be used in both compilation and linking steps. Not recommended for programs over 100,000 lines for the current compiler release. PathScale
-Mlist Creates a listing file with optimization info PGI
-Mneginfo Info on why certain optimizations are not performed PGI
-Mconcur Instructs the compiler to enable auto-concurrentization of loops. If specified, the compiler uses multiple processors to execute loops that it determines to be parallelizable; thus, loop iterations are split to execute optimally in a multithreaded execution context. PGI
-Munroll Invokes the loop unroller to unroll loops, executing multiple instances of the loop during each iteration. This also sets the optimization level to 2 if the level is set to less than 2, or if no -O or -g options are supplied. PGI
-M[no]vect Enables/Disables the code vectorizer. PGI
-apo Enables autoparallelization. PathScale
5.6.5. Performance Optimization Methods

Optimization generally increases compilation time and executable size, and may make debugging difficult. However, it usually produces code that runs significantly faster. The optimizations that you can use will vary depending on your code and the system on which you are running.

Note: Before considering optimization, you should always ensure that your code runs correctly and produces valid output.

In general, there are four main categories of optimization:

  • Global Optimization
  • Loop Optimization
  • Interprocedural Analysis and Optimization(IPA)
  • Function Inlining
Global Optimization

A technique that looks at the program as a whole and may perform any of the following actions:

  • Performed on code over all its basic blocks
  • Performs control-flow and data-flow analysis for an entire program
  • Detects all loops, including those formed by IF and GOTOs statements and performs general optimization.
  • Constant propagation
  • Copy propagation
  • Dead store elimination
  • Global register allocation
  • Invariant code motion
  • Induction variable elimination
Loop Optimization

A technique that focuses on loops (for, while, etc.) in your code and looks for ways to reduce loop iterations or parallelize the loop operations. The following types of actions may be performed:

  • Vectorization - rewrites loops to improve memory access performance. Some compilers may also support automatic loop vectorization by converting loops to utilize low-level hardware instructions and registers if they meet certain criteria.
  • Loop unrolling - (also known as "unwinding") replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization.
  • Parallelization - divides loop operations over multiple processors where possible.
Interprocedural Analysis and Optimization (IPA)

A technique that allows the use of information across function call boundaries to perform optimizations that would otherwise be unavailable.

Function Inlining

A technique that seeks to reduce function call and return overhead.

  • Used with functions that are called numerous times from relatively few locations.
  • Allows a function call to be replaced by a copy of the body of that function.
  • May create opportunities for other types of optimization
  • May not be beneficial. Improper use may increase code size and actually result in less efficient code.

6. Batch Schedulingto top

6.1. Scheduler

The Portable Batch System (PBS) is currently running on Einstein. It schedules jobs and manages resources and job queues, and can be accessed through the interactive batch environment or by submitting a batch request. PBS is able to manage both single-processor and multiprocessor jobs.

6.2. Queue Information

The following table describes the PBS queues available on Einstein:

Queue Descriptions and Limits
Priority Queue
Name
Job
Class
Max Wall
Clock Time
Max Cores
Per Job
Comments
Highest urgent Urgent TBD TBD Designated Urgent Project by DoD HPCMP
Down Arrow for decreasing priority high High 168 Hours 4096 Designated High-Priority Projects by DoD HPCMP
challenge Challenge 168 Hours 4096 Challenge Projects Only
special N/A 168 Hours 4096 Access Available by Request
debug Debug 30 Minutes 512 User Diagonostic Jobs
standard Standard 168 Hours 2048 Normal Priority Jobs
bigmem N/A 24 Hours 56 Large Memory Jobs
transfer N/A 12 Hours 1 Data Transfer Jobs
analysis N/A 8 Hours 1 Serial Jobs
Lowest background Background 4 Hours 512 User jobs that will not be charged against the project allocation

6.3. Interactive Logins

When you log in to Einstein, you will be running in an interactive shell on a login node. The login nodes provide login access for Einstein and support such activities as compiling, editing, and general interactive use by all users. Please note the Login Node Abuse policy. The preferred method to run resource intensive executions is to use an interactive batch session.

6.4. Interactive Batch Sessions

An interactive session on a compute node is possible using the PBS qsub command with the "-I" option from a login node. Once PBS has scheduled your request to the specified queue, you will be directly logged into a compute node, and this session can last as long as your requested wall time. For example:

qsub -l walltime=HHH:MM:SS -l mppwidth=max_cores -A Project_ID -q queue_name -I

Your batch shell request will be placed in the specified queue and scheduled for execution. Depending on system load, this may take a few minutes. Once your shell starts, you can run or debug interactive applications, execute job scripts or start an execution on the compute nodes via the aprun command.

6.5. Batch Request Submission

PBS batch jobs are submitted via the qsub command. The format of this command is:

qsub [ options ] batch_script_file

qsub options may be specified on the command line or embedded in the batch script file by lines beginning with "#PBS".

For a more thorough discussion of PBS Batch Submission on Einstein, see the Einstein PBS Guide.

6.6. Batch Resource Directives

Batch resource directives allow you to specify to PBS how your batch jobs should be run and what resources your job requires. Although PBS has many directives, you only need to know a few to run most jobs.

The basic syntax of PBS directives is as follows:

#PBS option[[=]value]

where some options may require values to be included. For example, to set the number of cores for your job, you might specify the following:

#PBS -l mppwidth=8

The following directives are required for all jobs:

Required Directives
Directive Value Description
-A Project_ID Name of the project
-q queue_name Name of the queue
-l mppwidth=# Number of cores
-l walltime=HHH:MM:SS Maximum wall time

A more complete listing of batch Resource Directives is available in the Einstein PBS Guide.

6.7. Launch Command

On Einstein the PBS batch scripts and the PBS interactive login session run on service nodes instead of compute nodes. The aprun command is used to send your executable to the compute nodes. The following example command line can be used in a batch script or in an interactive session, sending the executable $WORKDIR/a.out to 128 compute cores:

aprun -n 128 $WORKDIR/a.out

The aprun command can be used to launch MPI, SHMEM or OpenMP executables. Examples of using aprun to launch MPI, SHMEM, and OpenMP executables, see MPI, SHMEM, and OpenMP (above). Examples can also be found in the $SAMPLES_HOME directory on Einstein. For more information about aprun, see the aprun man page.

6.8. Sample Scripts

The following example is a good starting template for a batch script to run a serial job for one hour:

#!/bin/bash -l  ##Specify your shell
# Specify name of the job
#PBS -N serialjob
#
# Append std output to file serialjob.out
#PBS -o serialjob.out
#
# Append std error to file serialjob.err
#PBS -e serialjob.err
#
# Specify Project ID to be charged (Required)
#PBS -A Project_ID
#
# Request wall clock time of 1 hour (Required)
#PBS -l walltime="01:00:00"
#
# Specify queue name (Required)
#PBS -q standard
#
# Specify the number cores (Required)
#PBS -l mppwidth=1
#
#PBS -S /bin/bash
# Change to the specified directory
cd $WORKDIR
#
# Execute the serial executable on 1 core
aprun ./serial_fort.exe
# End of batch job

The first few lines tell PBS to save the standard output and error output to the given files and give the job a name. Skipping ahead, we estimate the run-time to be about one hour and know that this is acceptable for the standard batch queue. We need one core in total, so we request one core.

The following example is a good starting template for a batch script to run a parallel (MPI) job for 2 hours:

#!/bin/bash -l
## The first line (above) specifies the shell to use for parsing the
## remaining lines of the batch script.
#
## Required PBS Directives --------------------------------------
#PBS -A Project_ID
#PBS -q standard
#PBS -l mppwidth=256
#PBS -l walltime=12:00:00
#
## Optional PBS Directives --------------------------------------
#PBS -l mppnppn=4
#PBS -N Test_Run_1
#PBS -j oe
#PBS -V
#PBS -S /bin/bash
#
## Execution Block ----------------------------------------------
# Environment Setup
# cd to your personal directory in the scratch file system
cd $WORKDIR
#
# create a job-specific subdirectory based on JOBID and cd to it
JOBID=`echo $PBS_JOBID | cut -d '.' -f 1`
if [ ! -d $JOBID ]; then
  mkdir -p $JOBID
fi
cd $JOBID
#
# Launching
# copy executable from $HOME and submit it
cp $HOME/my_prog.exe .
aprun -N 4 -n 64 ./my_prog.exe > my_prog.out
#
# Clean up
# archive your results
# Using the "here document" syntax, create a job script
# for archiving your data.
cd $WORKDIR
rm -f archive_job
cat > archive_job << END
#!/bin/bash
#PBS -l walltime=6:00:00
#PBS -q transfer
#PBS -A Project_ID
#PBS -j oe
#PBS -S /bin/bash
cd $WORKDIR
rsh $ARCHIVE_HOST mkdir $ARCHIVE_HOME/$JOBID
rcp -r $JOBID $ARCHIVE_HOST:$ARCHIVE_HOME/
rsh $ARCHIVE_HOST ls -l $ARCHIVE_HOME/$JOBID
# Remove scratch directory from the file system.
cd $WORKDIR
rm -rf $JOBID
END
#
# Submit the archive job script.
qsub archive_job
# End of Batch job

The first few lines tell PBS to save the standard output and error output to the given files and give the job a name. Skipping ahead, we estimate the run-time to be about 12 hours and know that this is acceptable for the standard batch queue. The next couple of lines set the total number of cores and the number of cores per node for the job. This job is requesting 256 total cores and 4 cores per node allowing the job to run on 64 nodes. The default value for number of cores per node is 8.

Additional examples are available in the Sample Code Repository ($SAMPLES_HOME) on Einstein.

6.9. PBS Commands

Common PBS Commands
CommandDescription
qsub job_script Submit a job_script (Once submitted, PBS assigns it a unique jobID number)
qstat -a Check all jobs under PBS
qstat jobID Check one job
qstat -u username Check one user's jobs
qstat -f jobID Obtain detailed status on a job
qdel jobID Cancel a running/queued Job
qstat -Q List all PBS Queues

A more complete list of PBS commands is available in the Einstein PBS Guide.

6.10. Advance Reservations

An Advance Reservation Service (ARS) is available on Einstein for reserving up to 2688 cores for use, starting at a specific date/time, and lasting for a specific number of hours. The ARS is accessible via most modern web browsers at https://reservation.hpc.mil. Authenticated access is required. An ARS User's Guide is available online once you have logged in.

7. Software Resourcesto top

7.1. Application Software

All Commercial Off The Shelf (COTS) software packages can be found in the $CSI_HOME (/usr/local/CSI) directory. A complete listing of software on Einstein with installed versions can be found on our software page. The general rule for all COTS software packages is that the two latest versions will be maintained on our systems. For convenience, modules are also available for most COTS software packages.

7.2. Useful Utilities

The following utilities are available on Einstein:

Useful Utilities
CommandDescription
mpscp High-performance remote file copy
show_queues Report current batch queue status, usage, and limits
show_quota Display disk usage and number of files
show_usage Display CPU allocation and usage by subproject

7.3. Sample Code Repository

The Sample Code Repository is a directory that contains examples for COTS batch scripts, building and using serial and parallel programs, data management, and accessing and using serial and parallel math libraries. The $SAMPLES_HOME environment variable contains the path to this area, and is automatically defined in your login environment. Users should look in the $SAMPLES_HOME directory for simple examples. Examples and useful tips are periodically added to this directory.

Sample Code Repository on Einstein
Sample Description
Job Submission
pbs_scripts/serial.pbs Details how to submit a serial PBS job
pbs_scripts/parallel_mpi.pbs Details how to submit a parallel MPI PBS job
pbs_scripts/parallel_openmp.pbs Details how to submit a parallel OpenMP PBS job
pbs_scripts/transfer.pbs Details how to submit a serial transfer PBS job
Code Compilation
compilation/compile_info.txt Details basic compilation options for the PGI Fortran and C compilers

8. Links to Vendor Documentationto top

Cray Home: http://docs.cray.com/
Cray Application Developer's Environment User's Guide
http://docs.cray.com/books/S-2396-50/S-2396-50.pdf

Novell Home: http://www.novell.com/linux/
Novell SUSE Linux Enterprise Server: http://www.novell.com/products/server/

GNU Home: http://www.gnu.org
GNU Compiler: http://gcc.gnu.org/onlinedocs/
Portland Group Resources Page: http://www.pgroup.com/resources/