Benchmarks

In addition to the Crossroads benchmarks, an ASC Simulation Code Suite representing the three NNSA laboratories will be used to judge performance at time of acceptance (Mercury from Lawrence Livermore, PARTISN from Los Alamos, and SPARC from Sandia). NNSA mission requirements forecast the need for a 6X or greater improvement over the ASC Trinity system (Haswell partition) for the code suite, measured using SSI. Final acceptance performance targets will be negotiated after a final system configuration is defined. Source code will be provided to the Offeror, but it will require compliance with export control laws and no cost licensing agreements.

Note: Each code will require special handling. Refer to section 3.5.4 of the Crossroads 2021 Technical Specs (pdf)

Mercury: Lawrence Livermore National Laboratory
- For details on how obtain the code and the relevant paperwork, vendors should contact Dave Richards.
PARTISN: Los Alamos National Laboratory
- For details on how obtain the code and the relevant paperwork, vendors should contact Jim Lujan
SPARC: Sandia National Laboratories
- Download Militarily Critical Technical Data Agreement DD2345 (pdf), follow instructions.
- Download Participant Data Sheet (doc), follow instructions.
- Both forms are required before a license can be granted to use Sandia National Laboratories SPARC codes or accompanying input problems.
  Note: If your institution already has a DLA Logistics Information Service approved DD2345 License, please send a copy to Sandia National Laboratories with your completed Participant Data Sheet.
  Questions? Contact Simon Hammond or Jim Laros

Scalable System Improvement (SSI): An Application Performance Benchmarking Metric for HPC

Scalable System Improvement (SSI) provides a means to measure relative application performance between two high-performance computing (HPC) platforms. In defining SSI, it was desired to have a single metric to measure performance improvement for a wide variety of application and platform characteristics, for example capability, throughput, strong scaling, weak scaling, system size, etc. It is also desirable to provide parameters that allow architecture teams and benchmark analysts to define the workload characteristics and to weight benchmarks independently, a desirable characteristic in procurements that represent more than one organization and/or varied workloads.

Given two platforms using one as a reference, SSI is defined as a weighted geometric mean using the following equation.

Where:

M - total number of applications,
c - capability scaling factor,
U - utilization factor = (n_ref / n) x (N / N_ref),
- n is the total number of nodes used for the application,
- N is the total number of nodes in the respective platform,
- ref refers to the reference system,
S - application speedup = (t_ref / t) or (FOM / FOM_ref),
w - weighting factor.

The capability factor allows the design team to define weak scaled problems. For example, if for a given application the problem size (or some other metric of complexity) is four times larger than the problem run on the reference system c_i would be 4 for that application.

The utilization factor is the ratio of the platform utilizations used in obtaining the reported time or figure of merit (FOM). The utilization factor rewards using fewer nodes (n) to achieve a given speedup, and it also rewards providing more nodes in aggregate (N).

Speedup is calculated using an application specific figure of merit. Nominally, speedup is defined as the ratio of the execution times. Some applications define a different FOM such as: a dimensionless number, time per iteration for a key code segment, grind time, floating-point operations per second, etc. Speedup rewards a faster time, or a higher FOM.

A necessary condition of the SSI calculation is that speedup (S) must be >= 1.0. The reason for this condition is a user expects a turn-around time to be at least the same as on a previous generation machine. In addition, one could run a given benchmark on an unreasonably small number of nodes on the target system in order to minimize node-hours (and avoid scaling effects for example) and hence increase SSI.

The weighting factor allows an architecture team or benchmark analyst to weight some applications heavier than others. If all applications have equal weight, the weighted geometric mean is equivalent to the geometric mean.

Analyzing the SSI calculation, it can be observed that SSI is maximized by minimizing (n x t) or (n / FOM).

SSI is best illustrated with an example. This example uses data obtained from a workshop publication comparing NERSC’s Hopper (Cray XE6) and Edison (Cray XC30) platforms.[3] Application names, nodes counts and timing are summarized in the following table.


	Hopper (6,384 node)		Edison (5,576 nodes)
	# Nodes	Time (sec)	# Nodes	Time (sec)
FLASH	512	331.62	512	142.89
GTC	1200	344.10	400	266.21
MILC	512	1227.22	1024	261.10
UMT	512	270.10	1024	59.90
MiniFE	512	45.20	2048	5.10

The weighted geometric mean can be easily calculated in a spreadsheet using the following form.

Where: x = cUS.

While the original study was a strong scaling analysis, for illustrative purposes we’re going to assume that the UMT and MiniFE benchmarks were run at four times the problem size on Edison and hence c=4. The weights are assigned arbitrarily, again for illustrative purposes.


SSI					3.61
	w	c	U	S	cUS
FLASH	1	1	0.87	2.32	2.03
GTC	4	1	2.62	1.29	3.39
MILC	4	1	0.44	4.70	2.05
UMT	2	4	0.44	4.51	7.88
MiniFE	2	4	0.22	8.86	7.74

Appendix: Which Mean to Use

There are a few excellent references on which Pythagorean mean to use when benchmarking systems.[2,3] Fleming states that the arithmetic mean should NOT be used to average normalized numbers and to use the geometric mean instead. Smith summarizes that “If performance is to be normalized with respect to a specific machine, an aggregate performance measure such as total time or harmonic mean rate should be calculated before any normalizing is done. That is, benchmarks should not be individually normalized first.” However, the SSI metric normalizes each benchmark first and then calculates the geometric mean for the following reasons.

The geometric mean is best when comparing different figures of merit. One might think that the use of speedup is a single FOM, but for SSI each application’s FOM is independent. Hence we cannot add results together to calculate total time, nor total work, nor total rate as is recommended by Smith and as would be needed for correctness in the arithmetic and harmonic means.
The geometric mean normalizes the ranges being averaged so that no single application result dominates the resultant mean. The central tendency of the geometric mean emphasizes this more in that it is always less than or equal to the arithmetic mean.
The geometric mean is the only mean which has the property the geometric mean of (Xi/Yi) = geometric mean of (Xi) / geometric mean of (Yi), and hence has the property that the resultant ranking is independent of which platform is used for normalization when calculating speedup.

References

Cordery, M.J.; B. Austin, H. J. Wasserman, C. S. Daley, N. J. Wright, S. D. Hammond, D. Doerfler, "Analysis of Cray XC30 Performance using Trinity-NERSC-8 benchmarks and comparison with Cray XE6 and IBM BG/Q", PMBS2013: Sixth International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, November 11, 2013.
Fleming, Philip J.; John J. Wallace, "How not to lie with statistics: the correct way to summarize benchmark results". Communications of the ACM 29 (3): 218–221, 1986.
Smith, James E., "Characterizing computer performance with a single number". Communications of the ACM 31 (10): 1202–1206, 1988.

SNAP
A proxy for the performance of a modern discrete ordinates neutral particle transport application.
HPCG
High Performance Conjugate Gradient benchmark.
PENNANT
A mini-application for 2D, unstructured, finite element mesh with arbitrary polygons.
MiniPIC
A Particle-In-Cell proxy application that solves the discrete Boltzman equation in an electrostatic field in an arbitrary domain with reflective walls.
UMT
A proxy application that performs three-dimensional, non-linear, radiation transport calculations using deterministic (Sn) methods.
VPIC
A 3D relativistic, electromagnetic Particle-In-Cell plasma simulation code.
*NOTE: VPIC source is split into 6 files that must be reassembled into a single xzip file.
To reassemble: cat vpic_crossroads.tar.xz.* >& vpic_crossroads.tar.xz
Branson
A proxy application for the Implicit Monte carlo method, to model the exchange of radiation with material at high temperatures.

Micro-Benchmarks

The following microbenchmarks will be used in support of specific requirements in the RFP.

DGEMM
The DGEMM benchmark measures the sustained floating-point rate of a single node.
IOR
IOR is used for testing performance of parallel file systems using various interfaces and access patterns.
Mdtest
A metadata benchmark that performs open/stat/close operations on files and directories.
STREAM
The STREAM benchmark measures sustainable memory bandwidth using four simple vector kernels.
MPI Benchmarks

Benchmarks and Performance Analysis

Crossroads Benchmarks, Micro-Benchmarks, & ASC Code Suite

Scalable System Improvement (SSI): An Application Performance Benchmarking Metric for HPC

Appendix: Which Mean to Use

References

Micro-Benchmarks