# Papers

## Applications of Flexible Spatial and Temporal Discretization Techniques to a Numerical Weather Prediction Model

Numerical weather prediction is carried out by solving a set of governing partial differential equations (PDEs) based on fluid dynamics and thermodynamics principles, together with parameterizations of unresolved important physical processes. The Model for Prediction Across Scales - Atmosphere (MPAS-A) uses Spherical Centroidal Voronoi Tessellations (SCVT) to discretize the atmosphere spatially and an horizontally explicit finite volume method to time-integrate the PDEs. The use of a variable-resolution SCVT mesh, such as the 60km-to-3km mesh with circular region of refinement, can achieve good results in numerical weather prediction. However, MPAS-A uses a globally uniform time-step that needs to satisfy the Courant-Friedrichs-Lewy condition for the smallest mesh cells, and the computational cost would be impractically large if the finest resolution is pushed further down like 1km. We modified the MPAS-A source code to support the use of hierarchical time steps for regions with different grid spacing ranges. This enables small regions of very fine resolution to be incorporated in a global mesh while saving computational cost for the rest of the globe. Moreover, we implemented a software tool for the generation of SCVT meshes with arbitrarily-shaped, hierarchical refinement regions. Both features enable using a global atmospheric model for regional or local high-resolution weather forecasting with affordable computational resources.

Author(s): Chi-Chiu Cheung (ClusterTech Limited), Chi-Yung Tam (Earth System Science Programme, The Chinese University of Hong Kong), Wai-Nang Leung (ClusterTech Limited), Ka-Ki Ng (ClusterTech Limited), and Wai-Pang Sze (ClusterTech Limited)

Domain: Climate, Weather and Earth Sciences

## ChASE - A Distributed Hybrid CPU-GPU Eigensolver for Large-Scale Hermitian Eigenvalue Problems

As modern massively parallel clusters are getting larger with beefier compute nodes, traditional parallel eigensolvers, such as direct solvers, struggle keeping the pace with the hardware evolution and being able to scale efficiently due to additional layers of communication and synchronization. This difficulty is especially important when porting traditional libraries to heterogeneous computing architectures equipped with accelerators, such as Graphics Processing Unit (GPU). Recently, there have been significant scientific contributions to the development of filter-based subspace eigensolver to compute partial eigenspectrum. The simpler structure of these type of algorithms makes for them easier to avoid the communication and synchronization bottlenecks typical of direct solvers. The Chebyshev Accelerated Subspace Eigensolver (ChASE) is a modern subspace eigensolver to compute partial extremal eigenpairs of large-scale Hermitian eigenproblems with the acceleration of a filter based on Chebyshev polynomials. In this work, we extend our previous work on ChASE by adding support for distributed hybrid CPU-multi-GPU computing architectures. Out tests show that ChASE achieves very good scaling performance up to 144 nodes with 526 NVIDIA A100 GPUs in total on dense eigenproblems of size up to 360k.

Author(s): Xinzhe Wu (Jülich Supercomputing Centre), Davor Davidović (Ruđer Bošković Institute), Sebastian Achilles (Jülich Supercomputing Centre), and Edoardo Di Napoli (Jülich Supercomputing Centre)

Domain: Computer Science and Applied Mathematics

## Communication Bounds for Convolutional Neural Networks

Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words of data between levels of a memory hierarchy or between processors on a network is much more expensive than the cost of arithmetic, so minimizing communication is critical to optimizing performance. In this paper, we present new precise lower bounds on data movement for convolutions in both single-processor and parallel distributed memory models, as well as algorithms that outperform current implementations such as Im2Col. We obtain performance figures using GEMMINI, a machine learning accelerator, where our tiling provides improvements between 13% and 150% over a vendor supplied algorithm.

Author(s): Anthony Chen (University of Michigan), James Demmel (UC Berkeley), Grace Dinh (UC Berkeley), Mason Haberle (New York University), and Olga Holtz (UC Berkeley)

Domain: Computer Science and Applied Mathematics

## Distributed-Memory Simulations of Turbulent Flows on Modern GPU Systems using an Adaptive Pencil Decomposition Library

This paper presents a performance analysis of pencil domain decomposition methodologies for three-dimensional Computational Fluid Dynamics (CFD) codes for turbulence simulations, on several large GPU-accelerated clusters.

The performance was assessed for the numerical solution of the Navier-Stokes equations in two codes which require the calculation of Fast-Fourier Transforms (FFT): a tri-periodic pseudo-spectral solver for isotropic turbulence, and a finite-difference solver for canonical turbulent flows, where the FFTs are used in its Poisson solver. Both codes use a newly developed transpose library that automatically determines the optimal domain decomposition and communication backend on each system.

We compared the performance across systems with very different node topologies and available network bandwidth, to show how these characteristics impact decomposition selection for best performance. Additionally, we assessed the performance of several communication libraries available on these systems, such as OpenMPI, IBM Spectrum MPI, Cray MPI, the NVIDIA Collective Communication Library (NCCL), and NVSHMEM. Our results show that the optimal combination of communication backend and domain decompositon is highly system-dependent, and that the adaptive decomposition library is key in ensuring efficient resource usage.

Author(s): Joshua Romero (NVIDIA Corporation), Pedro Costa (University of Iceland), and Massimiliano Fatica (NVIDIA Inc.)

Domain: Physics

## An Effective Physics Simulation Methodology Based on a Data-Driven Learning Algorithm

A methodology of multi-dimensional physics simulations for engineering and scientific applications is investigated based on a data-driven learning algorithm derived from proper orthogonal decomposition (POD). The approach utilizes numerical simulation tools to collect solution data for the problems of interest subjected to parametric variations that may include interior excitations and/or boundary conditions influenced by exterior environments. The POD is applied to process the data and to generate a finite set of basis functions. The problem is then projected from the physical domain onto a mathematical space constituted by its basis functions. The effectiveness of the POD methodology thus depends on the data quality, which relies on the numerical settings implemented in the data collection (or the *training*). The simulation methodology is developed and demonstrated in a dynamic heat transfer problem for an entire CPU and in a quantum eigenvalue problem for a quantum-dot structure. Encouraging findings are observed for the POD simulation methodology in this investigation, including its extreme efficiency, high accuracy and great adaptability. The models constructed by the POD basis functions are even capable of predicting the solution of the problem beyond the conditions implemented in the training with a good accuracy.

Author(s): Lin Jiang (Clarkson University), Martin Veresko (Clarkson University), Yu Liu (Clarkson University), and Ming-Cheng Cheng (Clarkson University)

Domain: Physics

## FIST-HOSVD: Fused In-Place Sequentially Truncated Higher Order Singular Value Decomposition

In this paper, several novel methods of improving the memory locality of the Sequentially Truncated Higher Order Singular Value Decomposition (ST-HOSVD) algorithm for computing the Tucker decomposition are presented. We show how the two primary computational kernels of the ST-HOSVD can be fused together into a single kernel to significantly improve memory locality. We then extend matrix tiling techniques to tensors to further improve cache utilization. This block-based approach is then coupled with a novel in-place transpose algorithm to drastically reduce the memory requirements of the algorithm by overwriting the original tensor with the result. Our approach's effectiveness is demonstrated by comparing the multi-threaded performance of our optimized ST-HOSVD algorithm to TuckerMPI, a state-of-the-art ST-HOSVD implementation. We demonstrate up to ~135x reduction in memory consumption thereby increasing the problem size that can be computed for a given memory allocation by up to ~3x, whilst maintaining comparable runtime performance.

Author(s): Benjamin Cobb (Georgia Institute of Technology), Hemanth Kolla (Sandia National Laboratories), Eric Phipps (Sandia National Laboratories), and Ümit Çatalyürek (Georgia Institute of Technology, Amazon)

Domain: Computer Science and Applied Mathematics

## Flow Field Prediction on Large Variable Sized 2D Point Clouds with Graph Convolution

Computational Fluid Dynamics (CFD) solvers are an important tool to predict the behaviour of fluid flows in many industrial sectors. Thereby, accelerating the solution process without compromisingon geometric accuracy is a major goal in the development of numerical flow solvers and design optimization frameworks. For tackling the issue of complex geometries, this paper utilizes a graph-based machine learning approach for solving the regression problem of stationary fluid flow field prediction. Concretely, a graph convolutional network (GCN) architecture is applied and successfully predicts flow fields around geometrical different objects. As the GCN model operates on the numerical mesh directly, the exact geometry of the object as well as all other properties of the mesh are preserved. It turned out that a GCN is able to predict fluid flow fields of two-dimensional bluff-bodies and NACA airfoils, even considering predictions based on extrapolation. Furthermore, this approach is able to handle different sizes of point clouds up to 50k points. Finally, using the predicted flow field as an initial flow distribution for a CFD simulation, showed a decreased solver runtime in some cases.

Author(s): Sebastian Strönisch (TU Dresden), Marcus Meyer (Rolls-Royce Deutschland), and Christoph Lehmann (TU Dresden)

Domain: Engineering

## MINT: A Library that Conserves Lateral Fluxes when Interpolating Vector Fields

A C++/Python interpolation and regridding package (MINT) is presented, which is suited for vector fields with staggered components. MINT implements an interpolation method that conserves line and flux integrals in two dimensions. Mathematical identities such as curl grad = div curl = 0 are preserved to near machine accuracy after regridding.

The method is immune to grid singularities, supports partially valid cells and the grid can be an unstructured collection of cells of distorted quadrilateral cells. MINT is being developed as part of the next generation weather and climate forecasting, exascale LFRic effort, to assist scientists with the computation of fluxes and flows across arbitrary large, lateral areas.

Author(s): Alexander Pletzer (NIWA, NeSI), Erik Behrens (NIWA), and Bill Little (UK Met Office)

Domain: Climate, Weather and Earth Sciences

## On the Construction of AMG Prolongation through Energy Minimization

Algebraic Multigrid (AMG) is a very popular iterative method used in several applications. This wide spread is due to its effectiveness in solving linear systems arising from PDEs discretization. The key feature of AMG is its optimality, i.e., the ability to guarantee a con- vergence rate independent of the mesh size for different problems. This is obtained through a good interplay between the smoother and the interpolation. Unfortunately, for difficult problems, such as those arising from structural mechanics or diffusion problems with large jumps in the coefficients, standard smoothers and inter- polation techniques are not enough to ensure a fast convergence. In these cases, an improved prolongation operator is required to enhance the AMG effectiveness. In this talk, we present an updated prolongation according to an energy minimization criterion, and show how this minimization can be seen as a constrained mini- mization problem. In details, we have that the constraint is twofold: the prolongation must be sparse, and its range must represent the operator near-kernel. To solve this problem, we propose two strategies: a restricted Krylov subspace iterative procedure and the null-space method. Both approaches can be preconditioned to speed up the setup time. Finally, thanks to some numerical experiments, we demonstrate how the convergence rate can be significantly increased at a reasonable setup cost.

Author(s): Carlo Janna (M3E s.r.l., University of Padova), Giovanni Isotton (M3E s.r.l.), and Andrea Franceschini (University of Padova)

Domain: Computer Science and Applied Mathematics

## Parallel Memory-Efficient Computation of Symmetric Higher-Order Joint Moment Tensors

The decomposition of higher-order joint cumulant tensors of spatio-temporal data sets is useful in analyzing multi-variate non-Gaussian statistics with a wide variety of applications (e.g. anomaly detection, independent component analysis, dimensionality reduction). Computing the cumulant tensor often requires computing the joint moment tensor of the input data first, which is very expensive using a naïve algorithm. The current state-of-the-art algorithm takes advantage of the symmetric nature of a moment tensor by dividing it into smaller cubic tensor blocks and only computing the blocks with unique values and thus reducing computation. We propose a refactoring of this algorithm by posing its computation as matrix operations, specifically Khatri-Rao products and standard matrix multiplications. An analysis of the computational and cache complexity indicates significant performance savings due to the refactoring. Implementations of our refactored algorithm in Julia show speedups up to 10x over the reference algorithm in single processor experiments. We describe multiple levels of hierarchical parallelism inherent in the refactored algorithm, and present an implementation using an advanced programming model that shows similar speedups in experiments run on an NVIDIA GPU.

Author(s): Zitong Li (Wake Forest University), Hemanth Kolla (Sandia National Laboratories), and Eric Phipps (Sandia National Laboratories)

Domain: Computer Science and Applied Mathematics

## Parallel Space-Time Likelihood Optimization for Air Pollution Prediction on Large-Scale Systems

Gaussian geostatistical space-time modeling is an effective tool for performing statistical inference of data evolving in space and time, generalizing spatial modeling alone at the cost of the greater complexity of operations and storage, and pushing geostatistical modeling even further into the arms of high-performance computing. It makes inferences for missing data by leveraging space-time measurements of one or more fields. We propose a high-performance implementation of a space-time model for large-scale systems using a two-level parallelization technique. At the inner level, we rely on parallel linear algebra libraries and runtime systems to perform complex matrix operations required to evaluate the maximum likelihood estimation (MLE). At the outer level, we parallelize the optimization process using a distributed implementation of the particle swarm optimization (PSO) algorithm. At this level, parallelization is accomplished using MPI sub-communicators, such that the nodes in each sub-communicator perform a single MLE iteration at a time. We assess the accuracy of the implemented space-time model on large-scale synthetic space-time datasets. Moreover, we use the proposed implementation to model two air pollution datasets from the Middle East and US regions with 550 spatial locations X 730 time slots and 945 spatial locations X 500 time slots, respectively. The evaluation shows that the proposed implemntation satisfies high prediction accuracy on both synthetic and real particulate matter (PM) datasets in the context of the air pollution problem. We achieve up to 757.16 TFLOPS/s using 1024 nodes (75% of the peak performance) using 490K geospatial locations on a Cray XC40 system.

Author(s): Mary Lai Salvana (King Abdullah University of Science and Technology), Sameh Abdulah (King Abdullah University of Science and Technology), Hatem Ltaief (King Abdullah University of Science and Technology), Ying Sun (King Abdullah University of Science and Technology), Marc Genton (King Abdullah University of Science and Technology), and David Keyes (King Abdullah University of Science and Technology)

Domain: Climate, Weather and Earth Sciences

## Porting Uintah to Heterogeneous Systems

The Uintah Computational Framework is being prepared to make portable use of forthcoming exascale systems, initially the DOE Aurora system through the Aurora Early Science Program. This paper describes the evolution of Uintah to be ready for such architectures. A key part of this preparation has been the adoption of the Kokkos performance portability layer in Uintah. The sheer size of the Uintah codebase has made it imperative to have a representative benchmark. The design of this benchmark and the use of Kokkos within it is discussed. This paper extends recent work with new scaling studies run 24x further than earlier studies. Results are shown for two benchmarks executing workloads representative of typical Uintah applications. These results demonstrate single-source portability across the DOE Summit and NSF Frontera systems with good strong-scaling characteristics. The challenge of extending this approach to anticipated exascale systems is also considered.

Author(s): John Holmen (University of Utah), Martin Berzins (University of Utah), and Damodar Sahasrabudhe (University of Utah)

Domain: Engineering

## Reducing Communication in the Conjugate Gradient Method: A Case Study on High-Order Finite Elements

Currently, a major bottleneck for several scientific computations is communication, both communication between different processors, so-called horizontal communication, and vertical communication between different levels of the memory hierarchy. With this bottleneck in mind, we target a notoriously communication-bound solver at the core of many high-performance applications, namely the conjugate gradient method (CG). To reduce the communication we present lower bounds on the vertical data movement in CG and go on to make a CG solver with reduced data movement. Using our theoretical analysis we apply our CG solver on a high-performance discretization used in practice, the spectral element method (SEM). Guided by our analysis, we show that for the Poisson equation on modern GPUs we can improve the performance by 30% by both rematerializing the discrete system and by reformulating the system to work on unique degrees of freedom. In order to investigate how horizontal communication can be reduced, we compare CG to two communication-reducing techniques, namely communication-avoiding and pipelined CG. We strong scale up to 4096 CPU cores and showcase performance improvements of upwards of 70% for pipelined CG compared to standard CG when applied on SEM at scale. We show that in addition to improving the scaling capabilities of the solver, initial measurements indicate that the convergence of SEM is largely unaffected by pipelined CG.

Author(s): Martin Karp (KTH Royal Institute of Technology), Niclas Jansson (KTH Royal Institute of Technology), Artur Podobas (KTH Royal Institute of Technology), Philipp Schlatter (KTH Royal Institute of Technology), and Stefano Markidis (KTH Royal Institute of Technology)

Domain: Computer Science and Applied Mathematics

## Scalable Low-Rank Factorization using a Task-Based Runtime System with Distributed Memory

We present a new parallel task-based algorithm for randomized low-rank factorizations of a matrix and its application to fast hierarchical solvers. TaskTorrent is a lightweight, distributed, task-based runtime in C++. We explain how the randomized factorization can be expressed as a task graph in TaskTorrent. We compare the key steps in the factorization with similar implementations using StarPU, PaRSEC, and in ScaLAPACK. The overall randomized algorithm in TaskTorrent outperforms the column pivoted QR routine from ScaLAPACK by far. The weak and strong scaling experiments demonstrate excellent scalability up to our largest available configuration with 1,024 cores. Finally, we incorporate our distributed low-rank factorization into spaND, a fast hierarchical solver for solving large sparse linear systems. We demonstrate gains in performance and scalability of the spaND solver.

Author(s): Abeynaya Gnanasekaran (Stanford University), and Eric Darve (Stanford University)

Domain: Computer Science and Applied Mathematics

## A Task Programming Implementation for the Particle in Cell Code Smilei

An implementation of the electromagnetic Particle in Cell loop in the code Smilei using task programming is presented. Through OpenMP, the macro-particles operations are formulated in terms of tasks. This formulation allows asynchronous execution respect- ing the data dependencies of the macro-particle operations, the most time-consuming part of the code in simulations of interest for plasma physics. Through some benchmarks it is shown that this formulation can help mitigating the load imbalance of these oper- ations at the OpenMP thread level. The improvements in strong scaling for load-imbalanced physical cases are discussed.

Author(s): Francesco Massimo (Maison de la Simulation), Mathieu Lobet (Maison de la Simulation), Julien Derouillat (LSCE), Arnaud Beck (Laboratoire Leprince-Ringuet), Guillaume Bouchard (Laboratoire Leprince-Ringuet), Fréderic Pérez (LULI), Mickael Grech (LULI), and Tommaso Vinci (LULI)

Domain: Physics

## Toward a Big Data Analysis System for Historical Newspaper Collections Research

The availability and generation of digitized newspaper collections have provided researchers in several domains with a powerful tool to advance their research. More specifically, digitized historical newspapers give us a magnifying glass into the past. In this paper, we propose a scalable and customizable big data analysis system that enables researchers to study complex questions about our society as depicted in news media for the past few centuries by applying cutting-edge text analysis tools to large historical newspaper collections. We discuss our experience with building a preliminary version of such a system, including how we have addressed the following challenges: processing millions of digitized newspaper pages from various publications worldwide, which amount to hundreds of terabytes of data; applying article segmentation and Optical Character Recognition (OCR) to historical newspapers, which vary between and within publications over time; retrieving relevant information to answer research questions from such data collections by applying human-in-the-loop machine learning; and enabling users to analyze topic evolution and semantic dynamics with multiple compatible analysis operators. We also present some preliminary results of using the proposed system to study the social construction of juvenile delinquency in the United States and discuss important remaining challenges to be tackled in the future.

Author(s): Sandeep Puthanveetil Satheesan (University of Illinois Urbana-Champaign), Bhavya Bhavya (University of Illinois Urbana-Champaign), Adam Davies (University of Illinois Urbana-Champaign), Alan B. Craig (Extreme Science and Engineering Discovery Environment), Yu Zhang (California State University, Fresno), and ChengXiang Zhai (University of Illinois Urbana-Champaign)

Domain: Humanities and Social Sciences

## Towards Data-Driven Inference of Stencils for Discrete Differential Operators

Finite element, finite difference or similar methods allow to numerically solve ordinary (ODEs) and partial differential equations (PDEs).

The backbone of all these discrete approaches are matrices with non-zero coefficients for close-by degrees of freedom -the so called stencil- which approximate the underlying differential operators.

We apply comprehensive regression techniques, to infer stencil coefficients for linear, one- and two-dimensional PDEs. Starting with the 1D case, we discuss how mathematically meaningful stencil coefficients can be obtained via an ordinary least-squares (OLS) linear regression if a full-rank matrix of explanatory variables is given. We continue by demonstrating how regularization techniques allow for the recovery of the correct stencil coefficients from data even for singular matrices.

We discuss the impact of noise on the data in both explanatory and response variables for various noise levels and compare different errors-in-variables approaches to mitigate the inherent consequences of noisy independent variables in OLS regression. We extend our considerations to the two-dimensional case and higher-order expansions of the differential operator and compare the stencils obtained from our regression approach with their analytic solutions. Finally, we demonstrate both the relevance as well as the applicability of the presented technique by applying it to two scenarios from physics and hydrogeology.

Author(s): Yannis Schumann (Helmut Schmidt Universität), and Philipp Neumann (Helmut Schmidt Universität)

Domain: Computer Science and Applied Mathematics