Project Summaries

These summaries describe projects completed by members of our technical staff. They span a wide range of technologies from whole program development to performance tuning to libraries to performance tool development to training. Our staff is happy to discuss how we can help solve your problems.

 

Problem: Client needed radically improved performance on a ray tracing and solver code.

Solution: Made modifications in the workload distribution algorithm to enable parallel I/O processing of ray trace date. The performance improvement was an order of magnitude in the I/O code section, and the overall run time improvement of the ray trace section was substantial. Subsequently parallelized ray trace files reads in the solver, significantly reducing the I/O time. The FILL portion of the tomography application was reduced from 20 hours to 18 min on 60 nodes. The solver was reduced from 6 hours on 60 nodes to 2 hours on 90 nodes.

 

Problem: Client needs to replace custom coded Image Processing primitives with portable SIMD codes for next generation systems.

Solution: Re-imagine, optimize and implement Image Processing algorithms in AltiVec C++ for PowerPC G4 and G5 systems. All codes match or surpass previous implementations which were hand assembly coded.

 

Problem: The general scaling behavior of client's global ocean model advection scheme is poor.

Solution: Algorithm is coded with model variables in separate arrays leading to numerous MPI communications for the halo updates (i.e. a call to the communication library for each variable). Developed intermediate software layer which enabled high levels of message aggregation. The overall scalability of the model improved by almost 2x.

 

Problem: Client needs to understand intermittent application slowdowns and avoid them.

Solution: Identified linker dependent cache thrashing issues and specified rules for graphics DSO construction to allow run-time instruction cache optimization.

 

Problem: Performance issues in NAMD resulting in severe runtime degradation

Solution: Used performance tools, such as pfmon and HPCToolkit, to track a performance bug causing 90% degradation in runtime of NAMD. The issue involved intermittent hangs, and was the result of a wrap-around bug in the fast path kernel timer.

 

Problem: Client needs to accelerate a wide variety of numerical applications while staying loyal to open source philosophy and minimizing porting requirements.

Solution: Created fast, accurate system mathematical library of single and double precision elementary functions for MIPS cores, achieved many fold performance increase over system library with better cache behavior. The library is drop in compatible with existing namespace, so that no source code changes are required even across multiple language bindings.

 

Problem: Client needs to accelerate applications with significant data movement or string manipulation.

Solution: Created fast string functions library for MIPS cores, many fold performance increase over system library. Wrote fast memory movement functions exploiting specific architectural features. Ported functions to kernel space for operating system optimization.

 

Problem: Client needs to demonstrate peak theoretical efficiency on physical science kernels.

Solution: Analyzed and developed Quantum Chromodynamics kernels to run at peak processor efficiency.

 

Problem: Client needs to achieve maximum closed phase loop performance to minimize power and heat requirements for radar system. Client wants to amortize investment across multiple same-platform projects.

Solution: Design and implement generic library of optimized PowerPC fundamental vector functions to meet project specifications. All codes achieve theoretical maximum performance.

 

Problem: Large air frame manufacturer needs to understand scaling performance of complex CFD multigrid application.

Solution: Using the open source performance tools suite and performance database developed by founders of STG, demonstrated conclusively that computational load imbalance played a key role in scaling behavior. Determined problem decompositions with 30-60% better performance at comparable core counts as compared with decompositions supplied by client.

 

Problem: Developer of a climate model needs to understand trade off between programming "multi-purpose" verses performance for key subroutines in complex ocean dynamics model.

Solution: Streamlined key performance related subroutine as proof of concept. Demonstrated an almost 2x increase in the performance of the core functionality of the subroutine leading to over 10% overall improvement in the test case.

 

Problem: Government client has a parallel implementation for a global ocean model. The memory use characteristics within the complex boundary layer exchange algorithm limit the scalability.

Solution: Existing algorithm "pinned" the exchange decomposition to one side of the boundary layer (i.e. the exchange "grid" decomposition was limited to the decomposition associated with a completely unrelated model component). Re-writing the exchange grid so that it had a completely independent decomposition introduced additional communication, but allowed for load balancing of the exchange grid points. The load balancing greatly reduced the individual process spikes in memory requirements.

 

Problem: Client must retrofit legacy vector platform global ocean model for MPI and RISC processors.

Solution: Worked closely with scientist developers as they re-developed the code. Key contributions included recoding of subroutines to optimize scalar performance as well as crucial debugging of numerous complex problems associated with the parallel programming environment and legacy coding issues.

 

Problem: Industry client wishes to have target customer modeling infrastructure take advantage of CoArray Fortran in an application transparent way.

Solution: Target infrastructure provides MPI communication through a customer developed API. Replaced MPI communication in customer library with CoArray data structures and synchronization semantics. This approach made replacement of MPI with CoArray Fortran completely transparent to the user.

 

Problem: Client needs to improve performance of generalized computational manufacturing tool for high-fidelity numerical algorithms for discrete finite volume solutions of nonlinear systems of PDEs on 3D, unstructured-mesh computational domains.

Solution: Provided a number of coding style optimizations for RISC platforms. Also recoded Krylov solver infrastructure to greatly reduce redundant calculations. Achieved over 2x performance improvement in Krylov solver.

 

Problem: Client wishes platform independent improvements in cache behavior of climate application recently recoded from Fortran77 to Fortran90.

Solution: Performed detailed assembly code generation and runtime hardware counter analysis. Determined that while some cache behavior improvements on some platforms was possible, no approach would provide improvements across platforms given differences in memory / cache mapping strategies for various vendors.

 

Problem: Client needs accurate methodology to project application performance from existing generation of MPP to follow on generations.

Solution: Developed methodology which measured computation, communication and I/O time associated with application components. Introduced key notion of characterizing runtime components associated with "load imbalance". These components get measured within the context of "communication" due to the explicit and implicit barriers. Correct modeling of the load imbalance component as driven by processor performance rather than communication performance led to superior projection accuracy.

 

Problem: Client needs to establish system characteristics in order to understand application performance.

Solution: Performed system qualifications to determine maximum bandwidths and throughputs.

 

Problem: Client needs to leverage compiler technologies in order to provide better performance on underpowered hardware.

Solution: Extensive consulting on compiler back end for optimal code generation, identified enhancements for SW pipelining, indexed addressing, recurrence breaking, branch bubble elimination, improved streaming, IPA improvements, invariant load hoisting, processor pipeline load balancing, prefetch management.

 

Problem: Client needs to improve application base for then novel RISC processor based MPI platform.

Solution: Parallelized numerous applications from a wide range of native platforms including vector and various alternate flavors of RISC processor. During development process, implemented various optimizations for client's particular RISC variant. Most of the time, the optimizations provided general performance improvements across the RISC processor set available in the market.

 

Problem: Client needs to improve performance for their vector platform across a range of applications.

Solution: Optimized numerous CFD applications for the client's vector platform often achieving better price / performance and throughput compared to the competition.

 

Problem: Client needs to differentiate from competition FFT performance using standard packages.

Solution: Optimized FFTW internals for better multiply/add balancing, resulting in a 20% multicore 3D FFT performance increase.

 

Problem: Client needs to demonstrate maximum system performance scaling over many cores.

Solution: Developed anisotropic 2D and 3D convolutions scaling at near peak performance over thousands of cores.

 

Problem: Client needs to speed up slow graphics middleware package.

Solution: Optimized run time PHIGS+ traversal to achieve an eightfold performance increase.

 

Problem: Client needs to highlight superior workstation graphics performance.

Solution: Optimized rendering modes to group OpenGL graphics primitives for best use of hardware pipelines.

 

Problem: Client needs help porting large MCAD application, Dassault Systems' Catia (over 100 million lines, mixed languages).

Solution: Provide Technical Lead for AE team, identifying compiler and graphics sub-system bugs, analyzing and optimization, charting path to make sgi the performance leader. Project generates over $500 million in sales for SGI.

 

Problem: Client needs industry's best core mathematical functions.

Solution: Wrote vector libraries of hundreds of functions for various PowerPC processors, achieving “industry's best” status.

 

Problem: At various times, client needs numerical kernels fully optimized for specific applications on various platforms.

Solution: Provide assembly coded kernels for codes like: Stolt Migration; Image Processing primitives; FFT; Custom Medical Imaging; Convolutions; etc.

 

Problem: Client's customers often see less than optimal performance because of TLB or cache set thrashing.

Solution: Designed and implemented super-malloc function that achieves a 2X performance gain for multiple large buffers by avoiding TLB and cache set conflicts. Designed and implemented run-time cache set conflict checking that picks temporary vectors and vector registers to avoid thrashing low associativity caches.

 

Problem: Client needed performance improvements on the SpecHPC benchmark suite.

Solution: Tuned the suite for the IBM SP and IBM SP2 Power 2 based systems. Designed a software instrumentation hardware monitoring tool for the benchmark team to improve turnaround time on high pressure procurements.

 

Problem: Different performance tools used different formats and couldn’t interact.

Solution: Invented PAPI. Won R & D 100 award in 2000. The industry standard for hardware performance monitoring, now used worldwide.

 

Problem: Client wanted a massively parallel distributed time and event based simulation framework, scalable from dozens to hundreds of machines.

Solution: Implemented the entire framework in PVM and tested it on heterogeneous networks of workstations.

 

Problem: Client wanted to dynamically analyze the performance of large parallel codes.

Solution: Designed and implemented a dynamic instrumentation and performance analysis system called DynaProf.

 

Problem: Client was building a massively parallel Level3 router based on Power PC hardware and needed an instrumentation system to improve the performance.

Solution: Implemented performance monitoring infrastructure including user level tools and support in the real time kernel.

 

Problem: Client needed improved performance of the Parallel Ocean Program on ASCI Blue.

Solution: Worked with the primary developers of POP at LANL in conjunction with consulting from GFDL. We were able to more than double the performance from 50MF/pp to 130MF/pp on the SGI Origin's. Further opportunities were made available with an estimated gain of > 200MF, but these modifications were too large to be accepted by the client. Changes in POP 1 became standard code in POP 2.

 

Problem: Client needed improved performance on Gaussian.

Solution: Worked with a source license for Gaussian on improving the performance on the SGI Origin series. Succeeded in a 30% general speedup through compilation and library replacement.

 

Problem: HPC OEM required tuning of benchmark suite for a procurement related to a government lab engaged in numerical weather simulation.

Solution: Tuned a closed set of benchmarks for a large Alpha ES40 cluster targeted for the lab. A large amount of the work involved restructuring compute kernels in the applications to avoid denormals, redundant divides and to delay and cache larger and more complex arithmetic. Procurement was successful and system was installed.

 

Problem: National lab client needed improved application performance.

Solution: Worked on numerous applications to improve the performance, including but not limited to computational chemistry, high energy physics, computational electromagnetics and atmospheric simulations.

 

Problem: Performance of a closed computational electromagnetics code was highly inconsisent across a range of problem sizes on their target platform: 4 way Opteron systems.

Solution: Diagnosed the issue and decreased performance variance to less than 2% across the problem space. Through further tuning was able to improve the performance of the core solvers by up to 60% for problem sizes that fit in cache.