Case Studies

The following case studies cover some of our offerings and tuning exercises.

  • Math Library Performance

    The Samara Technology Group math library (libstm) is a fast and accurate drop in replacement for GNU libm, comprised of the most frequently used elementary functions and also fast ubiquitous functions like floor, etc. It aliases existing libm functions so that a simple addition to the link line is all that's necessary for increased performance; no changes to source code are required.

    This case study examines the performance relationships for some, but not all, math functions on the MIPS 5kf core platform. If the performance tools show a bottleneck in a math function, you can predict your application speedup by using these plots. (Note that math library performance may be data dependent. For this reason the plots are made across different ranges of powers of two.)

  • String Library Performance Speedup

    The Samara Technology Group string library, libststr, replaces many of the most commonly used string functions with highly optimized versions. In some cases, like strcpy, performance is dependent on the bit pattern of the characters themselves, with the very best performance being achieved for western-style text with all high bits clear. As shown in this case study, at best the string library offers greater than five and a half times the performance of glibc, and in no case is the performance increase less than two and a half times that of glibc.

  • Timing small-granularity events and optimizing expf

    It can sometimes be difficult to measure very fine granularity events, which conclude in tens of cycles, with accuracy, because the latencies involved in reading clocks or taking system calls are greater than the event to be measured. Interrupts and context switches associated with multi-tasking (i.e. not real time) operating systems further muddy the picture.

    This study describes the use of a Samara Technology benchmarking harness that uses a large sample space to strip off latencies and eliminate bad data. It further performs statistical analysis to validate the quality of the remaining data and perform very accurate measuring. We use the harness to explore the performance of the Samara Technology Group expf function on MIPS and show a speedup of 4.5X or 13.7X depending on input exponent.

  • Improving the Performance of Convolutions

    Many algorithms exhibit poor data locality when translated directly from the underlying formulation, so that naive programming, without detailed understanding of the underlying processor architecture, yields unsatisfactory results. But careful reorganization can produce computations performing at near peak.

    We show an example of this with the case study of 2D and 3D anisotropic convolutions. These important algorithms run at less than 20% of peak using either the gcc or pathscale compilers. A rethinking of the data usage to enhance locality, combined with careful tiling of the data arrays, produces 3D performance at near 90% of peak on an sgi Octane. These performances have been shown to scale over thousands of cores on a SiCortex 5832.

  • Boosting FFTW Performance

    Because of the prohibitive efforts and cost associated with developing proprietary FFT implementations, many companies now choose the excellent Open Source FFTW package (www.fftw.org) Yet even FFTW may sometimes produce code that doesn't maximize micro-architectural resources. This case study shows how applying work from the Vienna University of Technology to produce a more balanced multiply/add ratio improves FFTW performance on the SiCortex platform.

  • Using Pfmon to find Performance Bottlnecks

    This case study describes how pfmon, from the Samara Technology Group Performance Tools Platform, was used to find cold performance spots in anisotropic convolutions.