ERGO Code Analysis
Ergo is a quantum chemistry program for large-scale self-consistent field (SCF) calculations using the Hartree-Fock and Kohn-Sham density functional theory methods. The code is parallelised with OpenMP having poor performance results, so the target for the analysis was to identify the main bottlenecks limiting the application's performance. All the traces were obtained on KTH machines selecting runs with 8, 16 and 24 threads. The analysis started looking at the 8 threads runs. ERGO code has many different phases being hard to identify a repetitive structure. Figure 1 shows a global view of the ERGO application. The colouring of the image reports the different parallel functions that are executed. Light blue corresponds to sequential regions where only the master thread is working and the slaves are on an idle loop.

Figure 1: ERGO code structure
At this level the first conclusion is that there is an important part of the code not parallelised. Furthermore, zooming into different areas of the timeline we can see that the situation is even worst as we can see in some of the images from Figure 2.

Figure 2: zooming into the ERGO code structure
It can be clearly seen that some phases are really dominated by sequential parts (light blue) strongly limiting the scalability that can be achieved. Other phases are highly parallelised and we can identify different granularities . In order to compute the real weight of the sequential regions we have to use the statistics view. The next figure measures the weight of each parallel function with respect to the duration of the application run.

Figure 3: Profile of the parallel functions
This figure shows the percentage of time that each thread dedicates to each parallel function. The bottom line displaying the total value reports that only 30% of the execution is dedicated to parallel regions, so by Amdahl's law, no mater on which machine we run ERGO, the maximum speed-up will be always poor, as close to a 70% will have no reduction when increasing the number of threads. On the other hand, 22% of the 30% is concentrated on 3 parallel functions so they should be the target for the improvement as it will have the higher impact on the application performance. Another fact provided by this view is that loops seem well balanced between threads at least at a global level (they represent a similar percentage on the different threads).,
As we have seen there are areas with a good/high percentage of time parallelised while others are dominated by sequential computations.
The next step on the analysis is to look at one of each such areas to measure with the statistic its behaviour. Next figure shows the timeline of the parallel function and its statistics on the runtime states for a highly parallelised phase as well as for a phase dominated by sequential computations.

Figure 4: Analysis of a well parellised and a poor parallelised phases
We can see that on the good case the timeline has some light blue holes that did not appear on the previous view. These holes correspond to the OpenMP runtime. Using the statistics we can measure that 91% of the time is doing parallel computation, a 2% is dedicated to barriers synchronization and 1.5% is overhead due to dynamic balancing, so may be a good idea to try with bigger chuncks or to use guided scheduling to improve the performance. The measurements on the poorly parallelised phase confirmed us that only a very small percentage of time is real parallel computation (around 13%) and the threads spent 85% of its time on the idle loop.
Once we identify the main bottlenecks causing a poor performance with 8 threads, we compared the execution while increasing the number of threads. Figure 5 displays the duration of the computing phases with a Paraver gradient that goes from light green (low values) to dark blue (high values) colouring with orange the areas larger than the scale maximum. The 3 timelines are synchronised on the time scale, so we can see that when we increase the number of threads the duration of the application not only it is not reduced but even increases. We have used a dotted yellow line to distinguish two behaviours during the execution. The region at the left of the dotted line has the same duration for the 3 executions while the region at the right increases its duration. With the gradient colours we can see that most of the computations for this phase are enlarged when the number of threads is increased.

Figure 5: looking at the scalability with the duration of the computation phases (with 8, 16 and 24 threads)
As the duration of the sequential code would keep constant for any number of threads we should look at the degradation on the regions with a high percentage of time in parallel loops. Looking at a highly parallelised region from the right of the dotted line we can obtain the next figure where the timelines are synchronised on the time scale.

Figure 6: looking at the scalability from the parallel functions (with 8, 16 and 24 threads)
We can see that the parallel loops take much longer when the number of threads is increased. Also we see that while in the 8 threads run there was a uniform behaviour of the parallel loops (each colour keeps a very similar duration over the whole timeline), the runs with 16 and 24 threads have two modes for the red and green parallel functions. One of them with a similar enough duration than the 8 threads case, the other one much more larger . These observations should give hints to the code developers or when looking at the sources.
The last figure compares the behaviour of two iterations with respect to the number of instructions. In this case the timelines are not synchronised on the time scale, just on the gradient scale so the same colour in the instruction timelines correspond to areas with a similar number of instructions. Left views correspond to the parallel functions, right views to the instructions executed on the computation regions

Figure 7: looking at the scalability with respect to the number of instructions (with 8, 16 and 24 threads)
We can see that the green parallel function in the second iteration (right side) increases significantly the number of instructions when it is executed with more threads. The function coloured in pink seems to keep a similar number of instructions per chunk, or even reduce the number of instructions. These observations indicate that while the green function may be doing some code replication, the pink one may obtain a lower IPC when the number of threads is increased due to the sharing of node resources like the cache memory.These results were presented to one of Ergo's main developers. He was really interested in understanding why the code performs bad within some specific functions involving matrix operations. Thus, further analyses will be performed focusing in those parts of the application.


