ABAP Memory Latency & Hardware Prefetching

In another post I showed how using boxed types for certain use cases can reduce your memory bandwidth consumption and improve performance. But I also pointed out that this technique is to be used with caution and only if your access of these boxed types is rather rare. Today I want to show you why I said that – I want to show you the effects of memory latency and hardware prefetching.

Let us start of with a simple example setup:

We traverse through an internal table with 1 Mio lines. Each line has a payload structure which we access in the loop to modify the values. Initial runtime:

So far nothing exciting. Now let us try what I advised not to do: Turn the payload structure into a boxed one.

And now we get this:

Now THAT hurts! A runtime increase of 551,39%! But what causes this? Why is it so much slower?

Because we access through a reference. What enables the initial lower memory bandwidth consumption of a boxed type (which is essentially just the same raw type, but behind a reference pointer) can also be a weak point.

Your Hardware is trying it’s best to help your code to run fast, one thing that it does is called prefetching. Your hardware tries to predict what data you will try to access in the future and loads data it thinks will be accessed in the future into your Cache. As Memory is organized in a continuous way, your prefetcher also works that way. This helps a lot when you are traversing through an internal table with a flat structure, the access pattern aligns with the memory structure.

But references have a problem with this, because they are not very predictable. Though you know where the pointers are located (part of the line type of the internal table), you do not know what they are pointing at. Only after that reference is read (Accessed from RAM, got transfered into L1 Cache) you know what the content of that reference is. Now this is the point in time where you access the data the reference is pointed towards (And again access from RAM, transfered into L1 Cache). The reason this is slow is not only because you can not prefetch this data effectively, you are also jumping around in RAM. An access to RAM costs 100 times more than an access to L1 Cache. And you do that at least twice for a simple reference. Some classes are designed in a way where you are mostly engaged in jumping around in RAM from pointer, to pointer, to pointer – this can produce absurdly slow software. Another feature you obtain is almost non scalable software.

Why? Well your Cores are doing nothing while you jump around in RAM. Your processor stalls and there is nothing you can do about that – except a redesign. If you throw more hardware at your problem, performance will hardly increase. Because processing power was never an issue here, so more will not solve anything.

Take care,

Dmitrii

Reducing memory bandwidth consumption

Last time I talked about how reducing memory bandwidth consumption could be very benefitial to your ABAP performance.

I recommended to design your structures as lean as reasonably possible, but sometimes removing business relevant information is not an option.

And though an idealistic approach may sound good, just knowing about the drag I produce with wide structures does not improve my situation. What I would need is a solution to be able to still carry a payload without producing too much drag.

And there actually is one. Let me show you an example:

We perform scans on the first 3 fields, in order to find out what version is actually relevant. If we find a hit, we add information to our output table.

Runtime sequential:

Runtime parallel(8 cores, 8WP):

Now from the business side of things, there is nothing here we can simply leave out. And we can’t just go and shrink datatypes, which are used systemwide. Now we could try to move the extra info (not used for scans) into a separate internal table, and read just in case we have a hit. Unfortunately this is impractical, because in a real world scenario we are often operating in a bigger eco system. We often have fixed inputs( APIs or DB Table designs) and we often have some form of fixed outputs (again APIs or DB Tables).

So this would require to first loop over the main table (producing massive drag we actually wanted to avoid), in order to split the info between the two tables -> and to then again loop over the lean version, scan and read the other one if we have a hit. This does not sound very effective and/or fast. And it isn’t.

Please also consider that even read statements with a constant access time development (Index, Hash) do not come for free. They too have a non neglectable runtime. So this solution doesn’t work, is there another one?

Well yes, try to redesign your input stucture (If you are allowed to šŸ˜‰ ):

As you can observe I changed the structure layout and implemented a split. The 3 main fields required for the initial scan are still there, but instead of having the payload right behind it, I outsourced it (Good outsourcing, who would have thought…) into a separate structure. Now I have to slightly change the access of the algorithm, but this change is actually very local.

So did this improve runtime?

Yep, it did. But why?

Remember my last post? Remember what get’s loaded into cache if you access a structure? And more important, what does not?

Because my payload structure is boxed into the main structure, it does not get loaded into cache when I access other parts of the structure. Only if I access it directly, it is being loaded into cache. And because the payload structure has only one flat datatype, everything gets loaded.

Now this is only usefull if you do not actually access most of the table entriesĀ  payload, because an individual access of the payload in this manner is way slower than if it would be a simple part of the main structure. This is because of memory latency you get by accessing through a reference (Will be discussed in a separate post).

But for this type of use case it comes pretty handy – we can reduce drag without actually excluding relevant information. Our main concern in this situation would only be a meaningfull usage of boxed types to control what we want to load and when.

But regardless of how well we try to reduce memory bandwith usage, we are fighting an uphill battle:

This is the code after the change in parallel execution, still 8 cores and 8 WP. The individual execution time increased by 128%! Now we still are way faster than before the change (in both modes), but the more we push the hardware to its limits, the less scalable our code becomes.

So the more you push for bigger hardware, more cores and more WP – keep in mind that you often scale quite poorly when it comes to individual WP performance. Reducing memory bandwidth consumption through changes in code or data structure may delay this, but eventually it will catch up.

 

Take care,

Dmitrii

 

 

Improving performance before the squeeze

Improving performance only comes up as a relevant topic when things have gotten really bad.

When organisations are in a squeeze situation that one thing which was disregarded in the past suddenly becomes topic number one.

The first squeeze

Unfortunately a squeeze situation puts oneself in a very weak negotiation position. And when reality catches up everybody realizes that performance tuning is nothing that is done on a weekend right before go live. Sometimes these situations are “solved” by throwing a huge amount of expensive hardware at the problem. Combined with collecting the quick wins (Which are in essence developer created fuck ups, because you can not analyze and fix a complex design driven problem in 2 days), the situation can sometimes be brought to a point where the pain is just below the critical point of collapse. This is then called a “rough” and “exciting” or “engaging” go live.

The resurrection

Years later these applications (breathing imitations of frankensteins monster) rise up again to cause yet another “engaging” experience. This is typically the point where everyone has adapted to everything, the run departement knows that the world is not perfect but has accepted the situation.

But now that application is so slow in production, that people call for help and solutions. Unfortunately this is often again a squeeze situation (if it would’nt be considered urgent by management it would not have been brought up), so the drama developes again. Experts are invited to fix it in “as fast as you can” and new hardware is beeing ordered (despite nobody even previously analyzing that application in depth). But now things are different, actually worse. Because you do not have any low hanging fruits, those were collected by the initial go live fire departement. So we have an application with almost exclusively design driven performance problems which require an in depth analysis, without the time to do so. In addition you do not have a fresh application anymore – years in production have added a ton of changes, workarounds and features. Sumed up: Peace of cake!

Change your perception of performance

This does not have to be the way things develop, really! I know my view on performance tuning differs from that of my colleagues, but what I also try to communicate is that performance is a currency. A currency you can stack up when it is cheap to do so.

You can save up for unbenefitial changes in hardware prices ( Anyone checked RAM prices recently?), for new features you would like to design and implement, or simply to have a buffer you can use on the operational level to give your organisation the ability to recover from unexpected errors without appearing in the news. After all people do not often complain about HP reserves in their car, even if they do not use them every day.

Now really good performance tuning results do require effort and the right people to do so, but since you will probably have to do it at some point in time – do it before you actually get into a squeeze situation. Quality work takes time, you do not have that luxury in a squeeze.

 

Take care,

Dmitrii

ABAP Performance – Memory Bandwidth

Today I would like to show you an effect that is mostly underrated and even ignored when trying to improve application performance – Memory Bandwidth.

Just like what get’s measured get’s managed, what is not measured will not be managed. Or did you recently discover a KPI in your SAT Trace named “Memory Bandwidth – Time consumed”?

Let me start this with a simple example:

And this is the function we call:

So we basically traverse through an internal table with 100k lines and just do one simple addition operation. I inserted logic to push the relevant internal table out of cache before traversing through it – I want to show the effects of the memory bandwith and not the cache behaviour…

So this provides us with an average execution time of:

Now I change the code, well I actually only change the datastructure of the internal table we traverse through.

I added a new field named info, this simulates your normal additional payload you may not always use but still keep in your structure – just in case you would need to access this. Now the rest has not changed – same internal table, still 100k lines and just one addition.

But now we get this:

Execution time increased by 136,15%. And we have not even added fancy new routines to operate on that new component…

So now let me provide some very basic background:

When you access a part of a structure, you load all of it’s components from RAM into your L1 Cache (Referenced components are an exception, you then only load the pointer). Touch one, get all.

But why does this even make a difference?

 

Memory Systems in general have become the bottleneck of modern multicore and multisocket systems. You either suffer from bandwidth restrictions (like in our example above) or you can have problems with memory latency (This will be a separate post).

And now take into account that this was just sequential – with exclusive machine access.

Let us therefore test this under more realistic circumstances, 12 cores with 12 WP:

runtime per workprocess – pre modification

 

runtime per workprocess – after modification

 

Now the relative runtime increase stays the same, but we are still wasting a lot of runtime here. Scale this up with more data, more business logic and more hardware and this will start to really hurt.

The lesson I learned from this is to keep my data structures as lean as I reasonably can. Often there is no need to run around searching for single bytes to save (except maybe in really hot and tight loops), but big structures cost you performance even if you are just traversing through them.

Take care,

Dmitrii