Reducing memory bandwidth consumption

Last time I talked about how reducing memory bandwidth consumption could be very benefitial to your ABAP performance.

I recommended to design your structures as lean as reasonably possible, but sometimes removing business relevant information is not an option.

And though an idealistic approach may sound good, just knowing about the drag I produce with wide structures does not improve my situation. What I would need is a solution to be able to still carry a payload without producing too much drag.

And there actually is one. Let me show you an example:

We perform scans on the first 3 fields, in order to find out what version is actually relevant. If we find a hit, we add information to our output table.

Runtime sequential:

Runtime parallel(8 cores, 8WP):

Now from the business side of things, there is nothing here we can simply leave out. And we can’t just go and shrink datatypes, which are used systemwide. Now we could try to move the extra info (not used for scans) into a separate internal table, and read just in case we have a hit. Unfortunately this is impractical, because in a real world scenario we are often operating in a bigger eco system. We often have fixed inputs( APIs or DB Table designs) and we often have some form of fixed outputs (again APIs or DB Tables).

So this would require to first loop over the main table (producing massive drag we actually wanted to avoid), in order to split the info between the two tables -> and to then again loop over the lean version, scan and read the other one if we have a hit. This does not sound very effective and/or fast. And it isn’t.

Please also consider that even read statements with a constant access time development (Index, Hash) do not come for free. They too have a non neglectable runtime. So this solution doesn’t work, is there another one?

Well yes, try to redesign your input stucture (If you are allowed to šŸ˜‰ ):

As you can observe I changed the structure layout and implemented a split. The 3 main fields required for the initial scan are still there, but instead of having the payload right behind it, I outsourced it (Good outsourcing, who would have thought…) into a separate structure. Now I have to slightly change the access of the algorithm, but this change is actually very local.

So did this improve runtime?

Yep, it did. But why?

Remember my last post? Remember what get’s loaded into cache if you access a structure? And more important, what does not?

Because my payload structure is boxed into the main structure, it does not get loaded into cache when I access other parts of the structure. Only if I access it directly, it is being loaded into cache. And because the payload structure has only one flat datatype, everything gets loaded.

Now this is only usefull if you do not actually access most of the table entriesĀ  payload, because an individual access of the payload in this manner is way slower than if it would be a simple part of the main structure. This is because of memory latency you get by accessing through a reference (Will be discussed in a separate post).

But for this type of use case it comes pretty handy – we can reduce drag without actually excluding relevant information. Our main concern in this situation would only be a meaningfull usage of boxed types to control what we want to load and when.

But regardless of how well we try to reduce memory bandwith usage, we are fighting an uphill battle:

This is the code after the change in parallel execution, still 8 cores and 8 WP. The individual execution time increased by 128%! Now we still are way faster than before the change (in both modes), but the more we push the hardware to its limits, the less scalable our code becomes.

So the more you push for bigger hardware, more cores and more WP – keep in mind that you often scale quite poorly when it comes to individual WP performance. Reducing memory bandwidth consumption through changes in code or data structure may delay this, but eventually it will catch up.

 

Take care,

Dmitrii