Tuning of abap delete performance

Today I would like to discuss how you can improve your abap delete performance. As there are different kinds of solutions to this problem, I would like to work with one example setup through the whole post – improving the performance step by step.

Initial setup

Our initial setup deals with deleting all of the non essential datasets from our datastream. In order for the datasets to be relevant and not be deleted, they have to be validated.

And this is how the data structure looks like:

So we just search for a pattern in the payload. If we find it, our task is to not deliver it back. Which we achieve by deleting.

Before we start changing the solution, let us measure the runtime.

1 WP 1k lines

8 WP 1k lines

 

 

 

 

1WP 10k lines

8WP 10k lines

 

 

 

First thoughts

Now I initially wanted to go beyond 100k, but I honestly was afraid that this thing would never come back to me. The non linear runtime curve scares me. Which will be our main target to get rid off. A non linear response forced by the generic where condition…

… forces an internal search in the table for the right target(s). Now even if there is just one, the runtime curve still stays exponential. We have to change this as fast as possible.

Also the scaling concerns me a bit. We take almost double the time in full parallel mode per workprocess compared to our single process. Unfortunately, I can not spot anything I could do to reduce the load on the memory system. I do not use deep data types in my data structure and also do not have the possibility to reduce the width of my table structure. 🙁

Stage 1

Stage 1 has to get rid off the non linear response to table size. The solution here is to replace the target format for the deletion with an index value.

Why an index access? Because the access performance is not dependent upon the amount of the data in the table. It behaves like a hashed table in this regard. And also has a similar or mostly lower constant access time.

So let us look at the runtime:

1WP 1k lines

8WP 1k lines 

 

 

 

1 WP 10k lines

 

8WP 10k lines

 

 

 

Much better!

Judging Stage 1

Sequential performance improved by over 630% at 1k lines and over 610.000 % at 10 k. Removing the exponential search does pay off really well here.

Scaling attributes have also improved slightly, although not intended. Probably because we do not grind through the entire table every time we find something interesting.

But there is still one aspect of this solution, which I am not happy with. That is the delete statement itself.

Table Index rebuild

When we delete an entry from a table, we change its composition. With that action comes a necessity for the table to update its own composition status. After all we do expect the table to know what nodes it still has. That means that through our deletion process we force a rebuild of the tables index. It has to update its own status in order to stay consistent. This operation does take time and I want to know how much performance there is to gain if we avoid that rebuild.

Stage 2

I changed the mechanic by removing the delete statement and turning it into an append only mechanic. If something does not fit our criteria, it just does not get appended. After the loop I replace the initial table with my result.

This removes the index rebuild as a performance drain.

1WP 1k lines

8WP 1k lines

 

 

 

1WP 10k lines

8WP 10k lines

 

 

 

Judging Stage 2

Sequential performance improved by over 340% at 1k lines and by over 370%. compared to Stage 1. So the rebuild of the table index does have a significant impact on the performance of a deletion operation.

Compared to the initial solution, we achieved a performance improvement of over 282.000% at 1k lines and an improvement of over 2.897.000 % in runtime.

As a side note I must admit, that the last performance improvement figure does look ridiculous… but it proves a point. The point being not to ever build a solution which has an exponential runtime response to data input.

Take care,

Dmitrii

 

Reducing memory bandwidth consumption

Last time I talked about how reducing memory bandwidth consumption could be very benefitial to your ABAP performance.

I recommended to design your structures as lean as reasonably possible, but sometimes removing business relevant information is not an option.

And though an idealistic approach may sound good, just knowing about the drag I produce with wide structures does not improve my situation. What I would need is a solution to be able to still carry a payload without producing too much drag.

And there actually is one. Let me show you an example:

We perform scans on the first 3 fields, in order to find out what version is actually relevant. If we find a hit, we add information to our output table.

Runtime sequential:

Runtime parallel(8 cores, 8WP):

Now from the business side of things, there is nothing here we can simply leave out. And we can’t just go and shrink datatypes, which are used systemwide. Now we could try to move the extra info (not used for scans) into a separate internal table, and read just in case we have a hit. Unfortunately this is impractical, because in a real world scenario we are often operating in a bigger eco system. We often have fixed inputs( APIs or DB Table designs) and we often have some form of fixed outputs (again APIs or DB Tables).

So this would require to first loop over the main table (producing massive drag we actually wanted to avoid), in order to split the info between the two tables -> and to then again loop over the lean version, scan and read the other one if we have a hit. This does not sound very effective and/or fast. And it isn’t.

Please also consider that even read statements with a constant access time development (Index, Hash) do not come for free. They too have a non neglectable runtime. So this solution doesn’t work, is there another one?

Well yes, try to redesign your input stucture (If you are allowed to 😉 ):

As you can observe I changed the structure layout and implemented a split. The 3 main fields required for the initial scan are still there, but instead of having the payload right behind it, I outsourced it (Good outsourcing, who would have thought…) into a separate structure. Now I have to slightly change the access of the algorithm, but this change is actually very local.

So did this improve runtime?

Yep, it did. But why?

Remember my last post? Remember what get’s loaded into cache if you access a structure? And more important, what does not?

Because my payload structure is boxed into the main structure, it does not get loaded into cache when I access other parts of the structure. Only if I access it directly, it is being loaded into cache. And because the payload structure has only one flat datatype, everything gets loaded.

Now this is only usefull if you do not actually access most of the table entries  payload, because an individual access of the payload in this manner is way slower than if it would be a simple part of the main structure. This is because of memory latency you get by accessing through a reference (Will be discussed in a separate post).

But for this type of use case it comes pretty handy – we can reduce drag without actually excluding relevant information. Our main concern in this situation would only be a meaningfull usage of boxed types to control what we want to load and when.

But regardless of how well we try to reduce memory bandwith usage, we are fighting an uphill battle:

This is the code after the change in parallel execution, still 8 cores and 8 WP. The individual execution time increased by 128%! Now we still are way faster than before the change (in both modes), but the more we push the hardware to its limits, the less scalable our code becomes.

So the more you push for bigger hardware, more cores and more WP – keep in mind that you often scale quite poorly when it comes to individual WP performance. Reducing memory bandwidth consumption through changes in code or data structure may delay this, but eventually it will catch up.

 

Take care,

Dmitrii

 

 

ABAP Performance – Memory Bandwidth

Today I would like to show you an effect that is mostly underrated and even ignored when trying to improve application performance – Memory Bandwidth.

Just like what get’s measured get’s managed, what is not measured will not be managed. Or did you recently discover a KPI in your SAT Trace named “Memory Bandwidth – Time consumed”?

Let me start this with a simple example:

And this is the function we call:

So we basically traverse through an internal table with 100k lines and just do one simple addition operation. I inserted logic to push the relevant internal table out of cache before traversing through it – I want to show the effects of the memory bandwith and not the cache behaviour…

So this provides us with an average execution time of:

Now I change the code, well I actually only change the datastructure of the internal table we traverse through.

I added a new field named info, this simulates your normal additional payload you may not always use but still keep in your structure – just in case you would need to access this. Now the rest has not changed – same internal table, still 100k lines and just one addition.

But now we get this:

Execution time increased by 136,15%. And we have not even added fancy new routines to operate on that new component…

So now let me provide some very basic background:

When you access a part of a structure, you load all of it’s components from RAM into your L1 Cache (Referenced components are an exception, you then only load the pointer). Touch one, get all.

But why does this even make a difference?

 

Memory Systems in general have become the bottleneck of modern multicore and multisocket systems. You either suffer from bandwidth restrictions (like in our example above) or you can have problems with memory latency (This will be a separate post).

And now take into account that this was just sequential – with exclusive machine access.

Let us therefore test this under more realistic circumstances, 12 cores with 12 WP:

runtime per workprocess – pre modification

 

runtime per workprocess – after modification

 

Now the relative runtime increase stays the same, but we are still wasting a lot of runtime here. Scale this up with more data, more business logic and more hardware and this will start to really hurt.

The lesson I learned from this is to keep my data structures as lean as I reasonably can. Often there is no need to run around searching for single bytes to save (except maybe in really hot and tight loops), but big structures cost you performance even if you are just traversing through them.

Take care,

Dmitrii