Tuning ABAP with Hardware mechanics #sitHVR

This year I had the privilege to addend the SAP Inside Track in Hannover as a speaker. This post links to my presentation and my demos.

SAP Inside Track Hannover 2018 Wiki

Looking forward to the abap code retreat on April 14th in Hannover!

Power Point: Tuning ABAP with hardware mechanics

Demos for Presentation

Demo 1 – Everything fine – initial example

runtime: 48.185 Microseconds


Demo 2 – Adding of additional structure component causes performance degredation

runtime: 146.787 Microseconds

Demo 3 – Outsourcing the new component behind a reference using boxed -> back to initial performance

runtime: 48.319 Microseconds


Demo 4 – Religious approach, everything has to be reference

runtime: 390.354 Microseconds



Take care,



Tuning abap parallel cursor

Today I want to share with you how tuning abap parallel cursor works and when you would need to change your algorithm, because of different data distributions in the inner and outer table.

This post takes off where this one stopped. I highly recommend you reading it first, because here I take that setup and technique knowledge for granted.

different data distribution -> different performance

Here we have a parallel cursor implementation with a read table statement:

And this is our data generation logic:

1k Head datasets and 2,5 Mio Item datasets. In the following measurements I am also including the necessary sort of both tables.

Time taken:          151.398 Microseconds

Items / microsecond: 16,78

This is our baseline. Now I will change the distribution of datasets between the head and the items a bit:

Now we have 100k head datasets and 2,5Mio item datasets. Each head has now only has 25 item datasets.

Time taken:          324.033 Microseconds

Items / microsecond: 7,72

We suffered a runtime increase of over 117%. This is a problem.

Why is it slowing down?

As I changed the ratio of items<->head, I also increased the amount of cursor switches in the algorithm. Now I have to read 100times more often than I did before. This impacts my performance notably, as the read operation is still costly.

If I want to improve this, I have to get rid of my read mechanic somehow…

no read statement

There actually is a parallel cursor implementation without a read involved:

I ran it without the read statement:

Time taken:          215.838 Microseconds

Items / microsecond: 11,58

This implementation has improved my performance by 50%.

As we removed the read statement as a switch mechanic in our algorithm, we now do not receive such a big performance penalty when we switch to a new head dataset.

Is it always faster?

No, not always. Let’s look at our first data pattern:

Time taken:          149.364 Microseconds

Items / microsecond: 16,74

With the usage of our read statement we processed 16,78 Items per microsecond.

So both algorithms have a bottom line where they both deliver the same performance. But the less items a head dataset has, the more an implementation without a read statement makes sense.

I tested both implementations with a lot of different data distributions and I never saw the read statement implementation to ever clearly beat the implementation without it. So if you want to get on the safe side of things, implement a parallel cursor without a read statement.

Take care,


Parallel cursor vs. secondary key

This one bugged me  for a while. Whenever I implement an inner loop which is dependent upon the outer loop I start to think about which solution to implement. So i decided to answer the question parallel cursor vs. secondary key for my self – and make a post about it in the process 🙂

Initial Setup

Let me introduce my mighty initial setup…

… a simple head <-> item loop logic. And this is my data generation for this:

Before we start the comparison, I measured the initial solution:






So now we know what we are tuning for.


Parallel cursor – Take 1

So I wanted to implement the parallel cursor solution first, although this was not my preferred solution. I am a creature of some habit and after I picked one up it takes some work to change it. Remember that in order for this to work the inner table has to be sorted before this:

This is the implementation:

This gave me a slight boost (Sort was also measured).







But I was actually very disappointed, I expected way more. That was only an improvement by not even a factor of 2. And I did remember it being a very potent technique.

Secondary key

So now my hopes were with the secondary key of the table – which I made up after the subroutine was called:

Implementing my secondary key was very simple and straight forward. Now I was excited to see the runtime.






Now this seems right. This result is way more desirable. We got a performance improvement by 6.271 %. Scaling also improved. While the difference in runtime between the sequential workprocess and the individual parallel workprocess in the initial solution was + 95%, we have improved to “just” 73%.

But this seemed way too heavy of a difference. I was getting second thoughts… this seemed to good to be true. My favorite technique won. But it won by such a large margin that it seemed also ridiculous to me.

And in my experience something that is too good to be true – it isn’t. So my conclusion was that I had to have messed up something. So I searched – and  I found.

Parallel cursor – Take 2

I did discover what I messed up. I even found it to be interesting enough to share it, because I imagine that more people could tap into this trap without knowing that they left out so much potential.

Let me highlight the magical spot:

Take a look at my where condition. Found something unusual? Let me help.

Tell me, what happens when we are done with looping over the 25 matching datasets in the item table? Quit? Really?

We actually do not quit, because this thing does not know that we are done after the first 25 sets. It needs to make sure that the table does not contain matches which are in another part of that table. And this is a problem.

Remember our initial code?

This code was slow. And it was slow for a reason. Searching with a generic key forces a search with an exponential runtime curve. And this is exactly what I forced in my first parallel cursor implementation:

I just gave it another starting point. But the search type stays the same and so does the runtime curve – which makes this implementation stupid.

I then made a little change and made the code rely on my previous sort:

I deleted the where clause in the inner loop and replaced it with the if statement. You should place it right in the beginning in order to avoid bugs.

This worked very nice:






This solution beat my secondary key solution by 398%. And it is 31.654% faster than the initial solution.

So if you have a similar situation, I would recommend a parallel cursor solution.

And I have to change my habit, after all this blows my previous favorite out of the water…


Take care


Tuning of abap delete performance

Today I would like to discuss how you can improve your abap delete performance. As there are different kinds of solutions to this problem, I would like to work with one example setup through the whole post – improving the performance step by step.

Initial setup

Our initial setup deals with deleting all of the non essential datasets from our datastream. In order for the datasets to be relevant and not be deleted, they have to be validated.

And this is how the data structure looks like:

So we just search for a pattern in the payload. If we find it, our task is to not deliver it back. Which we achieve by deleting.

Before we start changing the solution, let us measure the runtime.

1 WP 1k lines

8 WP 1k lines





1WP 10k lines

8WP 10k lines




First thoughts

Now I initially wanted to go beyond 100k, but I honestly was afraid that this thing would never come back to me. The non linear runtime curve scares me. Which will be our main target to get rid off. A non linear response forced by the generic where condition…

… forces an internal search in the table for the right target(s). Now even if there is just one, the runtime curve still stays exponential. We have to change this as fast as possible.

Also the scaling concerns me a bit. We take almost double the time in full parallel mode per workprocess compared to our single process. Unfortunately, I can not spot anything I could do to reduce the load on the memory system. I do not use deep data types in my data structure and also do not have the possibility to reduce the width of my table structure. 🙁

Stage 1

Stage 1 has to get rid off the non linear response to table size. The solution here is to replace the target format for the deletion with an index value.

Why an index access? Because the access performance is not dependent upon the amount of the data in the table. It behaves like a hashed table in this regard. And also has a similar or mostly lower constant access time.

So let us look at the runtime:

1WP 1k lines

8WP 1k lines 




1 WP 10k lines


8WP 10k lines




Much better!

Judging Stage 1

Sequential performance improved by over 630% at 1k lines and over 610.000 % at 10 k. Removing the exponential search does pay off really well here.

Scaling attributes have also improved slightly, although not intended. Probably because we do not grind through the entire table every time we find something interesting.

But there is still one aspect of this solution, which I am not happy with. That is the delete statement itself.

Table Index rebuild

When we delete an entry from a table, we change its composition. With that action comes a necessity for the table to update its own composition status. After all we do expect the table to know what nodes it still has. That means that through our deletion process we force a rebuild of the tables index. It has to update its own status in order to stay consistent. This operation does take time and I want to know how much performance there is to gain if we avoid that rebuild.

Stage 2

I changed the mechanic by removing the delete statement and turning it into an append only mechanic. If something does not fit our criteria, it just does not get appended. After the loop I replace the initial table with my result.

This removes the index rebuild as a performance drain.

1WP 1k lines

8WP 1k lines




1WP 10k lines

8WP 10k lines




Judging Stage 2

Sequential performance improved by over 340% at 1k lines and by over 370%. compared to Stage 1. So the rebuild of the table index does have a significant impact on the performance of a deletion operation.

Compared to the initial solution, we achieved a performance improvement of over 282.000% at 1k lines and an improvement of over 2.897.000 % in runtime.

As a side note I must admit, that the last performance improvement figure does look ridiculous… but it proves a point. The point being not to ever build a solution which has an exponential runtime response to data input.

Take care,



DD03L remap performance

Today I would like to show you how I improve DD03L remap performance into a csv format. I will start with a simple setup and tune it step by step, thereby showing you different tuning possibilities.

Starting setup

Our starting setup is pretty basic. We have a lcl_file_service class which performs the reformat operation for us. We can feed it any table we want, as long as it is contained in DD03L.

So this where we start. I feed it from the outside with the request to remap the whole content of the DD02L table.

Initial runtime:

Stage 1

There are a few things that catch my attention here right away:

  • Looping at the it_dd03l into a workarea, that is a common site – but really slow. I want to change this to a field symbol
  • Assigning component with the fieldname, instead of the position. This statement can also be used by working with the position of the target field. This is usually faster.
  • We are inserting our result instead of appending it. If inserting is not obligatory, we should not do it. Switching to append here.
  • We have a branch, which checks if the end of the file line was reached in order to skip the last separator. I do not like branches in hot loops, I really don’t. If you have any possibility to pull branches outside of your hot loops, do it. Although your processor is likely to be able to predict the outcome of this branch most of the time, I would not bet on it. Therefore I will only loop until the next-to-last line and read the last line separately. This way I avoid having the branch.

So let us then implement these conclusions..

Now let us check our performance.

Stage 1 runtime:

We improved our performance by 125%.


But I am still not happy. I actually do not like the whole approach of the solution. Why do I have to do things over and over again?

For example the mapping of the individual fields: I do an assign component every time I touch a field in a data row.

And every time I have to check if my palms are in trouble.

Then I have to go through that buffer variable, so I do not blow myself up when I touch datatypes which are not compatible to a simple concatenate operation.

And if that was not enough I have to keep track of where I am in the file line, so I do not destroy the output format. 🙁

Stage 2

I do not want to do all of those things above over and over again. If I really have to do them, I want to do them only once.

In order to reach that goal we have to change our approach. Let me present you to my ideal remap solution:

Now I know that is quite abstract, but it is essentially what I am willing to invest. I want to enter a magical loop, where everything has already been taken care of. The input is just waiting to be mapped to the correct output. Both are in one line, so I do not have to jump around like a fool. And also every necessary input check has to be taken care of. After all everything I would need for those checks is contained in the dd03l table and one single row of input data.

I also have no more stupid branches to worry about and the formatting has also been taken care of.

Unfortunately the real solution is not that simple or lean, but it still follows the same approach:

As you see we have a new method involved – build_remap_customizing – which takes care of all that work I do not want to repeat doing.

The result table has a structure with only 4 components:

  • source_offset type i
  • source_width type i
  • target_offset type i
  • target_width type i

Nothing more is needed.

This is what our remap loop looks now. The whole assigning and input to output matching has already been taken care of.

Now maybe you have noticed that I do not work on the input and output directly, but through buffer variables. The reason for that is that I want my source and target to be alphanumerical, so I can work with parts of them. When I copy the data row into my input structure (well, it is a long char field actually…), I make sure that every field is right where I expect it to be. The same applies for the output field – here the separators have also been pre computed.

So, what’s the performance?

We have increased performance by 98% compared to stage 1 and 350% compared to the initial solution.


Basic improvements do help and can provide you with solid performance improvements. But sometimes we need a little change of perspective to come up with different and better solutions.

Do take care of the basics, but also try to pre compute and optimize your whole solution approach from time to time – it pays off.

Take care,


ABAP PP-Framework documentation

I would like to introduce you to the ABAP PP-Framework – a parallel processing framework, which I use for all of my batch applications.

It was developed in the year 2002 by SAP. The main author is Thomas Bollmeier.

This framework was recommended to me by my mentor back when I started my education – and I still use it. It is old, but gold.

If you do not require recursive parallelization, this framework does the job pretty well. Not very fancy, but very reliable. Because of its availabilty in every Netweaver system it has become a standard choice for parallelizing SAP applications.

After some of my colleagues asked me for the documentation, I decided to just put it online.

As a side note I have to say, that the documentation is in german.

Take care



Framework für die Parallelverarbeitung

Einführung in das Parallelverarbeitungstool


ABAP – Why parallelization through DIA processes is a bad idea

Today I would like to talk about how your program parallelization fits into a bigger concept of an abap system design & spec.

Now for this I assume you know that you have multiple types of process types available for parallelization on your abap system.

Whenever I see programs that can function in a parallel fashion, I look at what kind processes those programs use. Often, I see a simple function call with “Starting new task” added – the report is pumping out dialog processes to distribute the workload.  And this is precisely what I do not want to see as a solution. The reason is not so much the code itself, but the consequences it has on the overall system behavior and spec driven strategy of the abap system.

Let us take a step back and imagine for a moment that we are not developers – let us change our perspective. You now have the task of configuring your abap system to meet a certain business requirement. This business requirement is simple:

“Up to X Users per Minute must be able to use the reporting functions, apart from that we have these jobs which need to run every hour. Those jobs must run. If they do not, we lose money and clients. Budget is X€ per month.”

Now let us assume that the sizing of the machine has already been taken care off, the budget limit was also respected. Your job now is to make sure that those jobs can indeed run every hour with a minimum chance of failure. How do you do that?

Your only reliable solution is to adapt our abap server memory configuration. Your abap instance has multiple types of memory, but we now just want to focus on two:


HEAP Memory is mostly allocated by BTCH processes. BTCH processes start with allocating HEAP until they either hit their individual limit (parameter value) or the global HEAP limit (parameter value). If your BTCH process hits either of those limits, it starts allocating EM memory (skipped the roll memory, but it’s insignificant for this case) until it hits one of the set limits (individual or global).



EM Memory is mostly allocated by DIA processes. DIA processes start with allocating EM until they either hit their individual limit (parameter value) or the global EM limit (parameter value). If your DIA process hits either of those limits, it starts allocating HEAP memory (again skipped the roll memory) until it hits one of the set limits (individual or global). If a DIA process is allocating HEAP memory is goes into the so called private mode. In that mode the process is not being freed when the program is finished. In order to be able to hold the memory allocated in the HEAP it has to keep occupying the workprocess. If you have too much of this, you do not have any spare DIA workprocesses left and your system becomes unusable.


Those are the basics in a very simplified form. Your DIA and your BTCH processes each have a memory area where they feel at home. Apart from the split having an interesting technical background (for example: EM Memory can handle fast context switches better. Therefore better equipped to handle masses of short workloads), they also give the possibility of sealing off user driven workloads from batch driven workloads by restricting them to their starting memory type. In that case your DIA processes would be restricted from allocating more than a minimal amount of heap and your BTCH processes could not allocate EM past a technical minimum.

This is what enables you to make sure that those jobs are never short on memory and would always be in the position to start. Even if your DIA Users start to demand an extreme amount of memory, your BTCH processes are shielded. Although you can’t compensate for an undersized machine, you can protect your important business processes.

After having our core processes dump in the night because a user was trying to allocate 40GB of memory, I am a big fan of sealing off those workloads against each other.

And this is precisely the point where parallelization of programs with DIA processes becomes a big problem. If I as a “run oriented” person start BTCH jobs, I expect them to run in batch – not to pump out DIA processes. You can’t seal that off, and you will therefore fail to deliver. Your batch workload has to be immune to dialog user driven actions. When it gets serious, I rather have a few dialog users receive a memory dump than my core jobs failing.


Take care




How to wreck ABAP Performance

Read statement with generic key                                                         

My all-time favorite. Really. This construct blows up so often and with such force that you need a special name for it.

If both tables reach or surpass a line count of 10k, the death spiral starts. If you go beyond 100k your runtime can go from a few minutes to surpassing multiple hours. After all you are using a search method with an exponential runtime curve - enjoy the ride!

One special characteristic of this construct is that it does not surface until you put real pressure on the application. This leads to an interesting sequence of events:

  1. Developers develop software driven by the business case.
  2. Test with a small amount of test data.
  3. If scope is complete or budget has been consumed, rollout begins.
  4. Customer installs and try’s the application with a small testset, in order to confirm that the application is doing the task right.
  5. Application enters production.
  6. Just when the workload is high, pressure on the team is high, the timing really bad and the possible margin for errors in production are next to zero - it blows up.

This setup is really easy to spot, but it should be changed before it tears you apart.


Modify/Delete Statement with generic key

This follows the same pattern as our first setup. The problem here is that this delete operation requires a search in order to find out what is supposed to be deleted. And this search intrinsically has the same runtime curve as our generic read statement from the first setup. An exponential one. As a bonus, you receive the effect of your table index (something has to hold the internal table together) constantly rebuilding by force. This does not make it faster.

And of course, you get the same pleasant pattern of surfacing only when the workload is high - perfect for occasions like end of year processing and other mission critical processes.

Looping into a workarea

This is pretty common. Although this is nothing that obliterates your production machine, it does hurt and should be avoided.

Always replace this with a field symbol or a reference (there are differences in performance, but only if you like to exploit your CPU Cache). Fixing this is cheap and can improve your performance significantly.


Select loops

Select. .... Endselect. Loops are useful, but should be avoided like the plague.

If there is a possibility to use a standard bulk sql statement for it - use it. Everything is better than this.

You either fetch your whole package in one take or you push it down into your DB, but do not use this solution in anything even remotely dependent on performance.

Select single in loops

If you want to trash your database with a high management cost to result ratio, you should use this as often as you possibly can.

Great for slowing your application down for no reason. Avoid the "for all entries" function of open sql, because it might speed you up and reduce your lunch break significantly.


Not using mass update or insert

Some people tend to micromanage things. But there are folks that take it way too far.

I assure you, that the compiler and the database can manage things well themselves. You can even go a step further and trust them with inserting or updating a database table from internal table in just one command! Really. I promise you, they will not lose anything important. There is a reason for a database not having a lost property office...


Committing too often

An excellent strategy for preservation of the turning hourglass is committing after every interaction with the database.

It is important to take your time...


Take care





ABAP Memory Latency & Hardware Prefetching

In another post I showed how using boxed types for certain use cases can reduce your memory bandwidth consumption and improve performance. But I also pointed out that this technique is to be used with caution and only if your access of these boxed types is rather rare. Today I want to show you why I said that – I want to show you the effects of memory latency and hardware prefetching.

Let us start of with a simple example setup:

We traverse through an internal table with 1 Mio lines. Each line has a payload structure which we access in the loop to modify the values. Initial runtime:

So far nothing exciting. Now let us try what I advised not to do: Turn the payload structure into a boxed one.

And now we get this:

Now THAT hurts! A runtime increase of 551,39%! But what causes this? Why is it so much slower?

Because we access through a reference. What enables the initial lower memory bandwidth consumption of a boxed type (which is essentially just the same raw type, but behind a reference pointer) can also be a weak point.

Your Hardware is trying it’s best to help your code to run fast, one thing that it does is called prefetching. Your hardware tries to predict what data you will try to access in the future and loads data it thinks will be accessed in the future into your Cache. As Memory is organized in a continuous way, your prefetcher also works that way. This helps a lot when you are traversing through an internal table with a flat structure, the access pattern aligns with the memory structure.

But references have a problem with this, because they are not very predictable. Though you know where the pointers are located (part of the line type of the internal table), you do not know what they are pointing at. Only after that reference is read (Accessed from RAM, got transfered into L1 Cache) you know what the content of that reference is. Now this is the point in time where you access the data the reference is pointed towards (And again access from RAM, transfered into L1 Cache). The reason this is slow is not only because you can not prefetch this data effectively, you are also jumping around in RAM. An access to RAM costs 100 times more than an access to L1 Cache. And you do that at least twice for a simple reference. Some classes are designed in a way where you are mostly engaged in jumping around in RAM from pointer, to pointer, to pointer – this can produce absurdly slow software. Another feature you obtain is almost non scalable software.

Why? Well your Cores are doing nothing while you jump around in RAM. Your processor stalls and there is nothing you can do about that – except a redesign. If you throw more hardware at your problem, performance will hardly increase. Because processing power was never an issue here, so more will not solve anything.

Take care,


Reducing memory bandwidth consumption

Last time I talked about how reducing memory bandwidth consumption could be very benefitial to your ABAP performance.

I recommended to design your structures as lean as reasonably possible, but sometimes removing business relevant information is not an option.

And though an idealistic approach may sound good, just knowing about the drag I produce with wide structures does not improve my situation. What I would need is a solution to be able to still carry a payload without producing too much drag.

And there actually is one. Let me show you an example:

We perform scans on the first 3 fields, in order to find out what version is actually relevant. If we find a hit, we add information to our output table.

Runtime sequential:

Runtime parallel(8 cores, 8WP):

Now from the business side of things, there is nothing here we can simply leave out. And we can’t just go and shrink datatypes, which are used systemwide. Now we could try to move the extra info (not used for scans) into a separate internal table, and read just in case we have a hit. Unfortunately this is impractical, because in a real world scenario we are often operating in a bigger eco system. We often have fixed inputs( APIs or DB Table designs) and we often have some form of fixed outputs (again APIs or DB Tables).

So this would require to first loop over the main table (producing massive drag we actually wanted to avoid), in order to split the info between the two tables -> and to then again loop over the lean version, scan and read the other one if we have a hit. This does not sound very effective and/or fast. And it isn’t.

Please also consider that even read statements with a constant access time development (Index, Hash) do not come for free. They too have a non neglectable runtime. So this solution doesn’t work, is there another one?

Well yes, try to redesign your input stucture (If you are allowed to 😉 ):

As you can observe I changed the structure layout and implemented a split. The 3 main fields required for the initial scan are still there, but instead of having the payload right behind it, I outsourced it (Good outsourcing, who would have thought…) into a separate structure. Now I have to slightly change the access of the algorithm, but this change is actually very local.

So did this improve runtime?

Yep, it did. But why?

Remember my last post? Remember what get’s loaded into cache if you access a structure? And more important, what does not?

Because my payload structure is boxed into the main structure, it does not get loaded into cache when I access other parts of the structure. Only if I access it directly, it is being loaded into cache. And because the payload structure has only one flat datatype, everything gets loaded.

Now this is only usefull if you do not actually access most of the table entries  payload, because an individual access of the payload in this manner is way slower than if it would be a simple part of the main structure. This is because of memory latency you get by accessing through a reference (Will be discussed in a separate post).

But for this type of use case it comes pretty handy – we can reduce drag without actually excluding relevant information. Our main concern in this situation would only be a meaningfull usage of boxed types to control what we want to load and when.

But regardless of how well we try to reduce memory bandwith usage, we are fighting an uphill battle:

This is the code after the change in parallel execution, still 8 cores and 8 WP. The individual execution time increased by 128%! Now we still are way faster than before the change (in both modes), but the more we push the hardware to its limits, the less scalable our code becomes.

So the more you push for bigger hardware, more cores and more WP – keep in mind that you often scale quite poorly when it comes to individual WP performance. Reducing memory bandwidth consumption through changes in code or data structure may delay this, but eventually it will catch up.


Take care,