ABAP PP-Framework documentation

I would like to introduce you to the ABAP PP-Framework – a parallel processing framework, which I use for all of my batch applications.

It was developed in the year 2002 by SAP. The main author is Thomas Bollmeier.

This framework was recommended to me by my mentor back when I started my education – and I still use it. It is old, but gold.

If you do not require recursive parallelization, this framework does the job pretty well. Not very fancy, but very reliable. Because of its availabilty in every Netweaver system it has become a standard choice for parallelizing SAP applications.

After some of my colleagues asked me for the documentation, I decided to just put it online.

As a side note I have to say, that the documentation is in german.

Take care

Dmitrii

 

Framework für die Parallelverarbeitung

Einführung in das Parallelverarbeitungstool

 

ABAP – Why parallelization through DIA processes is a bad idea

Today I would like to talk about how your program parallelization fits into a bigger concept of an abap system design & spec.

Now for this I assume you know that you have multiple types of process types available for parallelization on your abap system.

Whenever I see programs that can function in a parallel fashion, I look at what kind processes those programs use. Often, I see a simple function call with “Starting new task” added – the report is pumping out dialog processes to distribute the workload.  And this is precisely what I do not want to see as a solution. The reason is not so much the code itself, but the consequences it has on the overall system behavior and spec driven strategy of the abap system.

Let us take a step back and imagine for a moment that we are not developers – let us change our perspective. You now have the task of configuring your abap system to meet a certain business requirement. This business requirement is simple:

“Up to X Users per Minute must be able to use the reporting functions, apart from that we have these jobs which need to run every hour. Those jobs must run. If they do not, we lose money and clients. Budget is X€ per month.”

Now let us assume that the sizing of the machine has already been taken care off, the budget limit was also respected. Your job now is to make sure that those jobs can indeed run every hour with a minimum chance of failure. How do you do that?

Your only reliable solution is to adapt our abap server memory configuration. Your abap instance has multiple types of memory, but we now just want to focus on two:

HEAP

HEAP Memory is mostly allocated by BTCH processes. BTCH processes start with allocating HEAP until they either hit their individual limit (parameter value) or the global HEAP limit (parameter value). If your BTCH process hits either of those limits, it starts allocating EM memory (skipped the roll memory, but it’s insignificant for this case) until it hits one of the set limits (individual or global).

 

EM

EM Memory is mostly allocated by DIA processes. DIA processes start with allocating EM until they either hit their individual limit (parameter value) or the global EM limit (parameter value). If your DIA process hits either of those limits, it starts allocating HEAP memory (again skipped the roll memory) until it hits one of the set limits (individual or global). If a DIA process is allocating HEAP memory is goes into the so called private mode. In that mode the process is not being freed when the program is finished. In order to be able to hold the memory allocated in the HEAP it has to keep occupying the workprocess. If you have too much of this, you do not have any spare DIA workprocesses left and your system becomes unusable.

 

Those are the basics in a very simplified form. Your DIA and your BTCH processes each have a memory area where they feel at home. Apart from the split having an interesting technical background (for example: EM Memory can handle fast context switches better. Therefore better equipped to handle masses of short workloads), they also give the possibility of sealing off user driven workloads from batch driven workloads by restricting them to their starting memory type. In that case your DIA processes would be restricted from allocating more than a minimal amount of heap and your BTCH processes could not allocate EM past a technical minimum.

This is what enables you to make sure that those jobs are never short on memory and would always be in the position to start. Even if your DIA Users start to demand an extreme amount of memory, your BTCH processes are shielded. Although you can’t compensate for an undersized machine, you can protect your important business processes.

After having our core processes dump in the night because a user was trying to allocate 40GB of memory, I am a big fan of sealing off those workloads against each other.

And this is precisely the point where parallelization of programs with DIA processes becomes a big problem. If I as a “run oriented” person start BTCH jobs, I expect them to run in batch – not to pump out DIA processes. You can’t seal that off, and you will therefore fail to deliver. Your batch workload has to be immune to dialog user driven actions. When it gets serious, I rather have a few dialog users receive a memory dump than my core jobs failing.

 

Take care

Dmitrii

 

 

Reducing memory bandwidth consumption

Last time I talked about how reducing memory bandwidth consumption could be very benefitial to your ABAP performance.

I recommended to design your structures as lean as reasonably possible, but sometimes removing business relevant information is not an option.

And though an idealistic approach may sound good, just knowing about the drag I produce with wide structures does not improve my situation. What I would need is a solution to be able to still carry a payload without producing too much drag.

And there actually is one. Let me show you an example:

We perform scans on the first 3 fields, in order to find out what version is actually relevant. If we find a hit, we add information to our output table.

Runtime sequential:

Runtime parallel(8 cores, 8WP):

Now from the business side of things, there is nothing here we can simply leave out. And we can’t just go and shrink datatypes, which are used systemwide. Now we could try to move the extra info (not used for scans) into a separate internal table, and read just in case we have a hit. Unfortunately this is impractical, because in a real world scenario we are often operating in a bigger eco system. We often have fixed inputs( APIs or DB Table designs) and we often have some form of fixed outputs (again APIs or DB Tables).

So this would require to first loop over the main table (producing massive drag we actually wanted to avoid), in order to split the info between the two tables -> and to then again loop over the lean version, scan and read the other one if we have a hit. This does not sound very effective and/or fast. And it isn’t.

Please also consider that even read statements with a constant access time development (Index, Hash) do not come for free. They too have a non neglectable runtime. So this solution doesn’t work, is there another one?

Well yes, try to redesign your input stucture (If you are allowed to 😉 ):

As you can observe I changed the structure layout and implemented a split. The 3 main fields required for the initial scan are still there, but instead of having the payload right behind it, I outsourced it (Good outsourcing, who would have thought…) into a separate structure. Now I have to slightly change the access of the algorithm, but this change is actually very local.

So did this improve runtime?

Yep, it did. But why?

Remember my last post? Remember what get’s loaded into cache if you access a structure? And more important, what does not?

Because my payload structure is boxed into the main structure, it does not get loaded into cache when I access other parts of the structure. Only if I access it directly, it is being loaded into cache. And because the payload structure has only one flat datatype, everything gets loaded.

Now this is only usefull if you do not actually access most of the table entries  payload, because an individual access of the payload in this manner is way slower than if it would be a simple part of the main structure. This is because of memory latency you get by accessing through a reference (Will be discussed in a separate post).

But for this type of use case it comes pretty handy – we can reduce drag without actually excluding relevant information. Our main concern in this situation would only be a meaningfull usage of boxed types to control what we want to load and when.

But regardless of how well we try to reduce memory bandwith usage, we are fighting an uphill battle:

This is the code after the change in parallel execution, still 8 cores and 8 WP. The individual execution time increased by 128%! Now we still are way faster than before the change (in both modes), but the more we push the hardware to its limits, the less scalable our code becomes.

So the more you push for bigger hardware, more cores and more WP – keep in mind that you often scale quite poorly when it comes to individual WP performance. Reducing memory bandwidth consumption through changes in code or data structure may delay this, but eventually it will catch up.

 

Take care,

Dmitrii