Today I would like to show you an effect that is mostly underrated and even ignored when trying to improve application performance – Memory Bandwidth.
Just like what get’s measured get’s managed, what is not measured will not be managed. Or did you recently discover a KPI in your SAT Trace named “Memory Bandwidth – Time consumed”?
Let me start this with a simple example:
And this is the function we call:
So we basically traverse through an internal table with 100k lines and just do one simple addition operation. I inserted logic to push the relevant internal table out of cache before traversing through it – I want to show the effects of the memory bandwith and not the cache behaviour…
So this provides us with an average execution time of:
Now I change the code, well I actually only change the datastructure of the internal table we traverse through.
I added a new field named info, this simulates your normal additional payload you may not always use but still keep in your structure – just in case you would need to access this. Now the rest has not changed – same internal table, still 100k lines and just one addition.
But now we get this:
Execution time increased by 136,15%. And we have not even added fancy new routines to operate on that new component…
So now let me provide some very basic background:
When you access a part of a structure, you load all of it’s components from RAM into your L1 Cache (Referenced components are an exception, you then only load the pointer). Touch one, get all.
But why does this even make a difference?
Memory Systems in general have become the bottleneck of modern multicore and multisocket systems. You either suffer from bandwidth restrictions (like in our example above) or you can have problems with memory latency (This will be a separate post).
And now take into account that this was just sequential – with exclusive machine access.
Let us therefore test this under more realistic circumstances, 12 cores with 12 WP:
Now the relative runtime increase stays the same, but we are still wasting a lot of runtime here. Scale this up with more data, more business logic and more hardware and this will start to really hurt.
The lesson I learned from this is to keep my data structures as lean as I reasonably can. Often there is no need to run around searching for single bytes to save (except maybe in really hot and tight loops), but big structures cost you performance even if you are just traversing through them.