In another post I showed how using boxed types for certain use cases can reduce your memory bandwidth consumption and improve performance. But I also pointed out that this technique is to be used with caution and only if your access of these boxed types is rather rare. Today I want to show you why I said that – I want to show you the effects of memory latency and hardware prefetching.
Let us start of with a simple example setup:
We traverse through an internal table with 1 Mio lines. Each line has a payload structure which we access in the loop to modify the values. Initial runtime:
So far nothing exciting. Now let us try what I advised not to do: Turn the payload structure into a boxed one.
And now we get this:
Now THAT hurts! A runtime increase of 551,39%! But what causes this? Why is it so much slower?
Because we access through a reference. What enables the initial lower memory bandwidth consumption of a boxed type (which is essentially just the same raw type, but behind a reference pointer) can also be a weak point.
Your Hardware is trying it’s best to help your code to run fast, one thing that it does is called prefetching. Your hardware tries to predict what data you will try to access in the future and loads data it thinks will be accessed in the future into your Cache. As Memory is organized in a continuous way, your prefetcher also works that way. This helps a lot when you are traversing through an internal table with a flat structure, the access pattern aligns with the memory structure.
But references have a problem with this, because they are not very predictable. Though you know where the pointers are located (part of the line type of the internal table), you do not know what they are pointing at. Only after that reference is read (Accessed from RAM, got transfered into L1 Cache) you know what the content of that reference is. Now this is the point in time where you access the data the reference is pointed towards (And again access from RAM, transfered into L1 Cache). The reason this is slow is not only because you can not prefetch this data effectively, you are also jumping around in RAM. An access to RAM costs 100 times more than an access to L1 Cache. And you do that at least twice for a simple reference. Some classes are designed in a way where you are mostly engaged in jumping around in RAM from pointer, to pointer, to pointer – this can produce absurdly slow software. Another feature you obtain is almost non scalable software.
Why? Well your Cores are doing nothing while you jump around in RAM. Your processor stalls and there is nothing you can do about that – except a redesign. If you throw more hardware at your problem, performance will hardly increase. Because processing power was never an issue here, so more will not solve anything.