Note: If you are using Guava < 13 and looking to fix a serious cache performance issue, you can directly jump to the conclusion. This article describes how I re-discovered this three year old issue but you probably don’t care.
I have spent the last few months bootstrapping a couple of big data pipelines. I will not go into details, but basically the goal is to crunch almost all the HTTP hits issued from French mobile phones to publish the official French web audience. After having focused on business code for a while, I have eventually been able to start chasing down performance issues.
The first section of theses pipelines is made of several enrichment steps adding various meta-data to the raw HTTP hits. Some of these steps are fast (nanoseconds) some others are quite expensive (milliseconds). When a slow enrichment step is a pure function, an in-process cache is put in front of it to amortize its cost (the distribution of the input data is almost always exponential).
During the initial phase, we decided to use Guava caches without any decent tuning. They are usually good enough and it is smarter to wait to have the full pipeline in place before fiddling with knobs or writing your own cache.
Last week, I started playing with the cache configuration to get a glimpse of the performance model of one of the pipelines. Our caches were configured to be quite small, tens of megabytes, and my goal was to observe what …