Tagged with jmh

Using exceptions for flow control is slow, even for DateTimeFormatter

Last week, I reviewed a seemingly simple patch to an Apache Crunch pipeline. Functional details don't matter, but the pipeline processes loads of hits from HTTP proxies. Hits have a timestamp and the patch aims to enrich every hit with some metadata indexed by user and business date (ie. we query a repository with a composite key, part of it is the date formatted as a yyyyMMdd string).

Functional requirements are clear, the code is simple and the repository is more than fast enough to not affect the performance profile of the pipeline

However, while QA'ing the patch I noticed an unexpected slowdown. Since our end-to-end tests are configured to always profile a few map & reduce tasks with HPROF, it was easy to generate a CPU Flame Graph which showed that the following line accounts for ~25% of the on CPU time:

      String date = localDate.format(DateTimeFormatter.BASIC_ISO_DATE);

According to its Javadoc BASIC_ISO_DATE is exactly what we should use to format a LocalDate as a yyyyMMdd string:

The ISO date formatter that formats or parses a date without an offset, such as '20111203'.

This returns an immutable formatter capable of formatting and parsing the ISO-8601 basic local date format. The format consists of:

  • Four digits for the year. Only years in the range 0000 to 9999 are supported.
  • Two digits for the month-of-year. This is pre-padded by zero to ensure two digits.
  • Two digits for the day-of-month. This is pre-padded by zero to ensure two digits.

HPROF is ...

read full article
Tagged , , , ,

Chasing down Guava cache slowness

Note: If you are using Guava < 13 and looking to fix a serious cache performance issue, you can directly jump to the conclusion. This article describes how I re-discovered this three year old issue but you probably don’t care.

I have spent the last few months bootstrapping a couple of big data pipelines. I will not go into details, but basically the goal is to crunch almost all the HTTP hits issued from French mobile phones to publish the official French web audience. After having focused on business code for a while, I have eventually been able to start chasing down performance issues.

The first section of theses pipelines is made of several enrichment steps adding various meta-data to the raw HTTP hits. Some of these steps are fast (nanoseconds) some others are quite expensive (milliseconds). When a slow enrichment step is a pure function, an in-process cache is put in front of it to amortize its cost (the distribution of the input data is almost always exponential).

During the initial phase, we decided to use Guava caches without any decent tuning. They are usually good enough and it is smarter to wait to have the full pipeline in place before fiddling with knobs or writing your own cache.

Last week, I started playing with the cache configuration to get a glimpse of the performance model of one of the pipelines. Our caches were configured to be quite small, tens of megabytes, and my goal was to observe what ...

read full article
Tagged , , ,