Tagged with java

Using exceptions for flow control is slow, even for DateTimeFormatter

Last week, I reviewed a seemingly simple patch to an Apache Crunch pipeline. Functional details don't matter, but the pipeline processes loads of hits from HTTP proxies. Hits have a timestamp and the patch aims to enrich every hit with some metadata indexed by user and business date (ie. we query a repository with a composite key, part of it is the date formatted as a yyyyMMdd string).

Functional requirements are clear, the code is simple and the repository is more than fast enough to not affect the performance profile of the pipeline

However, while QA'ing the patch I noticed an unexpected slowdown. Since our end-to-end tests are configured to always profile a few map & reduce tasks with HPROF, it was easy to generate a CPU Flame Graph which showed that the following line accounts for ~25% of the on CPU time:

      String date = localDate.format(DateTimeFormatter.BASIC_ISO_DATE);

According to its Javadoc BASIC_ISO_DATE is exactly what we should use to format a LocalDate as a yyyyMMdd string:

The ISO date formatter that formats or parses a date without an offset, such as '20111203'.

This returns an immutable formatter capable of formatting and parsing the ISO-8601 basic local date format. The format consists of:

  • Four digits for the year. Only years in the range 0000 to 9999 are supported.
  • Two digits for the month-of-year. This is pre-padded by zero to ensure two digits.
  • Two digits for the day-of-month. This is pre-padded by zero to ensure two digits.

HPROF is ...

read full article
Tagged , , , ,

Chasing down Guava cache slowness

Note: If you are using Guava < 13 and looking to fix a serious cache performance issue, you can directly jump to the conclusion. This article describes how I re-discovered this three year old issue but you probably don’t care.

I have spent the last few months bootstrapping a couple of big data pipelines. I will not go into details, but basically the goal is to crunch almost all the HTTP hits issued from French mobile phones to publish the official French web audience. After having focused on business code for a while, I have eventually been able to start chasing down performance issues.

The first section of theses pipelines is made of several enrichment steps adding various meta-data to the raw HTTP hits. Some of these steps are fast (nanoseconds) some others are quite expensive (milliseconds). When a slow enrichment step is a pure function, an in-process cache is put in front of it to amortize its cost (the distribution of the input data is almost always exponential).

During the initial phase, we decided to use Guava caches without any decent tuning. They are usually good enough and it is smarter to wait to have the full pipeline in place before fiddling with knobs or writing your own cache.

Last week, I started playing with the cache configuration to get a glimpse of the performance model of one of the pipelines. Our caches were configured to be quite small, tens of megabytes, and my goal was to observe what ...

read full article
Tagged , , ,

Presentation, Sample 'em all

Update 2015-12-15: Since 2014, a lot of good material has been written about flamegraphs and java profilers. You might want to watch Java Profiling from the Ground Up! from Nitsan Wakart. You might also want to know that since JDK 8 update 60, an unpatched JDK can be used to generate mixed mode flame graphs. See Brendan Gregg's Java Mixed-Mode Flame Graphs at Netflix, JavaOne 2015. And finally, please be aware that hprof will be removed from JDK 9. It's time to update your skills and tools!

For a few months now, I have been working with Mediametrie to help them to finish a project aiming at publishing the audiences and uses of all French Internet sites.

I joined the team well after the project kick-off. Most of the code was already laid down, but some of the modules exhibited poor performance. Unable to troubleshoot the issues, the team was struggling to achieve performance requirements. Performance analysis is a scarce skill, and Hadoop pipelines usually make things much harder.

I introduced flamegraphs, profiling tools like hprof, Flight Recorder or perf to the team, then explained how to use them in a Hadoop context and described some performance analysis methods. After a few weeks, most of the issues were ...

read full article
Tagged , , , ,

OpenJDK JEP 180: HashMap vs collisions

Note: This article was first written for linuxfr, which is a french speaking website. I decided to publish it here too, but being lazy, err busy, I never took the time translate it.

Dans cet article, je vais parler de la JEP 180 d'OpenJDK 8 qui propose une solution intéressante aux problèmes d'attaques sur la complexité que rencontrent les tables de hachage.

On a déjà parlé de ce sujet ici même à plusieurs reprises. Je vais cependant rapidement représenter le problème et l'évolution des discutions. Le lecteur averti sur le sujet ira directement au dernier paragraphe pour voir la proposition de la JEP 180.

Présentation des tables de hachage

Une table de hachage est une implémentation du type abstrait tableau associatif. Un tableau associatif permet d'associer une clé à une ou plusieurs valeurs, on le nomme aussi parfois dictionnaire. Il fait partie des types abstraits les plus utilisé avec les listes.

Une table de hachage est une implémentation particulière d'un tableau associatif. Elle est aussi la plus courante. Basiquement il s'agit d'un tableau dont les cases contiennent un pointeur vers nil, un élément ou une liste d'élément. On détermine la case à utiliser en appliquant une fonction de hachage à la clé. Idéalement, chaque case ne pointera que vers un unique élément. Dans ce cas les opérations d'insertion, de consultation et de suppression ...

read full article
Tagged , , ,