Last week, I reviewed a seemingly simple patch to an Apache Crunch pipeline. Functional details don't matter, but the pipeline processes loads of hits from HTTP proxies. Hits have a timestamp and the patch aims to enrich every hit with some metadata indexed by user and business date (ie. we query a repository with a composite key, part of it is the date formatted as a yyyyMMdd
string).
Functional requirements are clear, the code is simple and the repository is more than fast enough to not affect the performance profile of the pipeline
However, while QA'ing the patch I noticed an unexpected slowdown. Since our end-to-end tests are configured to always profile a few map & reduce tasks with HPROF, it was easy to generate a CPU Flame Graph which showed that the following line accounts for ~25% of the on CPU time:
String date = localDate.format(DateTimeFormatter.BASIC_ISO_DATE);
According to its Javadoc BASIC_ISO_DATE
is exactly what we should use to format a LocalDate
as a yyyyMMdd
string:
The ISO date formatter that formats or parses a date without an offset, such as '20111203'.
This returns an immutable formatter capable of formatting and parsing the ISO-8601 basic local date format. The format consists of:
- Four digits for the year. Only years in the range 0000 to 9999 are supported.
- Two digits for the month-of-year. This is pre-padded by zero to ensure two digits.
- Two digits for the day-of-month. This is pre-padded by zero to ensure two digits.
HPROF is far from …