Last week, I reviewed a seemingly simple patch to an Apache Crunch pipeline. Functional details don’t matter, but the pipeline processes loads of hits from HTTP proxies. Hits have a timestamp and the patch aims to enrich every hit with some metadata indexed by user and business date (ie. we query a repository with a composite key, part of it is the date formatted as a yyyyMMdd
string).
Functional requirements are clear, the code is simple and the repository is more than fast enough to not affect the performance profile of the pipeline
However, while QA’ing the patch I noticed an unexpected slowdown. Since our end-to-end tests are configured to always profile a few map & reduce tasks with HPROF, it was easy to generate a CPU Flame Graph which showed that the following line accounts for ~25% of the on CPU time:
String date = localDate.format(DateTimeFormatter.BASIC_ISO_DATE);
According to its Javadoc BASIC_ISO_DATE
is exactly what we should use to format a LocalDate
as a yyyyMMdd
string:
The ISO date formatter that formats or parses a date without an offset, such as ‘20111203’.
This returns an immutable formatter capable of formatting and parsing the ISO-8601 basic local date format. The format consists of:
- Four digits for the year. Only years in the range 0000 to 9999 are supported.
- Two digits for the month-of-year. This is pre-padded by zero to ensure two digits.
- Two digits for the day-of-month. This is pre-padded by zero to ensure two digits.
HPROF is far from perfect and has many flaws, see Evaluating the Accuracy of Java Profilers or Profilers are lying Hobbits from Nitsan Wakart. However, HPROF support is built-in in Hadoop map reduce and it would be stupid to not use it if you haven’t invested in better tooling yet. When using HPROF I don’t expect to get an accurate view of the system. Here, it is very likely that the weight of the format
call is overestimated due to sampling bias(es). However, it tells me that something is likely wrong and requires investigation.
Before digging into code & benchmarks let’s start with a quick back-of-the-envelope calculation. HPROF is configured to take a sample every 9ms and reports 3912 samples in LocalDate.format
. That’s ~35 seconds. This mapper processes ~500k records. It means that each call to format
cost ~70μs. We also know that the same job from the master branch takes ~90 seconds to process ~500k records, that’s ~180μs per record. We will use these ballpark figures to check our benchmarks later.
Now let’s observe the relevant part of the Flame Graph:
According to the Flame Graph, a significant amount of time is spent filling in stack traces. I did not expect this code path to throw exceptions but digging into JSR 310 source code learned me that DateTimeFormatter.BASIC_ISO_DATE
is defined as follows:
public static final DateTimeFormatter BASIC_ISO_DATE;
static {
BASIC_ISO_DATE = new DateTimeFormatterBuilder()
.parseCaseInsensitive()
.appendValue(YEAR, 4)
.appendValue(MONTH_OF_YEAR, 2)
.appendValue(DAY_OF_MONTH, 2)
.optionalStart()
.appendOffset("+HHMMss", "Z")
.toFormatter(ResolverStyle.STRICT, IsoChronology.INSTANCE);
}
While we use this formatter to format a LocalDate
which has no ZoneOffset
information, it would also print the ZoneOffset
if available (ie. formatting a ZonedDateTime
). Digging further into the code shown me that this optional segment is responsible for an exception being thrown on each invocation:
// From DateTimeFormatterBuilder
static final class OffsetIdPrinterParser implements DateTimePrinterParser {
@Override
public boolean format(DateTimePrintContext context, StringBuilder buf) {
Long offsetSecs = context.getValue(OFFSET_SECONDS);
if (offsetSecs == null) {
return false;
}
int totalSecs = Math.toIntExact(offsetSecs);
[...]
// From DateTimePrintContext
Long getValue(TemporalField field) {
try {
return temporal.getLong(field);
} catch (DateTimeException ex) {
if (optional > 0) {
return null;
}
throw ex;
}
}
Basically, DateTimeFormatter
uses exceptions for flow control! This is bad because exceptions are slow when… nonexceptional. In our case, the exceptional branch is always taken. It is even worse because exception performance depends on stack depth and Apache Crunch often involve deep stack traces. Because DoFn
s are chained using method calls, each DoFn
adds three frames to the stack. It is not uncommon to see 128+ frames deep stacks.
With the help of a JMH benchmark, we will now try to discover the real cost of using BASIC_ISO_DATE
to format a LocalDate
:
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
public class BasicIsoDateBenchmark {
@Param(value = {"0", "1", "2", "4", "8", "16", "32", "64", "128", "256"})
private int stackDepth;
@Param(value = {"1", "100", "10000"})
private int batchSize;
@Benchmark
public LocalDate baseline() {
return call(stackDepth, LocalDate::now);
}
@Benchmark
public String basicIsoDate() {
return call(
stackDepth,
() -> LocalDate.now().format(DateTimeFormatter.BASIC_ISO_DATE));
}
private <T> T call(int depth, Supplier<T> supplier) {
if (depth == 0) {
T t = null;
for (int i = 0; i < batchSize; i++) {
t = supplier.get();
}
return t;
}
return recursive(depth - 1, supplier);
}
}
- Because we expect the stack depth to play a key role in the performance profile, we will run each benchmark at various stack depths.
baseline
will tell us the cost of doing nothing, which is defined here by going down to a specific stack depth and invokingLocalDate.now()
. It will allow us to control that our measurements are valid.basicIsoDate
does the same thing as thebaseline
, but also formats theLocalDate
usingBASIC_ISO_DATE
. Subtracting the result of thebaseline
from it will tell us the cost of formatting ofLocalDate
.- Because getting at the desired stack depth could be expensive compared to the cost of formatting a date,
batchSize
allows amortizing the cost of the recursion. To avoid nasty surprises, looping should usually be left to the benchmarking framework rather than being part of the benchmark. However, here we don’t have a better option than to do it ourselves (and to be aware that HotSpot might optimize this loop away). Hopefully, we won’t have to use this parameter.
Following graph compare the median cost per operation of baseline
to basicIsoDate
at various stackDepth and batchSize = 1:
baseline
is always faster than 1μs/op.basicIsoDate
is several times slower thanbaseline
.- The difference between
basicIsoDate
andbaseline
gives us the cost of formatting a date. It ranges from ~3μs at stackDepth = 1 to ~18μs at stackDepth = 256.
The previous graph showed the median cost per operation. As a safety check, following graph shows the same at p50.0, p90.0, and p99.0:
Nothing fancy here. We can now affirm that, on average, BASIC_ISO_DATE
costs more than 10μs to format a LocalDate
as soon as the stack is deeper than ~128 frames. However, it is not yet possible to say that exception being used for flow control is the cause of the slow down. Using a good profiler like perfasm
, we could analyze where time is spent. Or we could write benchmarks to discover the real cost of exception creation and handling. This is tough work. Luckily, Aleksey Shipilёv already did that in The Exceptional Performance of Lil’ Exception. As usual, Aleksey did a great job to explain what is going on under the hood and how to craft valid benchmarks. If you haven’t read it yet, GO READ IT NOW ! And then read it again.
In his article, Aleksey states that instantiating an exception takes on average ~4μs when the stack is 32 frames deep, ~6μs at 64, ~11μs at 128 and ~21μs at 256. His numbers are quite close to what we observed for basicIsoDate
.
While 10μs may seem a short amount of time for a human, computers are insanely fast. We should be able to format way more than 100k LocalDate per second. Even when the stack is more than 128 frames deep. Since the exception occurs in the optional part of BASIC_ISO_DATE
, we can hope that if we define our own format without the optional part no exception will be thrown (and get our 10μs back):
private static final DateTimeFormatter formatter =
DateTimeFormatter.ofPattern("yyyyMMdd");
@Benchmark
public String ofPattern() {
return recursive(stackDepth, () -> LocalDate.now().format(formatter));
}
ofPattern
is much faster than basicIsoDate
and is seems quite close to the baseline
. We are now at least an order of magnitude faster.
Let’s zoom in and compare the cost of ofPattern
VS baseline
. Since costs look quite close and that call
might be more expensive than format
, we will also display the graphs at different batch sizes:
The cost of formatting a LocalDate
with a custom pattern at any stack depth is a few tens of nanoseconds, few hundreds at worst. That’s one to three orders of magnitude faster than BASIC_ISO_DATE
.
I could refine my graphs or provide the raw JMH output to get more accurate numbers but don’t really need to. With these few simple steps I learned that:
BASIC_ISO_DATE
is slow because it uses exceptions for flow control and defining your own pattern without an optional part can be ways faster.- Depending on the stack depth, I can expect
BASIC_ISO_DATE
to take 4..30μs to format aLocalDate
- While HPROF overestimated the issue, it allowed me to spot it. Initial back-of-the-envelope computation based on HPROF’s numbers estimated the cost of
format
to be ~70μs with a ~100 frames deep stack. JMH benchmarks say that the real cost is closer to ~10μs. This ballpark figure is confirmed by the elapsed time (ie. job is now 5..10% faster, not 40%). - Design or clean code considerations set apart, using exceptions for flow control is still a really bad idea.
- Aleksey Shipilev writes awesome blog posts, but I already knew that ;)