Introduction

In the last post, we saw a very naive method to benchmark Java code (Measuring code execution time – The wrong way). It comprised of running the code in a tight loop and measuring how long an invocation took. We then took the average time taken out of all the invocations. As we saw one of the primary issues with such an approach is the fluctuating runtime numbers and we include them when we average out to arrive at a single number (say in milliseconds).

We noted that the runtime numbers tend to settle down after some initial number of iterations. Let us take that into prospect to improve the earlier approach to arrive at a better (but still naive) approach to benchmarking in Java. We will also see why this is still not good.

Warmup and invocation

Apart from the fluctuating numbers, we noted that the numbers stabilized towards the end of the iterations. They gave almost similar runtime numbers. We will now ignore the numbers reported from the few initial iterations.

Our earlier benchmark ran for n times (say 10,000). Now, we divide the benchmarking into two steps - warmup and actual run. Each step runs for a predetermined number of times. Let us say the warmup runs for k times and the actual run m times (say k = 500 and m = 10,000). Thus, the total number of invocations (our code executed) will be k + m. We ignore the runtime numbers for the first k times as those invocations are to warmup. As the name suggests it is to warmup the JVM so it can do all its optimizations and we expect to arrive at a state where each run takes almost the same amount of time thereafter.

Our new benchmarking framework

Here’s our new benchmarking framework code to run any piece of code.

public static void bench(String name, long runMillis, int loop,
                       int warmupIterations, int actualIterations, Runnable runnable) {
    System.out.println("Running benchmark: " + name);
    int totalIterations = actualIterations + warmupIterations;
    long average = 0L;
    for (int i = 0; i < totalIterations; i++) {
        long count = 0;
        long duration = 0L;
        long totalThroughput = 0L;
        long start = System.currentTimeMillis();
        while (duration < runMillis) {
            for (int j = 0; j < loop; j++) {
                runnable.run();
                count++;
            }
            duration = System.currentTimeMillis() - start;
        }
        long throughput = count / duration;
        boolean actualRun = i >= warmupIterations;
        if (actualRun) {
            totalThroughput = totalThroughput + throughput;
        }
        System.out.print(throughput + " ops/ms" + (!actualRun ? " (warmup) | " : " | "));
    }
    average = totalThroughput / actualIterations;
    System.out.println();
    System.out.println("[ ~" + average + " ops/ms ]");
}

It takes six parameters. The name is a string representing the name of the function we are benchmarking. It can be anything and is there just for us to identify the results when printing on the screen. The last parameter, a Runnable, is the actual code we want to benchmark. The code to run is wrapped as a Runnable and passed to this function.

The runMillis parameter is the time in milliseconds to run each iteration (both warmup and actual). The loop parameter is the number of invocations done each time. warmupIterations and actualIterations represent the number of warmup and actual runs (our k and m parameters seen earlier).

Explanation of the benchmarking framework code

There is an outer loop running for warmupIterations + actualIterations times (k + m). Each time we check if we are below the runMillis bound. If yes, we run for loop times - else we return from the inner while loop.

We calculate the number of times the code is executed represented by the variable count. After an invocation (lasting for runMillis amount of time) is over, we measure the throughtput. Throughput is the amount of work done for a certain period of time. Here, by dividing the total count by the duration, we get the number of times our code got executed per millisecond. If the run was part of the actual run i.e., if the warmup iterations are over, then we add throughput to the variable totalThroughput.

After printing the throughput for each iteration, we find the average by dividing totalThroughput by actualIterations. This gives the average throughput of our code. Thus, throughput is printed for each of the invocations done per runMillis duration. Only the numbers in the actual run will contribute to the average.

Code to benchmark

Let us write three methods to benchmark. The distance method calculates the Euclidean distance between two points (x1, y1) and (x2, y2). The constant method returns a constant. We also include a dummy code that does nothing by an empty runnable.

static double distance(
            double x1, double y1,
            double x2, double y2) {
    double dx = x2 - x1;
    double dy = y2 - y1;
    return Math.sqrt((dx * dx) + (dy * dy));
}
static double constant(
        double x1, double y1,
        double x2, double y2) {
    return 0.0d;
}

We have the following constants

static final long RUN_MILLIS = 1000;
static final int REPEAT = 5;
static final int WARMUP = 2;
static final int LOOP = 10_000;

The total benchmark is run for 7 times with first two as warmups. The runMillis is 1 second and the inner loop is executed for 10,000 times.

Running the benchmark

Let us benchmark the three methods viz., distance, constant and nothing

bench("distance", RUN_MILLIS, LOOP, WARMUP, REPEAT, () ->
        distance(0.0d, 0.0d, 10.0d, 10.0d));
bench("constant", RUN_MILLIS, LOOP, WARMUP, REPEAT, () ->
        constant(0.0d, 0.0d, 10.0d, 10.0d));
bench("nothing", RUN_MILLIS, LOOP, WARMUP, REPEAT, () -> {});

A random execution output:

Running benchmark: distance
33183570 ops/ms (warmup) | 31459880 ops/ms (warmup) | 28700850 ops/ms | 25950620 ops/ms | 33279970 ops/ms | 32907970 ops/ms | 26520830 ops/ms | 
[ ~29472048 ops/ms ]
Running benchmark: constant
1002890 ops/ms (warmup) | 519420 ops/ms (warmup) | 531430 ops/ms | 544130 ops/ms | 513980 ops/ms | 524960 ops/ms | 546530 ops/ms | 
[ ~532206 ops/ms ]
Running benchmark: nothing
344450 ops/ms (warmup) | 331300 ops/ms (warmup) | 324600 ops/ms | 347570 ops/ms | 347390 ops/ms | 357280 ops/ms | 311780 ops/ms | 
[ ~337724 ops/ms ]

Looking at this, we see that nothing has the worst throughput and still constant has a lesser throughput than distance. We expected this to be the other way around.

Changing the benchmark order

Let us change the order of the runs running constant, distance and then nothing.

bench("constant", RUN_MILLIS, LOOP, WARMUP, REPEAT, () ->
        constant(0.0d, 0.0d, 10.0d, 10.0d));
bench("distance", RUN_MILLIS, LOOP, WARMUP, REPEAT, () ->
        distance(0.0d, 0.0d, 10.0d, 10.0d));
bench("nothing", RUN_MILLIS, LOOP, WARMUP, REPEAT, () -> {});

Sample numbers reported are as:

Running benchmark: constant
15935220 ops/ms (warmup) | 18967240 ops/ms (warmup) | 29676300 ops/ms | 33200080 ops/ms | 29036380 ops/ms | 33229000 ops/ms | 30466650 ops/ms | 
[ ~31121682 ops/ms ]
Running benchmark: distance
1028400 ops/ms (warmup) | 517860 ops/ms (warmup) | 543780 ops/ms | 551770 ops/ms | 542870 ops/ms | 517010 ops/ms | 536530 ops/ms | 
[ ~538392 ops/ms ]
Running benchmark: nothing
331910 ops/ms (warmup) | 348830 ops/ms (warmup) | 358050 ops/ms | 350520 ops/ms | 353090 ops/ms | 340910 ops/ms | 343210 ops/ms | 
[ ~349156 ops/ms ]

We are continuing to see a drop in throughput numbers - first > second > third regardless of the order in which we run them.

Think about what would have happened if you had executed nothing, constant and distance in that order? You would have been satisfied by the result and think this approach of benchmarking actually works!!.

Problem 1 - Isolating runs

The above numbers indicate that running multiple benchmarks in a JVM is wrong. The reason for the above numbers is using Runnable. We have been using three Runnable instances. By the time we complete the first and move to the second, it affects the JVM’s speculation. Behind the scenes, it made some optimizations that runnable.run always dispatches to the same target class. When we start the second benchmark, the JVM has to do some considerable work to deoptimize the generated code and produce an optimized code to dispatch to either of the two runnables. Thus, this contributes to the slowness of the second benchmark irrespective of what method it is.

Similar thing happens when we start the third benchmark. But Hotspot VM can generate such native code for only up to two different types (two runnables here) and hence, it falls back to a generic dispatch mechanism.

Separate benchmark methods

For the sake of understanding and verification and let us remove the runnable parameter and move the code to benchmark into the bench method itself. Keeping the problem of duplication aside for now, I have crated three benchmark methods each to benchmark our three methods.

Constant:


public static void benchConstant(long runMillis, int loop,
                         int warmupIterations, int actualIterations) {
    System.out.println("Running benchmark constant:");
    int totalIterations = actualIterations + warmupIterations;
    long average = 0L;
    long totalThroughput = 0L;
    for (int i = 0; i < totalIterations; i++) {
        long count = 0;
        long duration = 0L;
        long start = System.currentTimeMillis();
        while (duration < runMillis) {
        for (int j = 0; j < loop; j++) {
            constant(0.0d, 0.0d, 10.0d, 10.0d); //Inlined method call
            count++;
            }
            duration = System.currentTimeMillis() - start;
        }
        long throughput = count / duration;
        boolean actualRun = i >= warmupIterations;
        if (actualRun) {
            totalThroughput = totalThroughput + throughput;
        }
        System.out.print(throughput + " ops/ms" + (!actualRun ? " (warmup) | " : " | "));
    }
    average = totalThroughput / actualIterations;
    System.out.println();
    System.out.println("[ ~" + average + " ops/ms ]");
}

Showing only the relevant part for the other two to reduce verbosity

Distance:

for (int j = 0; j < loop; j++) {
    distance(0.0d, 0.0d, 10.0d, 10.0d);
    count++;
}

Nothing:

for (int j = 0; j < loop; j++) {
    //do nothing
    count++;
}

Running the benchmarks in the following order

benchConstant(RUN_MILLIS, LOOP, WARMUP, REPEAT);
benchDistance(RUN_MILLIS, LOOP, WARMUP, REPEAT);
benchNothing(RUN_MILLIS, LOOP, WARMUP, REPEAT);

Now, the numbers look like

Running benchmark constant:
29620480 ops/ms (warmup) | 36919880 ops/ms (warmup) | 42523500 ops/ms | 42998690 ops/ms | 42580150 ops/ms | 42870090 ops/ms | 39735140 ops/ms | 
[ ~42141514 ops/ms ]
Running benchmark distance:
35833800 ops/ms (warmup) | 36378820 ops/ms (warmup) | 30224375 ops/ms | 37577800 ops/ms | 40702880 ops/ms | 41429500 ops/ms | 41688580 ops/ms | 
[ ~38324627 ops/ms ]
Running benchmark nothing:
42325980 ops/ms (warmup) | 42107400 ops/ms (warmup) | 39762090 ops/ms | 41850260 ops/ms | 42648280 ops/ms | 42373150 ops/ms | 40666430 ops/ms | 
[ ~41460042 ops/ms ]

Still, constant and nothing are only slightly better than distance. Also, throughput of constant is only a little more than nothing.

Problem #2 - Dead code

The flaw in the earlier code is that the return value from the distance and constant weren’t being used. So, the JVM did dead-code elimination and could have removed the call altogether. This is also because the method had no side effect like printing or updating a variable. Let us modify the code to capture and assign the return value into a static method (to verify this assumption)

//Introduce a couple of static variables
private static double distanceResult;
private static double constantResult;

Assign the method return value to these variables as

distanceResult = distance(0.0d, 0.0d, 10.0d, 10.0d);

constantResult = constant(0.0d, 0.0d, 10.0d, 10.0d);

Running prodcues results that look like:

Running benchmark constant:
26243140 ops/ms (warmup) | 31346830 ops/ms (warmup) | 31713300 ops/ms | 31833370 ops/ms | 30791640 ops/ms | 26156690 ops/ms | 29175830 ops/ms | 
[ ~29934166 ops/ms ]
Running benchmark distance:
32684780 ops/ms (warmup) | 32769140 ops/ms (warmup) | 30256250 ops/ms | 31462930 ops/ms | 32045460 ops/ms | 32808940 ops/ms | 32855230 ops/ms | 
[ ~31885762 ops/ms ]
Running benchmark nothing:
41600420 ops/ms (warmup) | 40866500 ops/ms (warmup) | 40338580 ops/ms | 40310600 ops/ms | 40955160 ops/ms | 41445580 ops/ms | 42577570 ops/ms | 
[ ~41125498 ops/ms ]

nothing has the highest throughput followed by distance and constant.

How is still constant appearing to be slower than distance?

Problem #3 - Constant folding

The arguments to the constant method and the return value are constants and hence don’t change each time. The virtual machine can thus do a constant-folding. Hence, the method call may not happen all the time.

Calling distance and constant from three static variables (like distanceResult = distance(a, b, c, d);) changes the numbers like (but not always)

Running benchmark constant:
22834090 ops/ms (warmup) | 27116440 ops/ms (warmup) | 27574240 ops/ms | 30435040 ops/ms | 32922930 ops/ms | 29373350 ops/ms | 30817380 ops/ms | 
[ ~30224588 ops/ms ]
Running benchmark distance:
31483640 ops/ms (warmup) | 31958920 ops/ms (warmup) | 31439330 ops/ms | 30290550 ops/ms | 26243510 ops/ms | 21762890 ops/ms | 30465080 ops/ms | 
[ ~28040272 ops/ms ]
Running benchmark nothing:
40128090 ops/ms (warmup) | 41957190 ops/ms (warmup) | 31279280 ops/ms | 40729830 ops/ms | 37859870 ops/ms | 39615330 ops/ms | 40597810 ops/ms | 
[ ~38016424 ops/ms ]

This has a relative throughput numbers as expected (distance < constant < nothing). But the results would vary if you change the order of the benchmark calls or even from run to run.

Summary

In this post, we developed on the very naive benchmarking approach from the previous post to build a reasonable and a better utility for benchmarking in Java. We saw three problems relating to isolating run, dead-code removal and constant folding and how this affected the benchmark numbers. The last version did try to avoid these but still is not a fool-proof benchmarking methodology to employ.

In the next post, we will look at the benchmarking framework called the Java Microbenchmarking Harness (JMH) that is used to building, running and analysing benchmarks in Java

Reference

This post was based on Avoiding Benchmarking Pitfalls on the JVM from Oracle.