Benchmarking is hard

I’ve seen a lot of people creating benchmarks that are misleading and sometimes I even create some “biased” benchmarks myself. On this post I will try to explain how measuring things the wrong way can make you believe that a certain technique is faster but in real-life applications it may have the opposite effect or not so significant gains.

Common Pitfalls

  • Measuring always with the same data.
  • Doing some operation that won’t have any side effects.
  • Testing number of operations per second even when operation will be executed only once or at large intervals (e.g. code will never be called inside a loop).
  • Measure the wrong thing.
  • Test code outside of a real application.
  • Use a different kind or amount of data than a real application will handle (ordered/unordered items, small/large amount of data, mixed/equal values, etc..)
  • Consider assumptions as valid without ensuring it really is.
  • Run tests a few times and on a single environment.


Do the opposite of the pitfalls.

Always test with different kinds of data and try to avoid smart optimizations done automatically by the browser/vm/engine, if you are reusing the same data or executing the same operation multiple times the JS engine may be doing some heavy caching of the data or not even doing the operation (since it can detect that it will return same value every time or that it won’t have any side effects).

If JIT is doing some trick to improve performance you may not be testing the performance of your algorithm, but only how smart the JIT is at ignoring steps that causes no side-effects, on a real application results may be very different (if not the opposite) I’ve seen it happening multiple times and it can be very hard to spot this kind of error.

Try to prove that your results are accurate and that there is no way that something else could be skewing the results… One technique largely adopted is to ignore the best and worse results and just calculate the average between all the runs, that will give a better perception of how the algorithm usually performs.

Basic example

“Bad” test:

var n = 1000;
var result;
  //Math.floor will always return same value
  //if the engine is smart enough it can simply pre-calculate the Math.floor
  //result and even skip the loop altogether
  result = Math.floor(2.34567);

“Good” test:

// preparation code ====
var n = 1000;
var items = [];
   //we generate an array with different values to avoid "overly smart" JIT
  items.push( Math.random() * 100 );
n = items.length;
//result stored on the global scope, so it won't be garbage collected 
//and will enforce value to be calculated, if inside a closure where
//no code could reach it, JIT could simply ignore it.
var result;

// we only benchmark this loop ===
  //using different inputs and adding the result will ensure that the engine
  //can't simply bypass the loop and just use the value of the last iteration
  result += Math.floor(items[n]);

This is the benchmark which motivated me to finally write this post. (see old revisions for completely skewed data)

And always remember:

“Your performance improvements are only as good as your benchmarks.” - unknown

Further reading