How not to make benchmarks?
A couple of days ago Fabien Potencier, the lead developer of Symfony framework, published on his blog two notes about template engines that introduced Twig template engine. They contain a benchmark, where the author claims that Twig is the fastest and least memory-consuming template engine. For the first time I saw I even believed it… until I looked at the test procedure. This led me to write a note why we should check benchmarks twice before we believe them and what tricks can be used to prove absolutely everything.
What was wrong with Fabien's test procedure?
I will start with a case study. The Fabien's benchmark contains just one test which uses some of the language features, primarily the template inheritance and including external files. The time was measured for the template execution time, which was repeated 10,000 times. PHP completed this benchmark in 2.4 s, Twig in 3 s, whereas Smarty in 12.6 s. As I created a template engine benchmark a couple of months ago, I decided to check Twig there, too. I have rewritten the Fabien's test case there and compared the results to my OPT:
| Test | OPT | Twig |
|---|---|---|
| Initialization time | 0.0045 s | 0.002 s |
| Total test time | 0.0055 s | 0.005 s |
| Test time without initialization | 0.001 s | 0.003 s |
It was quite good. Although the initialization is quite longer in OPT, the templates themselves are executed faster. I was very surprised, when my brother came to me and said that it downloaded Fabien's benchmark, added there OPT and it was 6 times worse than Twig. It was obvious that something was wrong, because it is impossible that exactly the same test case gives so different results on different benchmarks.
Both benchmarks required to execute the test several times (my - 10 times, Fabien's - 10000 times). However, when I looked at the Fabien's code, I found the reason:
for ($i = 0; $i < 10000; $i++)
{
$template->render($params);
}
This is how Fabien did the test repetition. And this is why Twig was so good:
<?php
/* test7_base.tpl */
class __TwigTemplate_769d3687ca96c97ebe0b2b228d02b6a5 extends Twig_Template
{
public function display($context)
{
/// etc...
Of course, PHP cannot load the same class 10000 times - it loads it once and simply reuses it later. At the same time, OPT compiles templates to plain PHP files which must be physically included in each loop iteration:
Hello, <?php echo $this->_data['name']; ?>
<?php $_sectarray_vals = &$this->_data['array']; if(is_array($_sectarray_vals) && ($_sectarray_cnt = sizeof($_sectarray_vals)) > 0){ for($_sect1_i = 0; $_sect1_i < $_sectarray_cnt; $_sect1_i++){ ?>
* <?php echo htmlspecialchars($_sectarray_vals[$_sect1_i]); ?>
<?php } } ?>
The difference between the benchmarks was that I have provided process isolation - each of 10 iterations was a separate HTTP request and the total time was an approximation of the collected times. The thing that Fabien's benchmark actually proves is that Twig can include the same template 10000 times much faster than most template engines. But have you ever seen a website that required including the same file 10000 times in a single HTTP request? Me not.
The real nature of benchmarks
The biggest misunderstanding of benchmarks is treating them as an oracle. It would be very nice to have a benchmark that allows us to say: this piece of software is faster than that, but it is impossible. Take a look at template engines. They are very complex libraries using different programming techniques, design patterns and algorithms to perform their task. Thinking that a single test case is able to measure the total performance is a myth. In fact, benchmarks test usually a small part of the reality and nobody says that the tested part must be useful from the practical point of view. We have to watch out, because if someone wants to cheat, he or she can invent the most improbable scenario where the featured product surprisingly is faster and manipulate the public in this way.
In my benchmark, I always warn that they test only a few scenarios, and moreover - I try to select only those scenarios that often occur in the real applications. Moreover, I never describe the performance with just one word: faster/slower, but rather attempt to describe the nature of the software. Compare the sentences:
- Twig is faster than OPT.
- OPT is faster than Twig.
Both of them cannot be proved, because one can find counterexamples for each of them. I prefer something like this:
Twig initialization is very fast, but the library produces a quite complex output code for templates which may affect the performance in the more complex scenarios. In OPT, we have a quite demanding initialization procedure due to the size of the base file and autoloader configuration, but the library attempts to produce very simple output code and solve as many issues as possible during the compilation.
What description do you consider more useful for the real programmer from the performance point of view?
Benchmarking procedure
Good and realistic test cases are not enough. The second important thing is the benchmarking procedure. We cannot execute the test just once. There are many factors that affect the execution. If we repeat the test once again, we get a bit different time, depending on the current system usage, disk access times etc. This leads us to conclusion that each test should be executed several times. The final result can be defined in many ways, i.e.:
- It is a sum of elementary times
- It is the approximated value of elementary times
I personally like the statistical definition, as we can also calculate the standard deviation, or - in other words - the variability of results. Big standard deviation means that the test case is very sensitive to the external factors, making the general performance harder to predict.
However, as we can see from the case study at the beginning of the post, we should also know, how to execute the tests several times. The most important rule states that we must provide a reliable isolation level between test iterations. Especially, the tested code must not be able to perform a caching that does not occur in reality. Let's take a look. Template engines precompile the template code to the form of a PHP script and use it later, bypassing the compilation stage. This an indented behaviour. Real-world applications also use the precompiled template versions during the normal operation, so nothing wrong happens, if we allow it in the benchmarks. Actually, because it is so commonly used feature, we even should allow it. On the other hand we have situations like in Fabien's benchmark. Normally, the template engine cannot preload a class to the memory and use it in the next 10000 requests, but the benchmark allows it, thus making cheating possible. It should be obvious that template engines which can make use of this lack of isolation will win.
Conclusion
As we can see, benchmarks can be very tricky. Even the smallest detail can change the final results drastically and it is quite easy to cheat. This is why we should beware of them, especially if they are not accompanied with a detailed description and source codes that allow to verify the procedures and check everything ourselves. And by the way, my benchmark can be found here.
October 21st, 2009 - 14:11
I was really curious of your opinion about Fabien Potencier’s advertisement of Twig.
Great and argumentative counter-post.
October 21st, 2009 - 19:09
Hmmm, that’s a complex look at that issue. It’s obvious that Fabien Potencier made a big mistake, but I think that he wanted to make a little bit of noise around this project. Maybe you could mail him or something to prove him wrong and make him put rectification on his site? I believe that Symfony won’t go in that way, because it’s really good framework. Only pros of it is that this framework doesn’t support OPT
I mean yet