Chris Mason: Application performance: regressions, controlling preemption

May 28, 2014

Participants: Andy Lutomirski, Davidlohr Bueso, Greg KH, Jan Kara, Josh Triplett, and Li Zefan.

People tagged: Fengguang Wu, Ingo Molnar, Jens Axboe, Jiri Kosina, Josef Bacik, Khalid Aziz, Mel Gorman, and Peter Zijlstra.

Chris Mason is in the middle of upgrading a bunch of systems from 2.6.38 and 3.2 to 3.10 and higher, and previous experiences of this sort have uncovered a bunch of performance degradations with 10%-30% slowdowns. Chris expects to be done with the current effort in August, and would like to report results to Kernel Summit. Chris is also interested in controlling preemption from user applications without having to move to the -rt kernel in order to reduce the high context-switch rate, and plans to experiment with improved preemption controls and userspace RCU. Davidlohr Bueso expressed interest in this topic and a willingness to present his experience with various performance issues, as did Jan Kara. Greg KH disagreed, arguing that this discussion had taken place at many Kernel Summits over the years, but without any useful effect. Chris Mason countered that Intel had in fact listened and had a huge positive impact. Chris also found it interesting that even with large performance improvements in many areas, regressions were still the order of the day when upgrading large workloads. In some cases, small .config changes took care of things, while in other cases improvements in one area partially masks regressions in other areas. Chris also noted that while he could not make any promises, he hoped to be able to tease out new benchmarks that could be run regularly. Jan Kara also argued that it is valuable to learn what in particular regressed this time. In addition, Jan said that SUSE was looking at doing more continuous testing, so learning what others are testing would be useful.

Josh Triplett asked if the regression-triggering workloads could be automated as useful benchmarks, then added to automated patch checkers such as Fengguang Wu's 0day setup. Davidlohr Bueso liked the idea of adding automated tests to 0day, proposing some from perf-bench, but noted that some regressions are triggered by proprietary software, by unusual hardware, and by workloads that are difficult to convert to benchmarks.

Li Zefan wondered if the regressions fixed by small .config changes were suppressing new features, which led him to ask if they were really regressions. Chris agreed, except in the case where the default .config choice slowed things down.

Andy Lutomirski wondered how much of the context-switch overhead was due to the actual context switch (and any associated interrupts) and how much was due to caching effects. Andy recommended a simple benchmark for the former case.