Ottawa Linux Symposium (OLS) Papers for 2009:


Autotest - Testing the Untestable - John Admanski, Steve Howard

Increased automated testing has been one of the most popular and beneficial trends in software engineering. Yet low-level systems such as the kernel and hardware have proven extremely difficult to test effectively, and as a result much kernel testing has taken place in a manual and relatively ad-hoc manner. Most existing test frameworks are designed to test higher-level software isolated from the underlying platform, which is assumed to be stable and reliable. Testing the underlying platform itself requires a completely new set of assumptions and these must be reflected in the framework's design from the ground up. The design must incorporate the machine under test as an important component of the system and must anticipate failures at any level within the kernel and hardware. Furthermore, the system must be capable of scaling to hundreds or even thousands of machines under test, enabling the simultaneous testing of many different development kernels each on a variety of hardware platforms. The system must therefore facilitate efficient sharing of machine resources among developers and handle automatic upkeep of the fleet. Finally, the system must achieve end-to-end automation to make it simple for developers to perform basic testing and incorporate their own tests with minimal effort and no knowledge of the framework's internals. At the same time, it must accommodate complex cluster-level tests and diverse, specialized testing environments within the same scheduling, execution and reporting framework.

Autotest is an open-source project that overcomes these challenges to enable large-scale, fully automated testing of low-level systems and detection of rare bugs and subtle performance regressions. UsingAutotest at Google, kernel developers get per-checkin testing on a pool of hundreds of machines, and hardware test engineers can qualify thousands of new machines in a short time frame. This paper will cover the above challenges and present some of the solutions successfully employed inAutotest . It will focus on the layered system architecture and how that enables the distribution of not only the test execution environment but the entire test control system, as well as the leveraging of Python to provide simple but infinitely extensible job control and test harnesses, and the automatic system health monitoring and machine repairs used to isolate users from the management of the test bed.

Full paper (PDF)


Increasing memory density by using KSM - Andrea Arcangeli

With virtualization usage growing the amount of RAM duplication in the same host across different virtual machines possibly running the same software or handling the same data is growing at a fast pace too. KSM is a Linux Kernel module that allows to share equal anonymous memory across different processes and in turn also across different KVM virtual machines. Thanks to the KVM design and the mmu notifier feature, the KVM virtual machines aren't any different from any other process from the Linux Virtual Memory subsystem POV. And incidentally all Guest physical memory is allocated as regular Linux anonymous memory mappings. But KSM isn't just for virtual machines.

The KSM main task is to find equal pages in the system. To do that it uses two trees, one is the stable tree the other is the unstable tree. The stable tree contains only already shared and not changing KSM generated pages. The unstable tree contains only pages that aren't shared yet but that are tracked by KSM.

The content of the pages inserted into the two trees is the index of the tree, but we don't want to write-protect all the pagetables that points to the pages in the unstable tree. So we allow the content of the pages (so the tree index) to change under KSM and without knowledge of the tree balancing code. Thanks to the property of the red black trees that can keep a tree balanced without checking the node index value, even if the tree becomes unusable, the tree still remains balanced and the worst case insertion/deletion remains O(log(N)), to guarantee the ksm-tree algorithm not to degenerate in corner cases.

To reduce the number of false negative from the unstable tree lookups, a checksum is used to insert into the unstable tree only pages whose checksum didn't change recently, but in the future the checksum can be replaced by checking the dirty bit of the pagetables and shadow pagetables (not with current EPT though). After a full scan of all pages tracked by KSM, the unstable tree is rebuilt from scratch to reset all lookup errors introduced by the pages changing content during the scan.

Whenever KSM finds a match in the stable or unstable tree, it proceeds to write-protecting the pagetables that mapped to the old not shared anonymous page, and it makes them map the new shared KSM page as readonly. If any KVM shadow pagetable was mapping the page, it is updated and write-protected through the mmu notifier mechanism with a newly introduced change_pte method.

Full paper (PDF)


Sandboxer: Light-Weight Application Isolation in Mobile Internet Devices - R. Banginwar, M. Leibowitz, T. Tanaka

In this paper, we introduce sandboxer, an application isolation mechanism for Moblin based Mobile Internet Devices (MIDs). MIDs are expected to support the "open but secure" device model where end users are expected to download applications from potentially malicious sources. Sandboxer allows us to safely construct a system that is similar to the conventional *NIX desktop, but with the assumption that applications are malicious. Sandboxer uses a combination of filesystem namespace isolation, which provide a secure chroot like jail; UID/GID separation, which provide IPC isolation; and cgroups based resource controllers, which provide access control to devices as well as dynamic limits on resources. By combining these facilities, we are able to provide sufficient protection to the user and system from both compromised applications that have been subverted as well as malicious applications while maintaining a very similar environment to the traditional *NIX desktop. The mechanism also provides facility for applications to hide the local data from rest of the applications running in their own sandboxes.

Full paper (PDF)


Dynamic Debug - Jason Baron

The kernel is sprinkled with debug statements that are only available by individually re-compiling the various subsystems of the kernel. In addition, each subsystem has its own rules and methods for expressing these debug statements - dprintk, DEBUGP, pr_debug, etc. Dynamic debug, introduced in 2.6.28, organizes these debug statements and makes them available at run-time. Statements can be selected on an individual basis, or via higher level organizations such as per-module. Dynamic debug can be thought of as a verbose mode for the kernel. We explore the design, usage, and performance impact of this new feature. We also highlight issues that have been debugged with this methodology and its future uses.

Full paper (PDF)


Measuring Function Duration with Ftrace - Tim Bird

FTrace is relatively new kernel tool for tracing function execution in the Linux kernel. Recently, FTrace added the ability to trace function exit in addition to function entry. This allows for measurement of function duration, which adds an incredibly powerful tool for finding time-consuming areas of kernel execution. In this paper, the current state of the art for measuring function duration with FTrace is described. This includes recent work to add a new capability to filter the trace data by function duration, and tools for analyzing kernel function call graphs and kernel boot time execution.

Full paper (PDF)


The Simple Firmware Interface - A. Leonard Brown

The Simple Firmware Interface (SFI) was developed as a lightweight method for platform firmware to communicate with the Operating System.

Upcoming hand-held platforms from Intel will be deployed using SFI in lieu of ACPI or UEFI.

This talk will summarize the contents of the SFI specification, and will cover the motivation for SFI, the state of SFI Linux support, and the expected industry deployment.

Full paper (PDF)


The Corosync High Performance Shared Memory IPC Reusable C Library - Steven C Dake

Throughout the history of server application development, thousands of interprocess communication systems (IPC) have been designed and implemented with unique sets of bugs and unique performance characteristics. The Corosync Cluster Engine project team was also guilty of implementing an IPC system with high throughput and low overhead and latency. At the conclusion of our development, we made the IPC system generic and reusable as two C libraries. Through this effort, the client server development community can concentrate on debugging and optimizing one IPC system for a variety of uses.

The messaging and programming model is described in this presentation. The zero-copy features of the coroipc system is described. Some examples of the programming API are shown. Implementation details are covered at a high level. Finally message throughput and latency are measured and oprofile results are presented for a test application.

System designers of client server applications will find this presentation interesting. A firm understanding of client server applications and the C programming language is helpful. A basic understanding of Posix may be helpful but not required of attendees.

Full paper (PDF)


GStreamer on Texas Instruments OMAP35x Processors - D. Darling, C. Maupin, B. Singh

The Texas Instruments (TI) OMAP35x applications processors are targeted for embedded applications that need laptop-like performance with low power requirements. Combined with hardware accelerators for multimedia encoding and decoding, the OMAP35x is ideal for handheld multimedia devices. For OMAP35x processors that have both an ARM(R) and a digital signal processor (DSP), TI has created a GStreamer plugin that enables the use of the DSP and hardware accelerators for encode and decode operations while leveraging open source elements to provide common functionality such as AVI stream demuxing.

Often in the embedded applications space there are fewer computation and memory resources available than in a typical desktop system. On ARM+DSP systems, the DSP can be used for CPU-intensive tasks such as audio and video decoding to reduce the number of cycles consumed on the ARM processor. Likewise, additional hardware accelerators such as DMA engines can be used to move data without consuming ARM cycles. This leaves the ARM available to handle other operations such as running a web browser or media player, and thus provides a more feature-rich system. This paper covers the design of the TI GStreamer plugin, considerations for using GStreamer in an embedded environment, and the community project model used in ongoing development.

Full paper (PDF)


From Fast to Predictably Fast - Dominic Duval

Many software applications used in finance, telecommunications, the military, and other industries have extremely demanding timing requirements. Forcing an application to wait for a few extra milliseconds can cause vast sums of money to be lost on the stock markets, important phone calls to be dropped, or an industrial welding laser to miss its target. Highly specialized realtime operating systems have historically been the only way to guarantee that timing constraints would always be respected.

Several enhancements to the Linux kernel have recently made it possible to achieve predictable, guaranteed response times. The Linux kernel is now--more than ever before--well equipped to compete with other realtime operating systems. However, applications may still need to be modified and adjusted in order to run predictably and fully benefit from these realtime extensions. This paper describes our findings, experiences and best practices in reducing latency in user-space applications. This discussion focuses on how applications can optimize the realtime extensions available in the Linux kernel, but is also relevant to any software developer who may be concerned with application response times.

Full paper (PDF)


Combined Tracing of the Kernel and Applications with LTTng - Pierre-Marc Fournier

Kernel tracing provides an effective way of understanding system behavior and debugging many types of problems in the kernel and in userspace applications. In some cases, tracing events that occur in application code can further help by providing access to application activity unknown to the kernel.

LTTng now provides a way of tracing simultaneously the kernel as well as the applications of a system. The kernel instrumentation and event collection facilities were ported to userspace, yielding a performance impact as low as that of the kernel tracer. This presentation will demonstrate how to use LTTng in such a way, and show examples of how correlating kernel and userspace events can lead to successful debugging of complex problems.

Full paper (PDF)


Twenty Years Later: Still Improving the Correctness of an NFS Server - R. Gardner, S. D'Angelo, M. Sears

The NFS Reply Cache, also known as the Duplicate Request Cache, was first described in its current form over twenty years ago as a way to help a server give correct responses to certain types of replayed operations. Some operations, called idempotent, can be safely repeated and will do no harm. Other operations, called non-idempotent, can only succeed once, and an attempt to repeat one results in failure. For example, a request to read a certain block of a file will produce the same result each time. But an operation such as rename can only succeed once. A subsequent retry of the same request will result in an error being reported to the client, even though the original operation did succeed. The Reply Cache keeps track of responses to recently performed non-idempotent transactions, and in case of a replay, the cached response is sent instead of attempting to perform the operation again. In addition to avoiding these client-visible errors, performance is also improved by avoiding unnecessary work.

The trouble begins when the size of the cache is not adequate to deal with the rate of incoming transactions. Now the mechanism breaks down, and replayed requests may result in duplicate work being done and erroneous results generated. Even modest workloads can result in enormous incoming transaction rates which would necessitate enlarging the reply cache to unacceptable levels. Heavy workloads can cause network congestion and delays that can foil attempts to cache enough transactions to maintain correctness.

We address these problems by making the cache smarter instead of larger. First, we add the concept of protecting a cache entry, which temporarily makes it exempt from the usual replacement process. Next, we add some heuristics that grant or revoke the protection of a cache entry. Finally, we eliminate automatic expiration of cache entries. Taken all together, this scheme drastically reduces the number of errors reported by clients on a large network.

Full paper (PDF)


Memory Migration on Next-Touch - Brice Goglin, Nathalie Furmento

NUMA abilities such as explicit migration of memory buffers enable flexible placement of data buffers at runtime near the tasks that actually access them. The move_pages system call may be invoked manually but it achieves limited throughput and implies some strong collaboration of the application. Indeed, the location of threads and their memory access patterns must be carefully known so as to decide of migrating the right memory buffer on time.

We present the implementation of a Next-Touch memory placement policy so as to enable automatic dynamic migration of pages when they are actually accessed by a task. We introduce a new PTE flag setup by madvise, and the corresponding "Copy-on-Access" codepath in the page-fault handler which allocates the new page near the accessing task. We then look at the performance and overheads of this model and compare it to using the move_pages system call.

Full paper (PDF)


Non Privileged User Package Management: Use Cases, Issues, Proposed Solutions - Francois-Denis Gonthier, Steven Pigeon

The package manager and the associated repositories play a central role in the usability and stability of user environments in GNU/Linux distributions. However, the current package management paradigm puts the control of the system exclusively in the hands of the system administrator, a root-like user. The non privileged user must rely on the administrator to install the packages he needs, while having to deal with delays or even refusal. We think that non privileged package management is the solution to the users' woes. We show that not only non privileged package management has realistic use cases, but also that it is quite feasible. We examine several possible existing solutions and show how they cannot be satisfactory for the deployment of unprivileged user package management. Finally, we analyse the dpkg package manager and show how it can be extended to include safe, consistent, non privileged user package management. Amongst results, we present the conflict resolution rules to include multiple databases, to ensure system consistence and proper dependency management. We also present how to modify user environment initialization to include alternate install locations. We show, finally, the feasibility and usefulness of unprivileged user package management and how small the changes to be made to a package manager such as dpkg are.

Full paper (PDF)


GeoDNS - Geographically-aware, protocol-agnostic load balancing at the DNS level - John Hawley

The Open Source community has grown massively over the years, and now sports a complicated and diverse mirroring topology for its multitude of projects. While the mirroring structure has come to meet the needs of the world, getting users to the right place has become a larger issue. Users span the globe and it is imperative to attempt to try and distribute the load, not only across machines but by physical geographic location. There are many projects and companies that are attempting to handle this from mod_geo inside Apache, mirrormanager, to companies like Akami that provide full solutions. The commercial solutions are outside the reach of most open source projects, and other solutions currently rely on running within the webserver itself. This leaves other protocols like rsync, ftp, git and svn to fend for themselves, if they can.

GeoDNS is the idea of taking an incoming DNS request, doing the geographic lookup at the request time and returning different results based on the incoming ip address. This particular approach, taken by several DNS servers including bind-geodns, powerdns, and tinydns (with patches) allow large mirroring infrastructures like Kernel.org, Wikipedia and many other large sites to direct users seamlessly to an appropriate server and help distribute their loads. This protocol agnostic approach is more universal and simpler for end users to handle by making seemingly hard choices transparent to them.

Full paper (PDF)


Porting to Linux the Right Way - Neil Hortman

Linux has grown to be a major development platform over the last decade, often becoming the primary target for many new applications and appliances. Of course, businesses always wanting to stay current, the rate at which software has been ported to Linux has also gone on the rise. Often this is a trivial matter, especially in environments in which the development model is simmilar (AIX to Linux, Solaris to Linux, even Windows to Linux). However, there are environment (particularly in the embedded space) in which porting often becomes difficult. A stronger coupling of application and driver, coupled with a "just get it working fast" mentality invariably leads to substandard porting efforts which result in products with degraded performance, that leave developers and consumers alike with a bad taste in their mouths. This paper seeks to ease some of that porting effort, by focusing on what has been one of the most often mis-ported areas of code: The user space / kernel space boundary, specifically the movement of data between these domains. This paper will discuss in general terms: The common monolithic application model most often associated with embedded systems development, Its refactoring when porting to Linux, the modeling and description of data that must be passed between the refactored components when porting to Linux, and the selection of an appropriate mechanism for moving that data back and forth between user and kernel space. In so doing, the reader will be exposed to several mechanisms which can be leveraged to achieve a superiorly ported software product that provides both a better customer experience and a greater confidence in Linux as a future development platform.

Full paper (PDF)


Tracing the HA Cluster of Guests with VESPER - S. Kim, S.Moriya, S. Oshima

Recently, many tracing infrastructures, like kprobes, tracepoints, ftrace and etc., have been merged into mainline kernel. They seem useful to tell what is going on inside the kernel in the physical machine. So, it is natural that we tend to question --" can we use them to trace virtual machines?"

In this paper, we introduce VESPER, the framework to trace guest kernel states from the host utilizing in-tree tracing stuff in the just same manner as host kernel tracing. Especially, the mechanism of injecting probes to guest and splicing guest tracing reports onto host to alleviate data copy overhead will be focused. To verify the efficiency of VESPER, we takes HA cluster with guests on in-tree hypervisors, Lguest and KVM, for test cases. By combining tracepoints with kprobes to monitor guests, VESPER shows the improvement on fail-over response latency caused by application-bound as well as system-wide failure, against conventional heartbeat.

Full paper (PDF)


Hardware Breakpoint (or watchpoint) usage in the Linux Kernel - Prasad Krishnan

The Hardware Breakpoint (also known as watchpoint or debug) registers, hitherto, was a frugally used resource in the Linux Kernel (ptrace and in-kernel debuggers being them), with little co-operation between the users. These debug registers can trigger exceptions upon 'events' (memory read/write/execute accesses) performed on 'monitored' address locations to aid diagnosis of memory corruption and generation of profile data. Their role is best exemplified in

This talk will introduce the new generic interfaces and the underlying features of the abstraction layer for HW Breakpoint registers in Linux Kernel. The (potential) users of this infrastructure such as ftrace and SystemTap will be discussed, with interesting examples of their usage. The audience will also be introduced to some of the design challenges in developing interfaces over a highly diverse resource such as HW Breakpoint registers, along with a note on the future enhancements to the infrastructure.

Full paper (PDF)


Shoot first and stop the OS noise - Christopher Lameter

Latency requirements for Linux software can be extreme. One example is the financial industry: Whoever can create an order first in response to a change in market conditions has an advantage. In the high performance computing area a huge numbers of calculations must occur consistently with low latencies on large number of processors in order to make it possible for rendezvous to occur with sufficient frequency. Games are another case are where low latency is important. In games its a matter of survival. Whoever shoots first will win.

An operating system causes some interference with user space processing through scheduling, interrupts, timers etc. The application code sees execution being delayed for no discernible reason and a variance in execution time of code due to cache pollution by the operating system. Low latency applications are impacted in a significant way by OS noise.

We will investigate issues in software and hardware for low latency applications and show how the OS noise has been increasing in recent kernel versions.

Full paper (PDF)


Tuning 10Gb network cards on Linux - B. H. Leitao

This paper outlines the most important settings to use for optimising 10G network cards under Linux, focusing specifically on performance. Emphasis will be placed on configurations for latency and bandwidth, kernel configuration, multi-queues, QoS, multi-stream, kernel memory configurations, NAPI, and all offload options, among others.

Full paper (PDF)


A day in the life of a Linux kernel hacker... - John W. Linville

The Linux kernel is a huge project with contributors spanning the globe. It's usefulness and other advantages continue to draw new users on a daily basis. Some users will discover problems with the kernel, and others will eventually find a need to add their own features to Linux. Whether your are a user in need of support or a developer trying to enhance the kernel, it is good to know something about who is in the community and how they work together.

This topic will introduce the newcomer to some of the characters in the Linux community and some of the roles they play. It will highlight some of the tasks Linux hackers perform on a day to day basis, and give a general overview of how work gets done within the community.

Full paper (PDF)


Transcendent Memory and Linux - Dan Magenheimer

Managing a fixed amount of RAM optimally is a long-solved problem in the Linux kernel. Managing RAM optimally in a virtual enviroment, however, is still a challenging problem because: (1) each guest kernel is busy optimizing its entire fixed RAM allocation oblivious to the needs of other guests, and (2) very little information is exposed to the virtual machine manager (VMM) to enable it to decide if one guest needs RAM more than another guest. Mechanisms such as "ballooning" and hot-plug memory (Schopp, OLS'2006) allow RAM to be taken from a "selfish" guest and given to a "needy" guest, but these have significant known issues and, in any case, don't solve the hard problem: Which guests are selfish and which are needy? IBM's Collaborative Memory Management (Schwidefsky, OLS'2006) attempts to collect information from each guest and provide it to the VMM, but was deemed far too complex and attempts to upstream it have so far been rebuffed.

Transcendent Memory ("tmem" for short, see http://oss.oracle.com/projects/tmem) is a new approach to optimize RAM utilization in a virtual environment. Underutilized RAM from each guest, plus RAM unassigned to any guest ("fallow" memory), is collected into a central pool. Indirect access to that RAM is then provided by the VMM through a carefully crafted, page-copy-based interface. Linux kernel changes are required but are relatively small and not only provide valuable information to the VMM, but also furnish additional "magic memory" (which can be optionally compressed), provide performance benefits, and mitigate some of the issues that arise from ballooning/hotplug. We will introduce tmem and its implementation in the Xen environment, show how tmem can be used, describe the kernel changes required, and demonstrate tmem's advantages.

Full paper (PDF)


Incremental Checkpointing for Grids - John Mehnert-Spahn

The EU-funded project XtreemOS implements an open-source Linux-based grid operating system. Here checkpointing (CP) is used to implement fault tolerance and process migration. We have developed an incremental CP solution for saving only memory pages that have been changed since the last CP. We will present how we keep track of memory page modifications between CPs using Linux-native radix trees and how we handle virtual memory area changes. We will also provide experiment results with selected examples. We will present a new Linux connector transparently reporting application page modifications to user mode in order to adaptively control incremental CP at the grid level.

Full paper (PDF)


Putting LTP to test - Validating both the Linux kernel and Test-cases - Subrata Modak

The Linux Test Project (LTP) is receiving renewed interest and attention due to increased focus on testing and integration of Linux components from several projects in the Linux Ecosystem. LTP has not only discovered bugs in the Linux kernel, but also in other components such as libraries and documentation bugs in the man pages.

In this paper, we cover our experiences in this area and also discuss adoption of newer technologies for static analysis of existing test cases. We use this approach to reduce any errors in test cases (leading to better end automation of Linux testing). We also analyze the new LTP code using various test metrics, and, look at the requirements for allowing the test cases to handle errors introduced by fault injection."

Full paper (PDF)


Linux-based virtualization for HPC clusters - L. Nussbaum, F. Anhalt, O. Mornard, J. P. Gelas

There has been an increasing interest into virtualization in the HPC community, as it would allow to easily and efficiently share computing resources between users, and provide a simple solution to checkpointing. However, virtualization raises a number of interesting questions, on performance and overhead, of course, but also on the fairness of the sharing. In this work, we evaluate the suitability of KVM virtual machines in this context, by comparing them with solutions based on Xen. We also outline areas where improvements are needed, to provide directions for future works

Full paper (PDF)


I/O Topology - Martin K. Petersen

The smallest atomic unit a storage device can access is called a sector. With very few exceptions a sector size of 512 bytes has been akin to a mathematical constant in the storage industry for decades. That picture is now rapidly changing with hard drives moving to 4KB sectors. Flash-based solid state drives and enterprise RAID arrays also have alignment and block size requirements above and beyond what we have traditionally been honoring.

This paper will present a set of changes that expose the characteristics of the underlying storage to the Linux kernel. This information can be used by partitioning tools and filesystem formatters to lay out data optimally. Stacking devices like LVM and MD are also supported.

Full paper (PDF)


Step two in DCCP adoption: The Libraries - L. M. Sales, H. Stuart, H. O. Almeida, A. Perkusich

Multimedia applications are very popular in the Internet. The use of UDP in most of them may result in network collapse due to lack of congestion control. DCCP [RFC 4340] is a new protocol to deliver multimedia in congestion controlled unreliable datagrams.

In this talk I will discuss the results of enabling DCCP in open source libraries, as part of our efforts in disseminating the DCCP protocol to developers. It is a work in progress and currently DCCP is supported in libraries such as GNU CommonCPP, CCRTP (both used in the popular twinkle VoIP application), uCommon, in the GStreamer framework and on Farsight 2.

Full paper (PDF)


Programmatic Kernel Dump Analysis On Linux - Alex Sidorenko

Companies providing Linux support rely heavily on kernel dumps created on customers' hosts. Kernel dump analysis is an art and it is impossible to make it fully automatic. The standard tool used for dump-analysis, 'crash', provides a number of useful commands. But when we need to enhance it or to analyze several thousand similar structures, we need programmatic API.

In this paper we describe Python bindings to 'crash', http://sourceforge.net/projects/pykdump and compare it to C-like SIAL extension language. After general framework discussion we look at some practical tools developed on top of PyKdump, such as 'xportshow'. This tool works on kernels 2.4.21-2.6.28 and provides many useful features, such as printing routing tables, emulating 'netstat' and summarizing networking system status.

Full paper (PDF)


Online Hierarchical Storage Manager - S. K. Sinha, R. B. Agrawal, V. Agarwal, R. Vashist, R. K. Sharma, S. Hendre

Intel, Sandisk and Samsung are investing billions of dollars into SSD technology and manufacturing capacity. Unfortunately due to the extreme cost of building the manufacturing facilities, SSD manufacturing capacity is not likely to exceed HDD manufacturing capability for at least 10 years, and it may be 20 years or more. Most data center applications heavily lean toward database applications which use random read/write disk activity. For random read/write activity the performance of SSDs is 10x to 100x that of a single rotational disk. Unfortunately, the cost is also 10x to 100x that of a single rotational disk.

Due to the limited manufacturing capability of SSD, most applications are going to remain on rotational disk for the foreseeable future. We have developed OHSM to allow SSD and traditional HDD (including RAID) to be seamlessly merged into a single operational environment thus leveraging SSD while using only a modest amount of SSD capacity.

In an OHSM enabled environment, data is migrated to and from the high performing SSD storage to traditional storage based on various user defined policies. Thus if widely deployed, OHSM has the ability to improve computer performance in a significant way without a commiserate increase in cost. OHSM being developed as open source software also abolishes the licensing issues and the costs involved in using storage solution software. OHSM being online signifies the complete abolishment of the downtime and any changes to the existing namespace.

Full paper (PDF)


Effect of readahead and file system block reallocation for LBCAS - K. Suzaki, T. Yagi, K. Iijima, N. A. Quynh, Y. Watanabe

Disk prefetching, known as readahead, arranges its window size by the rate of cache hit. Fewer readaheads of large window can hide the slow I/O, especially it is effective for virtual block device of virtual machine. High cache hit ratio is achieved by increasing locality of reference, namely, file system block reallocation based on an access profile. The reallocation doesn't reserve block contiguousness in a file but it is effective for archival virtual block device like a content addressable storage.

We have developed a data block reallocation tool for ext2/3, called "ext2/3optimizer". The relocation is applied to Linux booting on KVM virtual machine with a virtual disk called "LBCAS (LoopBack Content Addressable Storage)". We confirmed keeping large window of readahead and fewer access requests to LBCAS. The results are useful not only for OS migration but for normal Linux booting. This work is conducted in collaboration with Toshiki Yagi (AIST), Kengo Iijima (AIST), Nguyen Anh Quynh (AIST), and Yoshihito Watanabe (Alpha Systems Inc.)

Full paper (PDF)


Scaling software on multi-core through co-scheduling of related tasks - Srivatsa Vaddagiri

Multi-core platforms pose interesting challenges for software in how they efficiently utilize vast number of compute resources. One particular challenge relates to thread scheduling. In many scenarios, threads of an application (ex: threads serving the same application container or database instance) work together closely, accessing shared locks and data. Scheduling such related threads in different cache domains (ex: in different nodes or chips) will lead to more cache synchronization overheads, thereby hurting performance. The existing interface to solve this problem (sched_setaffinity) is inflexible. We talk about the need for a more flexible interface that allows applications to provide hints about related threads and how the CPU scheduler could use those hints to do a best-effort job of co-scheduling related threads. Positive results obtained with early experiments of benchmarks like SPECJbb will be described.

Full paper (PDF)


Converged Networking in the Data Center - Peter P. Waskiewicz Jr.

The networking world in Linux has undergone some significant changes in the past two years. With the expansion of multiqueue networking, coupled with the growing abundance of multi-core computers with 10 Gigabit ethernet, the concept of efficiently converging different network flows becomes a real possibility.

This paper presents the concepts behind network convergence. Using the IEEE 802.1Qaz Priority Grouping and Data Center Bridging concepts to group multiple traffic flows, this paper will demonstrate how different types of traffic, such as storage and LAN traffic, can efficiently coexist on the same physical connection. With the support of multi-core systems and MSI-X, these different traffic flows can achieve latency and throughput comparable to the same traffic types' specialized adapters.

Full paper (PDF)


How to (Not) Lose Your Data - Ric Wheeler

Increasingly, Linux is the platform that major vendors use to implement everything from consumer grade NAS devices that you can buy at your local electronics store up to expensive, enterprise grade storage systems. This paper aims to present a high level overview of how some of this systems are put together, how to tune Linux for storage applications and what functionality is either on the horizon or yet to be started in the open source space that will enhance Linux as a storage system. The techniques presented are are also applicable to normal home users who would like to enhance data integrity.

Full paper (PDF)


Testing and verification of cluster filesystems - Steven Whitehouse

Although software testing can never prove a program is correct, it can catch many errors early and plays an important part in the process of declaring a program "stable" and ready for release. Cluster filesystems (e.g. GFS2), by their very nature require hardware intensive test environments, and are thus also expensive to test. This tends to limit test coverage compared with their simpler, single node counterparts (ext2/3/4, xfs, etc).

Since reliability is a key feature of any filesystem, this paper considers a number of techniques which may be used to help simulate a cluster on a single node, reducing the cost of testing and increasing the coverage in the process. Although GFS2 is taken as the example filesystem, the techniques described are generic and apply to any similar filesystem.

Full paper (PDF)


Fixing PCI Suspend and Resume - Rafael J. Wysocki

Interrupt handlers implemented by some PCI device drivers can misbehave if the device they are supposed to handle is not in the state they expect it to be in. If this happens, interrupt storm is possible and the system may lock up as a result. Unfortunately, if the device in question uses shared interrupts, it can easily happen during suspend to RAM, after the device has been put into a low power state, and even more likely during resume, before the device is brought back to the state it was in before the suspend. On some machines this leads to intermittent resume failures that are very difficult to diagnose in the majority of cases.

In Linux kernels prior to 2.6.29-rc3 the power management core did not do anything to prevent such failures from happening, but during the 2.6.29 development cycle we started to address the issue. Still, the solution finally implemented in the 2.6.29 kernel is partial, because it only covers the devices that support the native PCI power management and it only affects the resume part of the code. The complete solution, which is now scheduled for inclusion into 2.6.30 and which is described in the present paper, required us to make some radical changes, including a rearrangement of the major steps performed by the kernel during suspend and resume. However, not only should it make suspend and resume much more reliable on a number of systems, but it should also allow the writers of PCI device drivers to simplify their code, because some standard PCI device power management operations will now be carried out by the core.

Full paper (PDF)


Real-Time Performance Analysis in Linux-Based Robotic Systems - H. Yoon, J. Song, J. Lee

Mobile or humanoid robots collect environmental data and reflect back as robotic behaviors via various sensors and actuators. It is crucial this occurs within a specified time. Although real-time flavored Linux has been used to control robot arms and legs for quite a while, it has not been reported much whether the current real-time features in Linux could still meet this requirement for a much more complicated system - a humanoid with about 60 servo motors and sensors with multiple algorithms such as recognition, deci-sion, and navigation running simultaneously. In this paper, in order to meet such requirement, adopting EtherCAT technology is in-troduced and its Linux implementation is illustrated. In addition, results of real-time experiments and timing analysis on a multi-core processor are presented showing Linux is a viable solution to be successfully deployed in various robotic systems.

Full paper (PDF)