Performance Guide Pt. 1: Performance Characteristics and How It Can Be Measured | Blog

We're excited to introduce a new blog post series focused on testing and enhancing storage performance. Throughout this series, we'll walk you through the entire process, from defining objectives and preparing the necessary hardware and software to optimizing performance using a software RAID engine. In our first blog post, we'll delve into the fundamentals of system performance, including its core components and methods for measurement.

The performance of a data storage system is usually evaluated based on its inherent set of interrelated characteristics, such as data transfer bandwidth, input/output performance, etc. This approach allows for a detailed comparison among existing solutions and between the solutions and specifications provided by suppliers. It is also useful for predicting the performance of an application. The output characteristics of a system may, however, vary as the load changes. For instance, they can be influenced by factors such as the number of queues, the depth of the request queue, the size of the read and write blocks, the alignment of the blocks, the locality of requests, and the ability to compress and deduplicate data.

This chapter covers data storage units, basic workload patterns, and provides instructions on how to measure and test performance.

Data Storage Units

GBps or GiBps. Data storage system transfer bandwidth measured in GB (binary or decimal). This parameter measures how much data can be processed during read and write operations per unit of time. It is typically used to evaluate the data storage system performance when dealing with large block sizes, ranging from 64kB to 8MB. This parameter is directly proportional to the following one.
IOps. Input/output operations per second (IOPS) is a parameter that indicates the number of requests a data storage system can process. It is typically used to evaluate the performance of a data storage system when dealing with small block sizes, ranging from 512B to 64kB.
avg lat. Average latency is the waiting period for a request to be processed, measured in nanoseconds (ns), microseconds (us), and milliseconds (ms). This parameter is inversely proportional to the previous one. Sometimes, in advanced analytics, average latency is divided into submission and completion latency.
99.x lat. Indicates the response time for 99.x requests, where x is 5,9,95,99, etc. This parameter should be considered if the application is sensitive to latency or when using data storage system in a complex application cluster with multiple nodes involved in the processing of requests. For more complex tests, the latency distribution can also be taken into account.
CPU load
Storage Utilization
Application parameters: number of supported threads, transactions, etc.

However, when measuring and evaluating these parameters, it is important to consider the workload on the storage device, as it can affect its performance. The performance differs due to devices’ features.

Terms and Characteristics Describing the Workload Patterns

I/O
An I/O is a single read/write request. That I/O is issued to a storage medium (like a hard drive or solid state drive).

I/O Request Size (block size)
The I/O request has a size, which can vary from small (like 1 Kilobyte) to large (several megabytes). Different application workloads will issue I/O operations with varying request sizes. The size of the I/O request can impact latency and IOPS figures (two metrics we will discuss shortly).

Access patterns
Sequential access
Sequential access is a type of access where the next input/output (I/O) operation starts from the same location (address or LBA) where the previous I/O operation ended. In other words, I/O operations form a sequence of reads or writes which come sequentially, one after another.

In real-life scenarios, sequential access I/O typically uses relatively large I/O sizes.

Random access
I/O requests are issued in a seemingly random pattern to the storage media. The data could be stored all over various regions of the storage media. An example of such an access pattern is a heavy utilized database server or a virtualization host running a lot of virtual machines (all operating simultaneously).

In real life pure sequential and pure random patterns rarely can be found. But usually real applications generate workload which could be close to some “pure“ patterns within a timeframe.

Queue depth
The queue depth is a number (usually between 1 and ~128) that shows how many I/O requests are queued (in-flight) on average. Having a queue is beneficial as the requests in the queue can be submitted to the storage subsystem in an optimized manner and often in parallel. A queue improves performance at the cost of latency.

How to Measure and Test Performance

Prior to conducting any performance tests, two essential steps should be taken in order to analyze the final results correctly:

Defining test objectives
Setting expectations

Generally, the approach to testing should be as follows:

Defining the objectives, forming expectations and designing a test strategy.
Setting up test environment suitable for the task.
Studying specifications for the hardware and evaluating the results accordingly.

Steps two and three are interlinked and usually executed at the same time. Some elements of the equipment used in testing are predetermined, while others must be chosen to ensure an optimal outcome.

Software setup.
Testing individual components and removing bottlenecks.
Carrying out tests.
Assessing the results.

Steps 4-7 and sometimes 2 can be repeated until the desired result is achieved.

Depending on the desired outcomes, we can determine the type of testing:

1. System testing

The primary objective of this testing is to comprehend the system behavior under various loading types and in different scenarios. This type of testing is most common and synthetic benchmarks like fio, vdbench, and iometer are generally used for it. During testing, parameters such as the number of threads and queue depth, read-write ratio, block size, and so on are altered. This data often provides an accurate prediction of what to expect during other testing scenarios.

2. Testing application performance

It is used to understand how a real application will work with storage. It acts as a tool to imitate the activity of real applications as well as launches.

3. Acceptance testing

It is used to determine if a new storage or modified settings of an existing one meet the project requirements. A fixed set of tests is used.

4. Comparative testing

Allows to see how storages' performance differs when tested in the most similar conditions. A fixed set of tests is used.

The main focus of this blog post series is system testing.

Example of a test objective and expected results:

Understanding the xiRAID RAID engine's capabilities for both random and sequential workloads when using a server equipped with 16 PCI-E v5 drives.

Boost maximum performance levels 2 times higher than those obtained on previous generation drives.

>40 million IOps per read (4k block)

>2 million IOps per record (4k block)

>200Gbps per read (1Mb block)

>100Gbps per write (1Mb block)

While running system and application performance tests, it's essential to also test the system in case of drive failure, reconstruction or if restriping and resizing. This is due to a significant decrease in array performance that might happen during these operations.

Restriping refers to the process of adding an additional drive to a RAID in order to increase its volume or change its level. It involves changing the configuration of the RAID.

Tool for Testing the Drive Performance

At present, the most preferred tool for creating a workload on drives for the Linux OS is the fio - Flexible I/O tester program. It allows custom-defining workload patterns. The standard command to run the fio program is as follows:

# fio fio.cfg

where "fix.cfg" is a configuration file that describes the necessary FIO startup parameters, defines the tested devices, and outlines the load patterns.

Generally, this file contains one section, [global], which specifies parameters that are common to all tasks [jobs] and at least one section of the job. Most of the options can be specified both in the [global] section and in the [jobs] section.

Let us consider an example of a fio configuration file:

[global]
direct=1
ioengine=libaio
rw=randrw
rwmixread=50
iodepth=128
numjobs=2
offset_increment=2%
norandommap
time_based=1
runtime=600
random_generator=tausworthe64
group_reporting
gtod_reduce=1

[drive]
filename=/dev/nvme1n1

Major fio parameters:

[global]: [global] section start.
direct=1: Enables direct I/O, bypassing the operating system's cache.
bs=4k: Specifies the block size for I/O operations. In this example it’s set to 4 kilobytes.
ioengine=libaio: Sets the I/O engine to libaio, which provides asynchronous I/O support.
rw=randrw: Specifies the I/O access pattern. In this example it is set to random read/write. It allows to set the portion of the reads in the next parameter.
rwmixread=50: Sets the percentage of reads in the random read/write mix. It takes different values (0, 50, 70, or 100) for separate test runs.
iodepth=128: Determines the depth of I/O submission queue per job. It sets the maximum number of I/O requests in flight at any given time.
numjobs=2: Specifies the number of parallel jobs/threads to be used during the test. Each job defined below will be launched in “numjobs“ number of threads.
offset_increment=2%: Specifies offset between jobs start LBAs. The real start LBA for each thread will be offset_increment * thread_number. In this example - the offset is 2% of the target device size. It’s important to use this option at sequential patterns, since multiple threads working with the same stripes at the same time reduces the performance because of RAID engine stripe work logic and also such pattern does not reflect real tasks. For random patterns this option can be omitted.
norandommap: Disables random file access mapping.
time_based=1: Uses a time-based duration for the test instead of specifying the number of I/O operations.
runtime=600: Sets the duration of the test to 600 seconds (10 minutes).
random_generator=tausworthe64: Defines the random number generator algorithm used by FIO.
group_reporting: Enables reporting at the job group level instead of individual job level.
gtod_reduce=1: Activates gettimeofday reduction, which reduces the overhead of timing-related operations.
[drive]: Previous section finished, job with name “drive“ starts.
filename=/dev/nvme1n1: Specifies the target device or file to perform I/O operations on. In this case, it's set to the block device /dev/nvme1n1, likely an NVMe SSD.

More information on configuring fio can be found in the man-page of this program.