Orion I/O calibration tool bug

kyle Hailey
Jul 14, 2014
4 min read

I use fio for all my I/O testing. Why not Orion from Oracle since almost all of my I/O testing and benchmarking has been geared toward Oracle? Several reasons

fio is

super flexible – able to configure it for almost all types of test
active community – updates almost every week, many by Jens Axobe (who wrote much of the Linux I/O layer)
reliable – if there are problems, it’s open source and one can discuss on the fio commuity email list
easy to distribute – just one executable, doesn’t require getting for example a full Oracle distribution

Orion on the other hand unfortunately has had some problems that have made it too undependable for me to trust.

In some cases Orion re-reads the same blocks covering a much smaller data set size than requested.The following strange behavior is with orion on X86 Solaris. The orion binary was from an 11g distribution. The root of the strange behavior is that orion seems to revisit the same blocks over and over when doing it’s random read testing.

A dtrace script was used to trace which blocks orion was reading. The blocks in the test were on /domain.

    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    ::zfs_read:entry
    / strstr((args[0])->v_path, "/domain") != NULL /
    {  printf("%lld\n", args[1]->_uio_offset._f); }

Steps:

Created a 96GB file and put it’s path in /domain/mytest.lun
Modified io.d to filter for /domain .
Ensure no non-orion I/O is going to the filesystem.
Start running io.d > blocks-read.txt
Kicked off orion with:

export LD_LIBRARY_PATH=.  
./orion -testname mytest -run advanced -matrix row -num_disks 5 -cache_size 51200 \
           -duration 60 -simulate raid0 -write 0 -num_large 0

-run advanced : users can specify customizations -matrix row : only small random I/O -num_disks 5 : actual number of physical disks in test. Used to generate a range of loads -cache_size 51200 : defines a warmup period -duration 60 : duration of each point -simulate raid0 : simulate striping across all the LUNs specified. There is only one LUN in this test -write 0 : percentage of I/O that is write, which is zero in this test -num_large 0 : maximum outstanding I/Os for large Random I/O. There is no large random I/O in this test.

Once the test is finished, stopped the dtrace script io.d .

—

Example output from a run

   ORION VERSION 11.2.0.1.0
   Command line:
   -testname mytest -run advanced -matrix row -num_disks 5 -cache_size 51200 -duration 60 
   -simulate raid0 -write 0 -num_large 0 

   These options enable these settings:
   Test: mytest
   Small IO size: 8 KB
   Large IO size: 1024 KB
   IO types: small random IOs, large random IOs
   Sequential stream pattern: RAID-0 striping for all streams
   Writes: 0%
   Cache size: 51200 MB
   Duration for each data point: 60 seconds
   Small Columns:,      0,      1,      2,      3,      4,      5,      6,      7,      8,      9,     10,     11,     12, 
                       13,     14,     15,     16,     17,     18,     19,     20,     21,     22,     23,     24,     25
   Large Columns:,      0
   Total Data Points: 26

   Name: /domain0/group0/external/lun96g	Size: 103079215104
   1 files found.

   Maximum Small IOPS=62700 @ Small=16 and Large=0
   Minimum Small Latency=81.81 usecs @ Small=2 and Large=0

Things look wrong right away.

The average latency is in 100s microseconds (above the fastest minute was average of 81us) over a file that is 96G which is twice as big as the cache of 48G.

The max throughput was 489MB/s

Total blocks read

    # wc -l blocks-touched.txt
    78954834 blocks-touched.txt

Unique blocks read

    # sort blocks-touched.txt | uniq -c | sort -rn > block-count.txt
    # wc -l block-count.txt
    98305 block-count.txt

We only hit 98,305 unique offsets in the file yet a 96GB file has 12,582,912 unique 8k offsets.

The unique block hits totals around 768 MB of data which is easily cached.

Blocks access frequency

    # tail block-count.txt
    695 109297664
    694 34532360192
    693 76259328
    693 34558271488

The least frequently hit blocks were hit almost 700 times and the average was over 800 yet there were 78,954,834 block access in a file of

12,582,912 unique blocks , so the average should have been about 6 hits per block.

This may be caused by having multiple steams starting from the beginning of the file or at the same “random” offset every test duration of 60 seconds. I’m not sure. If this is the case, the only work around would be to increase the duration to an amount of time that would insure kicking out most of the blocks from the beginning of the test. If each thread starts out at the same location and reads the same set of “random” blocks, then there is no workaround. Ideally I’d want each stream to be starting from a different random location and reading a different set of random blocks.

—–

other issues

Then I thought I’d use the Oracle supplied tool, orion, to do random read test on an NFS mounted file system, but this doesn’t work, at least on AIX 6.1, with my mount settings.

First orion test gave this error

$ orion -run simple -testname orion ORION: ORacle IO Numbers — Version 11.1.0.7.0 orion_20101123_1503 rwbase_read_luncfg: SlfFopen error on orion.lun orion_parse_args: rwbase_read_luncfg failed

OK, have to create “orion.lun” with either my lun locations or my file locations. I put in file locations:

$ cat orion.lun

/tmp/system01.dbf

Now get another error

$ orion -run simple -testname orion ORION: ORacle IO Numbers — Version 11.1.0.7.0 orion_20101123_1508 Test will take approximately 9 minutes Larger caches may take longer orion_spawn: skgpspawn failed: Error category: 27155, Detail: 2 orion_main: orion_spawn failed Non test error occurred Orion exiting Illegal instruction(coredump)

Looks like the “orion” executable wasn’t being found, at least not by execve

$ truss -f orion -run simple -testname orion … 700502: execve(“orion”, 0x0FFFFFFFFFFBF2D0, 0x0FFFFFFFFFFFFB30) Err#2 ENOENT

so I ran it from my bin directory where the orion executable could be found. Now get another error

$ orion -run simple -testname orion ORION: ORacle IO Numbers — Version 11.1.0.7.0 orion_20101123_1510 Test will take approximately 9 minutes Larger caches may take longer storax_skgfr_openfiles: File identification failed on /kyle/system01.dbf OER 27054: please look up error in Oracle documentation Additional information: 6 rwbase_lio_init_luns: lun_openvols failed rwbase_rwluns: rwbase_lio_init_luns failed orion_thread_main: rw_luns failed Non test error occurred Orion exiting

If the datafile was on “/tmp” it worked fine but if it was on my NFS mount it failed with the above error. Hmm – doesn’t work over NFS?

Orion I/O calibration tool bug

Recent Posts

Comments