top of page
  • Writer's picturekyle Hailey

R: slicing and dicing data

R can do some awesome data visualizations: http://gallery.r-enthusiasts.com/thumbs.php

Instead of doing one off data visualizations like with Excel, R can automate the process allowing one to visualize many sets of data with the same visualizations.

There are lots of blogs out there on getting started with R. The one thing that I didn’t find explained well was slicing and dicing data.

Lets take some data that I want to visualize.  The following data shows the performance of network throughput. The throughput is measured by latency of communication in milliseconds (avg_ms) and throughput in MB per second (MB/s).

The parameters are the I/O message size in KB (0KB is actually 1 byte) and the number of concurrent threads sending data (threads)

IOsize ,threads ,avg_ms ,    MB/s      0 ,      1 ,   .02 ,    .010      0 ,      8 ,   .04 ,    .024      0 ,     64 ,   .20 ,    .025      8 ,      1 ,   .03 ,  70.529      8 ,      8 ,   .04 , 150.389      8 ,     64 ,   .23 ,  48.604     32 ,      1 ,   .06 , 149.405     32 ,      8 ,   .07 , 321.392     32 ,     64 ,   .18 ,  73.652    128 ,      1 ,   .03 , 226.457    128 ,      8 ,   .01 , 557.196    128 ,     64 ,   .06 , 180.176   1024 ,      1 ,   .01 , 335.587   1024 ,      8 ,   .01 , 726.876   1024 ,     64 ,   .02 , 714.162

If this data is a file, it can be easily loaded and charted with R.

Find out what directory R is working in:

getwd()

go to a directory with my data and R files:

setwd("C:/Users/Kyle/R")

list files

dir()

load data into a variable

mydata <- read.csv("mydata.csv")

Simple, et voila, the data is loaded. To see the data just type the name of the variable ( the “>” is the R prompt, like “SQL>” in SQL*Plus)

> mydata    IOsize threads avg_ms    MB.s 1       0       1   0.02   0.010 2       0       8   0.04   0.024 3       0      64   0.20   0.025 4       8       1   0.03  70.529 5       8       8   0.04 150.389 6       8      64   0.23  48.604 7      32       1   0.06 149.405 8      32       8   0.07 321.392 9      32      64   0.18  73.652 10    128       1   0.03 226.457 11    128       8   0.01 557.196 12    128      64   0.06 180.176 13   1024       1   0.01 335.587 14   1024       8   0.01 726.876 15   1024      64   0.02 714.162

Creating a chart is a breeze, just say plot(x,y) where x and y are the values you want to plot. How to we extract an x and y from mydata? First pick what to plot. Let’s plot averge ms latency (avg_ms) verse MB per sec (MB.s). Here is how to extract those columns from the data

x=mydata['avg_ms'] y=mydata['MB.s']

Now plot

> plot(x,y) Error in stripchart.default(x1, ...) : invalid plotting method

huh … what’s that Error?

If we look at x and/or y, they are actually columns from mydata and plot() wants rows (actually vectors but we’ll get there).

> x    avg_ms 1    0.02 2    0.04 3    0.20 4    0.03 5    0.04 6    0.23 7    0.06 8    0.07 9    0.18 10   0.03 11   0.01 12   0.06 13   0.01 14   0.01 15   0.02

To transpose a column into a row we can use “t()”

> t(x)        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] avg_ms 0.02 0.04  0.2 0.03 0.04 0.23 0.06 0.07 0.18  0.03  0.01  0.06  0.01  0.01  0.02

Now we can try plotting again:

> plot(t(x),t(y))

and voila


but let’s address the issue of transforming x and y from columns to rows and specifically into vectors. Let’s look at the original data and then the transformed data

x=mydata['avg_ms']     #  column of data extracted from a data.frame tx=t(mydata['avg_ms']) #  transform the column of data into a row

Look at the datatypes of x and t(x) using the class() function

> class(mydata)

[1] "data.frame"

> class(x) [1] "data.frame" > class(tx) [1] "matrix"

the column is considered a “data.frame” and the row is considered a “matrix”.

The method of extracting a column by it’s column name only works for datatype class data.frame.

If the datatype was a matrix we would be required to supply both the row and column as in  matrix[“row”,”column”]

By leaving either row or column empty but keeping the comma in place then it acts as a wild card.

matrix[,”column”] – gives all values in that column

matrix[“row”,] – gives all the values in that row

plot() wants a vector (but it forgivingly works with rows of data as we did above).

R data types

What are these datatypes in R? There is a simple discussion of data types at http://www.statmethods.net/input/datatypes.html

The types are basically (using “value1:value2” gives a list iterating from value1 to value2 by increments of 1)

  1. integer


> i=1:5 > class(i) [1] "integer" > i [1] 1 2 3 4 5

  1. character


> c=letters[1:5] > class(c) [1] "character" > c [1] "a" "b" "c" "d" "e"

  1. (booleans are integers )


> b=FALSE:TRUE > class(b) [1] "integer" > b [1] 0 1

  1. vectors


> v=c(1,2,3,4,5) > class(v) [1] "numeric" > v [1] 1 2 3 4 5

  1. matrix


> m=matrix(c(1,2,3,4,5)) > class(m) [1] "matrix" > m      [,1] [1,]    1 [2,]    2 [3,]    3 [4,]    4 [5,]    5

  1. data.frames – mixes numeric and character


> df=matrix(1:5,letters[1:5])      # matrix can't contain character and numeric Error in matrix(1:5, letters[1:5]) : non-numeric matrix extent > > df=data.frame(1:5,letters[1:5])  # dataframe can > class(df) [1] "data.frame" > df   X1.5 letters.1.5. 1    1            a 2    2            b 3    3            c 4    4            d 5    5            e

  1. lists – like an matrix but can mix different data types together such as character, number, matrix


 > a = c(1,2,5.3,6,-2,4) # numeric vector
> # generates 5 x 4 numeric matrix 
> y=matrix(1:20, nrow=5,ncol=4)
> # example of a list with 4 components - 
> # a string, a numeric vector, a matrix, and a scaler 
> w= list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
> w
$name
[1] "Fred"

$mynumbers
[1]  1.0  2.0  5.3  6.0 -2.0  4.0

$mymatrix
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

$age
[1] 5.3
  1. extract the various parts of a list with  list[[“name”]], as in w[[“mymatrix”]]

  2. array – are matrices with more than 2 dimensions

  3. factors

Useful functions on data types

  1. dput(var) – will give structure of var

  2. class(var) – will tell the data type

  3. dim(var) – will set dimension

  4. as.matrix(data.frame) – useful for changing a data.frame into a matrix, though be careful because if there are any character values in the data frame then all entries in the matrix will be charcter

Sometimes R transforms data in ways I don’t predict, but the best strategy is just to force R to do what I want more explicitly.

Converting columns into vectors

When originally selecting out the columns of the data, we could have selected out vectors directly instead of selecting a column and transforming the column to a vector. Instead of asking for the column which gives a column we can ask for every value in that column by adding in a “,” infront of the column name. The brackets take the equivalent of x and y coordinates or row and column position. By adding a “,” with no value before it, we are giving a wild card to the row identifier and saying give me all the values for all rows in the column “avg_ms”

x=mydata[,'avg_ms'] > class(x) [1] "numeric" > x  [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

We can also extract the values by the column position instead of column name. The “avg_ms” is column 3

> x=mydata[,3] > class(x) [1] "numeric" > x  [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

A third way to get the vector format is using “[[ ]]” syntax

> x=mydata[[3]] > class(x) [1] "numeric" > x  [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

A forth way is with the matrix$col syntax

> x=mydata$avg_ms > class(x) [1] "numeric" > x  [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

Another way that we’ll talk about in converting a row to a vector is the apply() and as.numeric() functions: The function apply can also change a column to a vector

> x=mydata['avg_ms'] > class(x) [1] "data.frame" > x    avg_ms 1    0.02 2    0.04 3    0.20 4    0.03 5    0.04 6    0.23 7    0.06 8    0.07 9    0.18 10   0.03 11   0.01 12   0.06 13   0.01 14   0.01 15   0.02 > x=apply(x,1,as.numeric) > class(x) [1] "numeric" > x [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

These vector extractions work for columns but things are different for rows.

Extracting Rows and converting Rows to numeric vectors

The other side other coin is extracting a row into vector format. In mydata, the rows don’t have names, so we have to use position. By specifying row position with no following column names then all column values are given for that row.

> row=mydata[3,] > class(row) [1] "data.frame" > row   IOsize threads avg_ms  MB.s 3      0      64    0.2 0.025

The resulting data is a  data frame and not a vector  (ie a vector is of datatype numeric) We can use the “as.numeric” function to convert the data.frame to a vector, ie numeric. The apply() function will apply the “as.numeric” function to multiple values at once. The apply() takes 3 args

  1. input variable

  2. 1=row,2=col,1:2=both

  3. function to apply

> ra=apply(row,2,as.numeric) > class(ra) [1] "numeric" > ra  IOsize threads  avg_ms    MB.s   0.000  64.000   0.200   0.025

The above applies  the change to all columns in the given row in a data.frame.

(apply can also be used for example to change all 0 to NULLs

new_matrix = apply(matrix,1:2,function(x)if (x==0)  NULL else x)

For selecting the row out directly as a vector, the as.matrix() function can also be used

> row=as.matrix(mydata)[3,] > class(row) [1] "numeric" > row  IOsize threads  avg_ms    MB.s   0.000  64.000   0.200   0.025

yet another way

> row=c(t(mydata[3,])) > class(row) [1] "numeric" > row [1]  0.000 64.000  0.200  0.025

or yet

> row=unlist(mydata[3,]) > class(row) [1] "numeric" > row  IOsize threads  avg_ms    MB.s   0.000  64.000   0.200   0.025

Filtering Data

The data in the CSV file actually represents throughput not only at different I/O send sizes but also for different number of concurrent senders. What if I wanted to just plot the throughput by I/O send size for tests with one thread? How would I filter the data?

IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 ) MBs=subset(mydata[,'MB.s'],mydata['threads'] == 1 ) plot(IOsize,MBs)


How about plotting the throughput by I/O size for each number of threads test? The parameter ‘type=”o”‘  makes the plot a line plot

IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 ) MBs_1=subset(mydata[,'MB.s'],mydata['threads'] == 1 ) MBs_8=subset(mydata[,'MB.s'],mydata['threads'] == 8 ) MBs_64=subset(mydata[,'MB.s'],mydata['threads'] == 64 )

# create graph

plot(IOsize,MBs_64,type="o")

# plot other lines

lines(IOsize,MBs_1,lty=2,col="green",type="o") lines(IOsize,MBs_8,lty=3,col="red",type="o")


# add a legend legend(1,700,c("1 thread","8 threads","64 threads"), cex=0.8,    col=c("green","red","black"), lty=3:1);


Entering data

Instead of entering data via a CSV file it can be entered directly into R

> m=matrix(c(      0 ,      1 ,  .02 ,    .010 ,      0 ,      8 ,  .04 ,    .024 ,      0 ,     64 ,  .20 ,    .025 ,      8 ,      1 ,  .03 ,  70.529 ,      8 ,      8 ,  .04 , 150.389 ,      8 ,     64 ,  .23 ,  48.604 ,     32 ,      1 ,  .06 , 149.405 ,     32 ,      8 ,  .07 , 321.392 ,     32 ,     64 ,  .18 ,  73.652 ,    128 ,      1 ,  .03 , 226.457 ,    128 ,      8 ,  .01 , 557.196 ,    128 ,     64 ,  .06 , 180.176 ,   1024 ,      1 ,  .01 , 335.587 ,   1024 ,      8 ,  .01 , 726.876 ,   1024 ,     64 ,  .02 , 714.162 ), nrow=4,ncol=15, dimnames=list(rows=c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s' ))) > m rows      [,1]  [,2]   [,3]   [,4]    [,5]   [,6]    [,7]    [,8]   [,9]   [,10]   [,11]   [,12]    [,13]    [,14]    [,15]   IOsize  0.00 0.000  0.000  8.000   8.000  8.000  32.000  32.000 32.000 128.000 128.000 128.000 1024.000 1024.000 1024.000   threads 1.00 8.000 64.000  1.000   8.000 64.000   1.000   8.000 64.000   1.000   8.000  64.000    1.000    8.000   64.000   avg_ms  0.02 0.040  0.200  0.030   0.040  0.230   0.060   0.070  0.180   0.030   0.010   0.060    0.010    0.010    0.020   MB/s    0.01 0.024  0.025 70.529 150.389 48.604 149.405 321.392 73.652 226.457 557.196 180.176  335.587  726.876  714.162 > t(m)         IOsize threads avg_ms    MB/s    [1,]      0       1   0.02   0.010    [2,]      0       8   0.04   0.024    [3,]      0      64   0.20   0.025    [4,]      8       1   0.03  70.529    [5,]      8       8   0.04 150.389    [6,]      8      64   0.23  48.604    [7,]     32       1   0.06 149.405    [8,]     32       8   0.07 321.392    [9,]     32      64   0.18  73.652   [10,]    128       1   0.03 226.457   [11,]    128       8   0.01 557.196   [12,]    128      64   0.06 180.176   [13,]   1024       1   0.01 335.587   [14,]   1024       8   0.01 726.876   [15,]   1024      64   0.02 714.162

The bizarre thing about this is that the nrows corresponds to the number of columns and the matrix comes out transposed. Using t() can re-transpose it, but this is all confusing. To make it more intuitive add the argument "byrow=TRUE," and add a "NULL" for the rowname position in the row and columns name section

m=matrix(c(      0 ,      1 ,  .02 ,    .010 ,      0 ,      8 ,  .04 ,    .024 ,      0 ,     64 ,  .20 ,    .025 ,      8 ,      1 ,  .03 ,  70.529 ,      8 ,      8 ,  .04 , 150.389 ,      8 ,     64 ,  .23 ,  48.604 ,     32 ,      1 ,  .06 , 149.405 ,     32 ,      8 ,  .07 , 321.392 ,     32 ,     64 ,  .18 ,  73.652 ,    128 ,      1 ,  .03 , 226.457 ,    128 ,      8 ,  .01 , 557.196 ,    128 ,     64 ,  .06 , 180.176 ,   1024 ,      1 ,  .01 , 335.587 ,   1024 ,      8 ,  .01 , 726.876 ,   1024 ,     64 ,  .02 , 714.162 ), nrow=15,ncol=4,byrow=TRUE, dimnames=list(NULL,c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s' ))) > m      IOsize threads avg_ms    MB/s  [1,]      0       1   0.02   0.010  [2,]      0       8   0.04   0.024  [3,]      0      64   0.20   0.025  [4,]      8       1   0.03  70.529  [5,]      8       8   0.04 150.389  [6,]      8      64   0.23  48.604  [7,]     32       1   0.06 149.405  [8,]     32       8   0.07 321.392  [9,]     32      64   0.18  73.652 [10,]    128       1   0.03 226.457 [11,]    128       8   0.01 557.196 [12,]    128      64   0.06 180.176 [13,]   1024       1   0.01 335.587 [14,]   1024       8   0.01 726.876 [15,]   1024      64   0.02 714.162

Vectorwise maximum/minimum

Another issues is trying to get the max or min of two or more values on a point by point basis.

Using the “min()” function gives a single minimum and not a minimum on a point by point basis.

Use “pmax()” and “pmin()” to get point by point max and min of two or more vectors.

> lat [1]  44.370  22.558  37.708  73.070 131.950 > std [1]  37.7  21.6  67.1 136.1 186.0 > min [1] 0.0 0.6 0.6 1.0 1.0 > pmax(lat-std,min) [1] 6.670 0.958 0.600 1.000 1.000

Column Sums and Row Sums

to sum up rows or colums use “rowSums()” and   “colSUms()”

For more info

for more info on data types and manipulation see

1 view0 comments

Comments


bottom of page