### Tibbles, Data frames and Vectors

#### June 18, 2019    tibble

I learned something new today and the reason I write this is that I’m somewhat confused as to how I didn’t know it already.

# Data Frames

Let us start by creating a data frame.

temp = data.frame(x = c(1:3), y = c(10:12))
temp
  x  y
1 1 10
2 2 11
3 3 12

We can select rows

temp[1,]
  x  y
1 1 10

and columns

temp$x  1 2 3 temp[,2]  10 11 12 What I hadn’t realised before is that there is another way to select columns, using [[ notation that I’m used to using with lists. temp[]  1 2 3 Let’s see if there is any difference in what is returned with these variations. class(temp[1,])  "data.frame" class(temp[,1])  "integer" class(temp$x)
 "integer"
class(temp[])
 "integer"

So selecting a row returns a data frame, whereas selecting a column in any way seems to return a vector.

# Tibbles

Let’s do the same things with a tibble

library(tibble)
temp2 = tibble(x = c(1:3), y = c(10:12))
temp2
# A tibble: 3 x 2
x     y
<int> <int>
1     1    10
2     2    11
3     3    12

We can select rows

temp2[1,]
# A tibble: 1 x 2
x     y
<int> <int>
1     1    10

and columns

temp2$x  1 2 3 temp2[,2] # A tibble: 3 x 1 y <int> 1 10 2 11 3 12 temp2[]  1 2 3 You’ve probably noticed a difference already but if we now look at what is returned we see some differences to the behaviour of data.frame(). class(temp2[1,])  "tbl_df" "tbl" "data.frame" class(temp2[,1])  "tbl_df" "tbl" "data.frame" class(temp2$x)
 "integer"
class(temp2[])
 "integer"

There is now consistency using [,] notation - both temp2[1,] and temp2[,1] return a tibble. Using either $ or [[]] returns a vector as it did for a data.frame. # Why did this make a difference? What happened in my case was that I’d written a function that had a data frame as one of the arguments. In this function I’d used the $ notation to select a column and then treated this as a vector. Not a problem.

I then wrote a similar function and used the [,] notation to select a column and then treated it as a vector. This wasn’t a problem when I gave it a data frame but then I gave it a tibble. This actually threw an error for me but this sort of thing could result in just the wrong value being returned. A quick demo of this could be using the length() function.

# data frame
length(temp[,1])
 3
# tibble
length(temp2[,1])
 1

You may be wondering what sort of error I got, well as an example lets do something slightly more meaningful.

temp = data.frame(x = rnorm(100))
temp2 = tibble(x = rnorm(100, 2, 0.5))

Then something like this will work fine

t.test(temp$x, temp2$x)

Welch Two Sample t-test

data:  temp$x and temp2$x
t = -17.078, df = 143.46, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.131615 -1.689357
sample estimates:
mean of x mean of y
0.1093995 2.0198855 

but this throws an error.

t.test(temp[,1], temp2[,1])

Welch Two Sample t-test

data:  temp[, 1] and temp2[, 1]
t = -17.078, df = 143.46, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.131615 -1.689357
sample estimates:
mean of x mean of y
0.1093995 2.0198855 

Now I know about [[]] I will definitely being using it with tibbles.