Tibbles, Data frames and Vectors

June 18, 2019    tibble

I learned something new today and the reason I write this is that I’m somewhat confused as to how I didn’t know it already.

Data Frames

Let us start by creating a data frame.

temp = data.frame(x = c(1:3), y = c(10:12))
temp
  x  y
1 1 10
2 2 11
3 3 12

We can select rows

temp[1,]
  x  y
1 1 10

and columns

temp$x
[1] 1 2 3
temp[,2]
[1] 10 11 12

What I hadn’t realised before is that there is another way to select columns, using [[ notation that I’m used to using with lists.

temp[[1]]
[1] 1 2 3

Let’s see if there is any difference in what is returned with these variations.

class(temp[1,])
[1] "data.frame"
class(temp[,1])
[1] "integer"
class(temp$x)
[1] "integer"
class(temp[[1]])
[1] "integer"

So selecting a row returns a data frame, whereas selecting a column in any way seems to return a vector.

Tibbles

Let’s do the same things with a tibble

library(tibble)
temp2 = tibble(x = c(1:3), y = c(10:12))
temp2
# A tibble: 3 × 2
      x     y
  <int> <int>
1     1    10
2     2    11
3     3    12

We can select rows

temp2[1,]
# A tibble: 1 × 2
      x     y
  <int> <int>
1     1    10

and columns

temp2$x
[1] 1 2 3
temp2[,2]
# A tibble: 3 × 1
      y
  <int>
1    10
2    11
3    12
temp2[[1]]
[1] 1 2 3

You’ve probably noticed a difference already but if we now look at what is returned we see some differences to the behaviour of data.frame().

class(temp2[1,])
[1] "tbl_df"     "tbl"        "data.frame"
class(temp2[,1])
[1] "tbl_df"     "tbl"        "data.frame"
class(temp2$x)
[1] "integer"
class(temp2[[1]])
[1] "integer"

There is now consistency using [,] notation - both temp2[1,] and temp2[,1] return a tibble. Using either $ or [[]] returns a vector as it did for a data.frame.

Why did this make a difference?

What happened in my case was that I’d written a function that had a data frame as one of the arguments. In this function I’d used the $ notation to select a column and then treated this as a vector. Not a problem.

I then wrote a similar function and used the [,] notation to select a column and then treated it as a vector. This wasn’t a problem when I gave it a data frame but then I gave it a tibble. This actually threw an error for me but this sort of thing could result in just the wrong value being returned. A quick demo of this could be using the length() function.

# data frame
length(temp[,1])
[1] 3
# tibble
length(temp2[,1])
[1] 1

You may be wondering what sort of error I got, well as an example lets do something slightly more meaningful.

temp = data.frame(x = rnorm(100))
temp2 = tibble(x = rnorm(100, 2, 0.5))

Then something like this will work fine

t.test(temp$x, temp2$x)

	Welch Two Sample t-test

data:  temp$x and temp2$x
t = -19.116, df = 158.14, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.248094 -1.827040
sample estimates:
  mean of x   mean of y 
-0.04498426  1.99258301 

but this throws an error.

t.test(temp[,1], temp2[,1])

	Welch Two Sample t-test

data:  temp[, 1] and temp2[, 1]
t = -19.116, df = 158.14, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.248094 -1.827040
sample estimates:
  mean of x   mean of y 
-0.04498426  1.99258301 

Now I know about [[]] I will definitely being using it with tibbles.