Tibbles, Data frames and Vectors

June 18, 2019    tibble

I learned something new today and the reason I write this is that I’m somewhat confused as to how I didn’t know it already.

Data Frames

Let us start by creating a data frame.

temp = data.frame(x = c(1:3), y = c(10:12))
  x  y
1 1 10
2 2 11
3 3 12

We can select rows

  x  y
1 1 10

and columns

[1] 1 2 3
[1] 10 11 12

What I hadn’t realised before is that there is another way to select columns, using [[ notation that I’m used to using with lists.

[1] 1 2 3

Let’s see if there is any difference in what is returned with these variations.

[1] "data.frame"
[1] "integer"
[1] "integer"
[1] "integer"

So selecting a row returns a data frame, whereas selecting a column in any way seems to return a vector.


Let’s do the same things with a tibble

temp2 = tibble(x = c(1:3), y = c(10:12))
# A tibble: 3 x 2
      x     y
  <int> <int>
1     1    10
2     2    11
3     3    12

We can select rows

# A tibble: 1 x 2
      x     y
  <int> <int>
1     1    10

and columns

[1] 1 2 3
# A tibble: 3 x 1
1    10
2    11
3    12
[1] 1 2 3

You’ve probably noticed a difference already but if we now look at what is returned we see some differences to the behaviour of data.frame().

[1] "tbl_df"     "tbl"        "data.frame"
[1] "tbl_df"     "tbl"        "data.frame"
[1] "integer"
[1] "integer"

There is now consistency using [,] notation - both temp2[1,] and temp2[,1] return a tibble. Using either $ or [[]] returns a vector as it did for a data.frame.

Why did this make a difference?

What happened in my case was that I’d written a function that had a data frame as one of the arguments. In this function I’d used the $ notation to select a column and then treated this as a vector. Not a problem.

I then wrote a similar function and used the [,] notation to select a column and then treated it as a vector. This wasn’t a problem when I gave it a data frame but then I gave it a tibble. This actually threw an error for me but this sort of thing could result in just the wrong value being returned. A quick demo of this could be using the length() function.

# data frame
[1] 3
# tibble
[1] 1

You may be wondering what sort of error I got, well as an example lets do something slightly more meaningful.

temp = data.frame(x = rnorm(100))
temp2 = tibble(x = rnorm(100, 2, 0.5))

Then something like this will work fine

t.test(temp$x, temp2$x)

    Welch Two Sample t-test

data:  temp$x and temp2$x
t = -16.121, df = 138.71, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.084115 -1.628734
sample estimates:
 mean of x  mean of y 
0.05642707 1.91285169 

but this throws an error.

t.test(temp[,1], temp2[,1])
Error: Must use a vector in `[`, not an object of class matrix.

Now I know about [[]] I will definitely being using it with tibbles.