I learned something new today and the reason I write this is that I’m somewhat confused as to how I didn’t know it already.
Let us start by creating a data frame.
temp = data.frame(x = c(1:3), y = c(10:12))
temp
x y
1 1 10
2 2 11
3 3 12
We can select rows
temp[1,]
x y
1 1 10
and columns
temp$x
[1] 1 2 3
temp[,2]
[1] 10 11 12
What I hadn’t realised before is that there is another way to select columns, using [[
notation that I’m used to using with lists.
temp[[1]]
[1] 1 2 3
Let’s see if there is any difference in what is returned with these variations.
class(temp[1,])
[1] "data.frame"
class(temp[,1])
[1] "integer"
class(temp$x)
[1] "integer"
class(temp[[1]])
[1] "integer"
So selecting a row returns a data frame, whereas selecting a column in any way seems to return a vector.
Let’s do the same things with a tibble
library(tibble)
temp2 = tibble(x = c(1:3), y = c(10:12))
temp2
# A tibble: 3 × 2
x y
<int> <int>
1 1 10
2 2 11
3 3 12
We can select rows
temp2[1,]
# A tibble: 1 × 2
x y
<int> <int>
1 1 10
and columns
temp2$x
[1] 1 2 3
temp2[,2]
# A tibble: 3 × 1
y
<int>
1 10
2 11
3 12
temp2[[1]]
[1] 1 2 3
You’ve probably noticed a difference already but if we now look at what is returned we see some differences to the behaviour of data.frame()
.
class(temp2[1,])
[1] "tbl_df" "tbl" "data.frame"
class(temp2[,1])
[1] "tbl_df" "tbl" "data.frame"
class(temp2$x)
[1] "integer"
class(temp2[[1]])
[1] "integer"
There is now consistency using [,]
notation - both temp2[1,]
and temp2[,1]
return a tibble. Using either $
or [[]]
returns a vector as it did for a data.frame.
What happened in my case was that I’d written a function that had a data frame as one of the arguments. In this function I’d used the $
notation to select a column and then treated this as a vector. Not a problem.
I then wrote a similar function and used the [,]
notation to select a column and then treated it as a vector. This wasn’t a problem when I gave it a data frame but then I gave it a tibble. This actually threw an error for me but this sort of thing could result in just the wrong value being returned. A quick demo of this could be using the length()
function.
# data frame
length(temp[,1])
[1] 3
# tibble
length(temp2[,1])
[1] 1
You may be wondering what sort of error I got, well as an example lets do something slightly more meaningful.
temp = data.frame(x = rnorm(100))
temp2 = tibble(x = rnorm(100, 2, 0.5))
Then something like this will work fine
t.test(temp$x, temp2$x)
Welch Two Sample t-test
data: temp$x and temp2$x
t = -19.116, df = 158.14, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.248094 -1.827040
sample estimates:
mean of x mean of y
-0.04498426 1.99258301
but this throws an error.
t.test(temp[,1], temp2[,1])
Welch Two Sample t-test
data: temp[, 1] and temp2[, 1]
t = -19.116, df = 158.14, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.248094 -1.827040
sample estimates:
mean of x mean of y
-0.04498426 1.99258301
Now I know about [[]]
I will definitely being using it with tibbles.