问题
In R, some packages (e.g. haven
) insert a label
attributes to variables (e.g. haven
), which explains the substantive name of the variable. For example, gdppc
may have the label GDP per capita
.
This is extremely useful, especially when importing data from Stata. However, I still struggle to know how to use this in my workflow.
How to quickly browse the variable and the variable label? Right now I have to do
attributes(df$var)
, but this is hardly convenient to get a glimpse (a lanames(df)
)How to use these labels in plots? Again, I can use
attr(df$var, "label")
to access the string label. However, it seems cumbersome.
Is there any official way to use these labels in a workflow? I can certainly write a custom function that wraps around the attr
, but it may break in the future when packages implement the label
attribute differently. Thus, ideally I'd want an official way supported by haven
(or other major packages).
回答1:
A solution with purrr package from tidyverse:
df %>% map_chr(~attributes(.)$label)
回答2:
Using sapply in a simple function to return a variable list as in Stata's Variable Window:
library(dplyr)
makeVlist <- function(dta) {
labels <- sapply(dta, function(x) attr(x, "label"))
tibble(name = names(labels),
label = labels)
}
回答3:
This is one of the innovations addressed in rio (full disclosure: I wrote this package). Basically, it provides various ways of importing variable labels, including haven's way of doing things and foreign's. Here's a trivial example:
Start by making a reproducible example:
> library("rio")
> export(iris, "iris.dta")
Import using foreign::read.dta()
(via rio::import()
):
> str(import("iris.dta", haven = FALSE))
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "datalabel")= chr ""
- attr(*, "time.stamp")= chr "15 Jan 2016 20:05"
- attr(*, "formats")= chr "" "" "" "" ...
- attr(*, "types")= int 255 255 255 255 253
- attr(*, "val.labels")= chr "" "" "" "" ...
- attr(*, "var.labels")= chr "" "" "" "" ...
- attr(*, "version")= int -7
- attr(*, "label.table")=List of 1
..$ Species: Named int 1 2 3
.. ..- attr(*, "names")= chr "setosa" "versicolor" "virginica"
Read in using haven::read_dta()
using its native variable attributes because the attributes are stored at the data.frame level rather than the variable level:
> str(import("iris.dta", haven = TRUE, column.labels = TRUE))
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species :Class 'labelled' atomic [1:150] 1 1 1 1 1 1 1 1 1 1 ...
.. ..- attr(*, "labels")= Named int [1:3] 1 2 3
.. .. ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"
Read in using haven::read_dta()
using an alternative that we (the rio developers) have found more convenient:
> str(import("iris.dta", haven = TRUE))
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "var.labels")=List of 5
..$ Sepal.Length: NULL
..$ Sepal.Width : NULL
..$ Petal.Length: NULL
..$ Petal.Width : NULL
..$ Species : NULL
- attr(*, "label.table")=List of 5
..$ Sepal.Length: NULL
..$ Sepal.Width : NULL
..$ Petal.Length: NULL
..$ Petal.Width : NULL
..$ Species : Named int 1 2 3
.. ..- attr(*, "names")= chr "setosa" "versicolor" "virginica"
By moving the attributes to be at the level of the data.frame, they're much easier to access using attr(data, "label.var")
, etc. rather than digging through each variable's attributes.
Note: the values of attributes will be NULL because I'm just writing a native R dataset to a local file in order to make this reproducible.
回答4:
A simple solution with the labelled package (tidyverse)
descriptions <- var_label(data_raw) %>%
as_tibble() %>%
gather(key = variable, value = description)
回答5:
The purpose of the labelled package is to provide convenient functions to manipulate variable and value labels as imported with haven
.
In addition, the functions lookfor
and describe
from the questionr
package are also useful to display variable and value labels.
来源:https://stackoverflow.com/questions/34817457/convenient-way-to-access-variables-label-after-importing-stata-data-with-haven