R - Handling multiple values as one string in a single variable

不羁的心 提交于 2021-02-20 06:55:48

问题


In a data.frame, I have a categorical variable for the language of a text. But, while most texts are only in one language, some have multiple languages. In my data, they appear in the same column, divided by comas:

text = c("Text1", "Text2", "Text3")
lang = c("fr", "en", "fr,en")
d = data.frame(text, lang)

Visually:

   text  lang
1 Text1    fr
2 Text2    en
3 Text3 fr,en

I'd like to plot the number of texts in each language, with Text3 being counted both in fr and in en.

I found how to split, with:

d$lang <- strsplit(d$lang, ",")

But then I can't find a way to plot it correctly, e.g. with a qplot barplot like this one:

qplot(lang, data=d)

Am I doing it right? Is there a better approach?


回答1:


You could try:

library(splitstackshape)
dl <- cSplit(d, "lang", ",", "long")
qplot(lang, data = dl)



回答2:


Without following the suggestion in user20650's comment, you probably won't be able to get away without restructuring your data, and how you do that cannot be blind to the way the data is arbitrarily stored. For example, if you know that the languages are represented by distinct, two-character strings (so that, for example, any language representation that isn't "fr" does not contain the sequence "fr"), you could create new boolean columns based on searches for the codes in the comma-separated representation. For example:

# Data
text = c("Text1", "Text2", "Text3", "Text4", "Text5")
lang = c("fr", "en", "fr,en", "sp,fr", "sp,fr,en")
d = data.frame(text, lang, stringsAsFactors = FALSE)

# Get a vector of the languages that exist
languages <- unique(unlist(strsplit(d$lang, ",")))

# Create a new column for each language
for (language in languages) d[[language]] <- grepl(language, d$lang)

# An example bar-plot
barplot(colSums(d[, -c(1, 2)]))



回答3:


Consider tidyr::separate() to split and tidyr::gather() to make it long.

library(magrittr)
ceiling <- 2L #The max language count of any single text
language_positions <- paste0("language_", seq_len(ceiling))

d %>%
  tidyr::separate("lang", language_positions, sep=",", extra="merge") %>%
  tidyr::gather_("ordinal", "language_name", language_positions) %>%
  dplyr::filter(!is.na(language_name))

The resulting long dataset is:

   text    ordinal language_name
1 Text1 language_1            fr
2 Text2 language_1            en
3 Text3 language_1            fr
4 Text3 language_2            en

If you want to break it into two smaller steps. The separate() creates a wide dataset,

> d_wide <- d %>%
+   tidyr::separate_("lang", language_positions, sep=",", extra="merge")
> d_wide
   text language_1 language_2
1 Text1         fr       <NA>
2 Text2         en       <NA>
3 Text3         fr         en

...and then gather() converts it to tall.

d_long <- d_wide %>%
  tidyr::gather_("ordinal", "language_name", language_positions) %>%
  dplyr::filter(!is.na(language_name))

For other reasons, I suggest adding , stringsAsFactors=F when you define d, but tidyr's separate functions don't seem to mind. The qplot call can remain the same: qplot(language_name, data=d_long).



来源:https://stackoverflow.com/questions/29997086/r-handling-multiple-values-as-one-string-in-a-single-variable

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!