Separating column using separate (tidyr) via dplyr on a first encountered digit

后端 未结 2 543
北恋
北恋 2021-01-11 10:51

I\'m trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:

set.         


        
相关标签:
2条回答
  • 2021-01-11 11:37

    I think this might do it.

    library(tidyr)
    separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
    #           indicator   period    values
    # 1     someindicator     2001 0.2655087
    # 2     someindicator     2011 0.3721239
    # 3         some text 20022008 0.5728534
    # 4 another indicator     2003 0.9082078
    

    The following is an explanation of the regular expression, brought to you by regex101.

    • (?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
    • ? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
    • (?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
    0 讨论(0)
  • 2021-01-11 11:49

    You could also use unglue::unnest() :

    dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
                                  "some text 20022008", "another indicator 2003"),
                      values = runif(n = 4))
    
    # remotes::install_github("moodymudskipper/unglue")
    library(unglue)
    unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
    #>       values         indicator   period
    #> 1 0.43234262     someindicator     2001
    #> 2 0.65890900     someindicator     2011
    #> 3 0.93576805         some text 20022008
    #> 4 0.01934736 another indicator     2003
    

    Created on 2019-09-14 by the reprex package (v0.3.0)

    0 讨论(0)
提交回复
热议问题