R: How can I extract an element from a column of data in spark connection (sparklyr) in pipe

后端 未结 1 1002
轻奢々
轻奢々 2021-01-27 03:38

I have a dataset as below.

Because of its large amount of data, I uploaded it through the sparklyr package, so I can use only pipe statements.



        
1条回答
  •  再見小時候
    2021-01-27 04:07

    Although this isn't the most elegant string of code, it should get the job done. Since no sample dataset is provided other than a screenshot, I just created a sample with the important elements you were interested in.

    csj <- tibble(helpful = rep(c("[0,0]","[0,1]","[0,2]","[1,3]"),100),
                                overall = rep(c(5,4,3,2),100))
    #this change the columns and creates the help column
    csj %>%
          mutate(col1 = as.numeric(stringi::stri_extract_first_regex(helpful, pattern = "[0-9]")),#extract first number
                 col2 = as.numeric(stringi::stri_extract_last_regex(helpful, pattern = "[0-9]")),#extract second
                 col3 = ifelse(col2 == 0, 1, row2 ),#change 0s to 1
                 help = col1/col3) %>% #divide row1 and 3
          select(helpful, help)#select the rows you wish to keep
    

    This should work as long as you modify the functions to your dataset as needed. Also note that helpful is a character type in your dataset which is why you need to change it to numeric

    EDIT: So I looked up some sparklyr and realized why the code isn't working so I created an example for myself to test out.Although I didn't replicate your data completely I came up with enough things to hopefully provide a working solution.

    library(sparklyr)
    library(dplyr)
    library(ggplot2)
    library(magrittr) 
    sc <- spark_connect(master="local")
    #create dataframe
    cjs <- tibble(helpful = rep(c("[0,  0]","[0, 1]","[0, 2]","[1, 3]","[,1]",NA,"a"),100),
                  overall = rep(c(6,5,4,3,2,1,0),100))
    
    #transfer to sparkly
    csj <- copy_to(sc, csj,"cjs")
    
    #this should do the trick
    csj %>% 
      mutate(newcol2 = regexp_replace(helpful, "[^0-9,]", " "), 
             newcol3 = as.numeric(substring_index(newcol2, ",", 1)),
             newcol4 = as.numeric(substring_index(newcol2,",",-1)),
             newcol5 = ifelse(newcol4 == 0, 1, newcol4),
             help = newcol3/newcol5) %>% 
      select(starts_with("new"),help) #select the columns you need with help calculated appropriately
    

    0 讨论(0)
提交回复
热议问题