How to import csv data where some observations are on two rows

后端 未结 5 1434
长发绾君心
长发绾君心 2021-01-16 02:29

I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many)

5条回答
  •  情话喂你
    2021-01-16 02:49

    A convoluted way is (comments inline):

    clear
    set more off
    
    *----- example data -----
    
    // change delimiter, if necessary
    insheet using "~/Desktop/stata_tests/test.csv", names delim(;)
    
    list
    
    *----- what you want -----
    
    // compute number of commas
    gen numcom = length(var1var2var3var4var5) ///
        - length(subinstr(var1var2var3var4var5, ",", "", .))
    
    // save all data
    tempfile orig
    save "`orig'"
    
    // keep observations that are fine
    drop if numcom != 4
    
    // save fine data
    tempfile origfine
    save "`origfine'"
    
    *-----
    
    // load all data
    use "`orig'", clear
    
    // keep offending observations
    drop if numcom == 4
    
    // for the -reshape-
    gen i = int((_n-1)/2) +1
    bysort i : gen j = _n
    
    // check that pairs add up to 4 commas
    by i : egen check = total(numcom)
    assert check == 4
    
    // no longer necessary
    drop numcom check
    
    // reshape wide
    reshape wide var1var2var3var4var5, i(i) j(j)
    
    // gen definitive variable
    gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
    keep var1var2var3var4var5
    
    // append new observations with original good ones
    append using "`origfine'"
    
    // split
    split var1var2var3var4var5, parse(,) gen(var)
    
    // we're "done"
    drop var1var2var3var4var5 numcom
    list
    

    But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.

    Note: the file test.csv looks like

    var1,var2,var3,var4,var5 
    text 1,    text 2,text 3   ,text 4,text 5
    text 6,text 7,text 8,text9,text10
    text 11,text 1     
             2,text 13,text14,text15
    text16,text17,text18,text19,text20
    

    Note 2: I'm using insheet because I don't have Stata 13 at the moment. import delimited is the way to go if available.

    Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.

提交回复
热议问题