How to import csv data where some observations are on two rows

后端 未结 5 1317
别跟我提以往
别跟我提以往 2021-01-16 02:10

I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many)

相关标签:
5条回答
  • 2021-01-16 02:36

    A bit of speculation without seeing the exact data: following Roberto Ferrer's comment, you might find the Stata command filefilter useful in cleaning the csv file before importing. You can substitute new and old string patterns, using basic characters as well as more complex \n and \r terms.

    0 讨论(0)
  • 2021-01-16 02:39

    If the observations in question are quoted in the CSV file, then you can use the bindquote(strict) option.

    0 讨论(0)
  • 2021-01-16 02:42

    I would try the following strategy.

    1. Import as a single string variable.
    2. Count commas on each line and combine following lines if lines are incomplete.
    3. Delete redundant material.

    The comma count will be

    length(variable) - length(subinstr(variable, ",", "", .)) 
    
    0 讨论(0)
  • 2021-01-16 02:52

    I can't offer any code at the moment, but I suggest you take a good look at help import. The infile and infix commands state:

    An observation can be on more than one line.

    (I don't know if this means that all observations should be on several lines, or if it can handle cases where only some observations are on more than one line.)

    Check also the manuals if the examples and notes in the help files turn out to be insufficient.

    0 讨论(0)
  • 2021-01-16 02:57

    A convoluted way is (comments inline):

    clear
    set more off
    
    *----- example data -----
    
    // change delimiter, if necessary
    insheet using "~/Desktop/stata_tests/test.csv", names delim(;)
    
    list
    
    *----- what you want -----
    
    // compute number of commas
    gen numcom = length(var1var2var3var4var5) ///
        - length(subinstr(var1var2var3var4var5, ",", "", .))
    
    // save all data
    tempfile orig
    save "`orig'"
    
    // keep observations that are fine
    drop if numcom != 4
    
    // save fine data
    tempfile origfine
    save "`origfine'"
    
    *-----
    
    // load all data
    use "`orig'", clear
    
    // keep offending observations
    drop if numcom == 4
    
    // for the -reshape-
    gen i = int((_n-1)/2) +1
    bysort i : gen j = _n
    
    // check that pairs add up to 4 commas
    by i : egen check = total(numcom)
    assert check == 4
    
    // no longer necessary
    drop numcom check
    
    // reshape wide
    reshape wide var1var2var3var4var5, i(i) j(j)
    
    // gen definitive variable
    gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
    keep var1var2var3var4var5
    
    // append new observations with original good ones
    append using "`origfine'"
    
    // split
    split var1var2var3var4var5, parse(,) gen(var)
    
    // we're "done"
    drop var1var2var3var4var5 numcom
    list
    

    But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.

    Note: the file test.csv looks like

    var1,var2,var3,var4,var5 
    text 1,    text 2,text 3   ,text 4,text 5
    text 6,text 7,text 8,text9,text10
    text 11,text 1     
             2,text 13,text14,text15
    text16,text17,text18,text19,text20
    

    Note 2: I'm using insheet because I don't have Stata 13 at the moment. import delimited is the way to go if available.

    Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.

    0 讨论(0)
提交回复
热议问题