How to import csv data where some observations are on two rows

后端未结

关注

 5  1434

长发绾君心 2021-01-16 02:29

I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many)

5条回答

情话喂你 (楼主)

2021-01-16 02:49

A convoluted way is (comments inline):

clear
set more off

*----- example data -----

// change delimiter, if necessary
insheet using "~/Desktop/stata_tests/test.csv", names delim(;)

list

*----- what you want -----

// compute number of commas
gen numcom = length(var1var2var3var4var5) ///
    - length(subinstr(var1var2var3var4var5, ",", "", .))

// save all data
tempfile orig
save "`orig'"

// keep observations that are fine
drop if numcom != 4

// save fine data
tempfile origfine
save "`origfine'"

*-----

// load all data
use "`orig'", clear

// keep offending observations
drop if numcom == 4

// for the -reshape-
gen i = int((_n-1)/2) +1
bysort i : gen j = _n

// check that pairs add up to 4 commas
by i : egen check = total(numcom)
assert check == 4

// no longer necessary
drop numcom check

// reshape wide
reshape wide var1var2var3var4var5, i(i) j(j)

// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5

// append new observations with original good ones
append using "`origfine'"

// split
split var1var2var3var4var5, parse(,) gen(var)

// we're "done"
drop var1var2var3var4var5 numcom
list

But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.

Note: the file test.csv looks like

var1,var2,var3,var4,var5 
text 1,    text 2,text 3   ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1     
         2,text 13,text14,text15
text16,text17,text18,text19,text20

Note 2: I'm using insheet because I don't have Stata 13 at the moment. import delimited is the way to go if available.

Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.

0 讨论(0)

查看其它5个回答