I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many)
A convoluted way is (comments inline):
clear
set more off
*----- example data -----
// change delimiter, if necessary
insheet using "~/Desktop/stata_tests/test.csv", names delim(;)
list
*----- what you want -----
// compute number of commas
gen numcom = length(var1var2var3var4var5) ///
- length(subinstr(var1var2var3var4var5, ",", "", .))
// save all data
tempfile orig
save "`orig'"
// keep observations that are fine
drop if numcom != 4
// save fine data
tempfile origfine
save "`origfine'"
*-----
// load all data
use "`orig'", clear
// keep offending observations
drop if numcom == 4
// for the -reshape-
gen i = int((_n-1)/2) +1
bysort i : gen j = _n
// check that pairs add up to 4 commas
by i : egen check = total(numcom)
assert check == 4
// no longer necessary
drop numcom check
// reshape wide
reshape wide var1var2var3var4var5, i(i) j(j)
// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5
// append new observations with original good ones
append using "`origfine'"
// split
split var1var2var3var4var5, parse(,) gen(var)
// we're "done"
drop var1var2var3var4var5 numcom
list
But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.
Note: the file test.csv
looks like
var1,var2,var3,var4,var5
text 1, text 2,text 3 ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14,text15
text16,text17,text18,text19,text20
Note 2: I'm using insheet
because I don't have Stata 13 at the moment. import delimited
is the way to go if available.
Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.