I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many)
A bit of speculation without seeing the exact data: following Roberto Ferrer's comment, you might find the Stata command filefilter
useful in cleaning the csv file before importing. You can substitute new and old string patterns, using basic characters as well as more complex \n
and \r
terms.
If the observations in question are quoted in the CSV file, then you can use the bindquote(strict) option.
I would try the following strategy.
The comma count will be
length(variable) - length(subinstr(variable, ",", "", .))
I can't offer any code at the moment, but I suggest you take a good look at help import
. The infile
and infix
commands state:
An observation can be on more than one line.
(I don't know if this means that all observations should be on several lines, or if it can handle cases where only some observations are on more than one line.)
Check also the manuals if the examples and notes in the help
files turn out to be insufficient.
A convoluted way is (comments inline):
clear
set more off
*----- example data -----
// change delimiter, if necessary
insheet using "~/Desktop/stata_tests/test.csv", names delim(;)
list
*----- what you want -----
// compute number of commas
gen numcom = length(var1var2var3var4var5) ///
- length(subinstr(var1var2var3var4var5, ",", "", .))
// save all data
tempfile orig
save "`orig'"
// keep observations that are fine
drop if numcom != 4
// save fine data
tempfile origfine
save "`origfine'"
*-----
// load all data
use "`orig'", clear
// keep offending observations
drop if numcom == 4
// for the -reshape-
gen i = int((_n-1)/2) +1
bysort i : gen j = _n
// check that pairs add up to 4 commas
by i : egen check = total(numcom)
assert check == 4
// no longer necessary
drop numcom check
// reshape wide
reshape wide var1var2var3var4var5, i(i) j(j)
// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5
// append new observations with original good ones
append using "`origfine'"
// split
split var1var2var3var4var5, parse(,) gen(var)
// we're "done"
drop var1var2var3var4var5 numcom
list
But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.
Note: the file test.csv
looks like
var1,var2,var3,var4,var5
text 1, text 2,text 3 ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14,text15
text16,text17,text18,text19,text20
Note 2: I'm using insheet
because I don't have Stata 13 at the moment. import delimited
is the way to go if available.
Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.