How to import csv data where some observations are on two rows

后端未结

关注

 5  1324

I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many)

相关标签:

5条回答

耶瑟儿～

2021-01-16 02:36

A bit of speculation without seeing the exact data: following Roberto Ferrer's comment, you might find the Stata command filefilter useful in cleaning the csv file before importing. You can substitute new and old string patterns, using basic characters as well as more complex \n and \r terms.

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2021-01-16 02:39

If the observations in question are quoted in the CSV file, then you can use the bindquote(strict) option.

0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2021-01-16 02:42
I would try the following strategy.
1. Import as a single string variable.
2. Count commas on each line and combine following lines if lines are incomplete.
3. Delete redundant material.
The comma count will be
```
length(variable) - length(subinstr(variable, ",", "", .)) 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2021-01-16 02:52

I can't offer any code at the moment, but I suggest you take a good look at help import. The infile and infix commands state:

An observation can be on more than one line.

(I don't know if this means that all observations should be on several lines, or if it can handle cases where only some observations are on more than one line.)

Check also the manuals if the examples and notes in the help files turn out to be insufficient.

0 讨论(0)
发布评论:

提交评论
- 加载中...

北荒

2021-01-16 02:57

A convoluted way is (comments inline):

clear
set more off

*----- example data -----

// change delimiter, if necessary
insheet using "~/Desktop/stata_tests/test.csv", names delim(;)

list

*----- what you want -----

// compute number of commas
gen numcom = length(var1var2var3var4var5) ///
    - length(subinstr(var1var2var3var4var5, ",", "", .))

// save all data
tempfile orig
save "`orig'"

// keep observations that are fine
drop if numcom != 4

// save fine data
tempfile origfine
save "`origfine'"

*-----

// load all data
use "`orig'", clear

// keep offending observations
drop if numcom == 4

// for the -reshape-
gen i = int((_n-1)/2) +1
bysort i : gen j = _n

// check that pairs add up to 4 commas
by i : egen check = total(numcom)
assert check == 4

// no longer necessary
drop numcom check

// reshape wide
reshape wide var1var2var3var4var5, i(i) j(j)

// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5

// append new observations with original good ones
append using "`origfine'"

// split
split var1var2var3var4var5, parse(,) gen(var)

// we're "done"
drop var1var2var3var4var5 numcom
list

But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.

Note: the file test.csv looks like

var1,var2,var3,var4,var5 
text 1,    text 2,text 3   ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1     
         2,text 13,text14,text15
text16,text17,text18,text19,text20

Note 2: I'm using insheet because I don't have Stata 13 at the moment. import delimited is the way to go if available.

Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.

0 讨论(0)