问题
i try to open a .dta as DataFrame. But an Error appears: "ValueError: Value labels for column ... are not unique. The repeated labels are:" followed by labels wich apper twice in a column.
I know labeling multiplie codes with the exact same value label in stata is not clever (not my fault :)) After some research i know, pandas will not accept repeated value labels (this IS clever).
But i can't figure out a (good) solution: Is there:
a. a smooth way to open the data with pandas and just rename the doubles (like "label" to "label(2)") in this process?
here is what the data looks like (value labels in brackets):
| multilabel
1 | 11 (oneone or twotwo)
2 | 22 (oneone or twotwo)
3 | 33 (other-label-which-is-unique)
my code so far:
import pandas as pd
#followed by any option that delivers this solution:
dataframe = pd.read_stata('file.dta')
or
b. a fast an easy way to tell stata: just rename all repeated value labels by "label(2)" instead of "label"? and yes, the code so far is also rather boring:
use "file.dta"
*followed by a loop wich finds repeated labels and changes them
save "file.dta", replace
And yes, there are to many repeated value labels to go trough it one by one.
And here the Stata-Commands to produce a minimal example:
set obs 1
generate var1 = 1 in 1
set obs 2
replace var1 = 2 in 2
set obs 3
replace var1 = 3 in 3
generate var2 = 11 in 1
replace var2 = 22 in 2
replace var2 = 33 in 3
rename var2 multilabel
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat
I am happy for each suggestion!
回答1:
If you have a variable with repeated labels, then
decode multilabel, gen(valuelabel)
label values multilabel
puts the value labels in a string variable and then undoes the association of multilabel
values and the previously attached value labels. I don't know what else you need to do and thus why you do anything else. You now have the same information as before. I don't know whether pandas will ignore the definition of value labels.
For completeness, here's a way to find out which variables have value labels that aren't in one-to-one correspondence with numeric values.
* your sandbox, simplified and extended
clear
set obs 3
generate var1 = _n
generate multilabel = 11 * _n
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat
label define var1 1 "frog" 2 "toad" 3 "newt"
label val var1 var1
* my code
local bad
ds *, has(vallabel)
quietly foreach v in `r(varlist)' {
tempvar decoded diff
decode `v', gen(`decoded')
bysort `decoded' (`v') : gen `diff' = `v'[1] != `v'[_N] & !missing(`decoded')
count if `diff'
if r(N) > 0 local bad `bad' `v'
drop `decoded' `diff'
}
di "`bad'"
回答2:
My final Solution (in Stata):
clear
use "file.dta"
*Find out which duplicated value labels there are
labelbook, length(12)
return list, all
*r(nuniq) contains the not-unique-values
*on all variables in r(nuniq) use the numlabels command
numlabel `r(nuniq)', add
*Look at the not unique value labels again:
labelbook, length(12)
return list, all
save "file2.dta", replace
Thank you Nick!
来源:https://stackoverflow.com/questions/46037548/stata-to-pandas-even-if-there-are-repeated-value-labels