Stata to Pandas: even if there are repeated Value Labels?

问题

i try to open a .dta as DataFrame. But an Error appears: "ValueError: Value labels for column ... are not unique. The repeated labels are:" followed by labels wich apper twice in a column.

I know labeling multiplie codes with the exact same value label in stata is not clever (not my fault :)) After some research i know, pandas will not accept repeated value labels (this IS clever).

But i can't figure out a (good) solution: Is there:

a. a smooth way to open the data with pandas and just rename the doubles (like "label" to "label(2)") in this process?

here is what the data looks like (value labels in brackets):

  | multilabel    
1 | 11 (oneone or twotwo)
2 | 22 (oneone or twotwo)
3 | 33 (other-label-which-is-unique)

my code so far:

import pandas as pd

#followed by any option that delivers this solution:
dataframe = pd.read_stata('file.dta')

b. a fast an easy way to tell stata: just rename all repeated value labels by "label(2)" instead of "label"? and yes, the code so far is also rather boring:

use "file.dta"

*followed by a loop wich finds repeated labels and changes them

save "file.dta", replace

And yes, there are to many repeated value labels to go trough it one by one.

And here the Stata-Commands to produce a minimal example:

set obs 1
generate var1 = 1 in 1
set obs 2
replace var1 = 2 in 2
set obs 3
replace var1 = 3 in 3
generate var2 = 11 in 1
replace var2 = 22 in 2
replace var2 = 33 in 3
rename var2 multilabel
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat

I am happy for each suggestion!

回答1:

If you have a variable with repeated labels, then

decode multilabel, gen(valuelabel)
label values multilabel

puts the value labels in a string variable and then undoes the association of multilabel values and the previously attached value labels. I don't know what else you need to do and thus why you do anything else. You now have the same information as before. I don't know whether pandas will ignore the definition of value labels.

For completeness, here's a way to find out which variables have value labels that aren't in one-to-one correspondence with numeric values.

* your sandbox, simplified and extended  
clear 
set obs 3
generate var1 = _n 
generate multilabel = 11 * _n
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat

label define var1 1 "frog" 2 "toad" 3 "newt"
label val var1 var1 


* my code 
local bad 
ds *, has(vallabel) 

quietly foreach v in `r(varlist)' { 
    tempvar decoded diff 
    decode `v', gen(`decoded') 
    bysort `decoded' (`v') : gen `diff' = `v'[1] != `v'[_N] & !missing(`decoded') 
    count if `diff' 
    if r(N) > 0 local bad `bad' `v' 
    drop `decoded' `diff' 
} 

di "`bad'"

回答2:

My final Solution (in Stata):

clear

use "file.dta"

*Find out which duplicated value labels there are
labelbook, length(12)

return list, all

*r(nuniq) contains the not-unique-values

*on all variables in r(nuniq) use the numlabels command

numlabel `r(nuniq)', add

*Look at the not unique value labels again:
labelbook, length(12)

return list, all

save "file2.dta", replace

Thank you Nick!

来源：https://stackoverflow.com/questions/46037548/stata-to-pandas-even-if-there-are-repeated-value-labels

标签

python

pandas

stata

labels