Python - Gini coefficient calculation using Numpy

问题

I'm a newbie, first of all, just started learning Python and I'm trying to write some code to calculate the Gini index for a fake country. I've came up with the following:

GDP = (653200000000)
A = (0.49 * GDP) / 100 # Poorest 10%
B = (0.59 * GDP) / 100
C = (0.69 * GDP) / 100
D = (0.79 * GDP) / 100
E = (1.89 * GDP) / 100
F = (2.55 * GDP) / 100
G = (5.0 * GDP) / 100
H = (10.0 * GDP) / 100
I = (18.0 * GDP) / 100
J = (60.0 * GDP) / 100 # Richest 10%

# Divide into quintiles and total income within each quintile
Q1 = float(A + B) # lowest quintile
Q2 = float(C + D) # second quintile
Q3 = float(E + F) # third quintile
Q4 = float(G + H) # fourth quintile
Q5 = float(I + J) # fifth quintile

# Calculate the percent of total income in each quintile
T1 = float((100 * Q1) / GDP) / 100
T2 = float((100 * Q2) / GDP) / 100
T3 = float((100 * Q3) / GDP) / 100
T4 = float((100 * Q4) / GDP) / 100
T5 = float((100 * Q5) / GDP) / 100

TR = float(T1 + T2 + T3 + T4 + T5)

# Calculate the cumulative percentage of household income
H1 = float(T1)
H2 = float(T1+T2)
H3 = float(T1+T2+T3)
H4 = float(T1+T2+T3+T4)
H5 = float(T1+T2+T3+T4+T5)

# Magic! Using numpy to calculate area under Lorenz curve.
# Problem might be here?
import numpy as np 
from numpy import trapz

# The y values. Cumulative percentage of incomes
y = np.array([Q1,Q2,Q3,Q4,Q5])

# Compute the area using the composite trapezoidal rule.
area_lorenz = trapz(y, dx=5)

# Calculate the area below the perfect equality line.
area_perfect = (Q5 * H5) / 2

# Seems to work fine until here. 
# Manually calculated Gini using the values given for the areas above 
# turns out at .58 which seems reasonable?

Gini = area_perfect - area_lorenz

# Prints utter nonsense.
print Gini

The result of Gini = area_perfect - area_lorenz just makes no sense. I've took out the values given by the area variables and did the math by hand and it came out fairly ok, but when i try to get the program to do it, it gives me a completely ??? value (-1.7198...). What am I missing? Can someone point me in the right direction?

Thanks!

回答1:

Stardust.

Your problem isn't with numpy.trapz; it is with 1) your definition of the perfect equality distribution, and 2) normalization of the Gini coefficient.

First, you had defined the perfect equality distribution as Q5*H5/2, which is half the product of the fifth quintile's income and the cumulative percentage (1.0). I'm not sure what this number is meant to represent.

Second, you have to normalize by the area under the perfect equality distribution; i.e.:

gini = (area under perfect equality - area under lorenz)/(area under perfect equality)

You don't have to worry about this if you define the perfect equality curve to have an area of 1, but it's a good safeguard in case there's an error in your definition of the perfect equality curve.

To address both of these issues, I defined the perfect equality curve with numpy.linspace. The first advantage of this is that you can use the real distribution's properties to define it the same way. In other words, whether you use quartiles or quintiles or deciles, the perfect equality CDF (y_pe, below) will have the right shape. The second advantage is that computing its area is done with numpy.trapz as well, a bit of parallelism that makes the code easier to read and guards against erroneous calculations.

import numpy as np
from matplotlib import pyplot as plt
from numpy import trapz

GDP = (653200000000)
A = (0.49 * GDP) / 100 # Poorest 10%
B = (0.59 * GDP) / 100
C = (0.69 * GDP) / 100
D = (0.79 * GDP) / 100
E = (1.89 * GDP) / 100
F = (2.55 * GDP) / 100
G = (5.0 * GDP) / 100
H = (10.0 * GDP) / 100
I = (18.0 * GDP) / 100
J = (60.0 * GDP) / 100 # Richest 10%

# Divide into quintiles and total income within each quintile
Q1 = float(A + B) # lowest quintile
Q2 = float(C + D) # second quintile
Q3 = float(E + F) # third quintile
Q4 = float(G + H) # fourth quintile
Q5 = float(I + J) # fifth quintile

# Calculate the percent of total income in each quintile
T1 = float((100 * Q1) / GDP) / 100
T2 = float((100 * Q2) / GDP) / 100
T3 = float((100 * Q3) / GDP) / 100
T4 = float((100 * Q4) / GDP) / 100
T5 = float((100 * Q5) / GDP) / 100

TR = float(T1 + T2 + T3 + T4 + T5)

# Calculate the cumulative percentage of household income
H1 = float(T1)
H2 = float(T1+T2)
H3 = float(T1+T2+T3)
H4 = float(T1+T2+T3+T4)
H5 = float(T1+T2+T3+T4+T5)

# The y values. Cumulative percentage of incomes
y = np.array([H1,H2,H3,H4,H5])

# The perfect equality y values. Cumulative percentage of incomes.
y_pe = np.linspace(0.0,1.0,len(y))

# Compute the area using the composite trapezoidal rule.
area_lorenz = np.trapz(y, dx=5)

# Calculate the area below the perfect equality line.
area_perfect = np.trapz(y_pe, dx=5)

# Seems to work fine until here. 
# Manually calculated Gini using the values given for the areas above 
# turns out at .58 which seems reasonable?

Gini = (area_perfect - area_lorenz)/area_perfect

print Gini

plt.plot(y,label='lorenz')
plt.plot(y_pe,label='perfect_equality')
plt.legend()
plt.show()

来源：https://stackoverflow.com/questions/31416664/python-gini-coefficient-calculation-using-numpy

标签

python

numpy

economics