Fitting a straight line to a log-log curve in matplotlib

ε祈祈猫儿з 提交于 2019-12-05 12:59:29

Your linear fit is not performed on the same data as shown in the loglog-plot.

Make a and b numpy arrays like this

a = numpy.asarray(a, dtype=float)
b = numpy.asarray(b, dtype=float)

Now you can perform operations on them. What the loglog-plot does, is to take the logarithm to base 10 of both a and b. You can do the same by

logA = numpy.log10(a)
logB = numpy.log10(b)

This is what the loglog plot visualizes. Check this by ploting both logA and logB as a regular plot. Repeat the linear fit on the log data and plot your line in the same plot as the logA, logB data.

coefficients = numpy.polyfit(logB, logA, 1)
polynomial = numpy.poly1d(coefficients)
ys = polynomial(b)
plt.plot(logB, logA)
plt.plot(b, ys)

To better understand this problem, let's first talk about plain ol' linear regression (the polyfit function, in this case, is your linear regression algorithm).


Suppose you have a set of data points (x,y), shown below:

You want to create a model that predicts y as a function of x, so you use linear regression. That uses the model:

y = mx + b

and computes the values of m and b that best predict your data, using some linear algebra.

Next, you use your model to predict values of y as a function of x. You do this by picking a set of values for x (think linspace) and computing the corresponding values of y. Plotting these (x,y) pairs gives you your regression line.


Now, let's talk about logarithmic regression. In this case, we still have two variables, y versus x, and we are still interested in their relationship, i.e., being able to predict y given x. The only difference is, now y and x happen to be logarithms of two other variables, which I'll call log(F) and log(R). Thus far, this is nothing more than a simple change of name.

The linear regression also works the same way. You're still regressing y versus x. The linear regression algorithm doesn't care that y and x are actually log(F) and log(R) - it makes no difference to the algorithm.

The last step is a little bit different - and this is where you're getting tripped up in your plot above. What you're doing is computing

F = m R + b

but this is incorrect, because the relationship between F and R is not linear. (That's why you're using a log-log plot.)

Instead, you should compute

log(F) = m log(R) + b

If you transform this (raise 10 to the power of both sides and rearrange), you get

F = c R^m

where c = 10^b. This is the relationship between F and R: it is a power law relationship. (Power law relationships are what log-log plots are best at.)

In your code, you're using A and B when calling polyfit, but you should be using log(A) and log(B).

The other answers offer great explanations and a solution. However I would like to propose a solution that helped myself a lot and maybe will help you as well.

Another simple way of writing a line fit for log-log scale is the function powerfit in the code below. It takes in the original x and y data and by using a number of new x-points you can get a straight line on log-log scale. In the current case the values xnew are the same as x (which are both b).

The advantage of defining new x-coordinates is that you can get as few or as many points of the powerfitted line for whatever purpose you might need them.

import numpy as np
from matplotlib import pyplot as plt
import math


def powerfit(x, y, xnew):
    """line fitting on log-log scale"""
    k, m = np.polyfit(np.log(x), np.log(y), 1)
    return np.exp(m) * xnew**(k)


fp=open("word-rank.txt","r")
a=[]
b=[]

for line in fp:
    string=line.strip().split()
    a.append(float(string[0]))
    b.append(float(string[1]))

ys = powerfit(b, a, b)

plt.loglog(b,a,'ro')
plt.plot(b,ys)
plt.xlabel("Log (Rank of frequency)")
plt.ylabel("Log (Frequency)")
plt.title("Frequency vs frequency rank for words")
plt.show()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!