-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p values does not seem accurate #86
Comments
I'm going to work on the following code example. import numpy as np
import matplotlib.pyplot as plt
import pwlf
# generate sin wave data
x = np.linspace(0, 10, num=100)
y = np.sin(x * np.pi / 2)
# add noise to the data
y = np.random.normal(0, 0.05, 100) + y
# initialize piecewise linear fit with your x and y data
my_pwlf = pwlf.PiecewiseLinFit(x, y)
res = my_pwlf.fitfast(10)
# predict for the determined points
xHat = np.linspace(min(x), max(x), num=10000)
yHat = my_pwlf.predict(xHat)
# plot the results
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xHat, yHat, '-')
plt.show() How do beta parameters relate to slopes?This is a perfect place to post questions like this. I don't mind using issues like a discussion board! The p-values you are looking at are for the So here you can see the translation between slopes and beta parameters. slopes = my_pwlf.calc_slopes()
for i in range(my_pwlf.n_segments):
slope = slopes[i]
beta_slope = np.sum(my_pwlf.beta[1:i+2])
print(f'Segment: {i}\tSlope: {slope:.3f}\tBeta Sum: {beta_slope:.3f}') Which will print something like the following:
To calculate the p-values of the slopes?In order to calculate the p-values of the slopes, you are going to need to use the delta method. This is an approximate method which is needed to translate a small change in slope to changes in beta parameters. This is not currently implemented in pwlf. The procedure will be similar to what is done in the I think the following code is one way you can consider p-values. from scipy import linalg, stats
n = my_pwlf.n_data
#nb = my_pwlf.beta.size + my_pwlf # This is the number of model parameters that change
nb = my_pwlf.beta.size + my_pwlf.fit_breaks.size - 2
# you should probally count the number of breakpoints, and the intercept...
k = nb - 1
step_size = 1.0e-4
A = np.zeros((n, my_pwlf.n_segments))
slopes = my_pwlf.calc_slopes()
orig_beta = my_pwlf.beta.copy()
f0 = my_pwlf.predict(my_pwlf.x_data)
for i in range(my_pwlf.n_segments):
betas = my_pwlf.beta[1:i+2]
n_betas = len(betas)
# translate the step size in slope into beta change
betas += (step_size/n_betas)
my_pwlf.beta[1:i+2] = betas
print(f'Slope {slopes[i]} Slope+FD {np.sum(betas)}')
f = my_pwlf.predict(my_pwlf.x_data)
A[:, i] = (f - f0) / step_size
# restore orig betas
my_pwlf.beta = orig_beta
e = f0 - my_pwlf.y_data
variance = np.dot(e, e) / (n - nb)
A2inv = np.abs(linalg.inv(np.dot(A.T, A)).diagonal())
se = np.sqrt(variance * A2inv)
# calculate my t-value
t = slopes / se
p = 2.0 * stats.t.sf(np.abs(t), df=n - k - 1)
print(p) You will not be happy with your p-values for slopes!So you can play around with your degrees of freedom as to what is an actual model parameter and what isn't, but I suspect that you will not be happy with your p-values. A small change to a slope will result in large errors at some point in your model due to the continuity constraint. That large error is going to tank your p-value. I don't know what's best to do here. We can maybe try only compute errors in one region at a time, but this is really pushing on whether your model is fundamentally continuous or discontinuous. What you should do insteadYou should really look at doing lack-of-fit tests in the breakpoints regions. This will not give you a p-value for each slope, but rather give you a p-value for each slope region by computing a F-statistic for each region. The F-statistic compares the variance in your data to your model variance. Some resources of lack-of-fit
There is substantial lack-of-fit in the figures you posted. I hope this helps a bit. What I would maybe first do is attempt to compute a lack-of-fit test for the entire model, then start to look at one region at a time. However, this has a very different meaning than p-values for a particular parameter. |
Thanks for the reply. I have a view questions related to your code.
Before your reply I gave it a try myself. I approached it by trying to calculate the SSE for each segment, using the slope and the breakpoints of the segments. From these values I tried to calculate the p-value. I know it is a very poorly fit. I tried many different parameters and different settings to find the best fitting. Unfortunately due to time constraints, this is what I got and I have to make the best of it. In the end I wish to say something about the average speed (slope) of the whole dataset. Since the segments where the slope is zero is not relevant for this analysis. I wish to only use the segments where I can reject H0 (a=0.05). Hopefully in this way I can filter out the regressions where the slope is zero. So they will not influence my conclusions to much. Thanks for the advice, I will definitely have a look at it and might use it. |
I could be wrong about this, so feel free to question my logic here.
I think because the model is continuous, each slope is dependent on all other slopes. Therefore, it seems appropriate to use the full dateset rather than the discrete form.
Follow the derivation in [1] https://mae.ufl.edu/nkim/Papers/paper54.pdf , specifically II. A. That line of code comes up in equation 9.
Yes (at least I think so). The difference between one-sided and two-sided should just be this factor of 2.
This is how the |
Dear cjekel,
In my data-analysis I tried to fit linear regressions (±160) to a dataset. I determined the breakpoints of the data with an alternative package (ruptures). Subsequently, I used the pwlf package and the method: .fit_with_break(my_breaks) to determine the slope of the lines. I now wish to determine the p-values of those slopes. For this I tried the method: .p_values(method='linear', step_size=0.0001). I deleted the first value since this corresponds to the Y1 value which I'm not interested in. I am only interested in the p-value of the slope.
The results I know get seems not accurate. Some lines that seem to have a very obvious correlation have a very high p value and some lines that seem very arbitrary have a low p-value. I added a few examples that show the p values and the contributing segment. Does anyone know what is wrong or do I maybe approach this problem in a wrong way?
In the examples the yellow hover screen show the p-value of the segment on the right and the other attributes of the line.
I figured out that when I use my_pwlf.beta I get different different values than from my.pwlf.calc_slopes()
The text was updated successfully, but these errors were encountered: