-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding IVPIN #7
base: master
Are you sure you want to change the base?
Adding IVPIN #7
Conversation
add ivpin output class
add lubridate to imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alexandre, a very good job tackling the ambitious project of implementing ivpin, and adapting it to the coding framework of PINstimation. Besides, I have tried the code and it seems to work with no problem, way to go! In order to improve further your code further so it reaches even better standing, I have submitted multiple comments and questions so we make sure that the code is efficient, clean, well-commented and fits seamlessly in the PINstimation coding framework. Please comment back or reach out for me if something is unclear.
R/model_ivpin.R
Outdated
|
||
start_index <- samplength + 1 | ||
|
||
" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is well-commented but could be clarified a bit more so the reader knows why you calculate these variables. It helps to have consistent variable names: total_arrival_rate
and informed_arrival_rate
.
# Easley et al. (2012b) derive the VPIN estimator based on two moment conditions from Poisson processes:
# E[|VB - VS|] ≈ alpha*mu and E[|VB + VS|] = 2*epsilon + alpha*mu.
# Ke et al. (2017) suggest that these conditions should be expressed as:
# E[|VB - VS||t; theta] ≈ alpha*mu*t and E[|VB + VS||t; theta] = (2*epsilon + alpha*mu)*t.
# These equations correspond to Equations (3) on page 364 of Ke et al. (2017) .
#
# In our implementation:
# - The variable `total_arrival_rate` is calculated as |VB + VS|/t. Its average (expected value) would, therefore, represent 2*epsilon + alpha*mu.
# - The variable `informed_arrival_rate` is calculated as |VB - VS|/t. Its average (expected value) would, therefore, approximate alpha*mu.
#
# This approximation allows us to estimate mu by dividing `informed_arrival_rate` by an estimated alpha.
# Once mu is estimated, we can determine epsilon using the equation for `total_arrival_rate`.
total_arrival_rate <- (bucketdata$agg.bvol + bucketdata$agg.svol) / bucketdata$duration
informed_arrival_rate <- abs(bucketdata$agg.bvol - bucketdata$agg.svol) / bucketdata$duration
R/model_ivpin.R
Outdated
sqrt(mu_init), | ||
sqrt(eps_init)) | ||
tryCatch({ | ||
result <- optim(initial_guess, compute_log_likelihood, t = t[j:i], Vb = Vb[j:i], Vs = Vs[j:i], method = "BFGS") |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
R/model_ivpin.R
Outdated
mu_init <- mean(informed_arrival[j:i]) / alpha_init | ||
eps_init <- mean(abs(total_arrival_rate[j:i] - informed_arrival[j:i])) / 2 | ||
|
||
initial_guess <- c(logit(max(min(alpha_init, 0.999), 0.001)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alecriste Can you explain the use of the function logit()
and sqrt()
here that will reversed in the function compute_log_likelihood()
using the function sigmoid()
and ^2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does feel combersome and maybe you can find an alternative solution but here is the issue behind it:
The optim function will not work on a small interval from 0-1 so we need to map it to all reals. The function in compute_log_likelihood works with alpha and delta from 0 to 1 so we need to remap it to that interval. Then optim returns the alpha and delta (on the real number line) and we map them to the 0 to 1 interval using sigmoid.
the same logic applies to the square root and square functions but in that case it is so that we can square the results at the end to ensure positive values (trader arrival rates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The function optim()
performs unconstrained non-linear optimization, so we need a workaround to constrain the variable values. A solution we use in the package involves the nloptr
package, which allows for constrained optimization by specifying lower and upper bounds for the variables. Below is an example of using nloptr
inspired by the code in the model_pin.R file at line 1571. You can apply your chosen optimization method by using nloptr::lbfgs()
instead of nloptr::neldermead()
. This should simplify the code considerably.
# Define the lower and upper bounds for the optimization variables
low <- c(0, 0, 0, 0, 0) # Lower bounds
up <- c(1, 1, Inf, Inf, Inf) # Upper bounds
# Try to perform the constrained optimization
tryCatch({
# Use the 'nloptr' package to perform constrained optimization
# 'initialsets[[i]]' contains the initial guess for the optimization
# 'mlefn' is the objective function to be minimized
estimates <- suppressWarnings(
nloptr::neldermead(initialsets[[i]], mlefn, lower = low, upper = up)
)
}, error = function(e) {
# Handle any errors that occur during optimization
message("Error during optimization: ", conditionMessage(e))
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alecriste In order to be conform to the coding framework of PINstimation, make sure to use the function nloptr::lbfgs()
instead of optim(... , "lbfgs")
in your code.
R/model_ivpin.R
Outdated
} else { | ||
for (alpha_init in seq(0.1, 0.9, by = 0.2)) { | ||
for (delta_init in seq(0.1, 0.9, by = 0.2)) { | ||
mu_init <- informed_arrival[i] / alpha_init |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
R/model_ivpin.R
Outdated
} | ||
|
||
if (exists("best_params")) { | ||
best_params[1:2] <- sigmoid(best_params[1:2]) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
R/model_ivpin.R
Outdated
best_log_likelihood <- Inf | ||
exit_flag <- FALSE | ||
|
||
if (i == start_index || i %% 10 == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code could benefit from comments so the reader can know clearly what the code does. For instance, what do we use the grid only for the first index and for any index multiple of 10, and not the others? If I guess well, you want to improve the accuracy of your estimates by using optimal estimates of previous runs and initial guesses for the current run. If the estimation fails, then you go back to the grid method. You regularly update these optimal estimates using the grid method every tenth run. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct yes I should definitely explain that in the comments, sorry you had to decipher all of it; also there is a redundancy in the code, we could have one function do the grid search. We would call that function for the first bucket, then use the initial parameters from the previous iteration for each subsequent bucket, except when the optimization fails in which case we would use the grid search again. This is the process used in the paper for their market data testing in section 7.
In this code I actually ran the grid search every 10 iterations but this was to combat the fact that many of the values of the duration vector were 0 which skewed the final ivpin results. Now I used the average value of the duration vector over the window but I still suspect that having many 0s impacts the result. To avoid this I wrote a check on line 694 which detects if more than 5% of the duration vector is 0s and informs the user to change the time bar size to a smaller value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this strategy is used in the research paper, it should be our first choice. We need to improve the code to make it reusable and avoid redundancy. Alternatively, because the optimization is usually very fast, we could keep the grid for each volume bucket and add the optimal parameters from the previous iteration. We can test whether the code runs in a reasonable time. If the running time is too long, we can use your strategy of calling the grid only when the optimization fails.
Regarding volume buckets with a duration of zero, I understand that this leads to a division by zero and should result in NA values for the parameters mu
and epsilon
, right? Is the use of the average value of the duration vector over the window considered in the research paper by Ke and Lin (2017)? Or do they assume that all duration values are above zero? I appreciate your suggestion to inform the user about the prevalence of zeros in the duration vector. We should first conduct some research and theoretically agree on the best approach for handling zero-duration buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the paper more closely, this is how Dr. Ke and Lin deal with large time bars:
"We use the bulk classification algorithm (Easley et al., 2012a) with a 1-min time bar to calculate the volumes of buy and sell, VτBand VτS, respectively. For a time bar, if the cumulative volume exceeds V , we assume that the trades are uniformly distributed and then calculate VτB, VτS, and tτ proportionally."
this means we can compute Vb and Vs for each time bar and if one exceeds the volume threshold, we simply assign only the proportion of Vb and Vs needed to fill the bucket and t would be the percentage of the minute that was used.
Before reading this part I tried a different way although it isn't what Dr. Ke and Dr. Lin did. What I did instead was to split for example a large minute bar into two 30 second bars. If they are still large I split them further into 15 second bars and so on until they are under the bucket threshold. Here is an example and the code:
` --------------------------------------------------------------------------
III.2 BREAK DOWN LARGE VOLUME TIMEBARS INTO SMALLER ONES
We break down these large volume timebars into smaller timebars with a
maximum volume equal to threshold.
The code recursively splits large timebars into smaller ones and removes
the original large timebars from the final dataset, until only
timebars with volumes below the threshold remain.
If, for example, the initial minute bar with interval 10:00:00 - 10:01:00
has a total volume (tbv) of 2200, which exceeds the threshold of 1000, and
is composed in the following trades:
trades price volume
2023-06-30 10:00:05 1 500
2023-06-30 10:00:10 1 600
2023-06-30 10:00:20 1 700
2023-06-30 10:00:30 1 400
The first pass splits this initial minute bar into two 30-second intervals:
interval dp tbv
2023-06-30 10:00:00 0 1800 (first 3 trades)
2023-06-30 10:00:30 1 400 (fourth trade)
The first 30-second interval with 1800 volume exceeds the threshold,
so it will be split further into two 15-second intervals:
interval dp tbv
2023-06-30 10:00:00 0 1100 (first 2 trades)
2023-06-30 10:00:15 1 700 (third trade)
2023-06-30 10:00:30 1 400 (fourth trade)
The first 15-second interval with 1100 volume still exceeds the threshold,
so it will be split further into two intervals:
interval dp tbv
2023-06-30 10:00:00 0 500 (first trade)
2023-06-30 10:00:07.5 1 600 (second trade)
2023-06-30 10:00:30 1 400 (fourth trade)
The final timebars with acceptable volumes are:
interval dp tbv
2023-06-30 10:00:00 0 500 (first trade)
2023-06-30 10:00:07.5 1 600 (second trade)
2023-06-30 10:00:15 1 700 (third trade)
2023-06-30 10:00:30 1 400 (last trade)
split_timebars <- function(data, dataset, interval_duration, threshold) {
final_timebars <- data.frame()
# Function to split the data recursively
recursive_split <- function(data, interval_duration) {
large_timebars <- subset(data, tbv > threshold)
if (nrow(large_timebars) == 0) {
# If no large timebars, add the data to final timebars
final_timebars <<- rbind(final_timebars, data)
return()
}
for (i in seq_len(nrow(large_timebars))) {
large_timebar <- large_timebars[i, ]
trades <- subset(dataset, interval == large_timebar$interval)
interval_start <- as.POSIXct(large_timebar$interval, tz = "")
midpoint <- interval_start + interval_duration / 2
first_half_trades <- subset(trades, timestamp <= midpoint)
second_half_trades <- subset(trades, timestamp > midpoint)
first_interval <- format(interval_start, "%Y-%m-%d %H:%M:%S")
second_interval <- format(midpoint, "%Y-%m-%d %H:%M:%S")
first_half <- data.frame(
interval = first_interval,
dp = ifelse(nrow(first_half_trades) > 0, tail(first_half_trades$price, 1) - head(first_half_trades$price, 1), 0),
tbv = sum(first_half_trades$volume)
)
second_half <- data.frame(
interval = second_interval,
dp = ifelse(nrow(second_half_trades) > 0, tail(second_half_trades$price, 1) - head(second_half_trades$price, 1), 0),
tbv = sum(second_half_trades$volume)
)
# Call the recursive split on the new intervals if their volumes are too large
if (first_half$tbv > threshold) {
recursive_split(first_half, interval_duration / 2)
} else {
final_timebars <<- rbind(final_timebars, first_half)
}
if (second_half$tbv > threshold) {
recursive_split(second_half, interval_duration / 2)
} else {
final_timebars <<- rbind(final_timebars, second_half)
}
# Remove the large time bar being split from the final_timebars
final_timebars <<- final_timebars[final_timebars$interval != large_timebar$interval, ]
}
}
# Start the recursive split with the initial data and interval duration
recursive_split(data, interval_duration)
return(final_timebars)
}
largebars <- subset(minutebars, tbv > threshold)
if (nrow(largebars) > 0) {
split_bars <- split_timebars(largebars, dataset, timebarsize, threshold)
minutebars <- minutebars[-which(minutebars$tbv > threshold), ]
minutebars <- rbind(minutebars, split_bars)
minutebars <- minutebars[order(as.POSIXct(minutebars$interval, format = "%Y-%m-%d %H:%M:%S")), ]
}
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the grid search, I can test the duration of running with or without it on the hfdata. From my experience the difference in time is minimal, however it would also be interesting to test the difference in the resulting iVPIN.
Should we let the user choose for a fast or slow runtime as an argument of the iVPIN function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hej @alecriste , I have read through your comment and your code. Good attempt for the code, well done! However, in order to ensure comparability of research results, we need to stick to the strategy used by Ke and Lin (2017), i.e., assuming a uniform distribution of trades and distribute VτB, VτS, and tτ proportionally. Would you like to make an attempt at it. Once this is implemented, all what is left is to work on checking, improving and testing the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alecriste As for the grid search, we can include an argument called grid_size
which specifies the level of partitioning of the parameter space of the pairs (alpha
, delta
). This allows users to increase the number of initial parameter sets and eventually the accuracy of the maximum-likelihood estimates.
R/model_ivpin.R
Outdated
@@ -787,7 +771,20 @@ ivpin <- function(data, timebarsize = 60, buckets = 50, samplength = 50, | |||
mean_duration <- mean(bucketdata$duration) | |||
bucketdata$duration <- bucketdata$duration / mean_duration | |||
bucketdata$duration <- ifelse(bucketdata$duration == 0, 0.1, bucketdata$duration) | |||
|
|||
|
|||
# The sigmoid function |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
R/model_ivpin.R
Outdated
return(y) | ||
} | ||
|
||
# The inverse sigmoid function |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
@@ -902,6 +902,10 @@ ivpin <- function(data, timebarsize = 60, buckets = 50, samplength = 50, | |||
} | |||
estimateivpin@ivpin <- ivpin | |||
estimatevpin@runningtime <- ux$timediff(time_on, time_off) | |||
num_of_nan <- sum(is.na(ivpin)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alecriste Can you share when the NA value for ivpin happen?
DESCRIPTION
Outdated
@@ -18,7 +18,7 @@ LazyDataCompression: xz | |||
RoxygenNote: 7.2.1 | |||
Roxygen: list(markdown = TRUE) | |||
VignetteBuilder: knitr | |||
Imports: Rdpack, knitr, methods, skellam, nloptr, furrr, future, dplyr, rmarkdown, coda | |||
Imports: Rdpack, knitr, methods, skellam, nloptr, furrr, future, dplyr, rmarkdown, coda, lubridate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please specify where do you use lubridate
in the ivpin
code?
249c421
to
6b0cbca
Compare
58b39ae
to
dbc0fff
Compare
cb5f1c4
to
20ae095
Compare
Added one file model_ivpin.R that compute the iVPIN according to "An Improved Version of the Volume-Synchronized Probability of Informed Trading"