Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding IVPIN #7

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Adding IVPIN #7

wants to merge 20 commits into from

Conversation

alecriste
Copy link

Added one file model_ivpin.R that compute the iVPIN according to "An Improved Version of the Volume-Synchronized Probability of Informed Trading"

Copy link
Owner

@monty-se monty-se left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alexandre, a very good job tackling the ambitious project of implementing ivpin, and adapting it to the coding framework of PINstimation. Besides, I have tried the code and it seems to work with no problem, way to go! In order to improve further your code further so it reaches even better standing, I have submitted multiple comments and questions so we make sure that the code is efficient, clean, well-commented and fits seamlessly in the PINstimation coding framework. Please comment back or reach out for me if something is unclear.

R/model_ivpin.R Outdated

start_index <- samplength + 1

"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is well-commented but could be clarified a bit more so the reader knows why you calculate these variables. It helps to have consistent variable names: total_arrival_rate and informed_arrival_rate.

# Easley et al. (2012b) derive the VPIN estimator based on two moment conditions from Poisson processes:
# E[|VB - VS|] ≈ alpha*mu and E[|VB + VS|] = 2*epsilon + alpha*mu.
# Ke et al. (2017) suggest that these conditions should be expressed as:
# E[|VB - VS||t; theta] ≈ alpha*mu*t and E[|VB + VS||t; theta] = (2*epsilon + alpha*mu)*t.
# These equations correspond to Equations (3) on page 364 of Ke et al. (2017) .
# 
# In our implementation:
# - The variable `total_arrival_rate` is calculated as  |VB + VS|/t. Its average (expected value) would, therefore, represent 2*epsilon + alpha*mu.
# - The variable `informed_arrival_rate` is calculated as |VB - VS|/t.  Its average (expected value) would, therefore, approximate alpha*mu.
# 
# This approximation allows us to estimate mu by dividing `informed_arrival_rate` by an estimated alpha.
# Once mu is estimated, we can determine epsilon using the equation for `total_arrival_rate`.

total_arrival_rate <- (bucketdata$agg.bvol + bucketdata$agg.svol) / bucketdata$duration
informed_arrival_rate <- abs(bucketdata$agg.bvol - bucketdata$agg.svol) / bucketdata$duration

R/model_ivpin.R Outdated
sqrt(mu_init),
sqrt(eps_init))
tryCatch({
result <- optim(initial_guess, compute_log_likelihood, t = t[j:i], Vb = Vb[j:i], Vs = Vs[j:i], method = "BFGS")

This comment was marked as resolved.

R/model_ivpin.R Outdated
mu_init <- mean(informed_arrival[j:i]) / alpha_init
eps_init <- mean(abs(total_arrival_rate[j:i] - informed_arrival[j:i])) / 2

initial_guess <- c(logit(max(min(alpha_init, 0.999), 0.001)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alecriste Can you explain the use of the function logit() and sqrt() here that will reversed in the function compute_log_likelihood() using the function sigmoid() and ^2?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel combersome and maybe you can find an alternative solution but here is the issue behind it:
The optim function will not work on a small interval from 0-1 so we need to map it to all reals. The function in compute_log_likelihood works with alpha and delta from 0 to 1 so we need to remap it to that interval. Then optim returns the alpha and delta (on the real number line) and we map them to the 0 to 1 interval using sigmoid.

the same logic applies to the square root and square functions but in that case it is so that we can square the results at the end to ensure positive values (trader arrival rates)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The function optim() performs unconstrained non-linear optimization, so we need a workaround to constrain the variable values. A solution we use in the package involves the nloptr package, which allows for constrained optimization by specifying lower and upper bounds for the variables. Below is an example of using nloptr inspired by the code in the model_pin.R file at line 1571. You can apply your chosen optimization method by using nloptr::lbfgs() instead of nloptr::neldermead(). This should simplify the code considerably.

# Define the lower and upper bounds for the optimization variables
low <- c(0, 0, 0, 0, 0)  # Lower bounds
up <- c(1, 1, Inf, Inf, Inf)  # Upper bounds

# Try to perform the constrained optimization
tryCatch({
  # Use the 'nloptr' package to perform constrained optimization
  # 'initialsets[[i]]' contains the initial guess for the optimization
  # 'mlefn' is the objective function to be minimized
  estimates <- suppressWarnings(
    nloptr::neldermead(initialsets[[i]], mlefn, lower = low, upper = up)
  )
}, error = function(e) {
  # Handle any errors that occur during optimization
  message("Error during optimization: ", conditionMessage(e))
})

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alecriste In order to be conform to the coding framework of PINstimation, make sure to use the function nloptr::lbfgs() instead of optim(... , "lbfgs") in your code.

R/model_ivpin.R Outdated
} else {
for (alpha_init in seq(0.1, 0.9, by = 0.2)) {
for (delta_init in seq(0.1, 0.9, by = 0.2)) {
mu_init <- informed_arrival[i] / alpha_init

This comment was marked as resolved.

R/model_ivpin.R Outdated
}

if (exists("best_params")) {
best_params[1:2] <- sigmoid(best_params[1:2])

This comment was marked as resolved.

R/model_ivpin.R Outdated
best_log_likelihood <- Inf
exit_flag <- FALSE

if (i == start_index || i %% 10 == 0) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code could benefit from comments so the reader can know clearly what the code does. For instance, what do we use the grid only for the first index and for any index multiple of 10, and not the others? If I guess well, you want to improve the accuracy of your estimates by using optimal estimates of previous runs and initial guesses for the current run. If the estimation fails, then you go back to the grid method. You regularly update these optimal estimates using the grid method every tenth run. Is this correct?

Copy link
Author

@alecriste alecriste Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct yes I should definitely explain that in the comments, sorry you had to decipher all of it; also there is a redundancy in the code, we could have one function do the grid search. We would call that function for the first bucket, then use the initial parameters from the previous iteration for each subsequent bucket, except when the optimization fails in which case we would use the grid search again. This is the process used in the paper for their market data testing in section 7.

In this code I actually ran the grid search every 10 iterations but this was to combat the fact that many of the values of the duration vector were 0 which skewed the final ivpin results. Now I used the average value of the duration vector over the window but I still suspect that having many 0s impacts the result. To avoid this I wrote a check on line 694 which detects if more than 5% of the duration vector is 0s and informs the user to change the time bar size to a smaller value.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this strategy is used in the research paper, it should be our first choice. We need to improve the code to make it reusable and avoid redundancy. Alternatively, because the optimization is usually very fast, we could keep the grid for each volume bucket and add the optimal parameters from the previous iteration. We can test whether the code runs in a reasonable time. If the running time is too long, we can use your strategy of calling the grid only when the optimization fails.
Regarding volume buckets with a duration of zero, I understand that this leads to a division by zero and should result in NA values for the parameters mu and epsilon, right? Is the use of the average value of the duration vector over the window considered in the research paper by Ke and Lin (2017)? Or do they assume that all duration values are above zero? I appreciate your suggestion to inform the user about the prevalence of zeros in the duration vector. We should first conduct some research and theoretically agree on the best approach for handling zero-duration buckets.

Copy link
Author

@alecriste alecriste Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the paper more closely, this is how Dr. Ke and Lin deal with large time bars:

"We use the bulk classification algorithm (Easley et al., 2012a) with a 1-min time bar to calculate the volumes of buy and sell, VτBand VτS, respectively. For a time bar, if the cumulative volume exceeds V , we assume that the trades are uniformly distributed and then calculate VτB, VτS, and tτ proportionally."

this means we can compute Vb and Vs for each time bar and if one exceeds the volume threshold, we simply assign only the proportion of Vb and Vs needed to fill the bucket and t would be the percentage of the minute that was used.

Before reading this part I tried a different way although it isn't what Dr. Ke and Dr. Lin did. What I did instead was to split for example a large minute bar into two 30 second bars. If they are still large I split them further into 15 second bars and so on until they are under the bucket threshold. Here is an example and the code:

` --------------------------------------------------------------------------
III.2 BREAK DOWN LARGE VOLUME TIMEBARS INTO SMALLER ONES

We break down these large volume timebars into smaller timebars with a
maximum volume equal to threshold.
The code recursively splits large timebars into smaller ones and removes
the original large timebars from the final dataset, until only
timebars with volumes below the threshold remain.

If, for example, the initial minute bar with interval 10:00:00 - 10:01:00
has a total volume (tbv) of 2200, which exceeds the threshold of 1000, and
is composed in the following trades:

 trades              price    volume

 2023-06-30 10:00:05    1    500  
 2023-06-30 10:00:10    1    600  
 2023-06-30 10:00:20    1    700  
 2023-06-30 10:00:30    1    400   

The first pass splits this initial minute bar into two 30-second intervals:

 interval              dp    tbv

 2023-06-30 10:00:00    0    1800  (first 3 trades)
 2023-06-30 10:00:30    1    400   (fourth trade)

The first 30-second interval with 1800 volume exceeds the threshold,
so it will be split further into two 15-second intervals:

 interval              dp    tbv

 2023-06-30 10:00:00    0    1100  (first 2 trades)
 2023-06-30 10:00:15    1    700   (third trade)
 2023-06-30 10:00:30    1    400   (fourth trade)

The first 15-second interval with 1100 volume still exceeds the threshold,
so it will be split further into two intervals:

 interval              dp    tbv

 2023-06-30 10:00:00    0    500   (first trade)
 2023-06-30 10:00:07.5  1    600   (second trade)
 2023-06-30 10:00:30    1    400   (fourth trade)

The final timebars with acceptable volumes are:

 interval              dp    tbv

 2023-06-30 10:00:00    0    500   (first trade)
 2023-06-30 10:00:07.5  1    600   (second trade)
 2023-06-30 10:00:15    1    700   (third trade)
 2023-06-30 10:00:30    1    400   (last trade)

split_timebars <- function(data, dataset, interval_duration, threshold) {
final_timebars <- data.frame()

# Function to split the data recursively
recursive_split <- function(data, interval_duration) {
  large_timebars <- subset(data, tbv > threshold)
  
  if (nrow(large_timebars) == 0) {
    # If no large timebars, add the data to final timebars
    final_timebars <<- rbind(final_timebars, data)
    return()
  }
  
  for (i in seq_len(nrow(large_timebars))) {
    large_timebar <- large_timebars[i, ]
    trades <- subset(dataset, interval == large_timebar$interval)
    
    interval_start <- as.POSIXct(large_timebar$interval, tz = "")
    midpoint <- interval_start + interval_duration / 2
    
    first_half_trades <- subset(trades, timestamp <= midpoint)
    second_half_trades <- subset(trades, timestamp > midpoint)
    
    first_interval <- format(interval_start, "%Y-%m-%d %H:%M:%S")
    second_interval <- format(midpoint, "%Y-%m-%d %H:%M:%S")
    
    first_half <- data.frame(
      interval = first_interval,
      dp = ifelse(nrow(first_half_trades) > 0, tail(first_half_trades$price, 1) - head(first_half_trades$price, 1), 0),
      tbv = sum(first_half_trades$volume)
    )
    
    second_half <- data.frame(
      interval = second_interval,
      dp = ifelse(nrow(second_half_trades) > 0, tail(second_half_trades$price, 1) - head(second_half_trades$price, 1), 0),
      tbv = sum(second_half_trades$volume)
    )
    
    # Call the recursive split on the new intervals if their volumes are too large
    if (first_half$tbv > threshold) {
      recursive_split(first_half, interval_duration / 2)
    } else {
      final_timebars <<- rbind(final_timebars, first_half)
    }
    
    if (second_half$tbv > threshold) {
      recursive_split(second_half, interval_duration / 2)
    } else {
      final_timebars <<- rbind(final_timebars, second_half)
    }
    
    # Remove the large time bar being split from the final_timebars
    final_timebars <<- final_timebars[final_timebars$interval != large_timebar$interval, ]
  }
}

# Start the recursive split with the initial data and interval duration
recursive_split(data, interval_duration)
return(final_timebars)

}

largebars <- subset(minutebars, tbv > threshold)
if (nrow(largebars) > 0) {
split_bars <- split_timebars(largebars, dataset, timebarsize, threshold)
minutebars <- minutebars[-which(minutebars$tbv > threshold), ]
minutebars <- rbind(minutebars, split_bars)
minutebars <- minutebars[order(as.POSIXct(minutebars$interval, format = "%Y-%m-%d %H:%M:%S")), ]
}
`

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the grid search, I can test the duration of running with or without it on the hfdata. From my experience the difference in time is minimal, however it would also be interesting to test the difference in the resulting iVPIN.

Should we let the user choose for a fast or slow runtime as an argument of the iVPIN function?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hej @alecriste , I have read through your comment and your code. Good attempt for the code, well done! However, in order to ensure comparability of research results, we need to stick to the strategy used by Ke and Lin (2017), i.e., assuming a uniform distribution of trades and distribute VτB, VτS, and tτ proportionally. Would you like to make an attempt at it. Once this is implemented, all what is left is to work on checking, improving and testing the code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alecriste As for the grid search, we can include an argument called grid_size which specifies the level of partitioning of the parameter space of the pairs (alpha, delta). This allows users to increase the number of initial parameter sets and eventually the accuracy of the maximum-likelihood estimates.

R/model_ivpin.R Outdated
@@ -787,7 +771,20 @@ ivpin <- function(data, timebarsize = 60, buckets = 50, samplength = 50,
mean_duration <- mean(bucketdata$duration)
bucketdata$duration <- bucketdata$duration / mean_duration
bucketdata$duration <- ifelse(bucketdata$duration == 0, 0.1, bucketdata$duration)


# The sigmoid function

This comment was marked as resolved.

R/model_ivpin.R Outdated
return(y)
}

# The inverse sigmoid function

This comment was marked as resolved.

@@ -902,6 +902,10 @@ ivpin <- function(data, timebarsize = 60, buckets = 50, samplength = 50,
}
estimateivpin@ivpin <- ivpin
estimatevpin@runningtime <- ux$timediff(time_on, time_off)
num_of_nan <- sum(is.na(ivpin))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alecriste Can you share when the NA value for ivpin happen?

DESCRIPTION Outdated
@@ -18,7 +18,7 @@ LazyDataCompression: xz
RoxygenNote: 7.2.1
Roxygen: list(markdown = TRUE)
VignetteBuilder: knitr
Imports: Rdpack, knitr, methods, skellam, nloptr, furrr, future, dplyr, rmarkdown, coda
Imports: Rdpack, knitr, methods, skellam, nloptr, furrr, future, dplyr, rmarkdown, coda, lubridate
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please specify where do you use lubridate in the ivpin code?

@monty-se monty-se force-pushed the master branch 2 times, most recently from 249c421 to 6b0cbca Compare July 28, 2024 15:28
@monty-se monty-se force-pushed the master branch 9 times, most recently from 58b39ae to dbc0fff Compare October 22, 2024 10:57
@monty-se monty-se force-pushed the master branch 11 times, most recently from cb5f1c4 to 20ae095 Compare October 22, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants