Adding IVPIN #7

alecriste · 2024-06-23T05:04:48Z

Added one file model_ivpin.R that compute the iVPIN according to "An Improved Version of the Volume-Synchronized Probability of Informed Trading"

add ivpin output class

add lubridate to imports

monty-se

Alexandre, a very good job tackling the ambitious project of implementing ivpin, and adapting it to the coding framework of PINstimation. Besides, I have tried the code and it seems to work with no problem, way to go! In order to improve further your code further so it reaches even better standing, I have submitted multiple comments and questions so we make sure that the code is efficient, clean, well-commented and fits seamlessly in the PINstimation coding framework. Please comment back or reach out for me if something is unclear.

monty-se · 2024-06-29T15:09:14Z

R/model_ivpin.R

+
+  start_index <- samplength + 1
+
+  "


This part is well-commented but could be clarified a bit more so the reader knows why you calculate these variables. It helps to have consistent variable names: total_arrival_rate and informed_arrival_rate.

# Easley et al. (2012b) derive the VPIN estimator based on two moment conditions from Poisson processes: # E[|VB - VS|] ≈ alpha*mu and E[|VB + VS|] = 2*epsilon + alpha*mu. # Ke et al. (2017) suggest that these conditions should be expressed as: # E[|VB - VS||t; theta] ≈ alpha*mu*t and E[|VB + VS||t; theta] = (2*epsilon + alpha*mu)*t. # These equations correspond to Equations (3) on page 364 of Ke et al. (2017) . # # In our implementation: # - The variable `total_arrival_rate` is calculated as |VB + VS|/t. Its average (expected value) would, therefore, represent 2*epsilon + alpha*mu. # - The variable `informed_arrival_rate` is calculated as |VB - VS|/t. Its average (expected value) would, therefore, approximate alpha*mu. # # This approximation allows us to estimate mu by dividing `informed_arrival_rate` by an estimated alpha. # Once mu is estimated, we can determine epsilon using the equation for `total_arrival_rate`. total_arrival_rate <- (bucketdata$agg.bvol + bucketdata$agg.svol) / bucketdata$duration informed_arrival_rate <- abs(bucketdata$agg.bvol - bucketdata$agg.svol) / bucketdata$duration

R/model_ivpin.R

+                             sqrt(mu_init), 
+                             sqrt(eps_init))
+          tryCatch({
+            result <- optim(initial_guess, compute_log_likelihood, t = t[j:i], Vb = Vb[j:i], Vs = Vs[j:i], method = "BFGS")


monty-se · 2024-06-29T15:27:01Z

R/model_ivpin.R

+          mu_init <- mean(informed_arrival[j:i]) / alpha_init
+          eps_init <- mean(abs(total_arrival_rate[j:i] - informed_arrival[j:i])) / 2
+
+          initial_guess <- c(logit(max(min(alpha_init, 0.999), 0.001)), 


@alecriste Can you explain the use of the function logit() and sqrt() here that will reversed in the function compute_log_likelihood() using the function sigmoid() and ^2?

It does feel combersome and maybe you can find an alternative solution but here is the issue behind it:
The optim function will not work on a small interval from 0-1 so we need to map it to all reals. The function in compute_log_likelihood works with alpha and delta from 0 to 1 so we need to remap it to that interval. Then optim returns the alpha and delta (on the real number line) and we map them to the 0 to 1 interval using sigmoid.

the same logic applies to the square root and square functions but in that case it is so that we can square the results at the end to ensure positive values (trader arrival rates)

I see. The function optim() performs unconstrained non-linear optimization, so we need a workaround to constrain the variable values. A solution we use in the package involves the nloptr package, which allows for constrained optimization by specifying lower and upper bounds for the variables. Below is an example of using nloptr inspired by the code in the model_pin.R file at line 1571. You can apply your chosen optimization method by using nloptr::lbfgs() instead of nloptr::neldermead(). This should simplify the code considerably.

# Define the lower and upper bounds for the optimization variables low <- c(0, 0, 0, 0, 0) # Lower bounds up <- c(1, 1, Inf, Inf, Inf) # Upper bounds # Try to perform the constrained optimization tryCatch({ # Use the 'nloptr' package to perform constrained optimization # 'initialsets[[i]]' contains the initial guess for the optimization # 'mlefn' is the objective function to be minimized estimates <- suppressWarnings( nloptr::neldermead(initialsets[[i]], mlefn, lower = low, upper = up) ) }, error = function(e) { # Handle any errors that occur during optimization message("Error during optimization: ", conditionMessage(e)) })

@alecriste In order to be conform to the coding framework of PINstimation, make sure to use the function nloptr::lbfgs() instead of optim(... , "lbfgs") in your code.

R/model_ivpin.R

+      } else {
+        for (alpha_init in seq(0.1, 0.9, by = 0.2)) {
+          for (delta_init in seq(0.1, 0.9, by = 0.2)) {
+            mu_init <- informed_arrival[i] / alpha_init


R/model_ivpin.R

+    }
+
+    if (exists("best_params")) {
+      best_params[1:2] <- sigmoid(best_params[1:2])


monty-se · 2024-06-29T15:37:45Z

R/model_ivpin.R

+    best_log_likelihood <- Inf
+    exit_flag <- FALSE
+
+    if (i == start_index || i %% 10 == 0) {


The code could benefit from comments so the reader can know clearly what the code does. For instance, what do we use the grid only for the first index and for any index multiple of 10, and not the others? If I guess well, you want to improve the accuracy of your estimates by using optimal estimates of previous runs and initial guesses for the current run. If the estimation fails, then you go back to the grid method. You regularly update these optimal estimates using the grid method every tenth run. Is this correct?

This is correct yes I should definitely explain that in the comments, sorry you had to decipher all of it; also there is a redundancy in the code, we could have one function do the grid search. We would call that function for the first bucket, then use the initial parameters from the previous iteration for each subsequent bucket, except when the optimization fails in which case we would use the grid search again. This is the process used in the paper for their market data testing in section 7.

In this code I actually ran the grid search every 10 iterations but this was to combat the fact that many of the values of the duration vector were 0 which skewed the final ivpin results. Now I used the average value of the duration vector over the window but I still suspect that having many 0s impacts the result. To avoid this I wrote a check on line 694 which detects if more than 5% of the duration vector is 0s and informs the user to change the time bar size to a smaller value.

Since this strategy is used in the research paper, it should be our first choice. We need to improve the code to make it reusable and avoid redundancy. Alternatively, because the optimization is usually very fast, we could keep the grid for each volume bucket and add the optimal parameters from the previous iteration. We can test whether the code runs in a reasonable time. If the running time is too long, we can use your strategy of calling the grid only when the optimization fails.
Regarding volume buckets with a duration of zero, I understand that this leads to a division by zero and should result in NA values for the parameters mu and epsilon, right? Is the use of the average value of the duration vector over the window considered in the research paper by Ke and Lin (2017)? Or do they assume that all duration values are above zero? I appreciate your suggestion to inform the user about the prevalence of zeros in the duration vector. We should first conduct some research and theoretically agree on the best approach for handling zero-duration buckets.

After reading the paper more closely, this is how Dr. Ke and Lin deal with large time bars:

"We use the bulk classification algorithm (Easley et al., 2012a) with a 1-min time bar to calculate the volumes of buy and sell, VτBand VτS, respectively. For a time bar, if the cumulative volume exceeds V , we assume that the trades are uniformly distributed and then calculate VτB, VτS, and tτ proportionally."

this means we can compute Vb and Vs for each time bar and if one exceeds the volume threshold, we simply assign only the proportion of Vb and Vs needed to fill the bucket and t would be the percentage of the minute that was used.

Before reading this part I tried a different way although it isn't what Dr. Ke and Dr. Lin did. What I did instead was to split for example a large minute bar into two 30 second bars. If they are still large I split them further into 15 second bars and so on until they are under the bucket threshold. Here is an example and the code:

` --------------------------------------------------------------------------
III.2 BREAK DOWN LARGE VOLUME TIMEBARS INTO SMALLER ONES

We break down these large volume timebars into smaller timebars with a
maximum volume equal to threshold.
The code recursively splits large timebars into smaller ones and removes
the original large timebars from the final dataset, until only
timebars with volumes below the threshold remain.

If, for example, the initial minute bar with interval 10:00:00 - 10:01:00
has a total volume (tbv) of 2200, which exceeds the threshold of 1000, and
is composed in the following trades:

trades price volume

2023-06-30 10:00:05 1 500 2023-06-30 10:00:10 1 600 2023-06-30 10:00:20 1 700 2023-06-30 10:00:30 1 400

The first pass splits this initial minute bar into two 30-second intervals:

interval dp tbv

2023-06-30 10:00:00 0 1800 (first 3 trades) 2023-06-30 10:00:30 1 400 (fourth trade)

The first 30-second interval with 1800 volume exceeds the threshold,
so it will be split further into two 15-second intervals:

interval dp tbv

2023-06-30 10:00:00 0 1100 (first 2 trades) 2023-06-30 10:00:15 1 700 (third trade) 2023-06-30 10:00:30 1 400 (fourth trade)

The first 15-second interval with 1100 volume still exceeds the threshold,
so it will be split further into two intervals:

interval dp tbv

2023-06-30 10:00:00 0 500 (first trade) 2023-06-30 10:00:07.5 1 600 (second trade) 2023-06-30 10:00:30 1 400 (fourth trade)

The final timebars with acceptable volumes are:

interval dp tbv

2023-06-30 10:00:00 0 500 (first trade) 2023-06-30 10:00:07.5 1 600 (second trade) 2023-06-30 10:00:15 1 700 (third trade) 2023-06-30 10:00:30 1 400 (last trade)

split_timebars <- function(data, dataset, interval_duration, threshold) {
final_timebars <- data.frame()

# Function to split the data recursively recursive_split <- function(data, interval_duration) { large_timebars <- subset(data, tbv > threshold) if (nrow(large_timebars) == 0) { # If no large timebars, add the data to final timebars final_timebars <<- rbind(final_timebars, data) return() } for (i in seq_len(nrow(large_timebars))) { large_timebar <- large_timebars[i, ] trades <- subset(dataset, interval == large_timebar$interval) interval_start <- as.POSIXct(large_timebar$interval, tz = "") midpoint <- interval_start + interval_duration / 2 first_half_trades <- subset(trades, timestamp <= midpoint) second_half_trades <- subset(trades, timestamp > midpoint) first_interval <- format(interval_start, "%Y-%m-%d %H:%M:%S") second_interval <- format(midpoint, "%Y-%m-%d %H:%M:%S") first_half <- data.frame( interval = first_interval, dp = ifelse(nrow(first_half_trades) > 0, tail(first_half_trades$price, 1) - head(first_half_trades$price, 1), 0), tbv = sum(first_half_trades$volume) ) second_half <- data.frame( interval = second_interval, dp = ifelse(nrow(second_half_trades) > 0, tail(second_half_trades$price, 1) - head(second_half_trades$price, 1), 0), tbv = sum(second_half_trades$volume) ) # Call the recursive split on the new intervals if their volumes are too large if (first_half$tbv > threshold) { recursive_split(first_half, interval_duration / 2) } else { final_timebars <<- rbind(final_timebars, first_half) } if (second_half$tbv > threshold) { recursive_split(second_half, interval_duration / 2) } else { final_timebars <<- rbind(final_timebars, second_half) } # Remove the large time bar being split from the final_timebars final_timebars <<- final_timebars[final_timebars$interval != large_timebar$interval, ] } } # Start the recursive split with the initial data and interval duration recursive_split(data, interval_duration) return(final_timebars)

}

largebars <- subset(minutebars, tbv > threshold)
if (nrow(largebars) > 0) {
split_bars <- split_timebars(largebars, dataset, timebarsize, threshold)
minutebars <- minutebars[-which(minutebars$tbv > threshold), ]
minutebars <- rbind(minutebars, split_bars)
minutebars <- minutebars[order(as.POSIXct(minutebars$interval, format = "%Y-%m-%d %H:%M:%S")), ]
}
`

As for the grid search, I can test the duration of running with or without it on the hfdata. From my experience the difference in time is minimal, however it would also be interesting to test the difference in the resulting iVPIN.

Should we let the user choose for a fast or slow runtime as an argument of the iVPIN function?

Hej @alecriste , I have read through your comment and your code. Good attempt for the code, well done! However, in order to ensure comparability of research results, we need to stick to the strategy used by Ke and Lin (2017), i.e., assuming a uniform distribution of trades and distribute VτB, VτS, and tτ proportionally. Would you like to make an attempt at it. Once this is implemented, all what is left is to work on checking, improving and testing the code.

@alecriste As for the grid search, we can include an argument called grid_size which specifies the level of partitioning of the parameter space of the pairs (alpha, delta). This allows users to increase the number of initial parameter sets and eventually the accuracy of the maximum-likelihood estimates.

R/model_ivpin.R

@@ -787,7 +771,20 @@ ivpin <- function(data, timebarsize = 60, buckets = 50, samplength = 50,
  mean_duration <- mean(bucketdata$duration)
  bucketdata$duration <- bucketdata$duration / mean_duration
  bucketdata$duration <- ifelse(bucketdata$duration == 0, 0.1, bucketdata$duration)
-
+
+  # The sigmoid function                               


R/model_ivpin.R

+     return(y)
+  }
+
+  # The inverse sigmoid function                               


monty-se · 2024-06-29T15:40:24Z

R/model_ivpin.R

@@ -902,6 +902,10 @@ ivpin <- function(data, timebarsize = 60, buckets = 50, samplength = 50,
  }
  estimateivpin@ivpin <- ivpin
  estimatevpin@runningtime <- ux$timediff(time_on, time_off)
+  num_of_nan <- sum(is.na(ivpin))


@alecriste Can you share when the NA value for ivpin happen?

monty-se · 2024-06-29T15:44:14Z

DESCRIPTION

@@ -18,7 +18,7 @@ LazyDataCompression: xz
 RoxygenNote: 7.2.1
 Roxygen: list(markdown = TRUE)
 VignetteBuilder: knitr
-Imports: Rdpack, knitr, methods, skellam, nloptr, furrr, future, dplyr, rmarkdown, coda
+Imports: Rdpack, knitr, methods, skellam, nloptr, furrr, future, dplyr, rmarkdown, coda, lubridate


Can you please specify where do you use lubridate in the ivpin code?

alecriste added 9 commits June 22, 2024 08:48

Add files via upload

417fa87

Update output_classes.R

3c8d0bf

add ivpin output class

Update DESCRIPTION

acef763

add lubridate to imports

Add files via upload

c8449e5

Update model_ivpin.R

f1ca020

Update model_ivpin.R

726a2bc

Update model_ivpin.R

615fe56

Update model_ivpin.R

59a4826

Update model_ivpin.R

f8f89cf

monty-se requested changes Jun 29, 2024

View reviewed changes

alecriste added 7 commits July 1, 2024 02:25

Update model_ivpin.R

aee81fa

Update model_ivpin.R

43d04b2

Update model_ivpin.R

11be686

Update model_ivpin.R

3dca53a

Update DESCRIPTION

a51ff8b

Update output_classes.R

90ed216

Update model_ivpin.R

43ae5a4

monty-se force-pushed the master branch 2 times, most recently from 249c421 to 6b0cbca Compare July 28, 2024 15:28

alecriste added 2 commits August 2, 2024 17:07

Update model_ivpin.R

bbb7608

Update model_ivpin.R

96de8c6

monty-se force-pushed the master branch 9 times, most recently from 58b39ae to dbc0fff Compare October 22, 2024 10:57

monty-se force-pushed the master branch 11 times, most recently from cb5f1c4 to 20ae095 Compare October 22, 2024 19:46

alecriste added 2 commits December 12, 2024 21:54

Merge branch 'monty-se:master' into master

c3047fc

Update model_vpin.R

514d745

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding IVPIN #7

Adding IVPIN #7

alecriste commented Jun 23, 2024

monty-se left a comment

monty-se Jun 29, 2024

This comment was marked as resolved.

monty-se Jun 29, 2024

alecriste Jun 29, 2024

monty-se Jun 30, 2024

monty-se Jul 8, 2024

This comment was marked as resolved.

This comment was marked as resolved.

monty-se Jun 29, 2024

alecriste Jun 29, 2024 •

edited

Loading

monty-se Jun 30, 2024

alecriste Jul 3, 2024 •

edited

Loading

alecriste Jul 3, 2024

monty-se Jul 8, 2024

monty-se Jul 8, 2024

This comment was marked as resolved.

This comment was marked as resolved.

monty-se Jun 29, 2024

monty-se Jun 29, 2024

Adding IVPIN #7

Are you sure you want to change the base?

Adding IVPIN #7

Conversation

alecriste commented Jun 23, 2024

monty-se left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

alecriste Jun 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alecriste Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

` -------------------------------------------------------------------------- III.2 BREAK DOWN LARGE VOLUME TIMEBARS INTO SMALLER ONES

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alecriste Jun 29, 2024 •

edited

Loading

alecriste Jul 3, 2024 •

edited

Loading

` --------------------------------------------------------------------------
III.2 BREAK DOWN LARGE VOLUME TIMEBARS INTO SMALLER ONES