Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Greenbids Real Time Data Module #3242

Merged
merged 104 commits into from
Nov 12, 2024

Conversation

EvgeniiMunin
Copy link
Contributor

@EvgeniiMunin EvgeniiMunin commented Jun 13, 2024

🔧 Type of changes

  • new bid adapter
  • update bid adapter
  • new feature
  • new analytics adapter
  • new module
  • bugfix
  • documentation
  • configuration
  • tech debt (test coverage, refactorings, etc.)

✨ What's the context?

Greenbids Real Time Data module to filter bidders SSP listed in the imp[].ext.prebid.bidder of the bid request at the processed-auction-request stage.

🧠 Rationale behind the change

Artefacts and caching
To perform the filtering the module uses the ML pipeline that outputs the probability of bid per bidder for each imp for the given BidRequest. Then this probability of bid is compared with the threshold to ensure target True Positive Rate (TPR) defined for each partner publisher. Thus the RTD module uses 2 artefacts that are fetched from the Greenbids Google Cloud Storage bucket

  • ML predictor in .onnx format
  • Probability thresholds in .json format with the list of thresholds and their corresponding TPRs

On Greenbids side these artefacts are updated every 15 min in the storage. In the classes ModelCache and ThresholdCache we introduce the caching of these artefacts on RTD side with the cache expiration delay of 15 min using Caffeine. When the artefacts are fetched they are put into the cache with their corresponding keys. When the new thread withe the BidRequest is hooked by the module we search the artefacts from the cache.

  • if cache hit then we proceed with prediction and filtering
  • if cache miss then the thread is tryLocked to fetch the artefacts from GCS. If another thread arrives when the mutex is already locked it skips the fetching, skips the filtering and the RTD hook returns the InvocationResult with no_action and BidRequest unchanged

When the artefacts are fetched we craft ThrottlingMessage with the features necessary for inference. The inference time for a single row of data is up to 0.2 ms in normal regime and 3.5 ms at the first call after fetching (warmup).

Communication with AnalyticsReporter and Exploration
The RTD module also communicates the filtering results with the GreenbidsAnalyticsReporter via AnalyticsTags. Here we populate AnalyticsResult of AnalyticsTag for each Imp the with

  • fingerprint (greenbidsId),
  • isKeptInAuction map of booleans for each bidder wether we keep them in auction or not for the given imp,
  • isExploration flag that is necessary to isolate the training log

Then the AnalyticsTag is then parsed by the AnalyticsReporter from ExtBidResponsePrebid and its content added to the analytics payload sent to Greenbids server.

The Exploration part of traffic is split randomly with the ratio defined for each partner publisher per bid requests and is not filtered by the RTD module.

Publisher settings
We have observed 2 options how to handle the settings by the activated partner publishers to avoid adding them to the code base

  • add the list of authorized partners with their configs (target TPR, Exploration Rate, pbuid) in the prebid-config-with-modules.yaml common config of prebid and update it buy sending manually to PBS team
  • let publishers add their configs direclty into bid-request.json where they indicate the activation of our module in bid request extenstion bid-request.ext.prebid.analytics.greenbids and bid-request.ext.prebid.analytics.greenbids-rtd . Here we use analytics extension in the hierarchy of both moduels as there is no RTD extensions in ExtRequestPrebid , so this is a workaround to add both modules configs but to name them differently, something like analytics.greenbids-analytics and. analytics.greenbids-rtd

At the given moment the 2nd option is implemented in the RTD module.

"ext": {
    "prebid": {
      "targeting": {
        "includewinners": true,
        "includebidderkeys": true
      },
      "analytics": {
        "greenbids": {
          "pbuid": "PBUID_FROM_GREENBIDS",
          "greenbidsSampling": 1
        },
        "greenbids-rtd": {
	        "pbuid": "PBUID_FROM_GREENBIDS",
	        "targetTpr": 0.95,
	        "explorationRate": 0.001
        }
      }
    }
  }

🧪 Test plan

We are testing the following cases

  • shouldExitEarlyIfPartnerNotActivatedInBidRequest when the partner is not activated so the extension bidrequest.ext.prebid.analytics.greenbids-rtd is not defined -> we skip filtering and return FutureSucceeded without modifiying BidRequest
  • shouldExitEarlyIfModelIsNotAvailable when ONNX model is not in cache on the partner and GCS bucket or the artefact init is not available -> we skip filtering and return FutureSucceeded without modifiying BidRequest
  • shouldExitEarlyIfThresholdIsNotAvailable same as above
  • shouldNotFilterBiddersAndReturnAnalyticsTagWhenExploration when we are at the exploration, the model should do the inference but the BidRequest stays unchanged. We populate the AnalyticsTag with the inference results and send it to AnalyticsReporter by indicating all bidders in the result as isKeptInAuction = true
  • shouldFilterBiddersBasedOnModelResults the nominal case when the module operates end2end: we filter the bidders and return the update BidRequest with modified bidrequest.imp[].ext.prebid.bidders and send AnalyticsTags to AnalyticsReporter.
  • shouldFilterBiddersBasedOnModelIfAnyFeatureNotAvailable if by any reason some features are not available in the BidRequest (ex: user IP address or device are null) -> we do the inference anyway same as in normal case but with filling the absent features with empty strings

🏎 Quality check

  • Are your changes following our code style guidelines?
  • Are there any breaking changes in your code?
  • Does your test coverage exceed 90%?
  • Are there any erroneous console logs, debuggers or leftover code in your changes?

@EvgeniiMunin EvgeniiMunin changed the title Add Greenbids RTD Module Add Greenbids Real Time Data Module Aug 21, 2024
@EvgeniiMunin EvgeniiMunin marked this pull request as ready for review August 21, 2024 13:59
Copy link
Collaborator

@AntoxaAntoxic AntoxaAntoxic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are definitely not the last comments, but at least a bunch of them might lead to a lot of changes, so it makes sense to continue the review once they are tackled

.expireAfterWrite(cacheExpirationMinutes, TimeUnit.MINUTES)
.build();

final ReentrantLock lock = new ReentrantLock();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding this locking thing my suggestion is following:

  • get rid of locking threads
  • if the value in cache is expired
    • remove the expired data from cache
    • fetch the actual data and put it in cache asynchronously
    • while any other thread sees the cache is empty - skip the module logic
    • the very first fetching might be done on the very first request ideally OR on the app start up alternatively

Copy link
Contributor Author

@EvgeniiMunin EvgeniiMunin Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have removed the reentrant locks. Added CompletableFuture.runAsync to fetch from GCS in case of cache miss + added Executors.newSingleThreadExecutor to make only one thread access GCS (to avoid race condition)

Now the logic is as follows

  • thread -> if cache miss -> fetch from GCS -> cache.put -> return null and skip filtering
  • thread -> if cache hit -> cache.getIfPresent -> return OnnxModelRunner and ThrottlingThresholds -> do filtering

The very first call when cache is empty is handled in the same way as cache miss -> cache.getIfPresent returns null and we fetch the artefacts and skip filtering after

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the data expires in cache, I suppose the cache is invalidated and when we do cache.getIfPresent(cacheKey) it should return null. So it should be handled automatically by cache with expiration ?

Copy link
Collaborator

@AntoxaAntoxic AntoxaAntoxic Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EvgeniiMunin I didn't understand your take on avoiding race conditions. If you point to the cache as a shared resource for threads you shouldn't worry since that's the main point of having in-memory cache where all the fetch data is preserved until expiration. I don't see any race condition problem here.

What I see you created a bottleneck for all the threads that would be waiting until each of them could fetch the data one by one. As I understand the GCS returns the same data for all the requests for 15 min, correct? If so and you want to minimize the number of calls to GCS you should make a synchronization/notification point where you decide whether the GCS should be called or not.

For example, you can use atomic boolean to notify all the threads that one thread started fetching process.

Another question: does the GCS support async execution? It's worth using them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added AtomicBoolean isFetching flag and changed executorService newCachedThreadPool. So now the fetch is not blocking for other threads.

By default the flag is set to false. When the thread is fetching from GCS it is switched to true with compareAndSet. So other threads are not blocked and skip. When the fetching is done the flag is set back to false.

Concerning GCS doc I don't see the functionallity specifically for async download of objects

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EvgeniiMunin Okay, if the GCS doesn't provide any async interface then please wrap it's call with the vertx.executeBlocking. It will uses the vertx-based thread management of PBS. Also it's fine to operate with vertx Futures since the module eventually returns the Future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra/modules/greenbids-real-time-data/pom.xml Outdated Show resolved Hide resolved
throw new PreBidException("Cache was empty, fetching and put artefacts for next request");
}

final GreenbidsInferenceData greenbidsInferenceData = new GreenbidsInferenceData(jacksonMapper, database);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this one is also can be created once

P.S. I've already asked you to check which objects can be created only once, so once more time, double-check the code and define such objects as beans

Copy link
Contributor Author

@EvgeniiMunin EvgeniiMunin Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this one I've modified the class with Builder annotation and staticprepareData that returns the built GreenbidsInferenceData, because the fields are private and final and we need to recreate it at each call with different throttlingMessages, throttlingInferenceRows. So GreenbidsInferenceData has to be defined at each call

I've rechecked other object created at hook call

  • FilterService moved to GreenbidsRealTimeDataConfig as bean
  • GreenbidsInvocationResult. Same as for GreenbidsInferenceData class here we need to rebuilt the object with different private final updatedBidRequest, invocationAction, analyticsResult, so it has to be kept at hook call

Comment on lines 123 to 134
final GreenbidsInferenceData greenbidsInferenceData = GreenbidsInferenceData
.of(bidRequest, dbReader, mapper);

impsBiddersFilterMap = filterService.filterBidders(
onnxModelRunner,
greenbidsInferenceData.getThrottlingMessages(),
threshold);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably make a service bean from the GreenbidsInferenceData with the method like parseData(BidRequest bidRequest) (please come up with better name) that returns List<ThrottlingMessage>

in that way you can remove dbReader from hook's dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Service annotation to GreenbidsInferenceDataService and then defined it in config GreenbidsRealTimeDataConfiguration as Bean. Named the method that outputs List<ThrottlingMessage> extractThrottlingMessagesFromBidRequest. Removed dbReader from hook field, as it is added to GreenbidsInferenceDataService bean

Comment on lines 40 to 46
final ModelCache modelCache = new ModelCache(
onnxModelPath,
storage,
gcsBucketName,
modelCacheWithExpiration,
onnxModelCacheKeyPrefix,
vertx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create each time a new ModelCache is something counter intuitive

create ModelCache bean with the modelCache.get(onnxModelPath, partner.getPbuid())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defined modelCache and ThresholdCache as beans + provided paths at and get methods

@Service
public class GreenbidsInferenceDataService {

@Autowired
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant annotation, since you create bean yourself inside spring config.

Copy link
Contributor Author

@EvgeniiMunin EvgeniiMunin Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed redundant autowired annotation. Also removed removed service annotation from GreenbidsInferenceDataService because otherwise the DatabaseReader bean is not injected properly. Kept GreenbidsInferenceDataService and DatabaseReader as beans in spring config

@EvgeniiMunin
Copy link
Contributor Author

@And1sS @AntoxaAntoxic The fixes on comments are applied, could you please have a look

sample/configs/prebid-config-with-module.yaml Outdated Show resolved Hide resolved
Comment on lines 131 to 133
final Map<String, Ortb2ImpExtResult> analyticsResultFromAnalyticsTag = extractAnalyticsResultFromAnalyticsTag(
bidResponse);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyticsResultFromAnalyticsTag == null check happens at least 3 times further in the code, could you check it once, and skip everything that is redundant in that case?

Copy link
Collaborator

@CTMBNara CTMBNara Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have removed one redundant null check in createBidMessage.

Now we have the one in greenbidsId

private String greenbidsId(Map<String, Ortb2ImpExtResult> analyticsResultFromAnalyticsTag) {
        return Optional.ofNullable(analyticsResultFromAnalyticsTag) // <--- null check 1
                .map(Map::values)
                .map(Collection::stream)
                .flatMap(Stream::findFirst)
                .map(Ortb2ImpExtResult::getGreenbids)
                .map(ExplorationResult::getFingerprint)
                .orElseGet(() -> UUID.randomUUID().toString());
    }

And another one in createAdUnit

final Ortb2ImpResult ortb2ImpResult = Optional.ofNullable(analyticsResultFromAnalyticsTag)
                .map(analyticsResult -> analyticsResult.get(imp.getId()))
                .map(Ortb2ImpResult::of)
                .orElse(null);

I prefer to keep 2 checks because of the following

if RTD module is called -> then we populate analyticsResultFromAnalyticsTag and extract it in AnalyticsAdapter

  • here in processEvent greenbidsId = ExplorationResult.getFingerprint
  • then in createAdunit we populate Ortb2ImpResult with ExplorationResult and tid for each adunit

if RTD module is not called but AnalyticsAdapter is enabled -> then in AnalyticsAdapter analyticsResultFromAnalyticsTag = null but it does not mean we skip it. We should still populate the fields as follows

  • greenbidsId is random = UUID.randomUUID()
  • and Ortb2ImpResult = null for each adunit

should still keep the check for greenbidsId and another one for Ortb2ImpResult extracted from analyticsResultFromAnalyticsTag

Comment on lines 68 to 80
private GreenbidsRealTimeDataProcessedAuctionRequestHook target;

private JacksonMapper jacksonMapper;

private Cache<String, OnnxModelRunner> modelCacheWithExpiration;

private Cache<String, ThrottlingThresholds> thresholdsCacheWithExpiration;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually mock dependencies

P.S. Waiting for the rest of unit tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new tests:
FilterServiceTest

  • shouldReturnFilteredBiddersWhenValidThrottlingMessagesProvided
  • shouldThrowPreBidExceptionWhenOrtExceptionOccurs
  • shouldThrowPreBidExceptionWhenThrottlingMessagesIsEmpty
  • shouldThrowPreBidExceptionWhenTensorSizeMismatchOccurs

GreenbidsInferenceDataServiceTest

  • shouldReturnValidThrottlingMessages
  • shouldHandleMissingIp
  • shouldThrowPreBidExceptionOnGeoIpFailure

GreenbidsInvocationServiceTest

  • shouldReturnUpdateBidRequestWhenNotExploration
  • shouldReturnNoActionWhenExploration

GreenbidsRealTimeDataModuleTest

  • shouldHaveValidInitialConfigs

GreenbidsUserAgentTest

  • getDeviceShouldReturnPCForWindowsNT
  • getDeviceShouldReturnDeviceIPhoneWhenIOS
  • getBrowserShouldReturnBrowserNameAndVersionWhenUserAgentIsPresent
  • getBrowserShouldReturnEmptyStringWhenBrowserIsNull

Copy link
Contributor Author

@EvgeniiMunin EvgeniiMunin Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on Slack, have applied @ExtendWith(MockitoExtension.class) annotation to test classes to mock dependencies

Also have moved the bid request creation into a separate utility class TestBidRequestProvider because its methods are reused in GreenbidsInvocationServiceTest, GreenbidsInferenceDataServiceTest, shouldReturnFilteredBiddersWhenValidThrottlingMessagesProvided

@prebid prebid deleted a comment from CTMBNara Sep 30, 2024
Copy link
Collaborator

@AntoxaAntoxic AntoxaAntoxic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any tests that cover ModelCache, ThresholdCache and OnnxModel classes, the most important ones I believe

after you add them, I believe the hook test can be simplified

@EvgeniiMunin
Copy link
Contributor Author

EvgeniiMunin commented Oct 14, 2024

I don't see any tests that cover ModelCache, ThresholdCache and OnnxModel classes, the most important ones I believe

after you add them, I believe the hook test can be simplified

Hi, have added the tests

ModelCacheTest
	getShouldReturnModelFromCacheWhenPresent
	getShouldSkipFetchingWhenFetchingInProgress
	getShouldFetchModelWhenNotInCache
	getShouldThrowExceptionWhenStorageFails
	getShouldThrowExceptionWhenOnnxModelFails
	getShouldThrowExceptionWhenBucketNotFound

ThresholdCacheTest
	getShouldReturnThresholdsFromCacheWhenPresent
	getShouldSkipFetchingWhenFetchingInProgress
	getShouldFetchThresholdsWhenNotInCache
	getShouldThrowExceptionWhenStorageFails
	getShouldThrowExceptionWhenLoadingJsonFails
	getShouldThrowExceptionWhenBucketNotFound

OnnxModelRunnerTest
	runModelShouldReturnProbabilitiesWhenValidThrottlingInferenceRow
	runModelShouldThrowOrtExceptionWhenNonValidThrottlingInferenceRow

Also added OnnxModelRunnerFactory and ThrottlingThresholdsFactory to be able to do the mocks in tests (otherwise the array of byte[][] is not mockable)

Then removed some redundant test cases from hook test

Remarks on OnnxModelRunnerTest: OrtEnvironment is not very much mockable as when mocking it NPE on OnnxTensor.createTensor is thrown. So I've used the givenOnnxModelRunner to return real object OnnxModelRunner. Also we are testing the wrapper class OnnxModelRunner, so no need in testing the base classes OrtEnvironment and OrtSession.

Copy link
Collaborator

@AntoxaAntoxic AntoxaAntoxic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of UT fails, please check before pushing

@EvgeniiMunin
Copy link
Contributor Author

EvgeniiMunin commented Oct 15, 2024

some of UT fails, please check before pushing

The errors in UT were due to Unnecessary stubbings detected when mocking GCS storage, bucket, blob. But we have to mock them to be able to properly simulate test cases. So I've added @Mock(strictness = LENIENT) on these objects. Retesting CI

Comment on lines 206 to 212
private Map<String, Ortb2ImpExtResult> extractAnalyticsResultFromAnalyticsTag(BidResponse bidResponse) {
return Optional.ofNullable(bidResponse)
.map(BidResponse::getExt)
.map(ExtBidResponse::getPrebid)
.map(ExtBidResponsePrebid::getModules)
.map(ExtModules::getTrace)
.map(ExtModulesTrace::getStages)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trace log will be available only in debug mode, maybe it is better to use AuctionContext.hookExecutionContext?

Copy link
Contributor Author

@EvgeniiMunin EvgeniiMunin Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced extraction of analytics tags from bid respone to AuctionContext.hookExecutionContext + modified UT

Comment on lines 131 to 133
final Map<String, Ortb2ImpExtResult> analyticsResultFromAnalyticsTag = extractAnalyticsResultFromAnalyticsTag(
bidResponse);
Copy link
Collaborator

@CTMBNara CTMBNara Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 43 to 55
URL url = new URL(properties.getGeoLiteCountryPath());
Path databasePath = Files.createTempFile("GeoLite2-Country", ".mmdb");

try (InputStream inputStream = url.openStream();
FileOutputStream outputStream = new FileOutputStream(databasePath.toFile())) {
inputStream.transferTo(outputStream);
}

try {
return new DatabaseReader.Builder(databasePath.toFile()).build();
} catch (IOException e) {
throw new PreBidException("Failed build DatabaseReader from DB", e);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's conceptually better to create a proxy (or factory) for DatabaseReader that also implements the Initializable interface? The application will still wait for the initialize method to complete before it can start working, but this will be easier to extend if additional logic is needed to download this file.

Besides, I just don't want to download anything while we're constructing beans :p

@And1sS what is your opinion on it?

Copy link
Contributor Author

@EvgeniiMunin EvgeniiMunin Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added DatabaseReaderFactory that implements Initializable. It also contains AtomicReference<DatabaseReader> databaseReaderRef to hold the DatabaseReader.

Provided factory into GreenbidsInferenceDataService in constructor and then get the DatabaseReader at getCountry from atomic reference

Copy link
Collaborator

@Compile-Ninja Compile-Ninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EvgeniiMunin Thanks for all the work you’ve done! 🎉🚀💪
Finally, we’re ready to merge.
Approved!! ✅

@Compile-Ninja Compile-Ninja merged commit 84b3a70 into prebid:master Nov 12, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants