Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Could ingest_* methods return something #191

Open
thenickg opened this issue Dec 11, 2019 · 3 comments
Open

Suggestion: Could ingest_* methods return something #191

thenickg opened this issue Dec 11, 2019 · 3 comments

Comments

@thenickg
Copy link

thenickg commented Dec 11, 2019

Is your feature request related to a problem? Please describe.
ingestion_ methods make developer life difficult to write robust code. They currently silently fail even if their ingestion ends up in the ".show ingestion failures". Users have no way of monitoring this as all underlying ids are hidden from the user to even check against that table.

Describe the solution you'd like
have ingestion_ methods return something useful to handle failures programmatically. Ideally mimicking the robust C# api.

Describe alternatives you've considered
I guess KIT?

edit1: So it appears our friend the KIT library has a pattern with ingestion tag monitoring: https://github.com/Azure/azure-kusto-ingestion-tools/blob/a2a256a09a66aacfe9c4756b2b0b457014013c4a/kit/kit/backends/kusto.py#L183

However the problem with ingestion tag monitoring is that it only can identify successful rows, at least with how it's implemented in KIT currently. However, the API that KIT offers seems much more full featured. Why aren't some of KIT's API features available for ingestion? seems like most of KIT's methods lend themselves well to the ingestion library. KIT makes sense standalone for CLI and data schema inferencing, but it has some nice repeated patterns anyone using the ingestion library would eventually need to write

edit2: ehhh kit is pretty tangled up with this manifest business. Possibly have ingest return the url?

url = blob_service.make_blob_url(container_details.object_name, blob_name, sas_token=container_details.sas)
This would at least allow users to inspect the failures table for their specific error filtering by the random GUID used in the dataframe/file helper methods? That would effectively be the unique key

@danield137
Copy link
Contributor

Hi @thenickg ,

First, thanks for looking into this.
We do want to make your life easier so let's explore this idea further. and we are aware that getting the exact status can be a pain right now, and we are constantly trying to improve it, so feedback is very welcomed.

As a Big Data service that allows ingestion of massive data loads, we have several considerations here:

  1. We have no way of telling if the client will ingest many batches of small (~1 KB) data (which will be aggregated prior to ingestion for query efficiency) or a small number of massive files (~1 GB). Status reporting for these cases are very different.
  2. There are sync options for ingestion (StreamingIngest), which have performance implications, but let you know about failures right away.
  3. Current methods implemented in the C# api are:
    Queue: let's you listen to a queue that holds statuses, but leaves a lot of the logic of figuring out what happened to the end user.
    Table: doesn't work well under heavy (10K ingestion per seconds)loads.
    Both of these are imperfect.
  4. Kit is very useful for playing around with ingestions, but was intended to be a replacement to the ingest package.
  5. I should note, we are working on a new monitoring feature which will be robust, and maybe, sometime in the far future, will be integrated in our sdks (no timeline currently).

Having said that, I agree that the ingestion tags monitoring feature is a nice trick. It too has performance implications (if you dig deeper into how indexing works in Azure Data Explorer, you will see tags cause extents to not be merged, thus reduces the gain from indexing large files).

Now, considering all the information above, I can think of several ideas that might help:

  1. Add the direct ingest method which returns the operation id, and let's you check that status. This is a very limited method, as it won't scale well (no error handling, throttling and such)
  2. Return a result object that will be able to at least show ingestion failures like you suggested, filtering by the blob url.
  3. Add a utility method to allow for monitored ingest (with ingest by tags)

What do you think?

@thenickg
Copy link
Author

All considerations super important to balance :) Nice to see the monitoring feature is coming up to keep increasing kusto's GA value.

With respect to the suggestions,

  1. Direct ingest would be nice! Especially for smaller data / data where you want heavy validation given the programmer handles the cases that queued ingestion handles automatically. For my use case I'd probably still use Queued ingestion because of the niceties offered.
  2. I think this is exactly what I was looking for out of this suggestion. At least having a struct object with metadata about your ingestion would help out my cause. At best having helper methods that automatically subscribe to the KustoIngestStatusQueues and return for that individual ingestion the success/failure status in a lazy way would be nice. See below code for what I mean by that:
ingest_obj = client.ingest_from_dataframe(df, props)
qs = KustoIngestStatusQueues(client)
future_obj = qs.monitor_ingestion(ingest_obj)
# user can still do other things as ingest monitored in background
while not future_obj.done():
    time.sleep(1)
ingest_results = future_obj.results()

This would help users manage monitoring in a targeted way, allowing for programmers to submit multiple ingestion, still do other work, and then await the results of all those ingestion with detailed response objects. Combined with an optional ingest-tag and there's a detailed API for monitoring statuses. Even if this doesn't get implemented, just having an ingest_obj would be nice, but a programmer can dream :)

  1. This would be super useful for validating that each row was ingested properly given the mapping especially if data changes without the developer handling it. Could be wrapped in the ingestion_properties and integrate with the above pattern would return extra information with respect to monitored ingest.

So all those suggestions sound great!

In the meantime, for anyone else curious in solving this problem: I'm going to use ingest_from_blob and hold the URL of the blob in memory and use that as the ID to subscribe to the StatusQueue with.

@danield137
Copy link
Contributor

danield137 commented Dec 14, 2019

@thenickg thanks for the input, and for adding your solution for future reference.. I'm going to leave this open for now so that we can consider internally when and what we want to invest in, if if you are interested, you can always submit a PR 😄.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants