Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ETLs for scooter companies #78

Closed
hunterowens opened this issue Sep 18, 2019 · 13 comments
Closed

fix ETLs for scooter companies #78

hunterowens opened this issue Sep 18, 2019 · 13 comments

Comments

@hunterowens
Copy link
Contributor

upgrade to 0.3.0 means that things have been less stable than expected.

Review and backfill data for all companies going back to 09-01-2019

@ian-r-rose
Copy link
Contributor

I don't have an encyclopedic knowledge of this DAG at the moment, but I see a few things going on with recent failures:

  1. With Jump, the initial API request is failing, possibly because it is hitting the wrong endpoint for 0.3.x: https://api.uber.com/v0.2/ (note, some of the failures have been marked as successes. That should probably be fixed).
  2. With bird, the temp tables for status changes seem to be getting the wrong schema. In particular, associated_trip is getting marked as a double precision column. I'm not sure where this is happening (mds_provider?).
  3. With Spin, we are also getting failing API requests: {"error":"bad_request","error_description":"Invalid parameters","error_details":["start_time: 1568703600000 is invalid","end_time: 1568746800000 is invalid","time period specified must be less than or equal to 14 days. Is this also hitting the wrong endpoint? I see a v1 in the URL.
  4. With Bolt we are getting 404 errors. Has there been a change in the endpoint? There are also some false passes in the jobs.

@ian-r-rose
Copy link
Contributor

With regards to point 2 above: I suspect the issue is in mds_provider.mds.db.loader where there is a DataFrame.to_sql. If pandas made the wrong inference about the dtype when creating the temp table (via sqlalchemy), that could cause the issue.

@thekaveman
Copy link

@ian-r-rose correct on point 2 - this is a side-effect of DataFrame.to_sql, however the point of those temp tables is we don't necessarily care about the schema right? At the time the data moves from temp to status_changes, we enforce the types.

@ian-r-rose
Copy link
Contributor

@thekaveman thanks for taking a look. I think we do care about the schema of the temp table. At least some of the time, pandas seems to be making the inference that associated_trip is a floating point number, rather than a string. If it sees it as a string, it can happily make the cast to UUID, if it sees it as a number it can't.

It seems to me that an appropriate fix would be to set the dtypes of the dataframe before calling DataFrame.to_sql.

@thekaveman
Copy link

@ian-r-rose makes sense... there is a certain amount of dtype fixing happening in mds.db.Database.load_status_changes. This can be further customized via the before_load param on that method:

before_load: callable(df=DataFrame, version=Version): DataFrame, optional
    Callback executed on the incoming DataFrame and Version.
    Should return the final DataFrame for loading.

And similarly for load_trips method (and the underlying load method).

It looks like the library already marks the associated_trip column as dtype object.

@ian-r-rose
Copy link
Contributor

It looks like the library already marks the associated_trip column as dtype object.

Thanks for the link. AFAIK object is more-or-less a shrug from pandas, so it should still need to make a choice when it inserts the data into the temp table.

An alternative to setting the dtype on the dataframe would be to set it to the appropriate sqlalchemy types in to_sql, as described here.

@hunterowens
Copy link
Contributor Author

9/25 Update:

Opened PR CityofSantaMonica/mds-provider#82 to fix silent errors issue

Fixed config.json for JUMP and SPIN.

Remaining issues for Bolt, Bird. (see CityofSantaMonica/mds-provider#83) and email to Bolt.

@ian-r-rose
Copy link
Contributor

The new Bolt urls seem to be okay, but we are now getting 401 errors saying that the token failed to parse. Is the token up-to-date?

@ian-r-rose
Copy link
Contributor

I am also seeing authorization errors for Bird now.

@ian-r-rose
Copy link
Contributor

Okay, so it's looking to me like Bolt tokens expire after a few minutes, so we can't store them directly in the config.json, instead we would have to regenerate them before every usage. I'm not up-to-date with the latest on requirements for MDS companies, but this feels like we should loop in the new DOT folks.

@thekaveman
Copy link

Does Bolt use an OAuth client_credentials grant flow like JUMP? (recommended in the spec)

mds-provider supports token renewal using a config like:

"provider name": {
    "client_id": "client id here",
    "client_secret": "client secret here",
    "scope": "scope(s) here",
    "token_url": "OAuth token refresh URL here"
}

@hunterowens
Copy link
Contributor Author

tagging @RMK0110

@hunterowens
Copy link
Contributor Author

completed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants