Metadata Query Feature for ConnectorX #704
pangjunrong
started this conversation in
Ideas
Replies: 1 comment
-
If this makes sense, I can look into implementing it and raising a PR for the change. At the moment, I'm using a workaround by placing a dummy metadata query at the start of my list of queries and removing the first row of the resultset when it is returned. It works but I figured maybe this approach can be integrated given that DBs like Oracle often choose terrible execution plans with complex queries. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey @wangxiaoying, so I'm noticing that the fetching metadata process uses whatever query is passed into connector-x and for cases with a list of queries, the first is chosen for the task.
While this works fine for a generic
SELECT * FROM TABLE
query, there is substantial overhead when a user passes a query which performs full table scans. For example, if a user wants to provide a list of queries because they are performing the partition logic themselves by constructing each query with specific partition conditions, the fetching metadata process will append aLIMIT 1
or equivalent clause to it.Using Oracle as an example, if the first query was something like the below where the user wants to fetch the first partition as 5 million rows:
SELECT * FROM (SELECT SQ.*, ROWNUM AS RN FROM TABLE WHERE ROWNUM < 5000000) WHERE RN >= 1
The metadata query will execute:
SELECT * FROM (SELECT SQ.*, ROWNUM AS RN FROM TABLE WHERE ROWNUM < 5000000) WHERE RN >= 1) FETCH FIRST 1 ROW ONLY
As such, even before the actual list of queries is executed to retrieve data, the metadata fetching process takes a long time to execute given that at best, it still needs to scan 5 million rows first before fetching the first row for the schema since the optimizer for Oracle will retrieve the full resultset of the inner query first.
I was wondering if we could have an optional argument for the
read_sql
method to specify a custom metadata query? This way, even if users have complex queries, they can explicitly override the internal logic for fetching metadata to use a more optimized query instead. So in the case above, we would have something like this for the arguments:conn = 'some connection string', query = ['query1', 'query2', 'query3'], metadata_query = 'SELECT * FROM TABLE'
Beta Was this translation helpful? Give feedback.
All reactions