Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mltable produces key error when trying to consume sdk v1 dataset type data with provided microsoft consume code #38944

Open
Bartcardi opened this issue Dec 19, 2024 · 3 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@Bartcardi
Copy link

  • Package Name: mltable
  • Package Version: 1.6.1
  • Operating System: Ubuntu 20.04
  • Python Version: 3.10.14

Describe the bug
While trying to consume a data asset from azure machine learning studio with table type but with underlying dataset type tabular (see attached image under screenshots) using the microsoft supplied example code for reading this asset into a pandas dataframe via an mltable object, we encounter a KeyError with key paths missing as shown below in the error trace.

Full error trace
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[3], line 8
      5 ml_client = MLClient.from_config(credential=DefaultAzureCredential())
      6 data_asset = ml_client.data.get("Energie_Aansluitingen_Current_1000", version="1")
----> 8 tbl = mltable.load(f'azureml:/{data_asset.id}')
     10 df = tbl.to_pandas_dataframe()
     11 df

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/azureml/dataprep/api/_loggerfactory.py:279, in track.<locals>.monitor.<locals>.wrapper(*args, **kwargs)
    277 with _LoggerFactory.track_activity(logger, func.__name__, activity_type, custom_dimensions) as activityLogger:
    278     try:
--> 279         return func(*args, **kwargs)
    280     except Exception as e:
    281         if hasattr(activityLogger, ACTIVITY_INFO_KEY) and hasattr(e, ERROR_CODE_KEY):

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mltable/mltable.py:600, in load(uri, storage_options, ml_client)
    547 @track(_get_logger,activity_type=_PUBLIC_API, custom_dimensions={'app_name': _APP_NAME})
    548 def load(uri, storage_options: dict = None, ml_client= None):
    549     """
    550     Loads the MLTable file (YAML) present at the given uri.
    551 
   (...)
    598     :rtype: mltable.MLTable
    599     """
--> 600     return _load(uri, storage_options, True, ml_client)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/azureml/dataprep/api/_loggerfactory.py:279, in track.<locals>.monitor.<locals>.wrapper(*args, **kwargs)
    277 with _LoggerFactory.track_activity(logger, func.__name__, activity_type, custom_dimensions) as activityLogger:
    278     try:
--> 279         return func(*args, **kwargs)
    280     except Exception as e:
    281         if hasattr(activityLogger, ACTIVITY_INFO_KEY) and hasattr(e, ERROR_CODE_KEY):

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mltable/mltable.py:706, in _load(uri, storage_options, enable_validate, ml_client)
    704     return mltable_loaded
    705 except Exception as ex:
--> 706     _reclassify_rslex_error(ex)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/azureml/dataprep/api/mltable/_validation_and_error_handler.py:90, in _reclassify_rslex_error(err)
     87 if 'ExecutionError(StreamError(PermissionDenied' in err_msg:
     88     raise UserErrorException(
     89         f'Getting permission error please make sure proper access is configured on storage: {err_msg}')
---> 90 raise err

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mltable/mltable.py:698, in _load(uri, storage_options, enable_validate, ml_client)
    696 # v1 sql dataset doesnt have paths
    697 if og_path_pairs is None:  # may have been set in _load_mltable_from_data_asset_uri
--> 698     mltable_dict, og_path_pairs = _make_all_paths_absolute(mltable_dict, base_path)
    699 mltable_loaded = MLTable._create_from_dict(mltable_yaml_dict=mltable_dict,
    700                                             path_pairs=og_path_pairs,
    701                                             load_uri=load_uri)
    702 mltable_loaded._workspace_context = _parse_workspace_context_from_longform_uri(load_uri)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/mltable/_utils.py:74, in _make_all_paths_absolute(mltable_yaml_dict, base_path)
     72     mltable_yaml_dict[_PATHS_KEY] = list(map(lambda x: x[1], path_pairs))
     73 else:
---> 74     path_pairs = list(tuple(zip(mltable_yaml_dict[_PATHS_KEY], mltable_yaml_dict[_PATHS_KEY])))
     75 return mltable_yaml_dict, path_pairs

KeyError: 'paths'

To Reproduce

  1. Setup an Azure SQL database type datastore in azure ml studio.
  2. Create a data asset from the datastore using a sql statement and make sure it can connect and has data.
  3. Try to consume the data asset for interactive development using the supplied microsoft snippet in the data asset section on azure ml (see second screenshot)

Expected behavior
I expected to end up with a pandas dataframe.

Screenshots

Screenshot of the data asset in azure ml

Image

Screenshot of the consume code snippet

Image

Screenshot of the error trace

Image

Additional context
We run this code on a Standard_DS12_v2 (4 cores, 28 GB RAM, 56 GB disk) compute instance with:

azure-ai-ml==1.23.0
azure-identity==1.18.0
@github-actions github-actions bot added customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Dec 19, 2024
@kristapratico kristapratico added Machine Learning Service Attention Workflow: This issue is responsible by Azure service team. Client This issue points to a problem in the data-plane of the library. and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Dec 19, 2024
@github-actions github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Dec 19, 2024
Copy link

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.

@Bartcardi
Copy link
Author

Bartcardi commented Dec 24, 2024

By the way this is what mltable_yaml_dict looks like in our case:

mltable_yaml_dict = {
    "query_source": {
        "handler": "AmlDatastore",
        "query": "SELECT TOP (1000) [Extern_Energie_Aansluitingen_HashKey] ,[EAN] ,[Product] ,[Status] ,[Locatie] ,[Bouwdeel] ,[Adres] ,[Postcode] ,[Plaats] ,[Segment] ,[GTV] ,[GeldigVan] ,[GeldigTm] ,[ETLLoaddate] ,[ETLRecordValidFrom] FROM [history].[tbl_Extern_Energie_Aansluitingen_current]",
        "handler_arguments": {
            "subscription": "<SUBSCRIPTION_ID>",
            "resource_group": "<RESOURCE_GROUP>",
            "workspace_name": "<WORKSPACE_NAME>",
            "datastore_name": "zorgcontrol",
        },
    },
    "transformations": [
        {
            "convert_column_types": [
                {
                    "columns": "Extern_Energie_Aansluitingen_HashKey",
                    "column_type": "string",
                },
                {"columns": "EAN", "column_type": "string"},
                {"columns": "Product", "column_type": "string"},
                {"columns": "Status", "column_type": "string"},
                {"columns": "Locatie", "column_type": "string"},
                {"columns": "Bouwdeel", "column_type": "string"},
                {"columns": "Adres", "column_type": "string"},
                {"columns": "Postcode", "column_type": "string"},
                {"columns": "Plaats", "column_type": "string"},
                {"columns": "Segment", "column_type": "string"},
                {"columns": "GTV", "column_type": "string"},
            ]
        }
    ],
}

@bhathiya-pilanawithana
Copy link

bhathiya-pilanawithana commented Dec 25, 2024

Encountered the same problem. In addition to the mentioned points, the code snippet given for consuming data in SDK V1 works fine for the same scenario, only the SDK V2 code snippet has the issue (at least for my case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

4 participants