-
Notifications
You must be signed in to change notification settings - Fork 325
ADLS Gen2 Implementation Guidance
Azurite is an open-source Azure Storage API compatible server (emulator). It currently supports the Blob, Queue, and Table services. We have received many customer asks on ADLS Gen2 support in Azurite from many channels, include but not limited to github issues, email, and requests from interested teams at Microsoft.
We have get 2 PRs (PR1, PR2) submitted by the community to implement ADLSgen2 in Azurite. However, we are unable to merge them at this time since they do not meet our expectations and merge bar.
Azurite welcomes contributions. To better cooperate with the community on an implementation of ADLS Gen2 in Azurite, this document gives the details of the plan we suggest to implement ADLS Gen 2 in Azurite, and our expectations for community submissions that we can accept as PRs.
It's very important to understand ADLS Gen2 feature before implementing it in Azurite.
Azure Data Lake Storage Gen2 (aka: Adls Gen2) is a set of capabilities dedicated to big data analytics and built on Azure Blob Storage.
A normal Azure storage account is with Flat namespace (FNS).
Users can provision a hierarchical namespace (HNS) storage account by creating storage account with HNS, or migrate an existing storage account from FNS to HNS. However, users can’t revert HNS accounts back to FNS.
HNS is a key feature that Azure Data Lake Storage Gen2 provides:
- High-performance data access at object storage scale and price.
- Atomic directory manipulation
- Familiar interface style like file systems
Azure Data Lake Storage Gen2 is primarily designed to work with Hadoop and all frameworks that use HDFS as their data access layer. A new endpoint DFS is introduced ADLS Gen2.
The DFS endpoint is available on both HNS and FNS accounts. Per our test, on FNS account, the DFS rest API behavior is different from HNS account, Blob rest API also behaviors differently on FNS/HNS account. The rest API doc already includes part of the differences.
FNS account | HNS account | |
---|---|---|
Blob Endpoint API | Already implemented in Azurite.
(The customer PR1 is mostly in refine and revise blob API implementation.) |
Not in Azurite, Phase II in below plan.
Behavior similar as FNS blob, but should have a little different on API, and performance different |
DFS Endpoint API | Not in Azurite, Phase I in below plan.
Support most DFS API, but some action not supported and API behavior different, and performance/atomic different |
Not in Azurite, Phase II in below plan
Support all DFS APIs, including ACL/permission support. (The customer PR2 is mostly on these APIs.) |
-
This should be much simpler than HNS account implementation.
- Current Azurite is based on FNS account, so:
- Don’t need to change data store structure.
- Don’t need Azurite user to differ HNS/FNS account.
- Current Azurite is based on FNS account, so:
-
The change will add all DFS API interface to Azurite, which can help to support phase II.
-
Code change should be split into several small PRs as follows:
- 1 PR to add DFS swagger and the auto-generated API interface (no manual change on auto generated code)
- 1 PR to add DFS endpoint
- Several PRs to implement each DFS API (with credential handler), include testing
-
Need make sure each API behavior is aligned on rest API doc, also aligned with real Azure Storage Server. See more in validation criteria.
-
The blob/DFS endpoint should share same data store, talk to same instance of BlobLokiMetadataStore & BlobSqlMetadataStore
-
Need to work with Azure Storage SDKs to change and support new DFS port (say: 10004)
- E.g. .Net SDK need changes the blob/DFS Uri convert function in this file
-
Azurite users need to configure each Azurite Account type as HNS/FNS when Azurite starts up.
- Need design how to input the config (default should be FNS)
- How to handle it when user start Azurite with change account type? (Report error?)
- Azurite won't support FNS/HNS migration and account type change. The storage account type is finalized after creation. A wrong configuration (like wrong HNS/FNS type) should get error reported.
-
Implement HNS metadata Store in Azurite
- Any schema change or new table design should be reviewed and signed off.
- We need to maintain hierarchical relationships between parent-child dir/file. For example, we can add a table to match each item (blob/dir) with its parent, and integrate existing blob tables and the new table added above (Detail design need discussion).
- Blob/file binary payload persistency based on local files shouldn’t be changed.
-
OAuth & ACL (Not priority in Phase II):
- Limited by no emulation on AAD and related components to provide information like user identity information, currently with bearer token, Azurite always assumes the user has enough Role permission, so OAuth authentication will always pass, then ACL authentication won’t take effect. We might need add configuration to fail OAuth check and make ACL check works.
- For groups in ACL, per the limitation of no emulation on AAD, Azurite can’t access AD to check if a user is in some group.
-
Need make sure each API behavior is aligned on rest API doc , also aligned with real Azure Server. See more in validation criteria.
-
Implement each features/APIs (priority from high to low, 1-6 should be P1)
- Create/Update/Delete/Get filesystem.
- List filesystem (continuation token).
- Create/Update/Delete/Get single directory/file.
- List directory/file (continuation token).
- Set/Get ACL/permission/user/group on single directory/file.
- SAS: Support DFS sas (and blob sas, account sas)
- Set/Update/Remove ACL recursive (with continuation token).
- OAuth: ACL works when user login with AAD account.
- Other left features / work items, we will add the detail plan by then.
- No regression on Blob API on FNS account.
- Test Case
- Need to cover each API and parameters.
- For enum/bool parameters need cover all possible values.
- For number parameters need to cover maximum, minimum values.
- For optional parameters need to cover explicit and empty(default) values.
- Need to cover all possible return HTTP and Azure Storage error codes.
- Need to run and pass against Azurite hosted storage account and Azure Storage cloud accounts.
- Need to cover each API and parameters.
- Needs to pass all language SDK test (JS, .net, java, go, python …) and ABFS test, at least on GA.
- Needs to pass test on Storage Explorer.
- We can use GitHub issues to track the work.
- We should first merge the PR to a preview branch, then after a phase complete, then merge the change back to main and release.
- Each PR should be small for quick review.
- Maintain: the code needs to be maintained (investigate/fix issues) by the PR owner for some time (e.g. till feature GA).
- Need clean/detail doc to introduce the implementation.
Azure Data Lake Storage Gen2 Introduction - Azure Storage | Microsoft Learn
Azure Data Lake Storage Gen2 hierarchical namespace - Azure Storage | Microsoft Learn
Blob Storage REST API - Azure Storage | Microsoft Learn
Azure Data Lake Storage Gen2 REST API reference - Azure Storage | Microsoft Learn