Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39320: [C++][FS][Azure] Add managed identity auth configuration #39321

Merged

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Dec 20, 2023

Rationale for this change

Workload identity is a useful Azure authentication method. Also I failed to set the account_name correctly for a bunch of auths (I think this got lost in a rebase then I copy pasted the broken code).

What changes are included in this PR?

  • Make filesystem initialisation fail if account_name_.empty(). This prevents the account name configuration bug we had. Also added a test asserting that filesystem initialization fails in this case.
  • Remove account name configuration on all auth configs, in favour of setting in separately from the auth configuration.
  • Implement AzureOptions::ConfigureManagedIdentityCredential

Are these changes tested?

Added a simple test initialising a filesystem using ConfigureManagedIdentityCredential. This is not the most comprehensive test but its the same as what we agreed on for #39263.

Are there any user-facing changes?

Managed identity authentication is now supported.

@Tom-Newton
Copy link
Contributor Author

Looks like merging workload identity caused merge conflicts. I'll fix it tomorrow.

@kou kou changed the title GH-39320: Add Managed identity auth configuration GH-39320: [C++][FS][Azure] Add managed identity auth configuration Dec 21, 2023
@Tom-Newton Tom-Newton force-pushed the tomnewton/azure_managed_identity/GH-39320 branch from e71b386 to bdb0a00 Compare December 21, 2023 10:30
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
@@ -119,6 +119,14 @@ Status AzureOptions::ConfigureDefaultCredential(const std::string& account_name)
return Status::OK();
}

Status AzureOptions::ConfigureManagedIdentityCredential(const std::string& account_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that account_name isn't used. Is it needed?

Copy link
Contributor Author

@Tom-Newton Tom-Newton Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Sorry, I have not been paying enough attention 🤦‍♂️ . I think @felipecrv refactored this slightly. I need to check how the account name is configured now. I'll do it tomorrow.

Copy link
Member

@kou kou Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
It seems that both of ConfigureWorkloadIdentityCredential() and ConfigureDefaultCredential() are the same situation too. Could you check them too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now. I have added a check so that filesystem initialisation fails if account_name_.empty().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I messed up and forgot to set account_name_ on the Configure... functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should undo something I did and move account_name_ back to a public account_name field (the first in AzureOptions, but remove the account_name parameter from all the Configure... functions. That allows us to avoid the repetition of the account_name parameter and is more in line to how every credential constructor doesn't really need an account_name.

@Tom-Newton what you think and can you do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to make account_name_ public though? Personally I would just make account_name a required argument of the AzureOptions constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good idea in normal classes, but these *Options classes tend to be designed like simple C structs that can be constructed from any language and easily serialized. The only exception being the secret stuff that we don't want to leak and lazily initialized fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think its as you recommended now.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 22, 2023
@Tom-Newton Tom-Newton force-pushed the tomnewton/azure_managed_identity/GH-39320 branch from bdb0a00 to d4af840 Compare December 22, 2023 12:34
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 22, 2023
@Tom-Newton Tom-Newton requested a review from kou December 22, 2023 12:41
@@ -119,6 +119,14 @@ Status AzureOptions::ConfigureDefaultCredential(const std::string& account_name)
return Status::OK();
}

Status AzureOptions::ConfigureManagedIdentityCredential(const std::string& account_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I messed up and forgot to set account_name_ on the Configure... functions.

@@ -119,6 +119,14 @@ Status AzureOptions::ConfigureDefaultCredential(const std::string& account_name)
return Status::OK();
}

Status AzureOptions::ConfigureManagedIdentityCredential(const std::string& account_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should undo something I did and move account_name_ back to a public account_name field (the first in AzureOptions, but remove the account_name parameter from all the Configure... functions. That allows us to avoid the repetition of the account_name parameter and is more in line to how every credential constructor doesn't really need an account_name.

@Tom-Newton what you think and can you do this?

credential_kind_ = CredentialKind::kTokenCredential;
token_credential_ = std::make_shared<Azure::Identity::WorkloadIdentityCredential>();
return Status::OK();
}

Result<std::unique_ptr<Blobs::BlobServiceClient>> AzureOptions::MakeBlobServiceClient()
const {
if (account_name_.empty()) {
return Status::Invalid("AzureOptions doesn't contain a valid account name");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another simplification we can make:

  • Rename kAnonymous to kDefaultCredential
  • On the kDefaultCredential case here, initialize token_credential_ to std::make_shared<Azure::Identity::DefaultAzureCredential>() if it's not initialized already.

In the kTokenCredential case you can add a DCHECK(token_credential_) as documentation that when credential_kind_ is set to kTokenCredential we expect it to be set already (something that is guaranteed by credential_kind_ being private and set only by functions that also initialize token_credential_.

What you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its useful to support both anonymous and default credential as distinct things. Admittedly, its a niche use-case but storage accounts can be public, which is when anonymous is useful https://learn.microsoft.com/en-us/azure/storage/blobs/anonymous-read-access-configure?tabs=portal.

Its a good question though which should be the default. I know adlfs deafults to anonomous rather than default credential. In some respects it might be good to be consistent but also this choice of default has caused issues for me in the past e.g. flyteorg/flyte#3962

Copy link
Contributor

@felipecrv felipecrv Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Makes sense to keep kAnonymous then. But what would you return in AzureOptions::MakeBlobServiceClient() when credential_kind_ is anonymous? Lazy-init token_credential_ with DefaultAzureCredential and create the client with it? can you create the client without any credentials?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked and there is a constructor that doesn't take any credentials. I guess that's what we should call :-)

Copy link
Contributor Author

@Tom-Newton Tom-Newton Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Do you mind if we do that in a separate PR to fix up anonymous and change the default to default credential. This was supposed to be a super simple PR until kou realised all the auths were broken 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.

cpp/src/arrow/filesystem/azurefs.h Outdated Show resolved Hide resolved
@Tom-Newton Tom-Newton requested a review from felipecrv December 22, 2023 19:59
@felipecrv felipecrv merged commit d519544 into apache:main Dec 23, 2023
36 of 37 checks passed
@felipecrv felipecrv removed the awaiting change review Awaiting change review label Dec 23, 2023
@felipecrv
Copy link
Contributor

the MINGW failure is just pacman failing right away (re-run didn't fix it), so I merged the PR.

Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d519544.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…ion (apache#39321)

### Rationale for this change
Workload identity is a useful Azure authentication method. Also I failed to set the account_name correctly for a bunch of auths (I think this got lost in a rebase then I copy pasted the broken code). 

### What changes are included in this PR?
- Make filesystem initialisation fail if `account_name_.empty()`. This prevents the account name configuration bug we had. Also added a test asserting that filesystem initialization fails in this case. 
- Remove account name configuration on all auth configs, in favour of setting in separately from the auth configuration. 
- Implement `AzureOptions::ConfigureManagedIdentityCredential`

### Are these changes tested?
Added a simple test initialising a filesystem using `ConfigureManagedIdentityCredential`. This is not the most comprehensive test but its the same as what we agreed on for apache#39263. 

### Are there any user-facing changes?
Managed identity authentication is now supported. 

* Closes: apache#39320

Authored-by: Thomas Newton <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ion (apache#39321)

### Rationale for this change
Workload identity is a useful Azure authentication method. Also I failed to set the account_name correctly for a bunch of auths (I think this got lost in a rebase then I copy pasted the broken code). 

### What changes are included in this PR?
- Make filesystem initialisation fail if `account_name_.empty()`. This prevents the account name configuration bug we had. Also added a test asserting that filesystem initialization fails in this case. 
- Remove account name configuration on all auth configs, in favour of setting in separately from the auth configuration. 
- Implement `AzureOptions::ConfigureManagedIdentityCredential`

### Are these changes tested?
Added a simple test initialising a filesystem using `ConfigureManagedIdentityCredential`. This is not the most comprehensive test but its the same as what we agreed on for apache#39263. 

### Are there any user-facing changes?
Managed identity authentication is now supported. 

* Closes: apache#39320

Authored-by: Thomas Newton <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][FS][Azure] Managed Identity authentication
3 participants