Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connector SPI for returning redactable properties #24562

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

piotrrzysko
Copy link
Member

@piotrrzysko piotrrzysko commented Dec 23, 2024

Description

An alternative approach to #23103. The main difference is that in this approach, the properties requiring redaction are selected from those provided by the user, rather than always returning a static set of predefined security-sensitive properties. The benefits are as follows:

  • By default (if a connector doesn't implement the SPI), all properties are masked.
  • Unknown (potentially misspelled) properties can also be treated as redactable.

This PR includes an implementation of the new SPI for the PostgreSQL, Hive, Iceberg, Delta, and Hudi connectors. Once we confirm that the approach is correct, we will apply it to the remaining connectors.

Here is a PR demonstrating how the new SPI could be used to mask security-sensitive properties in queries related to creating catalogs: #24563.

Additional context and related issues

Resolves #22887.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## SPI
* Add connector SPI for returning security-sensitive properties ({issue}`22887`)

{
checkArgument(!isNullOrEmpty(name), "name is null or empty");
this.name = name;
this.module = requireNonNull(module, "module is null");
Set<Class<?>> configClasses = ImmutableSet.<Class<?>>builder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of attempting to list every configuration class, I think we should modify ConfigurationFactory in Airlift to extract the properties for us. I'm thinking (just thoughts after a brief look) that we have a method to extract all properties from a set of modules, and classify them into used, unused, and unknown. With used and unused having sub classification for secure or unsecure.

Copy link
Member Author

@piotrrzysko piotrrzysko Jan 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of getting properties from Airlift. However, I’m wondering if this is feasible, given that some modules are bound conditionally. Additionally, in some cases, we use constructs like: https://github.com/trinodb/trino/blob/master/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSecurityModule.java#L43-L49, which means Airlift isn’t even aware that multiple modules can be bound.

I’m assuming we want to have a static list of possible properties, rather than bootstrapping a connector when getSecuritySensitivePropertyNames is called. Is this a wrong assumption?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piotrrzysko Can we scan the classpath to find all configuration classes annotated with @config ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do this in this test:

Set<ConfigPropertyMetadata> propertiesFoundInClasspath = findPropertiesInRuntimeClasspath(excludedClasses);
I wanted to avoid scanning the classpath at runtime to minimize the time required for plugin loading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piotrrzysko we could generate an index while building a project, in the trino-maven-plugin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classpath scanning won't work becasue you can mount configuration classes under a prefix... like we do for HTTP clients.

Copy link
Member Author

@piotrrzysko piotrrzysko Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update

I implemented the idea of retrieving properties from Airlift. See the recent changes in this PR and the corresponding PR in Airlift: airlift/airlift#1354.

To verify its functionality, I applied it to the following connectors:

  • PostgreSQL (effectively all JDBC connectors, as they share the same returning sensitive properties)
  • Hive, Iceberg, Delta, and Hudi – These are unique because they create a separate Bootstrap instance when loading trino-hdfs.

One concern I had was whether traversing the module tree might cause side effects. Specifically, I wasn’t sure what operations are allowed in the Module.configure method. For example, can you open a database connection in this method?

After some thought, I believe we’re fine because a similar situation can occur during catalog creation. Here’s an example:

  1. A user issues a query with invalid configuration.
  2. Connector bootstrapping fails after the configuration stage.
  3. If any resource was opened during the process, it would remain unclosed.

Based on this, I assume operations with side effects, such as opening database connections, shouldn't be performed in Module.configure.

Unresolved Issue

I encountered a problem while updating the PR with masking (#24563).
Since we now need to evaluate catalog properties for redaction right after parsing (reference code), more complex expressions than string literals can cause failures. For example, evaluating the following properties fails because the session doesn’t yet contain a transactionId:

CREATE CATALOG catalog USING jdbc
WITH (
    "connection-url" = 'jdbc:h2:mem:config',
    "connection-user" = lower('BOB'),
    "connection-password" = '1234'
)

To address this, I tried delaying redaction as much as possible by moving it to just before the query state machine is created (reference code). This didn’t resolve the issue. The query failed because it hadn’t been registered yet in the LanguageFunctionManager (reference code). The registration only happens after the query machine is created (reference code).

I’m not yet sure if I can delay redaction further, as the query machine references the query text, which is used to build QueryInfo objects that are exposed via the REST API. For now, when a query contains more complex expressions I mask all the properties.

@piotrrzysko piotrrzysko force-pushed the redactable-properties-spi branch from 8018b45 to a4f5809 Compare January 6, 2025 15:55
@ksobolew
Copy link
Contributor

ksobolew commented Jan 7, 2025

I think this will be helpful in getting #22669 done as well (the main stumbling block there is that by making catalog properties available in all catalog manager implementations, we may expose some security-sensitive properties).

@piotrrzysko piotrrzysko force-pushed the redactable-properties-spi branch from a4f5809 to 47db44b Compare January 9, 2025 17:51
@github-actions github-actions bot added hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Jan 9, 2025
@piotrrzysko
Copy link
Member Author

Update:

It seems the issue I described here: #24562 (comment) is solvable (thanks, @dain, for the help!). I’ve applied the necessary changes to #24563 and added tests to confirm that the evaluation of complex expressions works as expected.

This PR now depends on: airlift/airlift#1354

@piotrrzysko piotrrzysko force-pushed the redactable-properties-spi branch 2 times, most recently from c4a3d9f to 4edb088 Compare January 13, 2025 15:44
@piotrrzysko piotrrzysko marked this pull request as ready for review January 13, 2025 19:43
@piotrrzysko piotrrzysko requested a review from dain January 13, 2025 19:43
@piotrrzysko piotrrzysko force-pushed the redactable-properties-spi branch from 4edb088 to e32ef86 Compare January 20, 2025 08:49
@piotrrzysko
Copy link
Member Author

@dain @hashhar could you please review this PR again?

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good % comment

The SPI will be used by the engine to redact security-sensitive
information in statements that manage catalogs. It has been added at the
connector factory level, rather than the connector level, to allow more
flexibility in retrieving properties. In some cases, we want to perform
redacting before a connector is initiated. For example, when we create a
new catalog by issuing the CREATE CATALOG statement.
@piotrrzysko piotrrzysko force-pushed the redactable-properties-spi branch from e32ef86 to 6f81c49 Compare February 5, 2025 12:20
@github-actions github-actions bot added the postgresql PostgreSQL connector label Feb 5, 2025
Comment on lines +24 to +40
public final class ConfigUtils
{
private ConfigUtils() {}

public static Set<String> getSecuritySensitivePropertyNames(Map<String, String> config, Set<ConfigPropertyMetadata> usedProperties)
{
Set<String> sensitivePropertyNames = new HashSet<>(config.keySet());

for (ConfigPropertyMetadata propertyMetadata : usedProperties) {
if (!propertyMetadata.securitySensitive()) {
sensitivePropertyNames.remove(propertyMetadata.name());
}
}

return ImmutableSet.copyOf(sensitivePropertyNames);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public final class ConfigUtils
{
private ConfigUtils() {}
public static Set<String> getSecuritySensitivePropertyNames(Map<String, String> config, Set<ConfigPropertyMetadata> usedProperties)
{
Set<String> sensitivePropertyNames = new HashSet<>(config.keySet());
for (ConfigPropertyMetadata propertyMetadata : usedProperties) {
if (!propertyMetadata.securitySensitive()) {
sensitivePropertyNames.remove(propertyMetadata.name());
}
}
return ImmutableSet.copyOf(sensitivePropertyNames);
}
}
public final class ConfigUtils
{
private ConfigUtils() {}
public static Set<String> getSecuritySensitivePropertyNames(Map<String, String> config, Set<ConfigPropertyMetadata> usedProperties)
{
Set<String> safePropertyNames = usedProperties.stream()
.filter(metadata -> !metadata.securitySensitive())
.map(ConfigPropertyMetadata::name)
.collect(toImmutableSet());
return Sets.difference(config.keySet(), safePropertyNames);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hashhar why oh why :)

@Override
public Set<String> getSecuritySensitivePropertyNames(String catalogName, Map<String, String> config, ConnectorContext context)
{
ClassLoader classLoader = HiveConnectorFactory.class.getClassLoader();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can inline class loader


if (quietBootstrap) {
app.quiet()
.skipErrorReporting();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to the left

module);

if (quietBootstrap) {
app.quiet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to have a single method that accepts boolean flag as these are always called together in the if

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will allow for the chained call without the need to assign to a variable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add withQuiet(boolean flag) and withSkipErrorReporting(boolean flag)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting adding these methods to Airlift? If so, that would be inconsistent with other methods in Bootstrap. Currently, boolean switches like loadSecretsPlugins and doNotInitializeLogging do not take arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Inconsistent? Maybe, but more readable in the callsite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency we can add all methods with boolean flags

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this. However, the construct with conditionally applying quiet and skipErrorReporting is currently used in five places, and I don't think it will be needed for other connectors. In fact, I can replace four of these five usages with a different approach - it will be a chained call.

It seems to me that introducing a new convention only for this case might be overkill, especially considering that Bootstrap is used extensively elsewhere.

Exposed properties fall into one of the following categories: they are
either explicitly marked as security-sensitive or are unknown. The
connector assumes that unknown properties might be misspelled
security-sensitive properties.
This preparatory commit enables bootstrapping HDFS to retrieve its
security-sensitive properties.
@piotrrzysko piotrrzysko force-pushed the redactable-properties-spi branch from 6f81c49 to 9c71c79 Compare February 6, 2025 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector hudi Hudi connector iceberg Iceberg connector postgresql PostgreSQL connector
Development

Successfully merging this pull request may close these issues.

Add system to identify security sensitive catalog properties
5 participants