Skip to content

Commit

Permalink
Address comments
Browse files Browse the repository at this point in the history
  • Loading branch information
jinlintt committed Sep 27, 2023
1 parent 7870fe7 commit c6c2bca
Show file tree
Hide file tree
Showing 5 changed files with 20 additions and 27 deletions.
4 changes: 2 additions & 2 deletions datahub-web-react/src/app/ingest/source/builder/sources.json
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@
"name": "dynamodb",
"displayName": "DynamoDB",
"docsUrl": "https://datahubproject.io/docs/metadata-ingestion/",
"recipe": "source:\n type: dynamodb\n config:\n platform_instance: \"AWS_ACCOUNT_ID\"\n aws_access_key_id : '${AWS_ACCESS_KEY_ID}'\n aws_secret_access_key : '${AWS_SECRET_ACCESS_KEY}'\n # User could use the below option to provide a list of primary keys of a table in dynamodb format,\n # those items from given primary keys will be included when we scan the table.\n # For each table we can retrieve up to 16 MB of data, which can contain as many as 100 items.\n # We'll enforce the the primary keys list size not to exceed 100\n # The total items we'll try to retrieve in these two scenarios:\n # 1. If user don't specify include_table_item: we'll retrieve up to 100 items\n # 2. If user specifies include_table_item: we'll retrieve up to 100 items plus user specified items in\n # the table, with a total not more than 200 items\n # include_table_item:\n # table_name:\n # [\n # {\n # 'partition_key_name': { 'attribute_type': 'attribute_value' },\n # 'sort_key_name': { 'attribute_type': 'attribute_value' },\n # },\n # ]"
"recipe": "source:\n type: dynamodb\n config:\n platform_instance: \"AWS_ACCOUNT_ID\"\n aws_access_key_id : '${AWS_ACCESS_KEY_ID}'\n aws_secret_access_key : '${AWS_SECRET_ACCESS_KEY}'\n # If there are items that have most representative fields of the table, users could use the\n # `include_table_item` option to provide a list of primary keys of the table in dynamodb format.\n # For each `region.table`, the list of primary keys can be at most 100.\n # We include these items in addition to the first 100 items in the table when we scan it.\n # include_table_item:\n # region.table_name:\n # [\n # {\n # 'partition_key_name': { 'attribute_type': 'attribute_value' },\n # 'sort_key_name': { 'attribute_type': 'attribute_value' },\n # },\n # ]"
},
{
"urn": "urn:li:dataPlatform:glue",
Expand Down Expand Up @@ -223,4 +223,4 @@
"docsUrl": "https://datahubproject.io/docs/metadata-ingestion/",
"recipe": "source:\n type: <source-type>\n config:\n # Source-type specifics config\n <source-configs>"
}
]
]
13 changes: 5 additions & 8 deletions metadata-ingestion/docs/sources/dynamodb/dynamodb_post.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,18 @@
## Limitations

For each region, the list table operation returns maximum number 100 tables, we need to further improve it by implementing pagination for listing tables

## Advanced Configurations

### Using `include_table_item` config

If there are items that have most representative fields of the table, user could use the `include_table_item` option to provide a list of primary keys of a table in dynamodb format, those items from given primary keys will be included when we scan the table.
If there are items that have most representative fields of the table, users could use the `include_table_item` option to provide a list of primary keys of the table in dynamodb format. We include these items in addition to the first 100 items in the table when we scan it.

Take [AWS DynamoDB Developer Guide Example tables and data](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AppendixSampleTables.html) as an example, if user has a table `Reply` with composite primary key `Id` and `ReplyDateTime`, user can use `include_table_item` to include 2 items as following:
Take [AWS DynamoDB Developer Guide Example tables and data](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AppendixSampleTables.html) as an example, if a account has a table `Reply` in the `us-west-2` region with composite primary key `Id` and `ReplyDateTime`, users can use `include_table_item` to include 2 items as following:

Example:

```yml
# put the table name and composite key in DynamoDB format
# The table name should be in the format of region.table_name
# The primary keys should be in the DynamoDB format
include_table_item:
Reply:
us-west-2.Reply:
[
{
"ReplyDateTime": { "S": "2015-09-22T19:58:22.947Z" },
Expand Down
6 changes: 2 additions & 4 deletions metadata-ingestion/docs/sources/dynamodb/dynamodb_pre.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
### Prerequisities

In order to execute this source, you will need to create access key and secret keys that have DynamoDB read access. You can create these policies and attach to your account or can ask your account admin to attach these policies to your account.
In order to execute this source, you need to attach the `AmazonDynamoDBReadOnlyAccess` policy to a user in your AWS account. Then create an API access key and secret for the user.

For access key permissions, you can create a policy with permissions below and attach to your account, you can find more details in [Managing access keys for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html)
For a user to be able to create API access key, it needs the following access key permissions. Your AWS account admin can create a policy with these permissions and attach to the user, you can find more details in [Managing access keys for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html)

```json
{
Expand All @@ -22,5 +22,3 @@ For access key permissions, you can create a policy with permissions below and a
]
}
```

For DynamoDB read access, you can simply attach AWS managed policy `AmazonDynamoDBReadOnlyAccess` to your account, you can find more details in [Attaching a policy to an IAM user group](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_groups_manage_attach-policy.html)
16 changes: 7 additions & 9 deletions metadata-ingestion/docs/sources/dynamodb/dynamodb_recipe.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,14 @@ source:
platform_instance: "AWS_ACCOUNT_ID"
aws_access_key_id: "${AWS_ACCESS_KEY_ID}"
aws_secret_access_key: "${AWS_SECRET_ACCESS_KEY}"
# User could use the below option to provide a list of primary keys of a table in dynamodb format,
# those items from given primary keys will be included when we scan the table.
# For each table we can retrieve up to 16 MB of data, which can contain as many as 100 items.
# We'll enforce the the primary keys list size not to exceed 100
# The total items we'll try to retrieve in these two scenarios:
# 1. If user don't specify include_table_item: we'll retrieve up to 100 items
# 2. If user specifies include_table_item: we'll retrieve up to 100 items plus user specified items in
# the table, with a total not more than 200 items
#
# If there are items that have most representative fields of the table, users could use the
# `include_table_item` option to provide a list of primary keys of the table in dynamodb format.
# For each `region.table`, the list of primary keys can be at most 100.
# We include these items in addition to the first 100 items in the table when we scan it.
#
# include_table_item:
# table_name:
# region.table_name:
# [
# {
# "partition_key_name": { "attribute_type": "attribute_value" },
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ class DynamoDBConfig(DatasetSourceConfigMixin, StatefulIngestionConfigBase):

table_pattern: AllowDenyPattern = Field(
default=AllowDenyPattern.allow_all(),
description="regex patterns for tables to filter in ingestion.",
description="Regex patterns for tables to filter in ingestion. The table name format is 'region.table'",
)
# Custom Stateful Ingestion settings
stateful_ingestion: Optional[StatefulStaleMetadataRemovalConfig] = None
Expand Down Expand Up @@ -190,11 +190,11 @@ def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]:
for table_name in self._list_tables(dynamodb_client):
dataset_name = f"{region}.{table_name}"
if not self.config.table_pattern.allowed(dataset_name):
logger.info(f"skipping table: {dataset_name}")
logger.debug(f"skipping table: {dataset_name}")
self.report.report_dropped(dataset_name)
continue

logger.info(f"Processing table: {dataset_name}")
logger.debug(f"Processing table: {dataset_name}")
table_info = dynamodb_client.describe_table(TableName=table_name)[
"Table"
]
Expand Down Expand Up @@ -331,7 +331,7 @@ def include_table_item_to_schema(
f"failed to retrieve item from table {table_name} by the given key {primary_key_list}"
)
return
logger.info(
logger.debug(
f"successfully retrieved {len(primary_key_list)} items based on supplied primary key list"
)
items = response.get(table_name)
Expand Down

0 comments on commit c6c2bca

Please sign in to comment.