Skip to content

Commit

Permalink
fix: fix image paths
Browse files Browse the repository at this point in the history
  • Loading branch information
yoonhyejin committed Jul 26, 2023
1 parent c585a1b commit e279403
Show file tree
Hide file tree
Showing 262 changed files with 1,163 additions and 831 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,9 @@ Please follow the [DataHub Quickstart Guide](https://datahubproject.io/docs/quic

If you're looking to build & modify datahub please take a look at our [Development Guide](https://datahubproject.io/docs/developers).

[![DataHub Demo GIF](docs/imgs/entity.png)](https://demo.datahubproject.io/)
<p align="center">
<img width="70%" href="https://demo.datahubproject.io/" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/imgs/entity.png"/>
</p>

## Source Code and Repositories

Expand Down
44 changes: 33 additions & 11 deletions docker/airflow/local_airflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,35 +138,55 @@ Successfully added `conn_id`=datahub_rest_default : datahub_rest://:@http://data

Navigate the Airflow UI to find the sample Airflow dag we just brought in

![Find the DAG](../../docs/imgs/airflow/find_the_dag.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/find_the_dag.png"/>
</p>

By default, Airflow loads all DAG-s in paused status. Unpause the sample DAG to use it.
![Paused DAG](../../docs/imgs/airflow/paused_dag.png)
![Unpaused DAG](../../docs/imgs/airflow/unpaused_dag.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/paused_dag.png"/>
</p>
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/unpaused_dag.png"/>
</p>

Then trigger the DAG to run.

![Trigger the DAG](../../docs/imgs/airflow/trigger_dag.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/trigger_dag.png"/>
</p>

After the DAG runs successfully, go over to your DataHub instance to see the Pipeline and navigate its lineage.

![DataHub Pipeline View](../../docs/imgs/airflow/datahub_pipeline_view.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/datahub_pipeline_view.png"/>
</p>

![DataHub Pipeline Entity](../../docs/imgs/airflow/datahub_pipeline_entity.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/datahub_pipeline_entity.png"/>
</p>

![DataHub Task View](../../docs/imgs/airflow/datahub_task_view.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/datahub_task_view.png"/>
</p>

![DataHub Lineage View](../../docs/imgs/airflow/datahub_lineage_view.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/datahub_lineage_view.png"/>
</p>

## TroubleShooting

Most issues are related to connectivity between Airflow and DataHub.

Here is how you can debug them.

![Find the Task Log](../../docs/imgs/airflow/finding_failed_log.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/finding_failed_log.png"/>
</p>

![Inspect the Log](../../docs/imgs/airflow/connection_error.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/connection_error.png"/>
</p>

In this case, clearly the connection `datahub-rest` has not been registered. Looks like we forgot to register the connection with Airflow!
Let's execute Step 4 to register the datahub connection with Airflow.
Expand All @@ -175,4 +195,6 @@ In case the connection was registered successfully but you are still seeing `Fai

After re-running the DAG, we see success!

![Pipeline Success](../../docs/imgs/airflow/successful_run.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/docs/imgs/airflow/successful_run.png"/>
</p>
139 changes: 73 additions & 66 deletions docs/advanced/no-code-modeling.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,46 @@
# No Code Metadata
# No Code Metadata

## Summary of changes

As part of the No Code Metadata Modeling initiative, we've made radical changes to the DataHub stack.
As part of the No Code Metadata Modeling initiative, we've made radical changes to the DataHub stack.

Specifically, we've
Specifically, we've

- Decoupled the persistence layer from Java + Rest.li specific concepts
- Decoupled the persistence layer from Java + Rest.li specific concepts
- Consolidated the per-entity Rest.li resources into a single general-purpose Entity Resource
- Consolidated the per-entity Graph Index Writers + Readers into a single general-purpose Neo4J DAO
- Consolidated the per-entity Search Index Writers + Readers into a single general-purpose ES DAO.
- Consolidated the per-entity Graph Index Writers + Readers into a single general-purpose Neo4J DAO
- Consolidated the per-entity Search Index Writers + Readers into a single general-purpose ES DAO.
- Developed mechanisms for declaring search indexing configurations + foreign key relationships as annotations
on PDL models themselves.
- Introduced a special "Browse Paths" aspect that allows the browse configuration to be
pushed into DataHub, as opposed to computed in a blackbox lambda sitting within DataHub
on PDL models themselves.
- Introduced a special "Browse Paths" aspect that allows the browse configuration to be
pushed into DataHub, as opposed to computed in a blackbox lambda sitting within DataHub
- Introduced special "Key" aspects for conveniently representing the information that identifies a DataHub entities via
a normal struct.
a normal struct.
- Removed the need for hand-written Elastic `settings.json` and `mappings.json`. (Now generated at runtime)
- Removed the need for the Elastic Set Up container (indexes are not registered at runtime)
- Simplified the number of models that need to be maintained for each DataHub entity. We removed the need for
1. Relationship Models
2. Entity Models
3. Urn models + the associated Java container classes
4. 'Value' models, those which are returned by the Rest.li resource
1. Relationship Models
2. Entity Models
3. Urn models + the associated Java container classes
4. 'Value' models, those which are returned by the Rest.li resource

In doing so, dramatically reducing the level of effort required to add or extend an existing entity.

For more on the design considerations, see the **Design** section below.


## Engineering Spec

This section will provide a more in-depth overview of the design considerations that were at play when working on the No
Code initiative.
Code initiative.

# Use Cases

Who needs what & why?

| As a | I want to | because
| ---------------- | ------------------------ | ------------------------------
| DataHub Operator | Add new entities | The default domain model does not match my business needs
| DataHub Operator | Extend existing entities | The default domain model does not match my business needs
| As a | I want to | because |
| ---------------- | ------------------------ | --------------------------------------------------------- |
| DataHub Operator | Add new entities | The default domain model does not match my business needs |
| DataHub Operator | Extend existing entities | The default domain model does not match my business needs |

What we heard from folks in the community is that adding new entities + aspects is just **too difficult**.

Expand All @@ -62,24 +61,29 @@ Achieve the primary goal in a way that does not require a fork.
### Must-Haves

1. Mechanisms for **adding** a browsable, searchable, linkable GMS entity by defining one or more PDL models
- GMS Endpoint for fetching entity
- GMS Endpoint for fetching entity relationships
- GMS Endpoint for searching entity
- GMS Endpoint for browsing entity
2. Mechanisms for **extending** a ****browsable, searchable, linkable GMS ****entity by defining one or more PDL models
- GMS Endpoint for fetching entity
- GMS Endpoint for fetching entity relationships
- GMS Endpoint for searching entity
- GMS Endpoint for browsing entity

- GMS Endpoint for fetching entity
- GMS Endpoint for fetching entity relationships
- GMS Endpoint for searching entity
- GMS Endpoint for browsing entity

2. Mechanisms for **extending** a \***\*browsable, searchable, linkable GMS \*\***entity by defining one or more PDL models

- GMS Endpoint for fetching entity
- GMS Endpoint for fetching entity relationships
- GMS Endpoint for searching entity
- GMS Endpoint for browsing entity

3. Mechanisms + conventions for introducing a new **relationship** between 2 GMS entities without writing code
4. Clear documentation describing how to perform actions in #1, #2, and #3 above published on [datahubproject.io](http://datahubproject.io)

## Nice-to-haves

1. Mechanisms for automatically generating a working GraphQL API using the entity PDL models
2. Ability to add / extend GMS entities without a fork.
- e.g. **Register** new entity / extensions *at runtime*. (Unlikely due to code generation)
- or, **configure** new entities at *deploy time*

- e.g. **Register** new entity / extensions _at runtime_. (Unlikely due to code generation)
- or, **configure** new entities at _deploy time_

## What Success Looks Like

Expand All @@ -88,7 +92,6 @@ Achieve the primary goal in a way that does not require a fork.
3. Adding a new relationship among 2 GMS entities takes 1 dev < 15 minutes
4. [Bonus] Implementing the `datahub-frontend` GraphQL API for a new / extended entity takes < 10 minutes


## Design

## State of the World
Expand All @@ -104,7 +107,8 @@ Currently, there are various models in GMS:
5. [Entities](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/entity/DatasetEntity.pdl) - Records with fields derived from the URN. Used only in graph / relationships
6. [Relationships](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/relationship/Relationship.pdl) - Edges between 2 entities with optional edge properties
7. [Search Documents](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/search/ChartDocument.pdl) - Flat documents for indexing within Elastic index
- And corresponding index [mappings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/mappings.json), [settings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/settings.json)

- And corresponding index [mappings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/mappings.json), [settings.json](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/resources/index/chart/settings.json)

Various components of GMS depend on / make assumptions about these model types:

Expand All @@ -122,7 +126,7 @@ Various components of GMS depend on / make assumptions about these model types:
Additionally, there are some implicit concepts that require additional caveats / logic:

1. Browse Paths - Requires defining logic in an entity-specific index builder to generate.
2. Urns - Requires defining a) an Urn PDL model and b) a hand-written Urn class
2. Urns - Requires defining a) an Urn PDL model and b) a hand-written Urn class

As you can see, there are many tied up concepts. Fundamentally changing the model would require a serious amount of refactoring, as it would require new versions of numerous components.

Expand All @@ -132,25 +136,25 @@ The challenge is, how can we meet the requirements without fundamentally alterin

In a nutshell, the idea is to consolidate the number of models + code we need to write on a per-entity basis.
We intend to achieve this by making search index + relationship configuration declarative, specified as part of the model
definition itself.
definition itself.

We will use this configuration to drive more generic versions of the index builders + rest resources,
with the intention of reducing the overall surface area of GMS.
We will use this configuration to drive more generic versions of the index builders + rest resources,
with the intention of reducing the overall surface area of GMS.

During this initiative, we will also seek to make the concepts of Browse Paths and Urns declarative. Browse Paths
will be provided using a special BrowsePaths aspect. Urns will no longer be strongly typed.
will be provided using a special BrowsePaths aspect. Urns will no longer be strongly typed.

To achieve this, we will attempt to generify many components throughout the stack. Currently, many of them are defined on
a *per-entity* basis, including
To achieve this, we will attempt to generify many components throughout the stack. Currently, many of them are defined on
a _per-entity_ basis, including

- Rest.li Resources
- Index Builders
- Graph Builders
- Local, Search, Browse, Graph DAOs
- Clients
- Clients
- Browse Path Logic

along with simplifying the number of raw data models that need defined, including
along with simplifying the number of raw data models that need defined, including

- Rest.li Resource Models
- Search Document Models
Expand All @@ -159,39 +163,43 @@ along with simplifying the number of raw data models that need defined, includin

From an architectural PoV, we will move from a before that looks something like this:

![no-code-before](../imgs/no-code-before.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/imgs/no-code-before.png"/>
</p>

to an after that looks like this

![no-code-after](../imgs/no-code-after.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/imgs/no-code-after.png"/>
</p>

That is, a move away from patterns of strong-typing-everywhere to a more generic + flexible world.
That is, a move away from patterns of strong-typing-everywhere to a more generic + flexible world.

### How will we do it?

We will accomplish this by building the following:

1. Set of custom annotations to permit declarative entity, search, graph configurations
- @Entity & @Aspect
- @Searchable
- @Relationship
- @Entity & @Aspect
- @Searchable
- @Relationship
2. Entity Registry: In-memory structures for representing, storing & serving metadata associated with a particular Entity, including search and relationship configurations.
3. Generic Entity, Search, Graph Service classes: Replaces traditional strongly-typed DAOs with flexible, pluggable APIs that can be used for CRUD, search, and graph across all entities.
2. Generic Rest.li Resources:
- 1 permitting reading, writing, searching, autocompleting, and browsing arbitrary entities
- 1 permitting reading of arbitrary entity-entity relationship edges
2. Generic Search Index Builder: Given a MAE and a specification of the Search Configuration for an entity, updates the search index.
3. Generic Graph Index Builder: Given a MAE and a specification of the Relationship Configuration for an entity, updates the graph index.
4. Generic Index + Mappings Builder: Dynamically generates index mappings and creates indices on the fly.
5. Introduce of special aspects to address other imperative code requirements
- BrowsePaths Aspect: Include an aspect to permit customization of the indexed browse paths.
- Key aspects: Include "virtual" aspects for representing the fields that uniquely identify an Entity for easy
reading by clients of DataHub.
3. Generic Entity, Search, Graph Service classes: Replaces traditional strongly-typed DAOs with flexible, pluggable APIs that can be used for CRUD, search, and graph across all entities.
4. Generic Rest.li Resources:
- 1 permitting reading, writing, searching, autocompleting, and browsing arbitrary entities
- 1 permitting reading of arbitrary entity-entity relationship edges
5. Generic Search Index Builder: Given a MAE and a specification of the Search Configuration for an entity, updates the search index.
6. Generic Graph Index Builder: Given a MAE and a specification of the Relationship Configuration for an entity, updates the graph index.
7. Generic Index + Mappings Builder: Dynamically generates index mappings and creates indices on the fly.
8. Introduce of special aspects to address other imperative code requirements
- BrowsePaths Aspect: Include an aspect to permit customization of the indexed browse paths.
- Key aspects: Include "virtual" aspects for representing the fields that uniquely identify an Entity for easy
reading by clients of DataHub.

### Final Developer Experience: Defining an Entity

We will outline what the experience of adding a new Entity should look like. We will imagine we want to define a "Service" entity representing
online microservices.
online microservices.

#### Step 1. Add aspects

Expand Down Expand Up @@ -236,15 +244,15 @@ record ServiceInfo {
/**
* Description of the service
*/
@Searchable = {}
@Searchable = {}
description: string
/**
* The owners of the
*/
@Relationship = {
"name": "OwnedBy",
"entityTypes": ["corpUser"]
"entityTypes": ["corpUser"]
}
owner: Urn
}
Expand Down Expand Up @@ -310,7 +318,7 @@ namespace com.linkedin.metadata.snapshot
* A union of all supported metadata snapshot types.
*/
typeref Snapshot = union[
...
...
ServiceSnapshot
]
```
Expand All @@ -321,15 +329,15 @@ typeref Snapshot = union[

```
curl 'http://localhost:8080/entities?action=ingest' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '{
"entity":{
"entity":{
"value":{
"com.linkedin.metadata.snapshot.ServiceSnapshot":{
"urn": "urn:li:service:mydemoservice",
"aspects":[
{
"com.linkedin.service.ServiceInfo":{
"description":"My demo service",
"owner": "urn:li:corpuser:user1"
"owner": "urn:li:corpuser:user1"
}
},
{
Expand Down Expand Up @@ -400,4 +408,3 @@ curl --location --request POST 'http://localhost:8080/entities?action=browse' \
curl --location --request GET 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3Acorpuser%3Auser1&types=OwnedBy' \
--header 'X-RestLi-Protocol-Version: 2.0.0'
```

4 changes: 3 additions & 1 deletion docs/api/graphql/how-to-set-up-graphql.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@ Postman is a popular API client that provides a graphical user interface for sen
Within Postman, you can create a `POST` request and set the request URL to the `/api/graphql` endpoint.
In the request body, select the `GraphQL` option and enter your GraphQL query in the request body.

![postman-graphql](../../imgs/apis/postman-graphql.png)
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/acryldata/static-assets-test/master/imgs/apis/postman-graphql.png"/>
</p>

Please refer to [Querying with GraphQL](https://learning.postman.com/docs/sending-requests/graphql/graphql/) in the Postman documentation for more information.

Expand Down
Loading

0 comments on commit e279403

Please sign in to comment.