Scripts and azdo templates for databricks sql queries and jobs.
This is a very basic toolset for simple scenarios. In the future it will be developed untill a better version is provided by Databricks or the community.
Databricks jobs and sql queries are features are not backed by git at this time. Which creates a chalange when there is a need to promote queries and jobs between DB environments.
This repo contains a few resources that aim to provide a basic ability to create a CI/CD pipeline for those components using databricks API. Exporting from a source envirnment, committing to git, deploying to target environment.
-
The jobs in the source DB environment follow the naming standard
$projectname_$environment_$jobname
-
The Databricks Git repo branch for notebooks in the source environment is develop
-
The sql queries in the source DB environment follow the naming standard of
$environment_$projectname_$queryname
- A developer develops jobs and queries in the DB UI in the source environment.
- A developer creates config files with a list of jobs and queries and commits to the repo under jobs and sqlqueries folders respectiveley.
- The jobs and queries config files are exported and commited to the repo under jobs/src and sqlqueries/src
- The jobs/queries config files are updated with the target env values(qa name, qa cluster ID etc ) and deployed to the target environment.
Databrics CLI and a configured profile. In the example the profile name is AZDO
The config file (sample/jobs/jobs_params.json) has a list of environements and jobs for a project.
This file is provided as a parameter to the pulljobs.sh script. The script downloads the job JSON using the API, updates and saves it under jobs/src folder.
./pulljobs.sh -c "jobs/jobs_params.json" -p "AZDO"
For deployment all the jobs from a folder are deployed, while updating the cluster id (-c), the branch name in (-b) and the environment part of the job name (-e)
./deployjobs.sh -c qa-cluster -e qa -b release -s jobs/src - "AZDO"
The config file (sample/sqlqueries/queries_params.json) has a a list of environements and queries for a project.
Each query definition contains the source query name, and wherever the query schedule will be saved and thus promoted.
Each query uses an env parameter which is used in the queries.
Since queries can depend on one another, the script will recursivly get all of the dependencies for the queries in the list
Pull sql queries from source env
./pullsqlqueries.sh -c "sqlqueries/queries_dev_params.json" -t "DATABRICKSTOKEN" -u "https://adb-XXXXXXXXXXXXXXXXXX.XXXXX.azuredatabricks.net"
Deploy sql queries to target env
./deploysqlqueries.sh -c "sqlqueries/queries_dev_params.json" -s "sqlqueries/src" -t "DATABRICKSTOKEN" -u "https://adb-XXXXXXXXXXXXXXXXXX.XXXXX.azuredatabricks.net" -e "dev"
Bash scripts for pulling/deploying queries/jobs.
The scripts can be used as part of AZDO templates and/or pipelines. In the azdo_templates folder there are some templates for Databricks CI/CD pipelines.
getAzureDatabricksTokenJob.yml - is specific to Azure. The others are agnostic to the way Databricks is run
# Sample project pipeline
# develop - push git changes to DB, pull jobs and queries to git and commit with [no-ci]
# qa - push git changes to DB, deploy jobs and queries to DB
name: project-$(Build.SourceBranchName)-$(rev:rso)
trigger:
- develop
- release
- production
pr: none
resources:
repositories:
- repository: AzurePipelineTemplates
type: github
name: pjelou/databricks-sql-jobs-cicd
endpoint: GihubServiceConnection
ref : refs/heads/main
variables:
- name: isProduction
value: $[eq(variables['Build.SourceBranch'], 'refs/heads/production')]
- name: isDevelop
value: $[eq(variables['Build.SourceBranch'], 'refs/heads/develop')]
- name: isRelease
value: $[eq(variables['Build.SourceBranch'], 'refs/heads/release')]
- name: isManual
value: $[eq(variables['Build.Reason'], 'Manual')]
pool:
vmImage: 'ubuntu-latest'
stages:
# ---------------------------------------------------------------------------------------------------------------------
# Dev stage
# ---------------------------------------------------------------------------------------------------------------------
- stage: dev
dependsOn: []
displayName: "Deploy to Dev"
condition: and(succeeded(), eq(variables.isDevelop, true))
variables:
- group: variable-group-dev
jobs:
- deployment: repo
displayName: "pull git repo"
environment: "project DEV"
strategy:
runOnce:
deploy:
steps:
- template: pipeline-templates/databricks-git-pull.yml@AzurePipelineTemplates
parameters:
dataBricksURL: '$(DatabricksURL}'
repoId: '1234567891011'
repoBranch: 'develop'
databricks_git_token: '$(databricks-git-token)'
# Get Databricks Token
- template: pipeline-templates/getAzureDatabricksTokenJob.yml@AzurePipelineTemplates
parameters:
spAppId: '$(ServicePrincipalClientAD)'
spPassword: '$(ServicePrincipalSecret)'
ResourceGroupName: 'DatabricksRG'
DatabricksWorkspaceName: 'DatabricksWorkspaceName'
SubscriptionId: $(SubscriptionID)
TenantId: $(TenantID)
AzureRegion: 'eastus'
Jobname: 'getdbtoken'
- job: queries
dependsOn: [getdbtoken]
displayName: "Pull databricks sql queries"
pool:
vmImage: 'ubuntu-latest'
variables:
BearerToken: $[dependencies.getdbtoken.outputs['outputBearerToken.BearerToken']]
steps:
- checkout: self
path: self
persistCredentials: true
- template: pipeline-templates/pullDatabricksSqlQueries.yml@AzurePipelineTemplates
parameters:
databricksToken: $(BearerToken)
dataBricksURL: '$(DatabricksURL}'
queriesConfigFilePath: 'sqlqueries/queries_params.json'
- job: jobs
dependsOn: [getdbtoken]
displayName: "Pull databricks jobs"
pool:
vmImage: 'ubuntu-latest'
variables:
BearerToken: $[dependencies.getdbtoken.outputs['outputBearerToken.BearerToken']]
steps:
- checkout: self
path: self
persistCredentials: true
- template: pipeline-templates/pullDatabricksJobs.yml@AzurePipelineTemplates
parameters:
databricksToken: $(BearerToken)
dataBricksURL: '$(DatabricksURL}'
jobsConfigFilePath: 'jobs/jobs_params.json'
# ---------------------------------------------------------------------------------------------------------------------
# QA stage
# ---------------------------------------------------------------------------------------------------------------------
- stage: qa
dependsOn: []
displayName: "Deploy to QA"
condition: and(succeeded(), eq(variables.isRelease, true))
variables:
- group: variable-group-qa
jobs:
- deployment: repo
displayName: "pull git repo"
environment: "project QA"
strategy:
runOnce:
deploy:
steps:
- template: pipeline-templates/databricks-git-pull.yml@AzurePipelineTemplates
parameters:
dataBricksURL: '$(DatabricksURL}'
repoId: '12345678910111213'
repoBranch: 'release'
databricks_git_token: '$(databricks-git-token)'
# Get Databricks Token
- template: pipeline-templates/getAzureDatabricksTokenJob.yml@AzurePipelineTemplates
parameters:
spAppId: '$(ServicePrincipalClientAD)'
spPassword: '$(ServicePrincipalSecret)'
ResourceGroupName: 'DatabricksRG'
DatabricksWorkspaceName: 'DatabricksWorkspaceName'
SubscriptionId: $(SubscriptionID)
TenantId: $(TenantID)
AzureRegion: 'eastus'
Jobname: 'getdbtoken'
- deployment: queries
condition: and(succeeded(), eq(variables.isManual, true))
dependsOn: [getdbtoken]
displayName: "Deploy queries"
environment: "project QA"
variables:
BearerToken: $[dependencies.getdbtoken.outputs['outputBearerToken.BearerToken']]
strategy:
runOnce:
deploy:
steps:
- checkout: self
path: self
persistCredentials: true
- template: pipeline-templates/deployDatabricksSqlQueries.yml@AzurePipelineTemplates
parameters:
databricksToken: $(BearerToken)
dataBricksURL: '$(DatabricksURL}'
queriesConfigFilePath: 'sqlqueries/queries_params.json'
queriesFilesPath: 'sqlqueries/src'
envName: 'qa'
- deployment: jobs
condition: and(succeeded(), eq(variables.isManual, true))
dependsOn: [getdbtoken, repo]
displayName: "Deploy jobs"
environment: "project QA"
variables:
BearerToken: $[dependencies.getdbtoken.outputs['outputBearerToken.BearerToken']]
strategy:
runOnce:
deploy:
steps:
- checkout: self
path: self
- template: pipeline-templates/deployDatabricksJobs.yml@AzurePipelineTemplates
parameters:
databricksToken: $(BearerToken)
dataBricksURL: '$(DatabricksURL}'
dataBricksClusterName: 'qa-cluster'
envName: 'qa'
jobsDirPath: 'jobs/src'
branchName: 'release'
- The scripts assume following a certain naming format
- The sql/job scripts will fail if there are multiple jobs/queries with the same name.
- The jobs are assumed to use the same cluster for all jobs/taks/sub-tasks in the target environment
- The sql queries are assumed to use the same data source for all queries in the target environment.
Make a PR to this repo or fork it and make a better one! Any contributions are welcome.