tech-talks

This repository contains the notebooks and presentations we use for our Databricks Tech Talks.

You can find links to the tech talks below as well as the notebooks for these sessions directly in the repo.

Sections

Upcoming Tech Talks
Featured
Previous Tech Talks
COVID 19 Samples
- Datasets
- Notebooks

Upcoming-Tech-Talks

2020-04-29 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Introduction to Apache Spark

This workshop covers the fundamentals of Apache Spark, the most popular big data processing engine. In this workshop, you will learn how to ingest data with Spark, analyze the Spark UI, and gain a better understanding of distributed computing. We will be using data released by the Johns Hopkins Center for Systems Science and Engineering (CSSE) Novel Coronavirus (COVID-19). Prior basic Python experience is recommended.

2020-04-30 Using Delta as a Change Data Capture Source

While it is common to use Delta Lake as a sink for change data captured from traditional data sources; customers are increasingly asking how to use Delta tables as a source for a change data capture (CDC) process. To state a different way, how can we read a stream of changes from a Delta table, so that they can be propagated downstream. In each of these cases, we want to capture a change stream from a Delta table and send it somewhere for further processing. In this session, we will discuss the architecture, use cases, and solutions.

Featured

Notebook | Johns Hopkins CSSE COVID-19 Analysis

This notebook processes and performs quick analysis from the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE (https://github.com/CSSEGISandData/COVID-19). The data is updated in the `/databricks-datasets/COVID/CSSEGISandData/` location regularly so you can access the data directly. The following animated GIF shows the COVID-19 confirmed cases and deaths per 100K people per the Johns Hopkins CSSE dataset spanning March 22nd to April 14th 2020.

Notebook | NY Times COVID-19 Analysis

This notebook processes and performs quick analysis from the NY Times COVID-19 dataset (https://github.com/nytimes/covid-19-data). The data is updated in the `/databricks-datasets/COVID/covid-19-data/` location regularly so you can access the data directly. The following animated GIFs shows the COVID-19 confirmed cases and deaths per 100K people from the NY Times dataset spanning two week window around when educational facilities were closed for Washington (3/13) and New York (3/18) states .

Previous-Tech-Talks

2020-04-23 Predictive Maintenance (PdM) on IoT Data for Early Fault Detection w/ Delta Lake

Predictive Maintenance (PdM) is different from other routine or time-based maintenance approaches as it combines various sensor readings and sophisticated analytics on thousands of logged events in near real time and promises several fold improvements in cost savings because tasks are performed only when warranted. The collaborative Data and Analytics platform from Databricks is a great technology fit to facilitate these use cases by providing a single unified platform to ingest the sensor data, perform the necessary transformations and exploration, run ML and generate valuable insights.

2020-04-22 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Machine Learning with scikit-learn

scikit-learn is one of the most popular open-source machine learning libraries among data science practitioners. This workshop will walk through what machine learning is, the different types of machine learning, and how to build a simple machine learning model. This workshop focuses on the techniques of applying and evaluating machine learning methods, rather than the statistical concepts behind them. We will be using data released by the Johns Hopkins Center for Systems Science and Engineering (CSSE) Novel Coronavirus (COVID-19). Prior basic Python experience is recommended.

2020-04-16 - Diving into Delta Lake: DML Internals

In the earlier Delta Lake Internals webinar series sessions, we described how the Delta Lake transaction log works. In this session, we will dive deeper into how commits, snapshot isolation, and partition and files change when performing deletes, updates, merges, and structured streaming.

2020-04-15 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Data Analysis with Pandas

This workshop is on pandas, a powerful open-source Python package for data analysis and manipulation. In this workshop, you will learn how to read data, compute summary statistics, check data distributions, conduct basic data cleaning and transformation, and plot simple visualizations. We will be using data released by the Johns Hopkins Center for Systems Science and Engineering (CSSE) Novel Coronavirus (COVID-19). Prior basic Python experience is recommended.

2020-04-08 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Introduction to Python on Databricks

Python is a popular programming language because of its wide applications including but not limited to data analysis, machine learning, and web development. This workshop covers major foundational concepts necessary for you to start coding in Python, with a focus on data analysis. You will learn about different types of variables, for loops, functions, and conditional statements. No prior programming knowledge is required.

2020-04-02 - Diving into Delta Lake: Enforcing and Evolving Schema

As business problems and requirements evolve over time, so too does the structure of your data. With Delta Lake, as the data changes, incorporating new dimensions is easy. Users have access to simple semantics to control the schema of their tables. These tools include schema enforcement, which prevents users from accidentally polluting their tables with mistakes or garbage data, as well as schema evolution, which enables them to automatically add new columns of rich data when those columns belong. In this webinar, we’ll dive into the use of these tools.

2020-03-26 - Diving into Delta Lake: Unpacking the Transaction Log

The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel, and more. In this session, we’ll explore what the Delta Lake transaction log is, how it works at the file level, and how it offers an elegant solution to the problem of multiple concurrent reads and writes.

2020-03-19 - Analyzing COVID-19: Can the Data Community Help?

With the current concerns over SARS-Cov-2 and COVID-19, there are now various COVID-19 datasets on Kaggle and GitHub, competitions such as the COVID-19 Open Research Dataset Challenge (CORD-19), and models such as University of Washington’s Institute for Health Metrics and Evaluation (IHME) COVID-19 Projections. Whether you are a student or a professional data scientist, we thought we could help out by providing educational sessions on how to analyze these datasets.

2020-03-19 - Machine Learning Lessons Learned from the Field: Interview with Brooke Wenig

Developer Advocate Denny Lee will interview Brooke Wenig, Machine Learning Practice Lead, on the best practices and patterns when developing, training, and deploying Machine Learning algorithms in production.

2020-03-12 - Simplify and Scale Data Engineering Pipelines with Delta Lake

A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (“Bronze” tables), transformation/feature engineering (“Silver” tables), and machine learning training or prediction (“Gold” tables). Combined, we refer to these tables as a “multi-hop” architecture. It allows data engineers to build a pipeline that begins with raw data as a “single source of truth” from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake.

2020-03-05 - Beyond Lambda: Introducing Delta Architecture

Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. With the advent of Delta Lake, we are seeing a lot of our customers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture. In this session, we cover the major bottlenecks for adopting a continuous data flow model and how the Delta architecture solves those problems.

2020-02-27 - Getting Data Ready for Data Science with Delta Lake and MLflow

One must take a holistic view of the entire data analytics realm when it comes to planning for data science initiatives. Data engineering is a key enabler of data science helping furnish reliable, quality data in a timely fashion. Delta Lake, an open-source storage layer that brings reliability to data lakes can help take your data reliability to the next level.

2020-02-19 - The Genesis of Delta Lake - An Interview with Burak Yavuz

New decade, new start! Let's kick off 2020 with our first online meetup of the year featuring Burak Yavuz, Software Engineer at Databricks, for a talk about the genesis of Delta Lake. Developer Advocate Denny Lee will interview Burak Yavuz to learn about the Delta Lake team's decision making process and why they designed, architected, and implemented the architecture that it is today. Understand technical challenges that the team faced, how those challenges were solved, and learn about the plans for the future.

COVID-19-Samples

This section contains links to COVID-19 sample datasets and notebooks

Datasets

`/databricks-datasets/[location]`	Resource
`/../COVID/CORD-19/`	COVID-19 Open Research Dataset Challenge (CORD-19)
`/../COVID/CSSEGISandData/`	2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE
`/../COVID/ESRI_hospital_beds/`	Definitive Healthcare: USA Hospital Beds
`/../COVID/IHME/`	IHME (UW) COVID-19 Projections
`/../COVID/USAFacts/`	USA Facts: Confirmed \| Deaths
`/../COVID/coronavirusdataset/`	Data Science for COVID-19 (DS4C) (South Korea)
`/../COVID/covid-19-data/`	NY Times COVID-19 Datasets

Notebooks

Notebooks	Description	Datasets Used
Load JSON Datasets	Loading CORD-19 JSON Datasets	COVID-19 Open Research Dataset Challenge (CORD-19)
Analyzing CORD-19 Datasets	Exploratory Data Analysis of the CORD-19 dataset	COVID-19 Open Research Dataset Challenge (CORD-19)
NLP - Exploring CV19 Literature	Exploring CORD-19 Literature using NLP	COVID-19 Open Research Dataset Challenge (CORD-19)
South Korea COVID-19 Analysis	Exploratory Data Analysis of the South Korea COVID-19 dataset	Data Science for COVID-19 (DS4C) (South Korea)
Johns Hopkins COVID-19 Analysis	Exploratory Data Analysis of the Johns Hopkins CSSE COVID-19 dataset	2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE
NY Times COVID-19 Analysis	Exploratory Data Analysis of the NY Times COVID-19 dataset	NY Times COVID-19 Datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tech-talks

Upcoming-Tech-Talks

2020-04-29 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Introduction to Apache Spark

2020-04-30 Using Delta as a Change Data Capture Source

Featured

Notebook | Johns Hopkins CSSE COVID-19 Analysis

Notebook | NY Times COVID-19 Analysis

Previous-Tech-Talks

2020-04-23 Predictive Maintenance (PdM) on IoT Data for Early Fault Detection w/ Delta Lake

2020-04-22 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Machine Learning with scikit-learn

2020-04-16 - Diving into Delta Lake: DML Internals

2020-04-15 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Data Analysis with Pandas

2020-04-08 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Introduction to Python on Databricks

2020-04-02 - Diving into Delta Lake: Enforcing and Evolving Schema

2020-03-26 - Diving into Delta Lake: Unpacking the Transaction Log

2020-03-19 - Analyzing COVID-19: Can the Data Community Help?

2020-03-19 - Machine Learning Lessons Learned from the Field: Interview with Brooke Wenig

2020-03-12 - Simplify and Scale Data Engineering Pipelines with Delta Lake

2020-03-05 - Beyond Lambda: Introducing Delta Architecture

2020-02-27 - Getting Data Ready for Data Science with Delta Lake and MLflow

2020-02-19 - The Genesis of Delta Lake - An Interview with Burak Yavuz

COVID-19-Samples

Datasets

Notebooks

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 277 Commits
2020-02-27 \| Getting Data Ready for Data Science with Delta Lake and MLflow		2020-02-27 \| Getting Data Ready for Data Science with Delta Lake and MLflow
2020-03-05 \| Beyond Lambda - Introducing Delta Architecture		2020-03-05 \| Beyond Lambda - Introducing Delta Architecture
2020-03-12 \| Simplify and Scale Data Engineering Pipelines with Delta Lake		2020-03-12 \| Simplify and Scale Data Engineering Pipelines with Delta Lake
2020-03-19 \| Analyzing COVID-19 - Can the Data Community Help		2020-03-19 \| Analyzing COVID-19 - Can the Data Community Help
2020-03-26 \| Diving into Delta Lake - Unpacking the Transaction Log		2020-03-26 \| Diving into Delta Lake - Unpacking the Transaction Log
2020-04-02 \| Diving into Delta Lake - Schema Enforcement and Evolution		2020-04-02 \| Diving into Delta Lake - Schema Enforcement and Evolution
2020-04-08 \| Introduction to Python		2020-04-08 \| Introduction to Python
2020-04-15 \| Data Analysis with Pandas		2020-04-15 \| Data Analysis with Pandas
2020-04-16 \| Diving into Delta Lake - DML Internals		2020-04-16 \| Diving into Delta Lake - DML Internals
2020-04-22 \| Machine Learning with scikit-learn		2020-04-22 \| Machine Learning with scikit-learn
2020-04-23 \| Multi-hop Delta Lake Streaming		2020-04-23 \| Multi-hop Delta Lake Streaming
2020-04-29 \| Intro to Apache Spark		2020-04-29 \| Intro to Apache Spark
2020-04-30 \| Capturing Change Data from Delta		2020-04-30 \| Capturing Change Data from Delta
2020-05-14 \| Best Practices on How to Process and Analyze Audit Logs with Delta Lake and Structured Streaming		2020-05-14 \| Best Practices on How to Process and Analyze Audit Logs with Delta Lake and Structured Streaming
2020-05-28 \| Slowly Changing Dimensions (SCD) Type 2		2020-05-28 \| Slowly Changing Dimensions (SCD) Type 2
2020-08-11 \| Introducing Glow: An Open-Source Toolkit for Large-Scale Genomic Analysis		2020-08-11 \| Introducing Glow: An Open-Source Toolkit for Large-Scale Genomic Analysis
2020-08-25 \| Generating Surrogate Keys for your Data Lakehouse with Spark SQL and Delta Lake		2020-08-25 \| Generating Surrogate Keys for your Data Lakehouse with Spark SQL and Delta Lake
2020-08-27 \| How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability		2020-08-27 \| How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability
2020-08-27 \| How Delta Lake Supercharges Data Lakes		2020-08-27 \| How Delta Lake Supercharges Data Lakes
2020-09-01 \| SmartSQL Queries powered by Delta Engine on Lakehouse		2020-09-01 \| SmartSQL Queries powered by Delta Engine on Lakehouse
2020-09-24 \| Automate Data Pipelines with PySpark SQL		2020-09-24 \| Automate Data Pipelines with PySpark SQL
2020-09-30 \| Using SQL to Query your Data Lake with Delta Lake		2020-09-30 \| Using SQL to Query your Data Lake with Delta Lake
2020-10-20 \| Use Delta Engine to create a Hyperleaup from Delta Lake to Tableau		2020-10-20 \| Use Delta Engine to create a Hyperleaup from Delta Lake to Tableau
2020-10-27 \| Top Tuning Tips for Spark 3.0 and Delta Lake on Databricks		2020-10-27 \| Top Tuning Tips for Spark 3.0 and Delta Lake on Databricks
2020-11-18 \| Data + AI Summit EU 2020/Unpacking the Transaction Log V2		2020-11-18 \| Data + AI Summit EU 2020/Unpacking the Transaction Log V2
2020-12-03 \| Faster Spark SQL - Adaptive Query Execution in Databricks		2020-12-03 \| Faster Spark SQL - Adaptive Query Execution in Databricks
2020-12-10 \| Fatal Force: Exploring Police Shootings with SQL Analytics		2020-12-10 \| Fatal Force: Exploring Police Shootings with SQL Analytics
2021-04-13 \| Nested Data Tutorial		2021-04-13 \| Nested Data Tutorial
courses		courses
datasets		datasets
images		images
samples		samples
.gitignore		.gitignore
ETL & SQL Demo (Delta) - Current.dbc		ETL & SQL Demo (Delta) - Current.dbc
README.md		README.md

rohan-viz/tech-talks

Folders and files

Latest commit

History

Repository files navigation

tech-talks

Upcoming-Tech-Talks

2020-04-29 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Introduction to Apache Spark

2020-04-30 Using Delta as a Change Data Capture Source

Featured

Notebook | Johns Hopkins CSSE COVID-19 Analysis

Notebook | NY Times COVID-19 Analysis

Previous-Tech-Talks

2020-04-23 Predictive Maintenance (PdM) on IoT Data for Early Fault Detection w/ Delta Lake

2020-04-22 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Machine Learning with scikit-learn

2020-04-16 - Diving into Delta Lake: DML Internals

2020-04-15 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Data Analysis with Pandas

2020-04-08 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Introduction to Python on Databricks

2020-04-02 - Diving into Delta Lake: Enforcing and Evolving Schema

2020-03-26 - Diving into Delta Lake: Unpacking the Transaction Log

2020-03-19 - Analyzing COVID-19: Can the Data Community Help?

2020-03-19 - Machine Learning Lessons Learned from the Field: Interview with Brooke Wenig

2020-03-12 - Simplify and Scale Data Engineering Pipelines with Delta Lake

2020-03-05 - Beyond Lambda: Introducing Delta Architecture

2020-02-27 - Getting Data Ready for Data Science with Delta Lake and MLflow

2020-02-19 - The Genesis of Delta Lake - An Interview with Burak Yavuz

COVID-19-Samples

Datasets

Notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages