GitHub

PySpark Data Processing - week10_PySpark_zihan

This project utilizes PySpark for data processing on a large dataset, with a focus on applying Spark SQL queries and performing data transformations. The dataset is sourced from fivethirtyeight and contains information about the guests of The Daily Show.

Setup:

Open Codespaces environment.
Wait for the environment to finish installing.
Run the command: python main.py
View the results in the PySpark Output Data/Summary File

Code Formatting & Checks:

Format the code: make format
Lint the code: make lint
Run tests: make test

Data Processing Workflow:

First, I extract the dataset using the extract function.
Then, I start a Spark session with start_spark.
I load the dataset using load_data.
Next, I compute and display descriptive statistics via the describe function.
I execute a query on the dataset using query.
I apply further transformations on the sample dataset using example_transform.
Finally, I end the Spark session with end_spark.

Data Source

https://github.com/fivethirtyeight/data/tree/master/daily-show-guests

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
data		data
mylib		mylib
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
repeat.sh		repeat.sh
requirements.txt		requirements.txt
setup.sh		setup.sh
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Data Processing - week10_PySpark_zihan

Setup:

Code Formatting & Checks:

Data Processing Workflow:

Data Source

About

Releases

Packages

Languages

License

zihanxiao23/week10_PySpark_zihan

Folders and files

Latest commit

History

Repository files navigation

PySpark Data Processing - week10_PySpark_zihan

Setup:

Code Formatting & Checks:

Data Processing Workflow:

Data Source

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages