This project utilizes PySpark for data processing on a large dataset, with a focus on applying Spark SQL queries and performing data transformations. The dataset is sourced from fivethirtyeight and contains information about the guests of The Daily Show.
- Open Codespaces environment.
- Wait for the environment to finish installing.
- Run the command:
python main.py
- View the results in the PySpark Output Data/Summary File
- Format the code:
make format
- Lint the code:
make lint
- Run tests:
make test
- First, I extract the dataset using the
extract
function. - Then, I start a Spark session with
start_spark
. - I load the dataset using
load_data
. - Next, I compute and display descriptive statistics via the
describe
function. - I execute a query on the dataset using
query
. - I apply further transformations on the sample dataset using
example_transform
. - Finally, I end the Spark session with
end_spark
.