This week, we're going to write a Python program that:
- reads in AWS credential information from environment variables which allows us to connect to S3
- creates a
SparkSession
object representing a connection to a local Spark cluster, with parameters that allow us to:- connect to S3
- use 3rd party jars/libraries to interact with S3
- uses that
SparkSession
to reads in a local file containing (real) Amazon review data - queries that data using the Python Spark DataFrame API (opposed to the SQL API)
- writes that data to S3
- stops our
SparkSession
Note: from this point forward in the course, we will be connecting to various resources hosted on AWS. In order to successfully connect you will need to set several environment variables on your machine.
Refer to the videos to see how to set these.
Alternatively, copy the 'sample.env' to a file called '.env'.
Place the environment variables into this file.
Remember, this week we are using the Spark DataFrame API (and last week was the Spark SQL API).
-
Read the tab separated file named
resources/reviews.tsv.gz
into a dataframe. Call itreviews
. You will use thereviews
dataframe defined here to answer all the questions below... -
Display the schema of the dataframe.
-
How many records are in the dataframe? Store this number in a variable named
reviews_count
. -
Print the first 5 rows of the dataframe. Some of the columns are long - print the entire record, regardless of length.
-
Create a new dataframe based on
reviews
with exactly 1 column: the value of the product category field. Look at the first 50 rows of that dataframe. Which value appears to be the most common? -
Find the most helpful review in the dataframe - the one with the highest number of helpful votes What is the product title for that review? How many helpful votes did it have?
-
How many reviews have a 5 star rating?
-
Currently every field in the data file is interpreted as a string, but there are 3 that should really be numbers. Create a new dataframe with just those 3 columns, except cast them as
int
s. Look at 10 rows from this dataframe. -
Find the date with the most purchases. Print the date and total count of the date with the most purchases
-
Add a column to the dataframe named
review_timestamp
, representing the current time on your computer. Print the schema and inspect a few rows of data to make sure the data is correctly populated. -
Write the dataframe with load timestamp to
s3a://hwe-$CLASS/$HANDLE/bronze/reviews_static
in Parquet format. -
Read the tab separated file named
resources/customers.tsv.gz
into a dataframe. Write to S3 unders3a://hwe-$CLASS/$HANDLE/bronze/customers
. There are no questions to answer about this data set right now, but you will use it in a later lab...
We will stop the SparkSession
for you going forward.