Some examples rely on data which can be downloaded from the following site:
Here is a direct link to the file used in the examples:
- Query a Parquet file using SQL
- Query a Parquet file using the DataFrame API
- Run a SQL query and store the results in a Pandas DataFrame
- Query PyArrow Data
Within the subdirectory tpch
there are 22 examples that reproduce queries in
the TPC-H specification. These include realistic data that can be generated at
arbitrary scale and allow the user to see use cases for a variety of data frame
operations.
In the list below we describe which new operations can be found in the examples. The queries are designed to be of increasing complexity, so it is recommended to review them in order. For brevity, the following list does not include operations found in previous examples.
- Convert CSV to Parquet
- Read from a CSV files where the delimiter is something other than a comma
- Specify schema during CVS reading
- Write to a parquet file
- Pricing Summary Report
- Aggregation computing the maximum value, average, sum, and number of entries
- Filter data by date and interval
- Sorting
- Minimum Cost Supplier
- Window operation to find minimum
- Sorting in descending order
- Shipping Priority
- Order Priority Checking
- Aggregating multiple times in one data frame
- Local Supplier Volume
- Forecasting Revenue Change
- Using collect and extracting values as a python object
- Volume Shipping
- Finding multiple distinct and mutually exclusive values within one dataframe
- Using
case
andwhen
statements
- Market Share
- The operations in this query are similar to those in the prior examples, but it is a more complex example of using filters, joins, and aggregates
- Using left outer joins
- Product Type Profit Measure
- Extract year from a date
- Returned Item Reporting
- Important Stock Identification
- Shipping Modes and Order
- Finding non-null values using a boolean operation in a filter
- Case statement with default value
- Customer Distribution
- Promotion Effect
- Top Supplier
- Parts/Supplier Relationship
- Using anti joins
- Using regular expressions (regex)
- Creating arrays of literal values
- Determine if an element exists within an array
- Small-Quantity-Order Revenue
- Large Volume Customer
- Discounted Revenue
- Creating a user defined function (UDF)
- Convert pyarrow Array to python values
- Filtering based on a UDF
- Potential Part Promotion
- Extracting part of a string using substr
- Suppliers Who Kept Orders Waiting
- Using array aggregation
- Determining the size of array elements
- Global Sales Opportunity