forked from cartershanklin/pyspark-cheatsheet
-
Notifications
You must be signed in to change notification settings - Fork 0
/
categories.yaml
131 lines (82 loc) · 3.82 KB
/
categories.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
Accessing Data Sources:
priority: 0
description: Loading data stored in filesystems or databases, and saving it.
Data Handling Options:
priority: 2
description: Special data handling scenarios.
DataFrame Operations:
priority: 4
description: Adding, removing and modifying DataFrame columns.
Transforming Data:
priority: 6
description: Data conversions and other modifications.
Sorting and Searching:
priority: 8
description: Filtering, sorting, removing duplicates and more.
Grouping:
priority: 10
description: Group DataFrame data by key to perform aggregates like counting, sums, averages, etc.
Joining DataFrames:
priority: 12
description: Joining and stacking DataFrames.
File Processing:
priority: 14
description: Loading File Metadata and Processing Files.
Handling Missing Data:
priority: 16
description: Dealing with NULLs and NaNs in DataFrames.
Dealing with Dates:
priority: 18
description: Parsing and processing dates and times.
Unstructured Analytics:
priority: 20
description: Analyzing unstructured data like JSON, XML, etc.
Pandas:
priority: 22
description: Using Python's Pandas library to augment Spark. Some operations require the pyarrow library.
Data Profiling:
priority: 24
description: Extracting key statistics out of a body of data.
Data Management:
priority: 26
description: Upserts, updates and deletes on data.
Spark Streaming:
priority: 28
description: Spark Streaming (Focuses on Structured Streaming).
Time Series:
priority: 30
description: Techniques for dealing with time series data.
Machine Learning:
priority: 32
description: Machine Learning
Performance:
priority: 34
description: A few performance tips and tricks.
Preamble:
description: >
PySpark Cheat Sheet
===================
This cheat sheet will help you learn PySpark and write PySpark apps faster. Everything in here is fully functional PySpark code you can run or adapt to your programs.
These snippets are licensed under the CC0 1.0 Universal License. That means you can freely copy and adapt these code snippets and you don't need to give attribution or include any notices.
These snippets use DataFrames loaded from various data sources:
- "Auto MPG Data Set" available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/auto+mpg).
- customer_spend.csv, a generated time series dataset.
- date_examples.csv, a generated dataset with various date and time formats.
These snippets were tested against the Spark {version} API. This page was last updated {last_updated}.
Make note of these helpful links:
- [PySpark DataFrame Operations](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#dataframe-apis)
- [Built-in Spark SQL Functions](https://spark.apache.org/docs/latest/api/sql/index.html)
- [MLlib Main Guide](http://spark.apache.org/docs/latest/ml-guide.html)
- [Structured Streaming Guide](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss.html)
- [PySpark SQL Functions Source](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html)
Try in a Notebook
-----------------
See the [Notebook How-To](notebook.md) for instructions on running in a Jupyter notebook.
Generate the Cheatsheet
-----------------------
You can generate the cheatsheet by running `cheatsheet.py` in your PySpark environment as follows:
- Install dependencies: `pip3 install -r requirements.txt`
- Generate README.md: `python3 cheatsheet.py`
- Generate cheatsheet.ipynb: `python3 cheatsheet.py --notebook`
Postscript:
description: