From c0117afa2acabfc2038552e58094d5f2241e01bf Mon Sep 17 00:00:00 2001 From: Rebecca Barter Date: Tue, 30 Jan 2024 11:46:41 -0700 Subject: [PATCH] Add incomplete files and visualization lesson --- .Rhistory | 0 content/complete/01_variables.ipynb | 8 + content/complete/02_types.ipynb | 2 + content/complete/03_type_conversions.ipynb | 6 + content/complete/04_boolean_operations.ipynb | 15 + content/complete/05_numpy.ipynb | 14 +- content/complete/06_pandas_dataframes.ipynb | 8 + content/complete/07_index.ipynb | 2 + content/complete/08_series.ipynb | 12 +- content/complete/09_subsetting.ipynb | 15 +- content/complete/10_filtering_logical.ipynb | 11 +- content/complete/11_filtering_query.ipynb | 6 + content/complete/12_iloc.ipynb | 241 +-------- .../complete/13_modifying_dataframes.ipynb | 20 +- .../complete/14_summarizing_dataframes.ipynb | 24 +- .../complete/15_grouped_computations.ipynb | 4 + content/complete/16_visualization.ipynb | 263 ++++++++++ content/complete/17_list_comprehension.ipynb | 12 + content/incomplete/01_variables.ipynb | 223 ++++++++ content/incomplete/02_types.ipynb | 481 ++++++++++++++++++ content/incomplete/03_type_conversions.ipynb | 258 ++++++++++ .../incomplete/04_boolean_operations.ipynb | 240 +++++++++ content/incomplete/05_numpy.ipynb | 194 +++++++ content/incomplete/06_pandas_dataframes.ipynb | 304 +++++++++++ content/incomplete/07_index.ipynb | 255 ++++++++++ content/incomplete/08_series.ipynb | 307 +++++++++++ content/incomplete/09_subsetting.ipynb | 227 +++++++++ content/incomplete/10_filtering_logical.ipynb | 158 ++++++ content/incomplete/11_filtering_query.ipynb | 155 ++++++ content/incomplete/12_iloc.ipynb | 180 +++++++ .../incomplete/13_modifying_dataframes.ipynb | 345 +++++++++++++ .../14_summarizing_dataframes.ipynb | 298 +++++++++++ .../incomplete/15_grouped_computations.ipynb | 93 ++++ content/incomplete/16_visualization.ipynb | 146 ++++++ 34 files changed, 4271 insertions(+), 256 deletions(-) create mode 100644 .Rhistory create mode 100644 content/incomplete/01_variables.ipynb create mode 100644 content/incomplete/02_types.ipynb create mode 100644 content/incomplete/03_type_conversions.ipynb create mode 100644 content/incomplete/04_boolean_operations.ipynb create mode 100644 content/incomplete/05_numpy.ipynb create mode 100644 content/incomplete/06_pandas_dataframes.ipynb create mode 100644 content/incomplete/07_index.ipynb create mode 100644 content/incomplete/08_series.ipynb create mode 100644 content/incomplete/09_subsetting.ipynb create mode 100644 content/incomplete/10_filtering_logical.ipynb create mode 100644 content/incomplete/11_filtering_query.ipynb create mode 100644 content/incomplete/12_iloc.ipynb create mode 100644 content/incomplete/13_modifying_dataframes.ipynb create mode 100644 content/incomplete/14_summarizing_dataframes.ipynb create mode 100644 content/incomplete/15_grouped_computations.ipynb create mode 100644 content/incomplete/16_visualization.ipynb diff --git a/.Rhistory b/.Rhistory new file mode 100644 index 0000000..e69de29 diff --git a/content/complete/01_variables.ipynb b/content/complete/01_variables.ipynb index 2669d86..9bcd718 100644 --- a/content/complete/01_variables.ipynb +++ b/content/complete/01_variables.ipynb @@ -13,6 +13,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Simple computations\n", + "\n", "We can use Python to do simple computations, like this:" ] }, @@ -41,6 +43,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Defining variables/objects\n", + "\n", "If I want to use the \"output\" of this code, we need to assign it to a variable/object." ] }, @@ -166,6 +170,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Overwriting variables\n", + "\n", "You can overwrite variables, by re-assinging them:" ] }, @@ -204,6 +210,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### The `+=` shortcut\n", + "\n", "There is a shortcut that will let you add a number to a variable *and* update its value: `+=`" ] }, diff --git a/content/complete/02_types.ipynb b/content/complete/02_types.ipynb index e527da0..153bbea 100644 --- a/content/complete/02_types.ipynb +++ b/content/complete/02_types.ipynb @@ -65,6 +65,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### The `type()` function\n", + "\n", "We can check the type of `y` using the `type()` funciton" ] }, diff --git a/content/complete/03_type_conversions.ipynb b/content/complete/03_type_conversions.ipynb index e8110f7..5044bc1 100644 --- a/content/complete/03_type_conversions.ipynb +++ b/content/complete/03_type_conversions.ipynb @@ -62,6 +62,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Converting to a string using `str()`\n", + "\n", "The `str()` function will convert whatever value it is given to a string (whose shorthand is `str`). \n", "\n", "Below, we convert the integer `4` to a string, assign it to a variable called `a` and then we check the type of `a` (which is `str`):" @@ -130,6 +132,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Converting to an integer using `int()`\n", + "\n", "Converting the float `3.0` to an integer removes the decimal point:" ] }, @@ -215,6 +219,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Converting to a boolean using `bool()`\n", + "\n", "When you convert a number to a boolean using `bool()`, it is always converted to `True`, unless the number is equal to `0` (this is the only number that is converted to `False`):" ] }, diff --git a/content/complete/04_boolean_operations.ipynb b/content/complete/04_boolean_operations.ipynb index faf6483..4ead0d5 100644 --- a/content/complete/04_boolean_operations.ipynb +++ b/content/complete/04_boolean_operations.ipynb @@ -30,6 +30,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Asking if two things are equal with `==`\n", + "\n", "To ask a question of equality, we use two equal signs `==`" ] }, @@ -86,6 +88,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Asking if two things are not equal with `!=`\n", + "\n", "The \"not equal to\" operator is written `!=`. The following question asks if the `age` variable is \"not equal\" to 10:" ] }, @@ -114,6 +118,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Less than or greater than with `<` and `>`\n", + "\n", "Next, to ask questions of greater than or less than, we use the `<` and `>` operators:" ] }, @@ -219,6 +225,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Less than or greater than for strings\n", + "\n", "Strings are treated alphabetically, so `'apple'` is \"less\" than `'bannana'` because the first letter of apple \"a\" comes before the first letter of banana \"b\" in the alphabet:" ] }, @@ -264,6 +272,13 @@ "'carrot' < 'banana'" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "metadata": {}, diff --git a/content/complete/05_numpy.ipynb b/content/complete/05_numpy.ipynb index d1d9ec6..9f37149 100644 --- a/content/complete/05_numpy.ipynb +++ b/content/complete/05_numpy.ipynb @@ -20,6 +20,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Installing numpy\n", + "\n", "Just like an application on your computer, where you need to first download and install the application before you can use it on your computer, before you can use Python libraries, you need to first download and install them. \n", "\n", "The way that you will install Python libraries depends on your Python installation. \n", @@ -66,7 +68,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You *don't* need to include this `pip install numpy` code in your notebook.\n", + "You *don't* need to include this `pip install numpy` code in your notebook.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Importing numpy\n", "\n", "Once you've successfully installed the numpy library once, you can import the library (make its functions available) using the `import as ` command below. \n", "\n", @@ -87,6 +97,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Using numpy functions\n", + "\n", "Let's take a look at some of the functions that the numpy library provides.\n", "\n", "First, let's define a variable `x` that contains the value `2`:" diff --git a/content/complete/06_pandas_dataframes.ipynb b/content/complete/06_pandas_dataframes.ipynb index 2ba5326..8f83a54 100644 --- a/content/complete/06_pandas_dataframes.ipynb +++ b/content/complete/06_pandas_dataframes.ipynb @@ -41,6 +41,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Loading a data file into a pandas DataFrame\n", + "\n", "To load a .csv data file into our space, we need to use the `read_csv()` function from the pandas library. Make sure that you have saved the `gapminder.csv` file in a `data` subfolder that lives in the same place where this notebook is saved.\n", "\n", "Let's load the gapminder dataset:" @@ -618,6 +620,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### The shape attribute\n", + "\n", "To extract an attribute from an object in Python, we use the `object.attribute` syntax. So if we want to extract the `shape` attribute from the `gapminder` DataFrame object, we can do so as follows:" ] }, @@ -653,6 +657,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### The head() method\n", + "\n", "The `head()` function typically prints out the first few rows of a DataFrame. However, `head()` is not a regular function. If `head()` were a regular function, we would be able to apply it like this:" ] }, @@ -794,6 +800,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Arguments\n", + "\n", "You can provide additional arguments to the `head()` inside the parentheses. For example, if you want to print 10 rows instead of 5, you can do so as follows:" ] }, diff --git a/content/complete/07_index.ipynb b/content/complete/07_index.ipynb index ce11f5c..52fa733 100644 --- a/content/complete/07_index.ipynb +++ b/content/complete/07_index.ipynb @@ -150,6 +150,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Changing the index\n", + "\n", "You can change the index using the `set_index()` method and providing, for example, a column name as a string." ] }, diff --git a/content/complete/08_series.ipynb b/content/complete/08_series.ipynb index a6a5c85..8d4353a 100644 --- a/content/complete/08_series.ipynb +++ b/content/complete/08_series.ipynb @@ -227,7 +227,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "There are several ways to extract a column from a DataFrame. The first, involves writing the name of the DataFrame object followed by square parentheses inside which you provide the name of the column you want to extract as a string:" + "There are several ways to extract a column from a DataFrame. \n", + "\n", + "### Method 1: Using square brackets\n", + "\n", + "The first, involves writing the name of the DataFrame object followed by square parentheses inside which you provide the name of the column you want to extract as a string:" ] }, { @@ -266,6 +270,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Method 2: Using the column attribute with `.`\n", + "\n", "Another way to do the same thing is to use the `.` syntax to extract the named column attribute from the DataFrame object, such as:" ] }, @@ -428,6 +434,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### The Series index\n", + "\n", "They do however have an `index` (row name) attribute, which is inherited from the DataFrame from which the Series came:" ] }, @@ -456,6 +464,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### The vectorized nature of Series objects\n", + "\n", "The nice thing about Pandas Series objects is that they are **vectorized**. \n", "\n", "This means that when you apply simple mathematical operations to them, the operation will be applied to *every* entry in the Series. For example, if we add `5` to the `year` Series object, `5` will be added to *every* value in the `year` Series object:" diff --git a/content/complete/09_subsetting.ipynb b/content/complete/09_subsetting.ipynb index f4e017a..95d931e 100644 --- a/content/complete/09_subsetting.ipynb +++ b/content/complete/09_subsetting.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Working with DataFrames\n", + "# Extracting subsets of data frames\n", "\n", "In this notebook, we will learn how to manipulate pandas DataFrame objects, starting with extracting subsets." ] @@ -120,13 +120,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Extracting subsets of data frames" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "### Extracting multiple columns\n", + "\n", "Suppose that you want to extract multiple columns at once from your DataFrame object. You might imagine that you can do this by providing two column names inside the square parentheses that follow the object name, as follows:" ] }, @@ -362,6 +357,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Using `:` with `.loc` to select all rows/columns\n", + "\n", "If you want to extract all rows (or columns), you can replace the corresponding index entry with `:`. So the following code will extract all rows for the `gdpPercap` column:" ] }, @@ -702,6 +699,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Using `.loc` with non-numeric indexes\n", + "\n", "Note that the fact that we can index the rows using `.loc` with integers is solely a result of the fact that the row index corresponds to integers. If, instead the row index corresponded to the `country` values, such as in `gapminder_country`, we would not be able to use integers to subset the rows, and we would instead need to use the country names. \n", "\n", "Let's create `gapminder_country`, whose row index corresponds to the country variable:" diff --git a/content/complete/10_filtering_logical.ipynb b/content/complete/10_filtering_logical.ipynb index e7cb120..ded3f36 100644 --- a/content/complete/10_filtering_logical.ipynb +++ b/content/complete/10_filtering_logical.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Filtering using logical operations and `.loc`" + "# Filtering using logical operations and `.loc`" ] }, { @@ -114,6 +114,13 @@ "gapminder.head()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Filtering with `.loc` using a boolean series\n" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -159,6 +166,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "\n", + "\n", "We can use this boolean series to subset/filter the rows of our DataFrame by providing it in the row indexing position of the `.loc` indexer. The following will filter the `gapminder` DataFrame just to the rows where the `country` value equals `'Australia'`:" ] }, diff --git a/content/complete/11_filtering_query.ipynb b/content/complete/11_filtering_query.ipynb index f125b9d..3aeb565 100644 --- a/content/complete/11_filtering_query.ipynb +++ b/content/complete/11_filtering_query.ipynb @@ -473,6 +473,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Filtering using `.query()`\n", + "\n", "The `.query()` method does the same thing, but the syntax is a bit different. Since `query` is a method, it is followed by round parentheses `()` rather than square parentheses `[]`, and unlike in the above examples where we need to explicitly create a boolean Series object from the `country` column, e.g., `gapminder['country'] == \"Australia\"`, we instead provide a string (text) argument in which we just write the name of the column that we are using to filter, `country`, followed by the condiiton `== \"Australia\"`." ] }, @@ -653,6 +655,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### External variables in the `.query()` method\n", + "\n", "Note that if you want to use an \"external\" variable in your filtering query, you need to access it within the argument using `@variable_name`. For example, if we have defined an external variable, `selected_country` that contains the name of the country that we want to use to filter to in our query, to access this `selected_country` variable inside our query argument, we need to write `@selected_country` with the `@` symbol, which will impute the value stored in `selected_country` when the query is executed." ] }, @@ -835,6 +839,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Combining `.query()` with `.loc`\n", + "\n", "Note that since `gapminder.query()` outputs a DataFrame itself, you can follow a query method call with further subsetting which will then apply to the outputted DataFrame. The code below filters to just the country rows equal to \"Brazil\", and then uses the `.loc` indexer to subset just the \"year\" and \"lifeExp\" columns for the eventual output:" ] }, diff --git a/content/complete/12_iloc.ipynb b/content/complete/12_iloc.ipynb index a9dd46d..3700909 100644 --- a/content/complete/12_iloc.ipynb +++ b/content/complete/12_iloc.ipynb @@ -360,6 +360,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Positional indexing with `.iloc`\n", + "\n", "Now let's try and extract the rows 1, 4, and 5 and the columns `'year'`, `'lifeExp'`, and `'pop'` from this country-indexed version of gapminder:" ] }, @@ -398,239 +400,8 @@ "source": [ "We get an error, because there are no longer any rows with row index names 1, 4, 5.\n", "\n", - "To subset rows using `iloc`, we need to provide string values that match the row index values we're interested in:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
yearlifeExppop
country
Australia195269.1208691212
Australia195770.3309712569
Australia196270.93010794968
Australia196771.10011872264
Australia197271.93013177000
Australia197773.49014074100
Australia198274.74015184200
Australia198776.32016257249
Australia199277.56017481977
Australia199778.83018565243
Australia200280.37019546792
Australia200781.23520434176
Canada195268.75014785584
Canada195769.96017010154
Canada196271.30018985849
Canada196772.13020819767
Canada197272.88022284500
Canada197774.21023796400
Canada198275.76025201900
Canada198776.86026549700
Canada199277.95028523502
Canada199778.61030305843
Canada200279.77031902268
Canada200780.65333390141
\n", - "
" - ], - "text/plain": [ - " year lifeExp pop\n", - "country \n", - "Australia 1952 69.120 8691212\n", - "Australia 1957 70.330 9712569\n", - "Australia 1962 70.930 10794968\n", - "Australia 1967 71.100 11872264\n", - "Australia 1972 71.930 13177000\n", - "Australia 1977 73.490 14074100\n", - "Australia 1982 74.740 15184200\n", - "Australia 1987 76.320 16257249\n", - "Australia 1992 77.560 17481977\n", - "Australia 1997 78.830 18565243\n", - "Australia 2002 80.370 19546792\n", - "Australia 2007 81.235 20434176\n", - "Canada 1952 68.750 14785584\n", - "Canada 1957 69.960 17010154\n", - "Canada 1962 71.300 18985849\n", - "Canada 1967 72.130 20819767\n", - "Canada 1972 72.880 22284500\n", - "Canada 1977 74.210 23796400\n", - "Canada 1982 75.760 25201900\n", - "Canada 1987 76.860 26549700\n", - "Canada 1992 77.950 28523502\n", - "Canada 1997 78.610 30305843\n", - "Canada 2002 79.770 31902268\n", - "Canada 2007 80.653 33390141" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Use .loc[,] to subset row with index 'Australia', 'Canada' and columns `year', 'lifeExp', 'pop' from gapminder_country\n", - "gapminder_country.loc[['Australia', 'Canada'],['year', 'lifeExp', 'pop']]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you do want to do positional indexing, this is where the `.iloc` indexer comes in.\n", + "\n", + "If you want to do positional indexing, this is where the `.iloc` indexer comes in.\n", "\n", "The following code will subset the country-indexed gapminder dataset to just the 2nd positional row and the 2nd positional column (remember that Python is 0-indexed, so this is actually the third row and column). Recall also that the country column is now the index rather than a column." ] @@ -740,6 +511,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Using `:` to mean all rows/columns\n", + "\n", "We can also use the `:` placeholder for \"all rows\" or \"all columns\". The code below will select the 0th, 3rd, and 5th columns and *all* rows:" ] }, @@ -877,6 +650,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### More general sequencing with `start:stop:step`\n", + "\n", "And finally, you can extract more general sequences of rows/columns using the `start:stop` sequencing syntax. `0:20` will correspond to a list of integers from 0 up to 20 (not inclusive -- so it will actually go up to 19). \n", "\n", "`start:stop:step`, e.g., `0:20:2`, will similarly correspond to a list of integers from 0 up to 20 (non-inclusive) with a step size of 2, so 0, 2, 4, 6, ..., 18. \n", diff --git a/content/complete/13_modifying_dataframes.ipynb b/content/complete/13_modifying_dataframes.ipynb index 429ea36..d1249a8 100644 --- a/content/complete/13_modifying_dataframes.ipynb +++ b/content/complete/13_modifying_dataframes.ipynb @@ -125,14 +125,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Creating a new column in a DataFrame" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's add a new column `gdp` which is the product of the `pop` and `gdpPercap` columns:" + "### Creating a new column in a DataFrame" ] }, { @@ -167,6 +160,13 @@ "gapminder['pop'] * gapminder['gdpPercap']" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's add a new column `gdp` which is the product of the `pop` and `gdpPercap` columns:" + ] + }, { "cell_type": "code", "execution_count": 24, @@ -366,7 +366,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Removing a column from a DataFrame" + "### Removing a column from a DataFrame using `.drop()`" ] }, { @@ -924,7 +924,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Creating a copy of a DataFrame object" + "### Creating a copy of a DataFrame object" ] }, { diff --git a/content/complete/14_summarizing_dataframes.ipynb b/content/complete/14_summarizing_dataframes.ipynb index 35df4cb..b682008 100644 --- a/content/complete/14_summarizing_dataframes.ipynb +++ b/content/complete/14_summarizing_dataframes.ipynb @@ -23,7 +23,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To apply a statistical summary such as the mean to a single column from a DataFrame, you first need to extract the column of interest, e.g., using the `df['col']` syntax, and then apply the relevant method (e.g., `.mean()`) to the resulting Series object.\n", + "To apply a statistical summary such as the mean to a single column from a DataFrame, you first need to extract the column of interest, e.g., using the `df['col']` syntax, and then apply the relevant method (e.g., `.mean()`) to the resulting Series object.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Calculating the mean\n", "\n", "For example, the code below applies the `.mean()` method to the `lifeExp` column from gapminder:" ] @@ -353,6 +361,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Extracting columns of a particular type\n", + "\n", "Fortunately, you can extract all of the numeric columns of gapminder using the `.select_dtypes()` method, and then apply the `.mean()` method to the resulting DataFrame:" ] }, @@ -386,7 +396,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that `.mean()` is just one of many statistical summaries that you can compute on Pandas DataFrame and Series objects. \n", + "Note that `.mean()` is just one of many statistical summaries that you can compute on Pandas DataFrame and Series objects. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Other statistical summaries: sum, median, std\n", "\n", "Others include the sum:" ] @@ -484,6 +502,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Counting the number of unique categorical values with the `value_counts()` method\n", + "\n", "While the above methods can only be used for numeric (float or integer) columns, there are some summaries that you can use for categorical columns too. For example, the `.value_counts()` method will compute the number of times each unique value appears in a column.\n", "\n", "The following code counts the number of times each unique country appears in the `country` column:" diff --git a/content/complete/15_grouped_computations.ipynb b/content/complete/15_grouped_computations.ipynb index 9c14833..b7d3b91 100644 --- a/content/complete/15_grouped_computations.ipynb +++ b/content/complete/15_grouped_computations.ipynb @@ -23,6 +23,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### The `.groupby()` method\n", + "\n", "So far we have seen how to compute statistical summaries across an entire column, but sometimes we want to compute a summary separately for different groups (where the groups might be defined by the unique values in a column).\n", "\n", "The code below uses the `.groupby()` method to compute the mean of each column separately for each `year` value:" @@ -173,6 +175,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### Grouping by multiple columns\n", + "\n", "We can group by multiple columns at once by providing a *list* of the column names that we want to group by as the argument of the `.groupby()` method, and we can then extract a single column (if we choose) and compute the mean of it separately within each grouped combination. For example, the code below computes the mean `lifeExp` value for each year-continent combination:" ] }, diff --git a/content/complete/16_visualization.ipynb b/content/complete/16_visualization.ipynb index 5267dda..90225d8 100644 --- a/content/complete/16_visualization.ipynb +++ b/content/complete/16_visualization.ipynb @@ -19,6 +19,269 @@ "\n", "gapminder = pd.read_csv('data/gapminder.csv')" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pandas visualization methods\n", + "\n", + "Pandas has a few methods that allow you to quickly create visualizations from your DataFrame. We will use the `plot` method to create a few plots.\n", + "\n", + "The general syntax is `df.plot(kind=)`" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Use `df.plot` to create a scatterplot of gdpPercap vs lifeExp\n", + "gapminder.plot(kind='scatter', x='gdpPercap', y='lifeExp')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Use `df.plot` to create a barplot of the number of countries in each continent\n", + "# Hint: use value_counts() first to get the counts\n", + "gapminder['continent'].value_counts().plot(kind='bar')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAGdCAYAAAD0e7I1AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy81sbWrAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAovklEQVR4nO3deXRUZZ7/8U+RTbZUZEkqaUJAWSNLt+CEapdRk2aLNkicAxogLKOjHWwgihBbQaU1iO1G28IsDoEjNMoM0oIDGALGUSNL7LCpYRENdFIJI50UQQkhdX9/eKifJaCkqFCVx/frnHsOdZ+nbn3vc64nH5967i2bZVmWAAAADNUq2AUAAAA0J8IOAAAwGmEHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBo4cEuIBR4PB5VVFSoffv2stlswS4HAABcBMuydOLECSUkJKhVqwvP3xB2JFVUVCgxMTHYZQAAAD8cOXJEXbp0uWA7YUdS+/btJX07WNHR0UGuBgAAXAy3263ExETv3/ELIexI3q+uoqOjCTsAALQwP7YEhQXKAADAaIQdAABgNMIOAAAwGmEHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYLD3YBAAAES7c5bwe7hCb7YkF6sEtocZjZAQAARiPsAAAAoxF2AACA0Qg7AADAaIQdAABgNMIOAAAwGmEHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYLathZvHixBgwYoOjoaEVHR8vpdGrDhg3e9lOnTik7O1sdO3ZUu3btlJGRoaqqKp9jlJeXKz09XW3atFFsbKxmzZqlM2fOXO5TAQAAISqoYadLly5asGCBSkpKtHPnTt16660aNWqU9u3bJ0maOXOm1q1bp9WrV6uoqEgVFRUaM2aM9/2NjY1KT0/X6dOn9eGHH2rZsmXKz8/X3Llzg3VKAAAgxNgsy7KCXcR3dejQQc8++6zuvPNOde7cWStXrtSdd94pSfrss8/Ut29fFRcXa8iQIdqwYYNuu+02VVRUKC4uTpK0ZMkSzZ49W8eOHVNkZORFfabb7Zbdbldtba2io6Ob7dwAAKGl25y3g11Ck32xID3YJYSMi/37HTJrdhobG7Vq1SqdPHlSTqdTJSUlamhoUFpamrdPnz591LVrVxUXF0uSiouL1b9/f2/QkaRhw4bJ7XZ7Z4fOp76+Xm6322cDAABmCnrY2bNnj9q1a6eoqCjdd999evPNN5WcnCyXy6XIyEjFxMT49I+Li5PL5ZIkuVwun6Bztv1s24Xk5eXJbrd7t8TExMCeFAAACBlBDzu9e/dWaWmptm3bpvvvv19ZWVn65JNPmvUzc3NzVVtb692OHDnSrJ8HAACCJzzYBURGRqpHjx6SpEGDBmnHjh166aWXNHbsWJ0+fVo1NTU+sztVVVVyOBySJIfDoe3bt/sc7+zdWmf7nE9UVJSioqICfCYAACAUBX1m5/s8Ho/q6+s1aNAgRUREqLCw0NtWVlam8vJyOZ1OSZLT6dSePXtUXV3t7VNQUKDo6GglJydf9toBAEDoCerMTm5urkaMGKGuXbvqxIkTWrlypd59911t2rRJdrtdU6dOVU5Ojjp06KDo6Gg98MADcjqdGjJkiCRp6NChSk5O1oQJE7Rw4UK5XC49+uijys7OZuYGAABICnLYqa6u1sSJE1VZWSm73a4BAwZo06ZN+tWvfiVJeuGFF9SqVStlZGSovr5ew4YN0yuvvOJ9f1hYmNavX6/7779fTqdTbdu2VVZWlp588slgnRIAAAgxIfecnWDgOTsA8NPEc3Zathb3nB0AAIDmEPS7sQAAZmiJsyT4aWBmBwAAGI2wAwAAjEbYAQAARiPsAAAAoxF2AACA0Qg7AADAaIQdAABgNMIOAAAwGmEHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYj7AAAAKMRdgAAgNEIOwAAwGiEHQAAYDTCDgAAMBphBwAAGI2wAwAAjEbYAQAARiPsAAAAoxF2AACA0Qg7AADAaIQdAABgNMIOAAAwGmEHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIwW1LCTl5en6667Tu3bt1dsbKxGjx6tsrIynz4333yzbDabz3bffff59CkvL1d6erratGmj2NhYzZo1S2fOnLmcpwIAAEJUeDA/vKioSNnZ2bruuut05swZPfLIIxo6dKg++eQTtW3b1tvvnnvu0ZNPPul93aZNG++/GxsblZ6eLofDoQ8//FCVlZWaOHGiIiIi9PTTT1/W8wEAAKEnqGFn48aNPq/z8/MVGxurkpIS3XTTTd79bdq0kcPhOO8x3nnnHX3yySfavHmz4uLi9POf/1zz58/X7Nmz9fjjjysyMrJZzwEAAIS2kFqzU1tbK0nq0KGDz/4VK1aoU6dO6tevn3Jzc/X1119724qLi9W/f3/FxcV59w0bNkxut1v79u077+fU19fL7Xb7bAAAwExBndn5Lo/HoxkzZuj6669Xv379vPvvvvtuJSUlKSEhQbt379bs2bNVVlamNWvWSJJcLpdP0JHkfe1yuc77WXl5eXriiSea6UwAAEAoCZmwk52drb179+r999/32X/vvfd6/92/f3/Fx8crNTVVhw4d0tVXX+3XZ+Xm5ionJ8f72u12KzEx0b/CAQBASAuJr7GmTZum9evXa+vWrerSpcsP9k1JSZEkHTx4UJLkcDhUVVXl0+fs6wut84mKilJ0dLTPBgAAzBTUsGNZlqZNm6Y333xTW7ZsUffu3X/0PaWlpZKk+Ph4SZLT6dSePXtUXV3t7VNQUKDo6GglJyc3S90AAKDlCOrXWNnZ2Vq5cqX+8pe/qH379t41Nna7Xa1bt9ahQ4e0cuVKjRw5Uh07dtTu3bs1c+ZM3XTTTRowYIAkaejQoUpOTtaECRO0cOFCuVwuPfroo8rOzlZUVFQwTw8AAISAoM7sLF68WLW1tbr55psVHx/v3V5//XVJUmRkpDZv3qyhQ4eqT58+evDBB5WRkaF169Z5jxEWFqb169crLCxMTqdT48eP18SJE32eywMAAH66gjqzY1nWD7YnJiaqqKjoR4+TlJSk//mf/wlUWQAAwCAhsUAZAACguRB2AACA0Qg7AADAaIQdAABgNMIOAAAwGmEHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYj7AAAAKMRdgAAgNEIOwAAwGiEHQAAYDTCDgAAMBphBwAAGI2wAwAAjBYe7AKAn6puc94Odgl++WJBerBLAIAmYWYHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIzGz0UAaJKW+DMX/MQF8NPGzA4AADAaYQcAABiNsAMAAIxG2AEAAEYj7AAAAKMRdgAAgNEIOwAAwGiEHQAAYDTCDgAAMFpQw05eXp6uu+46tW/fXrGxsRo9erTKysp8+pw6dUrZ2dnq2LGj2rVrp4yMDFVVVfn0KS8vV3p6utq0aaPY2FjNmjVLZ86cuZynAgAAQlRQw05RUZGys7P10UcfqaCgQA0NDRo6dKhOnjzp7TNz5kytW7dOq1evVlFRkSoqKjRmzBhve2Njo9LT03X69Gl9+OGHWrZsmfLz8zV37txgnBIAAAgxNsuyrGAXcdaxY8cUGxuroqIi3XTTTaqtrVXnzp21cuVK3XnnnZKkzz77TH379lVxcbGGDBmiDRs26LbbblNFRYXi4uIkSUuWLNHs2bN17NgxRUZG/ujnut1u2e121dbWKjo6ulnPETirJf7GVEvFb2NdHlzTlwfX8/93sX+/Q2rNTm1trSSpQ4cOkqSSkhI1NDQoLS3N26dPnz7q2rWriouLJUnFxcXq37+/N+hI0rBhw+R2u7Vv377zfk59fb3cbrfPBgAAzBQyYcfj8WjGjBm6/vrr1a9fP0mSy+VSZGSkYmJifPrGxcXJ5XJ5+3w36JxtP9t2Pnl5ebLb7d4tMTExwGcDAABCRciEnezsbO3du1erVq1q9s/Kzc1VbW2tdzty5EizfyYAAAiO8GAXIEnTpk3T+vXr9d5776lLly7e/Q6HQ6dPn1ZNTY3P7E5VVZUcDoe3z/bt232Od/ZurbN9vi8qKkpRUVEBPgsAABCK/JrZ+fzzzwPy4ZZladq0aXrzzTe1ZcsWde/e3ad90KBBioiIUGFhoXdfWVmZysvL5XQ6JUlOp1N79uxRdXW1t09BQYGio6OVnJwckDoBAEDL5VfY6dGjh2655Ra99tprOnXqlN8fnp2drddee00rV65U+/bt5XK55HK59M0330iS7Ha7pk6dqpycHG3dulUlJSWaPHmynE6nhgwZIkkaOnSokpOTNWHCBO3atUubNm3So48+quzsbGZvAACAf2Hn448/1oABA5STkyOHw6F/+Zd/OeerpIuxePFi1dbW6uabb1Z8fLx3e/311719XnjhBd12223KyMjQTTfdJIfDoTVr1njbw8LCtH79eoWFhcnpdGr8+PGaOHGinnzySX9ODQAAGOaSnrNz5swZvfXWW8rPz9fGjRvVq1cvTZkyRRMmTFDnzp0DWWez4jk7CAaeSXL58FySy4Nr+vLgev7/LstzdsLDwzVmzBitXr1azzzzjA4ePKiHHnpIiYmJmjhxoiorKy/l8AAAAJfsksLOzp079Zvf/Ebx8fF6/vnn9dBDD+nQoUMqKChQRUWFRo0aFag6AQAA/OLXrefPP/+8li5dqrKyMo0cOVLLly/XyJEj1arVt9mpe/fuys/PV7du3QJZKwD4pSV+vcJXFUDg+BV2Fi9erClTpmjSpEmKj48/b5/Y2Fi9+uqrl1QcAADApfIr7Bw4cOBH+0RGRiorK8ufwwMAAASMX2t2li5dqtWrV5+zf/Xq1Vq2bNklFwUAABAofoWdvLw8derU6Zz9sbGxevrppy+5KAAAgEDxK+yUl5ef89MOkpSUlKTy8vJLLgoAACBQ/FqzExsbq927d59zt9WuXbvUsWPHQNQFAD9pLfEOMiBU+TWzc9ddd+m3v/2ttm7dqsbGRjU2NmrLli2aPn26xo0bF+gaAQAA/ObXzM78+fP1xRdfKDU1VeHh3x7C4/Fo4sSJrNkBAAAhxa+wExkZqddff13z58/Xrl271Lp1a/Xv319JSUmBrg8AAOCS+BV2zurVq5d69eoVqFoAAAACzq+w09jYqPz8fBUWFqq6uloej8enfcuWLQEpDgAA4FL5FXamT5+u/Px8paenq1+/frLZbIGuCwAAICD8CjurVq3SG2+8oZEjRwa6HgAAgIDy69bzyMhI9ejRI9C1AAAABJxfYefBBx/USy+9JMuyAl0PAABAQPn1Ndb777+vrVu3asOGDbrmmmsUERHh075mzZqAFAcAAHCp/Ao7MTExuuOOOwJdCwAAQMD5FXaWLl0a6DoAAACahV9rdiTpzJkz2rx5s/71X/9VJ06ckCRVVFSorq4uYMUBAABcKr9mdr788ksNHz5c5eXlqq+v169+9Su1b99ezzzzjOrr67VkyZJA1wkAAOAXv2Z2pk+frsGDB+vvf/+7Wrdu7d1/xx13qLCwMGDFAQAAXCq/Znb+93//Vx9++KEiIyN99nfr1k1/+9vfAlIYAABAIPg1s+PxeNTY2HjO/qNHj6p9+/aXXBQAAECg+BV2hg4dqhdffNH72mazqa6uTvPmzeMnJAAAQEjx62us5557TsOGDVNycrJOnTqlu+++WwcOHFCnTp305z//OdA1AgAA+M2vsNOlSxft2rVLq1at0u7du1VXV6epU6cqMzPTZ8EyAABAsPkVdiQpPDxc48ePD2QtAAAAAedX2Fm+fPkPtk+cONGvYgAAAALNr7Azffp0n9cNDQ36+uuvFRkZqTZt2hB2AABAyPDrbqy///3vPltdXZ3Kysp0ww03sEAZAACEFL9/G+v7evbsqQULFpwz6wMAABBMAQs70reLlisqKgJ5SAAAgEvi15qdt956y+e1ZVmqrKzUyy+/rOuvvz4ghQEAAASCX2Fn9OjRPq9tNps6d+6sW2+9Vc8991wg6gIAAAgIv8KOx+MJdB0AAADNIqBrdgAAAEKNXzM7OTk5F933+eef9+cjAAAAAsKvsPPXv/5Vf/3rX9XQ0KDevXtLkvbv36+wsDBde+213n42my0wVQIAAPjJr7Bz++23q3379lq2bJmuvPJKSd8+aHDy5Mm68cYb9eCDDwa0SAAAAH/5tWbnueeeU15enjfoSNKVV16p3//+99yNBQAAQopfYcftduvYsWPn7D927JhOnDhxyUUBAAAEil9h54477tDkyZO1Zs0aHT16VEePHtV///d/a+rUqRozZsxFH+e9997T7bffroSEBNlsNq1du9anfdKkSbLZbD7b8OHDffocP35cmZmZio6OVkxMjKZOnaq6ujp/TgsAABjIrzU7S5Ys0UMPPaS7775bDQ0N3x4oPFxTp07Vs88+e9HHOXnypAYOHKgpU6ZcMCQNHz5cS5cu9b6Oioryac/MzFRlZaUKCgrU0NCgyZMn695779XKlSv9ODMAAGAav8JOmzZt9Morr+jZZ5/VoUOHJElXX3212rZt26TjjBgxQiNGjPjBPlFRUXI4HOdt+/TTT7Vx40bt2LFDgwcPliT98Y9/1MiRI/WHP/xBCQkJTaoHAACY55IeKlhZWanKykr17NlTbdu2lWVZgarL691331VsbKx69+6t+++/X1999ZW3rbi4WDExMd6gI0lpaWlq1aqVtm3bdsFj1tfXy+12+2wAAMBMfoWdr776SqmpqerVq5dGjhypyspKSdLUqVMDetv58OHDtXz5chUWFuqZZ55RUVGRRowYocbGRkmSy+VSbGysz3vCw8PVoUMHuVyuCx43Ly9PdrvduyUmJgasZgAAEFr8CjszZ85URESEysvL1aZNG+/+sWPHauPGjQErbty4cfr1r3+t/v37a/To0Vq/fr127Nihd99995KOm5ubq9raWu925MiRwBQMAABCjl9rdt555x1t2rRJXbp08dnfs2dPffnllwEp7HyuuuoqderUSQcPHlRqaqocDoeqq6t9+pw5c0bHjx+/4Dof6dt1QN9f6AwAAMzk18zOyZMnfWZ0zjp+/HizhoijR4/qq6++Unx8vCTJ6XSqpqZGJSUl3j5btmyRx+NRSkpKs9UBAABaDr/Czo033qjly5d7X9tsNnk8Hi1cuFC33HLLRR+nrq5OpaWlKi0tlSQdPnxYpaWlKi8vV11dnWbNmqWPPvpIX3zxhQoLCzVq1Cj16NFDw4YNkyT17dtXw4cP1z333KPt27frgw8+0LRp0zRu3DjuxAIAAJL8/Bpr4cKFSk1N1c6dO3X69Gk9/PDD2rdvn44fP64PPvjgoo+zc+dOn3B09tfUs7KytHjxYu3evVvLli1TTU2NEhISNHToUM2fP99n9mjFihWaNm2aUlNT1apVK2VkZGjRokX+nBYAADCQzfLzfvHa2lq9/PLL2rVrl+rq6nTttdcqOzvb+xVTS+J2u2W321VbW6vo6Ohgl4OfiG5z3g52CQBaoC8WpAe7hJBxsX+/mzyz09DQoOHDh2vJkiX63e9+d0lFAgAANLcmr9mJiIjQ7t27m6MWAACAgPNrgfL48eP16quvBroWAACAgPNrgfKZM2f0n//5n9q8ebMGDRp0zm9iPf/88wEpDgAA4FI1Kex8/vnn6tatm/bu3atrr71WkrR//36fPjabLXDVAQAAXKImhZ2ePXuqsrJSW7dulfTtz0MsWrRIcXFxzVIcAADApWrSmp3v36W+YcMGnTx5MqAFAQAABJJfC5TP8vMRPQAAAJdNk8KOzWY7Z00Oa3QAAEAoa9KaHcuyNGnSJO/PNZw6dUr33XffOXdjrVmzJnAVAgAAXIImhZ2srCyf1+PHjw9oMQAAAIHWpLCzdOnS5qoDAACgWVzSAmUAAIBQR9gBAABGI+wAAACj+fXbWECo6Tbn7WCXAAAIUczsAAAAoxF2AACA0Qg7AADAaIQdAABgNMIOAAAwGmEHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYj7AAAAKMRdgAAgNEIOwAAwGiEHQAAYLTwYBcAAAAuXrc5bwe7hCb7YkF6UD+fmR0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYj7AAAAKMRdgAAgNGCGnbee+893X777UpISJDNZtPatWt92i3L0ty5cxUfH6/WrVsrLS1NBw4c8Olz/PhxZWZmKjo6WjExMZo6darq6uou41kAAIBQFtSwc/LkSQ0cOFB/+tOfztu+cOFCLVq0SEuWLNG2bdvUtm1bDRs2TKdOnfL2yczM1L59+1RQUKD169frvffe07333nu5TgEAAIS4oD5BecSIERoxYsR52yzL0osvvqhHH31Uo0aNkiQtX75ccXFxWrt2rcaNG6dPP/1UGzdu1I4dOzR48GBJ0h//+EeNHDlSf/jDH5SQkHDZzgUAAISmkF2zc/jwYblcLqWlpXn32e12paSkqLi4WJJUXFysmJgYb9CRpLS0NLVq1Urbtm274LHr6+vldrt9NgAAYKaQDTsul0uSFBcX57M/Li7O2+ZyuRQbG+vTHh4erg4dOnj7nE9eXp7sdrt3S0xMDHD1AAAgVIRs2GlOubm5qq2t9W5HjhwJdkkAAKCZhGzYcTgckqSqqiqf/VVVVd42h8Oh6upqn/YzZ87o+PHj3j7nExUVpejoaJ8NAACYKWTDTvfu3eVwOFRYWOjd53a7tW3bNjmdTkmS0+lUTU2NSkpKvH22bNkij8ejlJSUy14zAAAIPUG9G6uurk4HDx70vj58+LBKS0vVoUMHde3aVTNmzNDvf/979ezZU927d9djjz2mhIQEjR49WpLUt29fDR8+XPfcc4+WLFmihoYGTZs2TePGjeNOLAAAICnIYWfnzp265ZZbvK9zcnIkSVlZWcrPz9fDDz+skydP6t5771VNTY1uuOEGbdy4UVdccYX3PStWrNC0adOUmpqqVq1aKSMjQ4sWLbrs5wIAAEKTzbIsK9hFBJvb7ZbdbldtbS3rd1qobnPeDnYJAIAL+GJBerMc92L/fofsmh0AAIBAIOwAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYj7AAAAKMRdgAAgNEIOwAAwGiEHQAAYDTCDgAAMBphBwAAGI2wAwAAjEbYAQAARiPsAAAAoxF2AACA0Qg7AADAaIQdAABgNMIOAAAwGmEHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYj7AAAAKMRdgAAgNEIOwAAwGiEHQAAYDTCDgAAMBphBwAAGI2wAwAAjEbYAQAARiPsAAAAoxF2AACA0Qg7AADAaCEddh5//HHZbDafrU+fPt72U6dOKTs7Wx07dlS7du2UkZGhqqqqIFYMAABCTUiHHUm65pprVFlZ6d3ef/99b9vMmTO1bt06rV69WkVFRaqoqNCYMWOCWC0AAAg14cEu4MeEh4fL4XCcs7+2tlavvvqqVq5cqVtvvVWStHTpUvXt21cfffSRhgwZcrlLBQAAISjkZ3YOHDighIQEXXXVVcrMzFR5ebkkqaSkRA0NDUpLS/P27dOnj7p27ari4uIfPGZ9fb3cbrfPBgAAzBTSYSclJUX5+fnauHGjFi9erMOHD+vGG2/UiRMn5HK5FBkZqZiYGJ/3xMXFyeVy/eBx8/LyZLfbvVtiYmIzngUAAAimkP4aa8SIEd5/DxgwQCkpKUpKStIbb7yh1q1b+33c3Nxc5eTkeF+73W4CDwAAhgrpmZ3vi4mJUa9evXTw4EE5HA6dPn1aNTU1Pn2qqqrOu8bnu6KiohQdHe2zAQAAM7WosFNXV6dDhw4pPj5egwYNUkREhAoLC73tZWVlKi8vl9PpDGKVAAAglIT011gPPfSQbr/9diUlJamiokLz5s1TWFiY7rrrLtntdk2dOlU5OTnq0KGDoqOj9cADD8jpdHInFgAA8ArpsHP06FHddddd+uqrr9S5c2fdcMMN+uijj9S5c2dJ0gsvvKBWrVopIyND9fX1GjZsmF555ZUgVw0AAEKJzbIsK9hFBJvb7ZbdbldtbS3rd1qobnPeDnYJAIAL+GJBerMc92L/freoNTsAAABNRdgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYj7AAAAKMRdgAAgNFC+lfPERz8qCYAwCTM7AAAAKMRdgAAgNEIOwAAwGiEHQAAYDTCDgAAMBphBwAAGI2wAwAAjEbYAQAARiPsAAAAoxF2AACA0Qg7AADAaIQdAABgNMIOAAAwGmEHAAAYjbADAACMRtgBAABGI+wAAACjEXYAAIDRCDsAAMBohB0AAGA0wg4AADAaYQcAABiNsAMAAIxG2AEAAEYj7AAAAKMRdgAAgNEIOwAAwGiEHQAAYDTCDgAAMBphBwAAGI2wAwAAjGZM2PnTn/6kbt266YorrlBKSoq2b98e7JIAAEAIMCLsvP7668rJydG8efP08ccfa+DAgRo2bJiqq6uDXRoAAAgym2VZVrCLuFQpKSm67rrr9PLLL0uSPB6PEhMT9cADD2jOnDk/+n632y273a7a2lpFR0cHtLZuc94O6PEAAGhpvliQ3izHvdi/3+HN8umX0enTp1VSUqLc3FzvvlatWiktLU3FxcXnfU99fb3q6+u9r2trayV9O2iB5qn/OuDHBACgJWmOv6/fPe6Pzdu0+LDzf//3f2psbFRcXJzP/ri4OH322WfnfU9eXp6eeOKJc/YnJiY2S40AAPyU2V9s3uOfOHFCdrv9gu0tPuz4Izc3Vzk5Od7XHo9Hx48fV8eOHWWz2YJYWehxu91KTEzUkSNHAv4Vn6kYs6ZhvJqG8Wo6xqxpWtJ4WZalEydOKCEh4Qf7tfiw06lTJ4WFhamqqspnf1VVlRwOx3nfExUVpaioKJ99MTExzVWiEaKjo0P+og81jFnTMF5Nw3g1HWPWNC1lvH5oRuesFn83VmRkpAYNGqTCwkLvPo/Ho8LCQjmdziBWBgAAQkGLn9mRpJycHGVlZWnw4MH6h3/4B7344os6efKkJk+eHOzSAABAkBkRdsaOHatjx45p7ty5crlc+vnPf66NGzees2gZTRcVFaV58+ad87UfLowxaxrGq2kYr6ZjzJrGxPEy4jk7AAAAF9Li1+wAAAD8EMIOAAAwGmEHAAAYjbADAACMRtiBFi9erAEDBngfIOV0OrVhwwZv+6lTp5Sdna2OHTuqXbt2ysjIOOchjj9lCxYskM1m04wZM7z7GDNfjz/+uGw2m8/Wp08fbzvjdX5/+9vfNH78eHXs2FGtW7dW//79tXPnTm+7ZVmaO3eu4uPj1bp1a6WlpenAgQNBrDh4unXrds41ZrPZlJ2dLYlr7PsaGxv12GOPqXv37mrdurWuvvpqzZ8/3+c3poy6viz85L311lvW22+/be3fv98qKyuzHnnkESsiIsLau3evZVmWdd9991mJiYlWYWGhtXPnTmvIkCHWL3/5yyBXHRq2b99udevWzRowYIA1ffp0737GzNe8efOsa665xqqsrPRux44d87YzXuc6fvy4lZSUZE2aNMnatm2b9fnnn1ubNm2yDh486O2zYMECy263W2vXrrV27dpl/frXv7a6d+9uffPNN0GsPDiqq6t9rq+CggJLkrV161bLsrjGvu+pp56yOnbsaK1fv946fPiwtXr1aqtdu3bWSy+95O1j0vVF2MF5XXnlldZ//Md/WDU1NVZERIS1evVqb9unn35qSbKKi4uDWGHwnThxwurZs6dVUFBg/eM//qM37DBm55o3b541cODA87YxXuc3e/Zs64Ybbrhgu8fjsRwOh/Xss89699XU1FhRUVHWn//858tRYkibPn26dfXVV1sej4dr7DzS09OtKVOm+OwbM2aMlZmZaVmWedcXX2PBR2Njo1atWqWTJ0/K6XSqpKREDQ0NSktL8/bp06ePunbtquLi4iBWGnzZ2dlKT0/3GRtJjNkFHDhwQAkJCbrqqquUmZmp8vJySYzXhbz11lsaPHiw/umf/kmxsbH6xS9+oX//93/3th8+fFgul8tn3Ox2u1JSUn7S4yZJp0+f1muvvaYpU6bIZrNxjZ3HL3/5SxUWFmr//v2SpF27dun999/XiBEjJJl3fRnxBGVcuj179sjpdOrUqVNq166d3nzzTSUnJ6u0tFSRkZHn/FBqXFycXC5XcIoNAatWrdLHH3+sHTt2nNPmcrkYs+9JSUlRfn6+evfurcrKSj3xxBO68cYbtXfvXsbrAj7//HMtXrxYOTk5euSRR7Rjxw799re/VWRkpLKysrxj8/0nxf/Ux02S1q5dq5qaGk2aNEkS/02ez5w5c+R2u9WnTx+FhYWpsbFRTz31lDIzMyXJuOuLsANJUu/evVVaWqra2lr913/9l7KyslRUVBTsskLSkSNHNH36dBUUFOiKK64Idjktwtn/W5SkAQMGKCUlRUlJSXrjjTfUunXrIFYWujwejwYPHqynn35akvSLX/xCe/fu1ZIlS5SVlRXk6kLbq6++qhEjRighISHYpYSsN954QytWrNDKlSt1zTXXqLS0VDNmzFBCQoKR1xdfY0HSt78e36NHDw0aNEh5eXkaOHCgXnrpJTkcDp0+fVo1NTU+/auqquRwOIJTbJCVlJSourpa1157rcLDwxUeHq6ioiItWrRI4eHhiouLY8x+RExMjHr16qWDBw9yjV1AfHy8kpOTffb17dvX+/Xf2bH5/h1FP/Vx+/LLL7V582b98z//s3cf19i5Zs2apTlz5mjcuHHq37+/JkyYoJkzZyovL0+SedcXYQfn5fF4VF9fr0GDBikiIkKFhYXetrKyMpWXl8vpdAaxwuBJTU3Vnj17VFpa6t0GDx6szMxM778Zsx9WV1enQ4cOKT4+nmvsAq6//nqVlZX57Nu/f7+SkpIkSd27d5fD4fAZN7fbrW3btv2kx23p0qWKjY1Venq6dx/X2Lm+/vprtWrlGwHCwsLk8XgkGXh9BXuFNIJvzpw5VlFRkXX48GFr9+7d1pw5cyybzWa98847lmV9e8tm165drS1btlg7d+60nE6n5XQ6g1x1aPnu3ViWxZh934MPPmi9++671uHDh60PPvjASktLszp16mRVV1dblsV4nc/27dut8PBw66mnnrIOHDhgrVixwmrTpo312muvefssWLDAiomJsf7yl79Yu3fvtkaNGtVibw0OhMbGRqtr167W7Nmzz2njGvOVlZVl/exnP/Peer5mzRqrU6dO1sMPP+ztY9L1RdiBNWXKFCspKcmKjIy0OnfubKWmpnqDjmVZ1jfffGP95je/sa688kqrTZs21h133GFVVlYGseLQ8/2ww5j5Gjt2rBUfH29FRkZaP/vZz6yxY8f6PC+G8Tq/devWWf369bOioqKsPn36WP/2b//m0+7xeKzHHnvMiouLs6KioqzU1FSrrKwsSNUG36ZNmyxJ5x0DrjFfbrfbmj59utW1a1friiuusK666irrd7/7nVVfX+/tY9L1ZbOs7zwuEQAAwDCs2QEAAEYj7AAAAKMRdgAAgNEIOwAAwGiEHQAAYDTCDgAAMBphBwAAGI2wAwAAjEbYAQAARiPsAAAAoxF2AACA0Qg7AADAaP8Pl47axCVxPW0AAAAASUVORK5CYII=", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# create a histogram of the life expectancy\n", + "# Hint: extract the lifeExp column first\n", + "gapminder['lifeExp'].plot(kind='hist')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The Seaborn library\n", + "\n", + "The inbuilt Pandas plotting functionalities are somewhat limited in what they can do.\n", + "\n", + "If you want to create more sophisticated visualizations, you'll want to use another library. One of the most popular libraries for data visualization is [Seaborn](https://seaborn.pydata.org/)." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Import seaborn\n", + "import seaborn as sns\n", + "\n", + "# Use seaborn to create a scatterplot of gdpPercap vs lifeExp\n", + "sns.scatterplot(data=gapminder, x='gdpPercap', y='lifeExp')" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# create a version of the above scatterplot for just the year 2007, \n", + "# where the color is based on continent \n", + "# and the size is based on population\n", + "# and the size range is from 20 to 500\n", + "# and the transparency is 0.7\n", + "sns.scatterplot(data=gapminder.query('year == 2007'), \n", + " x='gdpPercap', \n", + " y='lifeExp', \n", + " hue='continent', \n", + " size='pop', \n", + " sizes=(20, 500), \n", + " alpha=0.7)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Creating bar charts and histograms with seaborn is also fairly straightforward" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# create a bar chart of the number of countries in each continent using sns.countplot()\n", + "sns.countplot(data=gapminder,\n", + " y='continent')" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# create a histogram of lifeExp using sns.histplot()\n", + "sns.histplot(data=gapminder,\n", + " x='lifeExp')" + ] } ], "metadata": { diff --git a/content/complete/17_list_comprehension.ipynb b/content/complete/17_list_comprehension.ipynb index b4a46ba..f18f8c5 100644 --- a/content/complete/17_list_comprehension.ipynb +++ b/content/complete/17_list_comprehension.ipynb @@ -19,6 +19,18 @@ "\n", "gapminder = pd.read_csv('data/gapminder.csv')" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We've seen lists a few times throughout this workshop, but we haven't really talked much about them." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] } ], "metadata": { diff --git a/content/incomplete/01_variables.ipynb b/content/incomplete/01_variables.ipynb new file mode 100644 index 0000000..9ba3a5c --- /dev/null +++ b/content/incomplete/01_variables.ipynb @@ -0,0 +1,223 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Variables and objects in Python\n", + "\n", + "In this document, we will define some objects/variables and learn how to interact with them. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Simple computations\n", + "\n", + "We can use Python to do simple computations, like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# compute 1 + 1\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Defining variables/objects\n", + "\n", + "If I want to use the \"output\" of this code, we need to assign it to a variable/object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# assign the output of 1 + 1 to a variable y\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To access the results of the `1 + 1` computation, we need to type the name of the variable:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print out the results of the above computation using y\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can do mathematical operations with variables:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# multiply y by 6\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# compute y squared\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# compute y**2 + y/2 and assign it to z\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# see what z contains\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Overwriting variables\n", + "\n", + "You can overwrite variables, by re-assinging them:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# overwrite/reassign y with 10\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the value of z -- has it changed?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The `+=` shortcut\n", + "\n", + "There is a shortcut that will let you add a number to a variable *and* update its value: `+=`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print the value of z + 2 (do not reassign it)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# has the value of z changed?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# update z to be z + 2 using the += shorthand\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print z again. Has it changed now?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise\n", + "\n", + "Without running the code below, can you guess what the output of the following code cell will be?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "value = 1\n", + "computed_result = (value * 10) + (3 ** 2)\n", + "value += 2\n", + "computed_result * 2" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/02_types.ipynb b/content/incomplete/02_types.ipynb new file mode 100644 index 0000000..6292aa7 --- /dev/null +++ b/content/incomplete/02_types.ipynb @@ -0,0 +1,481 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Types\n", + "\n", + "The main types in python are:\n", + "\n", + "- Float (decimal point): 1.2, 3.0, 5.123\n", + "\n", + "- Integers: 1, 7, 11\n", + "\n", + "- String (text): 'banana', 'Utah', 'Rebecca' etc\n", + "\n", + "- Boolean (True/False): True or False" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Numeric types" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's define a variable $y = 2 \\times 3.2 + 1$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# define y\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# look at y. What do you think its type is?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The `type()` function\n", + "\n", + "We can check the type of `y` using the `type()` funciton" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the type function to the variable y\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's also create a variable $z = 1 + 5$ and check its type:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# define z\n", + "\n", + "# look at z--what type do you think z has?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of z using type()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can do mathematical operations with float and integer type objects" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# compute y squared\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# divide z by 2\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### String types" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"String\" in Python just means \"text\", so string type objects contain text, and can be identified because they are surrounded by quotes:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# define w to contain the string 'John Doe'\n", + "\n", + "# look at the value of w\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What do you think will happen when you try to **multiply** a string by an *integer*?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Multiply w by 7\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What about **adding** a string and an *integer*?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Add w and 7\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What about **adding** a string to another *string*?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# add w and the string 'Smith'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's overwrite `w` with a new string value: 'banana'. Notice that we need 'banana' to be surrounded by quotes. The quotes around the string values are very important since they distinguish string *values* from variable *names*:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# redefine w to contain 'banana'\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# look at w\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of w\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that you can also use the type function on *values* directly (rather than on variables):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of the string value 'a' ('a' is not assigned to a variable)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Boolean type" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The boolean type corresponds to binary True/False values:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write the boolean True value\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that `True` above, is a special value, it's not a variable name. You can't just write any text and expect it to be printed out." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's check the type of `True`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of True\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that this is different from the type of `'True'`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of 'True' (with quotes)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The converse to `True` is `False`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# write the boolean False value\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of False\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can assign boolean values to variables, just as with integers/floats and strings:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# assign True to the variable a\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print out the value of a\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can do mathematical operations with boolean values:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# add a and 3\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# add False and 4\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that you *cannot* add Boolean values and string values together:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# try to add True and 'True'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "Without running the following code cells, answer the following questions:\n", + "\n", + "1. Will the computation work? \n", + "\n", + "1. If so, what will the output be? \n", + "\n", + "1. What type will the output have?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "'True' * 4" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "'banana' + 'apple'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "False + 5" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "True * 'True'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "5 + '5.2'" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/03_type_conversions.ipynb b/content/incomplete/03_type_conversions.ipynb new file mode 100644 index 0000000..cc2bd92 --- /dev/null +++ b/content/incomplete/03_type_conversions.ipynb @@ -0,0 +1,258 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Type conversion\n", + "\n", + "Objects of one type can be converted to another type using a collection of functions whose names match the type shorthand you want to convert to." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a reminder, the `type()` function tells us the type of a value::" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of '4' (with quotes)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of 4:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Converting to a string using `str()`\n", + "\n", + "The `str()` function will convert whatever value it is given to a string (whose shorthand is `str`). \n", + "\n", + "Below, we convert the integer `4` to a string, assign it to a variable called `a` and then we check the type of `a` (which is `str`):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert the integer 4 to a string and assign it to a variable a:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of a\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Rather than assigning `str(4)` to a, and then checking the type of `a` directly, we can use nested functions:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# in one line, convert the integer 4 to a string and check its type\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Converting to an integer using `int()`\n", + "\n", + "Converting the float `3.0` to an integer removes the decimal point:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert the float 3.0 to an integer\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you do the same thing for a float point number with a non-zero decimal, it also just removes the decimal point:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert the float 4.2 to an integer\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What happens when you try to convert a string to an integer?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Convert the string 'two' to an integer\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Converting to a boolean using `bool()`\n", + "\n", + "When you convert a number to a boolean using `bool()`, it is always converted to `True`, unless the number is equal to `0` (this is the only number that is converted to `False`):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert the integer 3 to a boolean\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert the integer 1 to a boolean\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert the integer 0 to a boolean\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert the float 1.1 to a boolean\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Do you think a negative number will be converted to `True` or `False`?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert the float -3.4 to a boolean\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What will a string be converted to?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert the string 'hello' to a boolean\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of the boolean version of 'hello' from the previous cell\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise\n", + "\n", + "1. Create a string variable called `my_string` containing the value `'two'`. \n", + "\n", + "1. Convert `my_string` to a boolean and assign the result to a variable called `my_bool`. \n", + "\n", + "1. Print the value of `my_bool` and check its type. \n", + "\n", + "1. Convert `my_bool` to an integer and assign the result to a variable called `my_int`.\n", + "\n", + "Try to convert the original `my_string` variable to an integer directly. What happens?" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/04_boolean_operations.ipynb b/content/incomplete/04_boolean_operations.ipynb new file mode 100644 index 0000000..dfcb224 --- /dev/null +++ b/content/incomplete/04_boolean_operations.ipynb @@ -0,0 +1,240 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Asking questions with boolean operations\n", + "\n", + "There are several helpful operations (`==`, `<=`, `<` `!=`) that allow us to ask logical questions of our data whose answers are True or False" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's define a variable `age` and give it the value `20`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# assign age to 20\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Asking if two things are equal with `==`\n", + "\n", + "To ask a question of equality, we use two equal signs `==`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask is age equal to 18?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This questions doesn't have to involve a variable and a value, it can instead be asked directly of two values, for instance:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask is 1 equal to 1?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Asking if two things are not equal with `!=`\n", + "\n", + "The \"not equal to\" operator is written `!=`. The following question asks if the `age` variable is \"not equal\" to 10:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask is age \"not equal\" to 18?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Less than or greater than with `<` and `>`\n", + "\n", + "Next, to ask questions of greater than or less than, we use the `<` and `>` operators:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask is age greater than 18?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask is the value 20 greater than the value 18?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Greater or equal to\" is the \"greater than\" symbol (`>`) followed by the equals symbol (`=`):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# is age greater than or equal to 18?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "and similarly for less than or equal to:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# is age less than or equal to 18?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Less than or greater than for strings\n", + "\n", + "Strings are treated alphabetically, so `'apple'` is \"less\" than `'bannana'` because the first letter of apple \"a\" comes before the first letter of banana \"b\" in the alphabet:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask is 'apple' less than 'banana'\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask is 'carrot' less than 'banana'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also use these operators to compare the values contained within two variables:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# assign age_john the value 18\n", + "\n", + "# assign age_beth the value 22\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask is age_john less than age_beth?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "1. Create two variables, `age1` and `age2` and assign them integer values `12` and `15`, respectively.\n", + "\n", + "1. Write a boolean condition to test if `age1` is equal to `age2`.\n", + "\n", + "1. Write a boolean condition to test if `age1` is greater than `age2`.\n", + "\n", + "1. Write a boolean condition to test if `age1` is less than or equal to `age2`.\n", + "\n", + "1. Write a boolean condition to test if `age1` is not equal to `age2`.\n", + "\n", + "1. Write a boolean condition to test if the absolute difference between `age1` and `age2` is greater than 5 using the `abs()` function." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/05_numpy.ipynb b/content/incomplete/05_numpy.ipynb new file mode 100644 index 0000000..7127bd6 --- /dev/null +++ b/content/incomplete/05_numpy.ipynb @@ -0,0 +1,194 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The numpy library" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Some functions that exist \"natively\" in the Python programming language (like `type()`, `bool()`, `int()`, etc), byt Python lacks native versions of many important functions, such as the logarithm, exponential, and square root functions. \n", + "\n", + "Libraries can be installed that provide access to many functions. \n", + "\n", + "The **Numpy** library (pronounced \"Num Pie\") is an add-on library that gives us access to many mathematical functions such as the `log()`, `exp()`, and `sqrt()` functions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Installing numpy\n", + "\n", + "Just like an application on your computer, where you need to *first download and install* the application before you can use it on your computer, before you can use Python libraries, you need to first download and install them. \n", + "\n", + "\n", + "**Downloading libraries requires internet access.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that just like **you only ever have to download and install an application to your computer _once_**, you similarly only need to install each Python library *once*. Thus it is not encouraged to repeatedly run a `pip install ` command every time you work on your notebook. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# install numpy and then comment out the installation line\n", + "# !pip install numpy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Importing numpy\n", + "\n", + "Unlike the `pip install` command, you **do** need to include this `import` line of code in every new `.ipynb` notebook file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import numpy and rename it to np\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using numpy functions\n", + "\n", + "Define a variable `x` that contains the value `2`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# define a variable x and assign it a value of 2\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# take a look at x\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we wanted to compute the logarithm of `x`, you might imagine that we would write:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# try to compute the log of x using the log() function\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But, notice that we get an error that essentially says that Python doesn't know what `log()` is. \n", + "\n", + "In Python, to access functions that come from imported libraries, you need to extract the function from the library shorthand/nickname using the `library.function()` syntax, e.g., `np.log()` instead of just `log()`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# try to compute the log of x using the np.log() function\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is also true of the `sqrt()` function that is imported from the `numpy` library:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# compute the square root of x\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And the `exp()` function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# compute the exponential of 2\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "Compute the sum of the log of 7 and the square root of 8 and exponentiate the result, i.e., $e^{\\log(7) + \\sqrt{8}}$." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/06_pandas_dataframes.ipynb b/content/incomplete/06_pandas_dataframes.ipynb new file mode 100644 index 0000000..52ce93d --- /dev/null +++ b/content/incomplete/06_pandas_dataframes.ipynb @@ -0,0 +1,304 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pandas data frames\n", + "\n", + "In this notebook, we will start learning about Pandas data frames. \n", + "\n", + "To import the pandas library to our notebook, if you haven't done so already you will first need to download and install the pandas library (`pip install pandas`).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then you can import the pandas library into this notebook as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import the pandas library and alias as pd\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Loading a data file into a pandas DataFrame\n", + "\n", + "To load a .csv data file into our space, we need to use the `read_csv()` function from the pandas library. Make sure that you have saved the `gapminder.csv` file in a `data` subfolder that lives in the same place where this notebook is saved.\n", + "\n", + "Let's load the gapminder dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# read the csv file living in data/gapminder.csv into a pandas dataframe \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This just prints out the gapminder dataset, but it doesn't save it. \n", + "\n", + "To save the dataset so that we can use it in our notebook, we can to assign the results of the `pd.read_csv()` function to a variable called `gapminder`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save the above dataframe as a variable called gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To view the dataset we have just loaded, we can type the name of the variable that we saved it in:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# look at the gapminder dataframe object\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can then use the same `type()` function that we used in the previous notebook to ask what kind of object the `gapminder` variable is (the answer is a pandas DataFrame):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of the dataframe object\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extracting information/attributes from DataFrames\n", + "\n", + "In this section, we will learn how to extract attributes from DataFrame objects and how to apply DataFrame-specific \"methods\" to DataFrames, both using the `.` syntax." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a reminder, let's print out the `gapminder` DataFrame that we're working with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# look at gapminder again\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The shape attribute\n", + "\n", + "To extract an attribute from an object in Python, we use the `object.attribute` syntax. So if we want to extract the `shape` attribute from the `gapminder` DataFrame object, we can do so as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extract the shape attribute from gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This `shape` attribute tells us the number of rows (1704) and the number of columns (6) and is helpful for learning about the size of our data objects" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The head() method\n", + "\n", + "The `head()` function typically prints out the first few rows of a DataFrame. However, `head()` is not a regular function. If `head()` were a regular function, we would be able to apply it like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# try to look at the first 5 rows of the gapminder dataset using the head() function\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But this results in an error. This is because `head()` is not a function that can be applied in the regular way. \n", + "\n", + "Instead, `head()` is a **method**, which is applied using the `object.method()` syntax rather than the `method(object)` syntax above. \n", + "\n", + "We can apply the `head()` method to the `gapminder` dataset as follows, which will print out the first 5 rows of the DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the the head() method to gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Arguments\n", + "\n", + "You can provide additional arguments to the `head()` inside the parentheses. For example, if you want to print 10 rows instead of 5, you can do so as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the head() method to gapminder with an argument of 10\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This argument has a name `n`, which you can explicitly specify:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the head method to gapminder with a *named* argument of n=10\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But you don't need to specify the `n=` part of the argument because the `head()` method knows that the first argument is the number of rows to print." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "1. The pandas DataFrame has an attribute called `dtypes` that will print out the *type* of each column. Extract the `dtypes` attribute from the `gapminder` DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extract the dtypes attribute from gapminder\n", + "gapminder.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the \"string\" type is called `object` in pandas." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. The pandas DataFrame has a \"method\" called `select_dtypes()` that will extract just the columns of a certain type from the DataFrame. Use the `select_dtypes()` function to extract the numeric (float and integer) columns of gapminder by providing an argument `include='number'` inside the parentheses of `select_dtypes()`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use the select_dtypes() method to select only the columns of type number\n", + "gapminder.select_dtypes(include='number')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# to instead extract the string/object-type columns:\n", + "gapminder.select_dtypes(include='object')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/07_index.ipynb b/content/incomplete/07_index.ipynb new file mode 100644 index 0000000..d298c7f --- /dev/null +++ b/content/incomplete/07_index.ipynb @@ -0,0 +1,255 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## DataFrame indexing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this section we will learn about the column and row indexes of the DataFrame. These are essentially the column and row names of the DataFrame.\n", + "\n", + "We will keep working with the gapminder DataFrame." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's start by importing the libraries that we need:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import the pandas library\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's load the gapminder dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# read the data from the downloaded CSV file and save it as a pandas DataFrame called gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To view the dataset we have just loaded, we can type the name of the variable that we saved it in:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# look at the gapminder object\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The column index\n", + "\n", + "To extract the column index, which corresponds to the column names of the DataFrame, we need to extract the `columns` attribute of the Dataframe object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extract the column index (the column names) from the gapminder DataFrame\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that the output of the cell above is an \"Index\" object. \n", + "\n", + "We can use the `list()` function to convert the index object to a simpler type of object called a \"list\" (which is just a collection of values):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use the list() function to create a \"list\" of the column names\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The row index\n", + "\n", + "The row index can be extracted using the `index` attribute (there is no `rows` attribute):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extract the row index from gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This time, the output is a `RangeIndex` object." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To extract the actual integer values we can use the `list()` function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use the list() function to create a \"list\" of the row index entries\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Changing the index\n", + "\n", + "You can change the index using the `set_index()` method." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's set the `country` column to be the row index:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use the set_index() method to set the 'country' column as the index of gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, notice also that this did not actually modify the `gapminder` object itself." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# show that the gapminder DataFrame still has its original index\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Below, we create a *new* DataFrame corresponding to the version of `gapminder` with the `'country'` column as the row index:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a new DataFrame called gapminder_country where the index is the 'country' column\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that the original gapminder dataset is unchanged:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print gapminder to show that gapminder DataFrame still has its original index\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But that the `gapminder_country` DataFrame has the `country` column as its index. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print gapminder_country to show that it has the 'country' column as the index \n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/08_series.ipynb b/content/incomplete/08_series.ipynb new file mode 100644 index 0000000..443ef69 --- /dev/null +++ b/content/incomplete/08_series.ipynb @@ -0,0 +1,307 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Extracting columns from a DataFrame as a Pandas Series object\n", + "\n", + "In the video I was working in the old file, but I thought it was actually getting a bit long, so here is a shorter one. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's start by importing the libraries that we need:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import the pandas library\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's load the gapminder dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# read the gapminder data and save it as a DataFrame object called 'gapminder'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To view the dataset we have just loaded, we can type the name of the variable that we saved it in:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# take a look at gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extracting columns from a DataFrame" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are several ways to extract a column from a DataFrame. \n", + "\n", + "### Method 1: Using square brackets\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use the `df['colname']` syntax to extract the 'year' column\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Method 2: Using the column attribute with `.`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use the df.colname syntax to extract the 'year' column\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The Pandas Series object" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Save the `year` column from gapminder as a variable called `year`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# define a variable called year and assign to it the 'year' column of gapminder\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# take a look at the year variable\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is the type of `year`? \n", + "\n", + "The answer is a **Pandas Series** object:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the type of year using type()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Series objects are like 1-column Dataframe objects, but they don't have a `columns` attribute like a DataFrame does:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# try to extract the columns attribute from the year object\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The Series index\n", + "\n", + "They do however have an `index` (row name) attribute, which is inherited from the DataFrame from which the Series came:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extract the index attribute from the year object\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The vectorized nature of Series objects\n", + "\n", + "The nice thing about Pandas Series objects is that they are **vectorized**. \n", + "\n", + "This means that when you apply simple mathematical operations to them, the operation will be applied to *every* entry in the Series. For example, if we add `5` to the `year` Series object, `5` will be added to *every* value in the `year` Series object:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Try to add 5 to year\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similarly, if we raise `lifeExp` to the power of 2, this computation will be applied to every single value in the Series object:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Try to raise the lifeExp column of gapminder to the power of 2\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also apply the numpy mathematical functions in a vectorized way to Series objects:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Try to compute the logarithm of the gdpPercap column of gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is important to note that this behaviour, while it seeems natural, is not exhibited in other Python object types, such as lists (more on lists in a future video)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also ask boolean/logical questions of each value in a Series object simultaneously. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask which entries in the year column of gapminder equal 2007\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ask which entries in the lifeExp column are greater or equal to 60\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "1. Extract the `country` and `continent` columns from gapminder and create a Series object that contains the country and continent values separated by a comma, e.g., the first few entries should be \"Afghanistan, Asia\"." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. Extract the `pop` and `gdpPercap` columns from gapminder and create a Series object that contains the total GDP for each country. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/09_subsetting.ipynb b/content/incomplete/09_subsetting.ipynb new file mode 100644 index 0000000..41ac65c --- /dev/null +++ b/content/incomplete/09_subsetting.ipynb @@ -0,0 +1,227 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Extracting subsets of data frames\n", + "\n", + "In this notebook, we will learn how to manipulate pandas DataFrame objects, starting with extracting subsets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import pandas\n", + "\n", + "# load the gapminder dataset and save as `gapminder`\n", + "\n", + "# take a look at the head of gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extracting multiple columns\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# try to extract two columns: country and gdpPercap from gapminder using the `df[]` notation with two column names\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `df[]` syntax expects only one value (or object) inside the square parentheses. \n", + "\n", + "Fortunately, you can provide multiple column names as a single **list** object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a list containing the names of the columns we want to extract: 'country' and 'gdpPercap' \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can extract both the country and gdpPercap columns by providing this *list* in the indexing square parentheses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# provide the list of names inside the `df[]` notation to extract the two columns from gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The `.loc` indexer\n", + "\n", + "An alternative (and ultimately more flexible) approach to subsetting a Pandas DataFrame is to use the `.loc` indexer. \n", + "\n", + "With `.loc`, the square brackets expect *two* values: one for the *row* index and one for the *column* index. \n", + "\n", + "The general syntax is `df.loc[rows, cols]`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use `df.loc[,]` to extract the entry with row index 3 from the 'gdpPercap' column\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using `:` with `.loc` to select all rows/columns\n", + "\n", + "If you want to extract all rows (or columns), you can replace the corresponding index entry with `:`. So the following code will extract all rows for the `gdpPercap` column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use `df.loc[,]` to extract all rows from the 'gdpPercap' column\n", + "\n", + "# what are two other ways that you could do this same thing?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you want to extract multiple columns (or rows), you still need to provide all of the index values that you want to extract in a list." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use `df.loc[,]` to extract all rows for the 'country' and 'gdpPercap' columns\n", + "\n", + "# what is another way to do the same thing?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extract the rows with index 4, 5, 6, 7, and 8 for the country and gdpPercap columns\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If your index corresponds to a sequence of integers, you can instead provide a \"range\" object:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use the `range()` function to simplify the code in the previous cell\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using `.loc` with non-numeric indexes\n", + "\n", + "Let's create `gapminder_country`, whose row index corresponds to the country variable:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# define gapminder_country as a new dataframe with the country column as the row index\n", + "\n", + "# look at gapminder_country\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use the `df.loc[,]` notation to extract the rows for Germany for the gdpPercap column\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "1. Extract the population and year columns for Australia using `gapminder_country`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. Extract the 'country' and 'lifeExp' columns for the first, second, and third rows of `gapminder_country`." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/10_filtering_logical.ipynb b/content/incomplete/10_filtering_logical.ipynb new file mode 100644 index 0000000..9974f19 --- /dev/null +++ b/content/incomplete/10_filtering_logical.ipynb @@ -0,0 +1,158 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Filtering using logical operations and `.loc`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import pandas\n", + "\n", + "# load the gapminder dataset\n", + "\n", + "# take a look at the head of gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Filtering with `.loc` using a boolean series\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Recall that you can create a **boolean series** based on a **logical condition** on a column from a DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use a boolean operation to identify which rows have the value 'Australia' in the 'country' column \n", + "# save the result as australia_index\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We can use this boolean series to subset/filter the rows of our DataFrame by providing it in the row indexing position of the `.loc` indexer. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use .loc and australia_index to select only the rows corresponding to Australia\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The simpler `df[]` syntax can be used to filter *either* the columns or the rows, but not both at the same time. \n", + "\n", + "If you provide a list of column names, Pandas will know that you are trying to subset to those *columns*, whereas if you provide a boolean series whose length equals the number of rows, Pandas will know that you are trying to subset the *rows*. \n", + "\n", + "But just looking at the code, if you don't know what the data looks like, it is very hard to tell if the syntax below is subsetting the rows or the columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# using the single-bracket `df[]` syntax:\n", + "gapminder[australia_index]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This it is recommended that you use the `.loc` indexing syntax, which has an explicit position for the row subsetting and the column subsetting: `df.loc[row_index,column_index]`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Rather than defining a separate indexing series object for filtering the rows (like `australia_index`), it is common to just put the logical filtering condition directly in the indexing syntax:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# directly use logical conditioning inside .loc to select only the rows corresponding to Australia\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Multiple conditions\n", + "\n", + "You can provide multiple row filtering conditions by separating them with an `&`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# provide multiple logical conditions in `.loc` to filter the rows so that only rows corresponding to Australia after 1990 are selected\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "Extract the subset of the data corresponding to Asian countries for which the life expectancy is at least 75." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/11_filtering_query.ipynb b/content/incomplete/11_filtering_query.ipynb new file mode 100644 index 0000000..b3b2329 --- /dev/null +++ b/content/incomplete/11_filtering_query.ipynb @@ -0,0 +1,155 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Filtering the rows using the `.query` method\n", + "\n", + "In this section, we will introduce another row filtering approach using the `.query()` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import pandas\n", + "\n", + "# load the gapminder dataset\n", + "\n", + "# take a look at the head of gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a reminder, so far we have learned:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# filter to Australia using a logical condition inside `gapminder[]`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# filter to Australia using a logical condition inside `gapminder.loc[,]`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Filtering using `.query()`\n", + "\n", + "The `.query()` method does the same thing, but the syntax is a bit different. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# filter to Australia using a logical string expression inside query\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### External variables in the `.query()` method\n", + "\n", + "Note that if you want to use an \"external\" variable in your filtering query, you need to access it within the argument using `@variable_name`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# define a variable, selected_country, and assign it the string 'Brazil'\n", + "\n", + "# use query with @ to filter gapminder to selected_country\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Combining `.query()` with `.loc`\n", + "\n", + "Note that since `gapminder.query()` outputs a DataFrame itself, you can follow a query method call with further subsetting, e.g., using `.loc[]`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use .query() to filter to the selected country and then use df.loc[,] to select the year and lifeExp columns\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note, however, that you cannot reverse the order of the `.loc` indexer and the `query()` method in this case." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Try to use .loc[,] to select the year and lifeExp columns and then use .query() to filter to the selected country\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "Use the query method to filter to the gapminder data to the year 2007, and return just the `country` and `lifeExp` columns.\n", + "\n", + "Then do the same thing using only the `.loc` indexer." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/12_iloc.ipynb b/content/incomplete/12_iloc.ipynb new file mode 100644 index 0000000..9227a49 --- /dev/null +++ b/content/incomplete/12_iloc.ipynb @@ -0,0 +1,180 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Subsetting with the `.iloc` indexer\n", + "\n", + "The `.iloc` indexer allows you to do positional indexing, whereas `.loc` required that you do named indexing. The fact that we could provide integer positions to the `.loc` indexer previously was a result of the fact that the row index values of the `gapminder` DataFrame were themselves integer values.\n", + "\n", + "Let's take a quick look at the gapminder dataset (notice that the row index consists of integers):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import pandas\n", + "\n", + "# load the gapminder dataset\n", + "\n", + "# take a look at the head of gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A `.loc[,]` reminder:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use .loc[,] to subset row with index 1, 4, 5 and columns `year', 'lifeExp', 'pop' from gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define the `gapminder_country` DataFrame that has the `country` column as the row index:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create gapminder_country dataframe with country as index\n", + "\n", + "# look at the head of gapminder_country\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Positional indexing with `.iloc`\n", + "\n", + "Now let's try and extract the rows 1, 4, and 5 and the columns `'year'`, `'lifeExp'`, and `'pop'` from this country-indexed version of gapminder:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Try to use .loc[,] to subset row with index 1, 4, 5 and columns `year', 'lifeExp', 'pop' from gapminder_country\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We get an error, because there are no longer any rows with row index names 1, 4, 5.\n", + "\n", + "If you want to do positional indexing, use `.iloc`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use .iloc to extract the 3rd row and 3rd column (i.e., index position of 2) of gapminder_country using positional indexing\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can subset to multiple rows and column by providing a list of row/column positions to the corresponding entry of the `.iloc` indexer. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use .iloc to extract rows in position 2, 5, 7, and the columns in position 1 and 3\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using `:` to mean all rows/columns\n", + "\n", + "We can also use the `:` placeholder for \"all rows\" or \"all columns\". " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use .iloc to extract all rows for columns 0, 3, 5\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### More general sequences with `start:stop:step`\n", + "\n", + "`start:stop:step`, e.g., `0:20:2`, will similarly correspond to a list of integers from 0 up to 20 (non-inclusive) with a step size of 2, so 0, 2, 4, 6, ..., 18. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use .iloc to extract rows 0, 2, 4, 6, ..., 18 (inclusive) for columns 0 to 3 (inclusive)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "Use `iloc` to extract every third row starting at index position 2 up to position 100, and the first two columns. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/13_modifying_dataframes.ipynb b/content/incomplete/13_modifying_dataframes.ipynb new file mode 100644 index 0000000..3f037be --- /dev/null +++ b/content/incomplete/13_modifying_dataframes.ipynb @@ -0,0 +1,345 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Adding and dropping columns from DataFrames" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This document will demonstrate how to add and remove columns from a DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import pandas\n", + "\n", + "# load the gapminder dataset\n", + "\n", + "# take a look at the head of gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating a new column in a DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# compute the product of the pop and gdpPercap columns\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's add a new column `gdp` which is the product of the `pop` and `gdpPercap` columns:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# add a new column to gapminder corresponding to the product of the values in the 'pop' and 'gdpPercap' columns\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Has the original gapminder object changed?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Removing a column from a DataFrame using `.drop()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To remove a column from a DataFrame, you can use the pandas `.drop` method. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# try to remove gdp from gapminder using the df.drop(columns=) method\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# did the original gapminder data object change?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "To update the `gapminder` DataFrame to be the version without the `gdp` column, you need to overwrite the `gapminder` object by assigning it to be the version without `gdp` as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# overwrite gapminder with the output of the df.drop(columns=) method\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# now look at the gapminder data object - has it changed?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating a copy of a DataFrame object" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Suppose that you want to keep an unmodified copy of the original `gapminder` DataFrame object in your environment, and create a different version, called `gapminder_new`, that you can modify as much as you like. \n", + "\n", + "You might try to create a new variable `gapminder_new` that contains the original `gapminder` DataFrame as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# define gapminder_new and set it equal to gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Indeed, `gapminder_new` contains the same DataFrame object as `gapminder`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# take a look at gapminder_new\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's add a `GDP` column to this new `gapminder_new` DataFrame object:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define a new column in gapminder_new called 'GDP' that is equal to the product of the 'pop' and 'gdpPercap' columns\n", + "\n", + "# take a look at gapminder_new\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# take a look at the original gapminder object -- has it changed?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What's going on here?\n", + "\n", + "Let's revert the `gapminder` DataFrame object to the original dataset by re-loading the csv file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# read in gapminder again to revert to the original dataset\n", + "gapminder = pd.read_csv('data/gapminder.csv')\n", + "gapminder" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The `.copy()`` method\n", + "\n", + "The problem, is that when you write `gapminder_new = gapminder`, this is creating a new \"pointer\" to the `gapminder` DataFrame: `gapminder_new` acts as an \"alias\" for the original `gapminder` DataFrame. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The way to create an independent copy of a DataFrame, use the Pandas `.copy()` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# define gapminder_new this time as a copy of gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's add a new column to `gapminder_new` called `gdp_new`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# add a column, gdp_new, to gapminder_new that is equal to the product of the 'pop' and 'gdpPercap' columns\n", + "\n", + "# take a look at gapminder_new\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check whether the original gapminder object has changed\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise \n", + "\n", + "Create a version of gapminder called `gapminder_gdp` that contains three columns: country, year, and gdp (the GDP for each country-year in millions). Make sure that the original `gapminder` DataFrame is not modified." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Modifying existing columns of a DataFrame" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `df['col'] = ...` syntax can be used not only to add new columns, but also to modify existing columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Round the lifeExp column values to the nearest integer:\n", + "# -------------\n", + "# import numpy\n", + "\n", + "# apply np.round() to the 'lifeExp' column of gapminder\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Update the existing `lifeExp` column with this rounded version as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# update the lifeExp column of gapminder with the rounded version\n", + "\n", + "# look at gapminder\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/14_summarizing_dataframes.ipynb b/content/incomplete/14_summarizing_dataframes.ipynb new file mode 100644 index 0000000..c2d05a1 --- /dev/null +++ b/content/incomplete/14_summarizing_dataframes.ipynb @@ -0,0 +1,298 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Summarizing DataFrames\n", + "\n", + "This notebook will demonstrate how to apply simple statistical methods, such as the mean, sum, median, standard deviation (for numeric columns), and counts (for categorical columns) to columns in a pandas DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "gapminder = pd.read_csv('data/gapminder.csv')\n", + "gapminder" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To apply a statistical summary such as the mean to a single column from a DataFrame, you first need to extract the column of interest, e.g., using the `df['col']` syntax, and then apply the relevant method (e.g., `.mean()`) to the resulting Series object.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Calculating the mean\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use the .mean() method to compute the mean of the lifeExp column\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that `.mean()` is a *method*, rather than a function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Try to use mean() as a function to compute the mean of lifeExp\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can apply `.mean()` to multiple columns at once by extracting the relevant columns (e.g., using `df[[]]` or `df.loc[,]`), and then applying `.mean()` to the resulting DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use the .mean() method to compute the mean of both the lifeExp and gdpPercap columns simultaneously\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is the type of the object returned by `.mean()`?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Check the type of the result above\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since `.mean()` can be applied to multiple columns at once, why is it that applying `.mean()` directly to `gapminder` doesn't work? " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# what happens when you try to apply the .mean() method to the entire gapminder DataFrame?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# show gapminder to try to figure out why the above command failed\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extracting columns of a particular type\n", + "\n", + "You can extract all of the numeric columns of gapminder using the `.select_dtypes()` method, and then apply the `.mean()` method to the resulting DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the .mean() method to the gapminder DataFrame\n", + "# but only to the columns of type 'number' extracted using .select_dtypes()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Other statistical summaries: sum, median, std\n", + "\n", + "The sum:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the .sum() method to the gapminder DataFrame, but only to the columns of type 'number'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The median:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the .median() method to the gapminder DataFrame, but only to the columns of type 'number'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The standard deviation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the .std() method to the gapminder DataFrame, but only to the columns of type 'number'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Counting the number of unique categorical values with the `value_counts()` method\n", + "\n", + "While the above methods can only be used for numeric (float or integer) columns, there are some summaries that you can use for categorical columns too. \n", + "\n", + "The `.value_counts()` method will compute the number of times each unique value appears in a column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the .value_counts() method to the country column of the gapminder DataFrame\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And the `continent` column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# apply the .value_counts() method to the continent column of the gapminder DataFrame\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise\n", + "\n", + "Compute the average life expectancy for all countries in Asia in the year 1992." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Standardizing a DataFrame" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's create a version of gapminder that just contains the numeric columns:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create gapminder_numeric, a subset of gapminder that contains only the columns of type 'number'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that Pandas' will perform mathematical operations column-wise, so standardization can be done by subtracting the mean and dividing by the standard deviation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create gapminder_std, which contains the standardized values of gapminder_numeric\n", + "\n", + "# look at gapminder_std" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/15_grouped_computations.ipynb b/content/incomplete/15_grouped_computations.ipynb new file mode 100644 index 0000000..50b4ecd --- /dev/null +++ b/content/incomplete/15_grouped_computations.ipynb @@ -0,0 +1,93 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Grouped Computations for DataFrames" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "gapminder = pd.read_csv('data/gapminder.csv')\n", + "# create a version of gapminder with only the numeric columns called `gapminder_numeric`\n", + "gapminder_numeric = gapminder.select_dtypes(include='number').copy()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The `.groupby()` method\n", + "\n", + "Sometimes we want to compute a summary separately for different groups (where the groups might be defined by the unique values in a column).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Apply the .mean() method to gapminder_numeric\n", + "\n", + "# Apply the .mean() method to gapminder_numeric, but group by the 'year' column.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Grouping by multiple columns\n", + "\n", + "We can group by multiple columns at once by providing a *list* of the column names to the `.groupby()` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# group by the year and continent columns and compute the mean of the lifeExp column\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "1. Compute the maximum population for each country\n", + "\n", + "2. Compute the mean gdpPercap for each continent averaged across all years after 1990" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/content/incomplete/16_visualization.ipynb b/content/incomplete/16_visualization.ipynb new file mode 100644 index 0000000..4cd6cab --- /dev/null +++ b/content/incomplete/16_visualization.ipynb @@ -0,0 +1,146 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Visualizing DataFrames\n", + "\n", + "In this notebook, we will use the pandas visualization methods to visualize our DataFrames. We will also touch on other visualization libraries that allow you to create more advanced and customized figures." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "gapminder = pd.read_csv('data/gapminder.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pandas visualization methods\n", + "\n", + "Pandas has a few methods that allow you to quickly create visualizations from your DataFrame. We will use the `plot` method to create a few plots.\n", + "\n", + "The general syntax is `df.plot(kind=)`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use `df.plot` to create a scatterplot of gdpPercap vs lifeExp\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use `df.plot` to create a barplot of the number of countries in each continent\n", + "# Hint: use value_counts() first to get the counts\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a histogram of the life expectancy\n", + "# Hint: extract the lifeExp column first\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The Seaborn library\n", + "\n", + "The inbuilt Pandas plotting functionalities are somewhat limited in what they can do.\n", + "\n", + "If you want to create more sophisticated visualizations, you'll want to use another library. One of the most popular libraries for data visualization is [Seaborn](https://seaborn.pydata.org/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import seaborn\n", + "\n", + "\n", + "# Use seaborn to create a scatterplot of gdpPercap vs lifeExp\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a version of the above scatterplot for just the year 2007, \n", + "# where the color is based on continent \n", + "# and the size is based on population\n", + "# and the size range is from 20 to 500\n", + "# and the transparency is 0.7\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Creating bar charts and histograms with seaborn is also fairly straightforward" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a bar chart of the number of countries in each continent using sns.countplot()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create a histogram of lifeExp using sns.histplot()\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}