diff --git a/Code-Drawing Template.pdf b/Code-Drawing Template.pdf new file mode 100644 index 0000000..2ad2ede Binary files /dev/null and b/Code-Drawing Template.pdf differ diff --git a/NumpyExercises/Individual_Numpy.ipynb b/NumpyExercises/Individual_Numpy.ipynb new file mode 100644 index 0000000..90a9f46 --- /dev/null +++ b/NumpyExercises/Individual_Numpy.ipynb @@ -0,0 +1,391 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 1: make a common array" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 80842, 333008, 202553, 140037, 81969],\n", + " [ 63857, 42105, 261540, 481981, 176739],\n", + " [489984, 326386, 110795, 394863, 25024],\n", + " [ 38317, 49982, 408830, 485118, 16119],\n", + " [407675, 231729, 265455, 109413, 103399],\n", + " [174677, 343356, 301717, 224120, 401101],\n", + " [140473, 254634, 112262, 25063, 108262],\n", + " [375059, 406983, 208947, 115641, 296685],\n", + " [444899, 129585, 171318, 313094, 425041],\n", + " [188411, 335140, 141681, 59641, 211420],\n", + " [287650, 8973, 477425, 382803, 465168],\n", + " [ 3975, 32213, 160603, 275485, 388234],\n", + " [246225, 56174, 244097, 9350, 496966],\n", + " [225516, 273338, 73335, 283013, 212813],\n", + " [ 38175, 282399, 318413, 337639, 379802],\n", + " [198049, 101115, 419547, 260219, 325793],\n", + " [148593, 425024, 348570, 117968, 107007],\n", + " [ 52547, 180346, 178760, 305186, 262153],\n", + " [ 11835, 449971, 494184, 472031, 353049],\n", + " [476442, 35455, 191553, 384154, 29917]])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "\n", + "# Seed insures results are stable.\n", + "np.random.seed(21)\n", + "random_integers = np.random.randint(1, high=500000, size=(20, 5))\n", + "random_integers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 2:What is the average value of the second column (to one decimal place)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "214895.8" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# The average value of the second column\n", + "random_integers[:, 1].mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 3: What is the average value of the first 5 rows of the third and fourth columns (to one decimal place)?" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "286058.5" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# The average of the first 5 rows of 3rd and 4th columns\n", + "subset = random_integers[:5, 2:4]\n", + "np.mean(subset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 4: Result of matrix 1 plus matrix 2\n", + "\\begin{bmatrix}\n", + "2 & 4 & 6\\\\\n", + "5 & 7 & 9\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[2 4 6]\n", + " [5 7 9]]\n" + ] + } + ], + "source": [ + "# Exercise 4 Python:\n", + "first_matrix = np.array([[1, 2, 3], [4, 5, 6]])\n", + "second_matrix = np.array([1, 2, 3])\n", + "print(first_matrix + second_matrix)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 5: Result of my_vector[selection]:\n", + "\\begin{bmatrix}\n", + "2 & 4 & 6\\\\\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[2 4 6]\n" + ] + } + ], + "source": [ + "# Exercise 5 python:\n", + "my_vector = np.array([1, 2, 3, 4, 5, 6])\n", + "selection = my_vector % 2 == 0\n", + "print(my_vector[selection])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For exercise 6: I didn't make any errors but I learned how to do matrix notation on markdown" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 7 slicing:\n", + "\\begin{bmatrix}\n", + "2&3\\\\\n", + "5&6\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[ 4 6]\n", + " [10 12]]\n" + ] + } + ], + "source": [ + "# Exercise 8\n", + "my_array = np.array([[1, 2, 3], [4, 5, 6]])\n", + "my_slice = my_array[:, 1:3]\n", + "my_array[:, :] = my_array * 2\n", + "print(my_slice)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 8 slicing and view\n", + "\\begin{bmatrix}\n", + "4&6\\\\\n", + "10&12\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 9 what does the slice look like?\n", + "\\begin{bmatrix}\n", + "2&3\\\\\n", + "5&6\n", + "\\end{bmatrix}\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 10: my prediction was correct. I knew that slice would not change because my_slice creates its own subsetted array that is different from the original my array" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[2 3]\n", + " [5 6]]\n" + ] + } + ], + "source": [ + "# exercise 10\n", + "my_array = np.array([[1, 2, 3], [4, 5, 6]])\n", + "my_slice = my_array[:, 1:3]\n", + "my_array = my_array * 2\n", + "print(my_slice)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 11: \n", + "\\begin{bmatrix}\n", + "2&3\\\\\n", + "5&6\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[2 3]\n", + " [5 6]]\n" + ] + } + ], + "source": [ + "my_array = np.array([[1, 2, 3], [4, 5, 6]])\n", + "my_slice = my_array[:, 1:3].copy()\n", + "my_array[:, :] = my_array * 2\n", + "print(my_slice)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 12 prediction:\n", + "y would be [\"a change\", 2]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 13 prediction: if we printed x it would be [1,2,3]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['a change', 2]\n", + "[1, 2, 3]\n" + ] + } + ], + "source": [ + "# Exercise 12 and 13:\n", + "x = [1, 2, 3]\n", + "y = x[0:2]\n", + "y[0] = \"a change\"\n", + "print(y)\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 14: I was correct and this is because we sliced y and then made a change on a specific index number for y and also x doesn't change at all." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[2 3]\n", + "[3]\n", + "[ 2 -1]\n", + "[-1]\n" + ] + } + ], + "source": [ + "my_array = np.array([1, 2, 3])\n", + "my_array = my_array[1:4]\n", + "print(my_array)\n", + "my_slice = my_array[1:3]\n", + "print(my_slice)\n", + "my_slice[0] = -1\n", + "print(my_array)\n", + "print(my_slice)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/NumpyExercises/Individual_Numpy_files/Individual_Numpy.html b/NumpyExercises/Individual_Numpy_files/Individual_Numpy.html new file mode 100644 index 0000000..c90aeda --- /dev/null +++ b/NumpyExercises/Individual_Numpy_files/Individual_Numpy.html @@ -0,0 +1,8221 @@ + + + + +Individual_Numpy + + + + + + + + + + + + +
+
+ +
+ + +
+
+ +
+ + +
+
+ +
+ + +
+
+ +
+ + +
+
+ +
+ + +
+
+ +
+
+ +
+ + +
+
+ +
+
+ +
+
+ +
+ + +
+
+ +
+ + +
+
+ +
+
+ +
+ + +
+
+ +
+ + +
+
+ + + \ No newline at end of file diff --git a/NumpyExercises/Individual_Numpy_files/Individual_Numpy.ipynb b/NumpyExercises/Individual_Numpy_files/Individual_Numpy.ipynb new file mode 100644 index 0000000..90a9f46 --- /dev/null +++ b/NumpyExercises/Individual_Numpy_files/Individual_Numpy.ipynb @@ -0,0 +1,391 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 1: make a common array" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 80842, 333008, 202553, 140037, 81969],\n", + " [ 63857, 42105, 261540, 481981, 176739],\n", + " [489984, 326386, 110795, 394863, 25024],\n", + " [ 38317, 49982, 408830, 485118, 16119],\n", + " [407675, 231729, 265455, 109413, 103399],\n", + " [174677, 343356, 301717, 224120, 401101],\n", + " [140473, 254634, 112262, 25063, 108262],\n", + " [375059, 406983, 208947, 115641, 296685],\n", + " [444899, 129585, 171318, 313094, 425041],\n", + " [188411, 335140, 141681, 59641, 211420],\n", + " [287650, 8973, 477425, 382803, 465168],\n", + " [ 3975, 32213, 160603, 275485, 388234],\n", + " [246225, 56174, 244097, 9350, 496966],\n", + " [225516, 273338, 73335, 283013, 212813],\n", + " [ 38175, 282399, 318413, 337639, 379802],\n", + " [198049, 101115, 419547, 260219, 325793],\n", + " [148593, 425024, 348570, 117968, 107007],\n", + " [ 52547, 180346, 178760, 305186, 262153],\n", + " [ 11835, 449971, 494184, 472031, 353049],\n", + " [476442, 35455, 191553, 384154, 29917]])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "\n", + "# Seed insures results are stable.\n", + "np.random.seed(21)\n", + "random_integers = np.random.randint(1, high=500000, size=(20, 5))\n", + "random_integers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 2:What is the average value of the second column (to one decimal place)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "214895.8" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# The average value of the second column\n", + "random_integers[:, 1].mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 3: What is the average value of the first 5 rows of the third and fourth columns (to one decimal place)?" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "286058.5" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# The average of the first 5 rows of 3rd and 4th columns\n", + "subset = random_integers[:5, 2:4]\n", + "np.mean(subset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 4: Result of matrix 1 plus matrix 2\n", + "\\begin{bmatrix}\n", + "2 & 4 & 6\\\\\n", + "5 & 7 & 9\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[2 4 6]\n", + " [5 7 9]]\n" + ] + } + ], + "source": [ + "# Exercise 4 Python:\n", + "first_matrix = np.array([[1, 2, 3], [4, 5, 6]])\n", + "second_matrix = np.array([1, 2, 3])\n", + "print(first_matrix + second_matrix)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 5: Result of my_vector[selection]:\n", + "\\begin{bmatrix}\n", + "2 & 4 & 6\\\\\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[2 4 6]\n" + ] + } + ], + "source": [ + "# Exercise 5 python:\n", + "my_vector = np.array([1, 2, 3, 4, 5, 6])\n", + "selection = my_vector % 2 == 0\n", + "print(my_vector[selection])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For exercise 6: I didn't make any errors but I learned how to do matrix notation on markdown" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 7 slicing:\n", + "\\begin{bmatrix}\n", + "2&3\\\\\n", + "5&6\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[ 4 6]\n", + " [10 12]]\n" + ] + } + ], + "source": [ + "# Exercise 8\n", + "my_array = np.array([[1, 2, 3], [4, 5, 6]])\n", + "my_slice = my_array[:, 1:3]\n", + "my_array[:, :] = my_array * 2\n", + "print(my_slice)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 8 slicing and view\n", + "\\begin{bmatrix}\n", + "4&6\\\\\n", + "10&12\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 9 what does the slice look like?\n", + "\\begin{bmatrix}\n", + "2&3\\\\\n", + "5&6\n", + "\\end{bmatrix}\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 10: my prediction was correct. I knew that slice would not change because my_slice creates its own subsetted array that is different from the original my array" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[2 3]\n", + " [5 6]]\n" + ] + } + ], + "source": [ + "# exercise 10\n", + "my_array = np.array([[1, 2, 3], [4, 5, 6]])\n", + "my_slice = my_array[:, 1:3]\n", + "my_array = my_array * 2\n", + "print(my_slice)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 11: \n", + "\\begin{bmatrix}\n", + "2&3\\\\\n", + "5&6\n", + "\\end{bmatrix}" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[2 3]\n", + " [5 6]]\n" + ] + } + ], + "source": [ + "my_array = np.array([[1, 2, 3], [4, 5, 6]])\n", + "my_slice = my_array[:, 1:3].copy()\n", + "my_array[:, :] = my_array * 2\n", + "print(my_slice)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 12 prediction:\n", + "y would be [\"a change\", 2]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 13 prediction: if we printed x it would be [1,2,3]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['a change', 2]\n", + "[1, 2, 3]\n" + ] + } + ], + "source": [ + "# Exercise 12 and 13:\n", + "x = [1, 2, 3]\n", + "y = x[0:2]\n", + "y[0] = \"a change\"\n", + "print(y)\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise 14: I was correct and this is because we sliced y and then made a change on a specific index number for y and also x doesn't change at all." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[2 3]\n", + "[3]\n", + "[ 2 -1]\n", + "[-1]\n" + ] + } + ], + "source": [ + "my_array = np.array([1, 2, 3])\n", + "my_array = my_array[1:4]\n", + "print(my_array)\n", + "my_slice = my_array[1:3]\n", + "print(my_slice)\n", + "my_slice[0] = -1\n", + "print(my_array)\n", + "print(my_slice)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/NumpyExercises/Individual_Numpy_files/Individual_Numpy.pdf b/NumpyExercises/Individual_Numpy_files/Individual_Numpy.pdf new file mode 100644 index 0000000..b734c5f Binary files /dev/null and b/NumpyExercises/Individual_Numpy_files/Individual_Numpy.pdf differ diff --git a/NumpyExercises/Individual_Numpy_files/MathJax.js.download b/NumpyExercises/Individual_Numpy_files/MathJax.js.download new file mode 100644 index 0000000..4f36e31 --- /dev/null +++ b/NumpyExercises/Individual_Numpy_files/MathJax.js.download @@ -0,0 +1,19 @@ +/* + * /MathJax.js + * + * Copyright (c) 2009-2018 The MathJax Consortium + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +if(document.getElementById&&document.childNodes&&document.createElement){if(!(window.MathJax&&MathJax.Hub)){if(window.MathJax){window.MathJax={AuthorConfig:window.MathJax}}else{window.MathJax={}}MathJax.isPacked=true;MathJax.version="2.7.7";MathJax.fileversion="2.7.7";MathJax.cdnVersion="2.7.7";MathJax.cdnFileVersions={};(function(d){var b=window[d];if(!b){b=window[d]={}}var e=[];var c=function(f){var g=f.constructor;if(!g){g=function(){}}for(var h in f){if(h!=="constructor"&&f.hasOwnProperty(h)){g[h]=f[h]}}return g};var a=function(){return function(){return arguments.callee.Init.call(this,arguments)}};b.Object=c({constructor:a(),Subclass:function(f,h){var g=a();g.SUPER=this;g.Init=this.Init;g.Subclass=this.Subclass;g.Augment=this.Augment;g.protoFunction=this.protoFunction;g.can=this.can;g.has=this.has;g.isa=this.isa;g.prototype=new this(e);g.prototype.constructor=g;g.Augment(f,h);return g},Init:function(f){var g=this;if(f.length===1&&f[0]===e){return g}if(!(g instanceof f.callee)){g=new f.callee(e)}return g.Init.apply(g,f)||g},Augment:function(f,g){var h;if(f!=null){for(h in f){if(f.hasOwnProperty(h)){this.protoFunction(h,f[h])}}if(f.toString!==this.prototype.toString&&f.toString!=={}.toString){this.protoFunction("toString",f.toString)}}if(g!=null){for(h in g){if(g.hasOwnProperty(h)){this[h]=g[h]}}}return this},protoFunction:function(g,f){this.prototype[g]=f;if(typeof f==="function"){f.SUPER=this.SUPER.prototype}},prototype:{Init:function(){},SUPER:function(f){return f.callee.SUPER},can:function(f){return typeof(this[f])==="function"},has:function(f){return typeof(this[f])!=="undefined"},isa:function(f){return(f instanceof Object)&&(this instanceof f)}},can:function(f){return this.prototype.can.call(this,f)},has:function(f){return this.prototype.has.call(this,f)},isa:function(g){var f=this;while(f){if(f===g){return true}else{f=f.SUPER}}return false},SimpleSUPER:c({constructor:function(f){return this.SimpleSUPER.define(f)},define:function(f){var h={};if(f!=null){for(var g in f){if(f.hasOwnProperty(g)){h[g]=this.wrap(g,f[g])}}if(f.toString!==this.prototype.toString&&f.toString!=={}.toString){h.toString=this.wrap("toString",f.toString)}}return h},wrap:function(i,h){if(typeof(h)!=="function"||!h.toString().match(/\.\s*SUPER\s*\(/)){return h}var g=function(){this.SUPER=g.SUPER[i];try{var f=h.apply(this,arguments)}catch(j){delete this.SUPER;throw j}delete this.SUPER;return f};g.toString=function(){return h.toString.apply(h,arguments)};return g}})});b.Object.isArray=Array.isArray||function(f){return Object.prototype.toString.call(f)==="[object Array]"};b.Object.Array=Array})("MathJax");(function(BASENAME){var BASE=window[BASENAME];if(!BASE){BASE=window[BASENAME]={}}var isArray=BASE.Object.isArray;var CALLBACK=function(data){var cb=function(){return arguments.callee.execute.apply(arguments.callee,arguments)};for(var id in CALLBACK.prototype){if(CALLBACK.prototype.hasOwnProperty(id)){if(typeof(data[id])!=="undefined"){cb[id]=data[id]}else{cb[id]=CALLBACK.prototype[id]}}}cb.toString=CALLBACK.prototype.toString;return cb};CALLBACK.prototype={isCallback:true,hook:function(){},data:[],object:window,execute:function(){if(!this.called||this.autoReset){this.called=!this.autoReset;return this.hook.apply(this.object,this.data.concat([].slice.call(arguments,0)))}},reset:function(){delete this.called},toString:function(){return this.hook.toString.apply(this.hook,arguments)}};var ISCALLBACK=function(f){return(typeof(f)==="function"&&f.isCallback)};var EVAL=function(code){return eval.call(window,code)};var TESTEVAL=function(){EVAL("var __TeSt_VaR__ = 1");if(window.__TeSt_VaR__){try{delete window.__TeSt_VaR__}catch(error){window.__TeSt_VaR__=null}}else{if(window.execScript){EVAL=function(code){BASE.__code=code;code="try {"+BASENAME+".__result = eval("+BASENAME+".__code)} catch(err) {"+BASENAME+".__result = err}";window.execScript(code);var result=BASE.__result;delete BASE.__result;delete BASE.__code;if(result instanceof Error){throw result}return result}}else{EVAL=function(code){BASE.__code=code;code="try {"+BASENAME+".__result = eval("+BASENAME+".__code)} catch(err) {"+BASENAME+".__result = err}";var head=(document.getElementsByTagName("head"))[0];if(!head){head=document.body}var script=document.createElement("script");script.appendChild(document.createTextNode(code));head.appendChild(script);head.removeChild(script);var result=BASE.__result;delete BASE.__result;delete BASE.__code;if(result instanceof Error){throw result}return result}}}TESTEVAL=null};var USING=function(args,i){if(arguments.length>1){if(arguments.length===2&&!(typeof arguments[0]==="function")&&arguments[0] instanceof Object&&typeof arguments[1]==="number"){args=[].slice.call(args,i)}else{args=[].slice.call(arguments,0)}}if(isArray(args)&&args.length===1&&typeof(args[0])==="function"){args=args[0]}if(typeof args==="function"){if(args.execute===CALLBACK.prototype.execute){return args}return CALLBACK({hook:args})}else{if(isArray(args)){if(typeof(args[0])==="string"&&args[1] instanceof Object&&typeof args[1][args[0]]==="function"){return CALLBACK({hook:args[1][args[0]],object:args[1],data:args.slice(2)})}else{if(typeof args[0]==="function"){return CALLBACK({hook:args[0],data:args.slice(1)})}else{if(typeof args[1]==="function"){return CALLBACK({hook:args[1],object:args[0],data:args.slice(2)})}}}}else{if(typeof(args)==="string"){if(TESTEVAL){TESTEVAL()}return CALLBACK({hook:EVAL,data:[args]})}else{if(args instanceof Object){return CALLBACK(args)}else{if(typeof(args)==="undefined"){return CALLBACK({})}}}}}throw Error("Can't make callback from given data")};var DELAY=function(time,callback){callback=USING(callback);callback.timeout=setTimeout(callback,time);return callback};var WAITFOR=function(callback,signal){callback=USING(callback);if(!callback.called){WAITSIGNAL(callback,signal);signal.pending++}};var WAITEXECUTE=function(){var signals=this.signal;delete this.signal;this.execute=this.oldExecute;delete this.oldExecute;var result=this.execute.apply(this,arguments);if(ISCALLBACK(result)&&!result.called){WAITSIGNAL(result,signals)}else{for(var i=0,m=signals.length;i0&&priority=0;i--){this.hooks.splice(i,1)}this.remove=[]}});var EXECUTEHOOKS=function(hooks,data,reset){if(!hooks){return null}if(!isArray(hooks)){hooks=[hooks]}if(!isArray(data)){data=(data==null?[]:[data])}var handler=HOOKS(reset);for(var i=0,m=hooks.length;ig){g=document.styleSheets.length}if(!i){i=document.head||((document.getElementsByTagName("head"))[0]);if(!i){i=document.body}}return i};var f=[];var c=function(){for(var k=0,j=f.length;k=this.timeout){i(this.STATUS.ERROR);return 1}return 0},file:function(j,i){if(i<0){a.Ajax.loadTimeout(j)}else{a.Ajax.loadComplete(j)}},execute:function(){this.hook.call(this.object,this,this.data[0],this.data[1])},checkSafari2:function(i,j,k){if(i.time(k)){return}if(document.styleSheets.length>j&&document.styleSheets[j].cssRules&&document.styleSheets[j].cssRules.length){k(i.STATUS.OK)}else{setTimeout(i,i.delay)}},checkLength:function(i,l,n){if(i.time(n)){return}var m=0;var j=(l.sheet||l.styleSheet);try{if((j.cssRules||j.rules||[]).length>0){m=1}}catch(k){if(k.message.match(/protected variable|restricted URI/)){m=1}else{if(k.message.match(/Security error/)){m=1}}}if(m){setTimeout(a.Callback([n,i.STATUS.OK]),0)}else{setTimeout(i,i.delay)}}},loadComplete:function(i){i=this.fileURL(i);var j=this.loading[i];if(j&&!j.preloaded){a.Message.Clear(j.message);clearTimeout(j.timeout);if(j.script){if(f.length===0){setTimeout(c,0)}f.push(j.script)}this.loaded[i]=j.status;delete this.loading[i];this.addHook(i,j.callback)}else{if(j){delete this.loading[i]}this.loaded[i]=this.STATUS.OK;j={status:this.STATUS.OK}}if(!this.loadHooks[i]){return null}return this.loadHooks[i].Execute(j.status)},loadTimeout:function(i){if(this.loading[i].timeout){clearTimeout(this.loading[i].timeout)}this.loading[i].status=this.STATUS.ERROR;this.loadError(i);this.loadComplete(i)},loadError:function(i){a.Message.Set(["LoadFailed","File failed to load: %1",i],null,2000);a.Hub.signal.Post(["file load error",i])},Styles:function(k,l){var i=this.StyleString(k);if(i===""){l=a.Callback(l);l()}else{var j=document.createElement("style");j.type="text/css";this.head=h(this.head);this.head.appendChild(j);if(j.styleSheet&&typeof(j.styleSheet.cssText)!=="undefined"){j.styleSheet.cssText=i}else{j.appendChild(document.createTextNode(i))}l=this.timer.create.call(this,l,j)}return l},StyleString:function(n){if(typeof(n)==="string"){return n}var k="",o,m;for(o in n){if(n.hasOwnProperty(o)){if(typeof n[o]==="string"){k+=o+" {"+n[o]+"}\n"}else{if(a.Object.isArray(n[o])){for(var l=0;l="0"&&q<="9"){f[j]=p[f[j]-1];if(typeof f[j]==="number"){f[j]=this.number(f[j])}}else{if(q==="{"){q=f[j].substr(1);if(q>="0"&&q<="9"){f[j]=p[f[j].substr(1,f[j].length-2)-1];if(typeof f[j]==="number"){f[j]=this.number(f[j])}}else{var k=f[j].match(/^\{([a-z]+):%(\d+)\|(.*)\}$/);if(k){if(k[1]==="plural"){var d=p[k[2]-1];if(typeof d==="undefined"){f[j]="???"}else{d=this.plural(d)-1;var h=k[3].replace(/(^|[^%])(%%)*%\|/g,"$1$2%\uEFEF").split(/\|/);if(d>=0&&d=3){c.push([f[0],f[1],this.processSnippet(g,f[2])])}else{c.push(e[d])}}}}else{c.push(e[d])}}return c},markdownPattern:/(%.)|(\*{1,3})((?:%.|.)+?)\2|(`+)((?:%.|.)+?)\4|\[((?:%.|.)+?)\]\(([^\s\)]+)\)/,processMarkdown:function(b,h,d){var j=[],e;var c=b.split(this.markdownPattern);var g=c[0];for(var f=1,a=c.length;f1?d[1]:""));f=null}if(e&&(!b.preJax||d)){c.nodeValue=c.nodeValue.replace(b.postJax,(e.length>1?e[1]:""))}if(f&&!f.nodeValue.match(/\S/)){f=f.previousSibling}}if(b.preRemoveClass&&f&&f.className===b.preRemoveClass){a.MathJax.preview=f}a.MathJax.checked=1},processInput:function(a){var b,i=MathJax.ElementJax.STATE;var h,e,d=a.scripts.length;try{while(a.ithis.processUpdateTime&&a.i1){d.jax[a.outputJax].push(b)}b.MathJax.state=c.OUTPUT},prepareOutput:function(c,f){while(c.jthis.processUpdateTime&&h.i=0;q--){if((b[q].src||"").match(f)){s.script=b[q].innerHTML;if(RegExp.$2){var t=RegExp.$2.substr(1).split(/\&/);for(var p=0,l=t.length;p=parseInt(y[z])}}return true},Select:function(j){var i=j[d.Browser];if(i){return i(d.Browser)}return null}};var e=k.replace(/^Mozilla\/(\d+\.)+\d+ /,"").replace(/[a-z][-a-z0-9._: ]+\/\d+[^ ]*-[^ ]*\.([a-z][a-z])?\d+ /i,"").replace(/Gentoo |Ubuntu\/(\d+\.)*\d+ (\([^)]*\) )?/,"");d.Browser=d.Insert(d.Insert(new String("Unknown"),{version:"0.0"}),a);for(var v in a){if(a.hasOwnProperty(v)){if(a[v]&&v.substr(0,2)==="is"){v=v.slice(2);if(v==="Mac"||v==="PC"){continue}d.Browser=d.Insert(new String(v),a);var r=new RegExp(".*(Version/| Trident/.*; rv:)((?:\\d+\\.)+\\d+)|.*("+v+")"+(v=="MSIE"?" ":"/")+"((?:\\d+\\.)*\\d+)|(?:^|\\(| )([a-z][-a-z0-9._: ]+|(?:Apple)?WebKit)/((?:\\d+\\.)+\\d+)");var u=r.exec(e)||["","","","unknown","0.0"];d.Browser.name=(u[1]!=""?v:(u[3]||u[5]));d.Browser.version=u[2]||u[4]||u[6];break}}}try{d.Browser.Select({Safari:function(j){var i=parseInt((String(j.version).split("."))[0]);if(i>85){j.webkit=j.version}if(i>=538){j.version="8.0"}else{if(i>=537){j.version="7.0"}else{if(i>=536){j.version="6.0"}else{if(i>=534){j.version="5.1"}else{if(i>=533){j.version="5.0"}else{if(i>=526){j.version="4.0"}else{if(i>=525){j.version="3.1"}else{if(i>500){j.version="3.0"}else{if(i>400){j.version="2.0"}else{if(i>85){j.version="1.0"}}}}}}}}}}j.webkit=(navigator.appVersion.match(/WebKit\/(\d+)\./))[1];j.isMobile=(navigator.appVersion.match(/Mobile/i)!=null);j.noContextMenu=j.isMobile},Firefox:function(j){if((j.version==="0.0"||k.match(/Firefox/)==null)&&navigator.product==="Gecko"){var m=k.match(/[\/ ]rv:(\d+\.\d.*?)[\) ]/);if(m){j.version=m[1]}else{var i=(navigator.buildID||navigator.productSub||"0").substr(0,8);if(i>="20111220"){j.version="9.0"}else{if(i>="20111120"){j.version="8.0"}else{if(i>="20110927"){j.version="7.0"}else{if(i>="20110816"){j.version="6.0"}else{if(i>="20110621"){j.version="5.0"}else{if(i>="20110320"){j.version="4.0"}else{if(i>="20100121"){j.version="3.6"}else{if(i>="20090630"){j.version="3.5"}else{if(i>="20080617"){j.version="3.0"}else{if(i>="20061024"){j.version="2.0"}}}}}}}}}}}}j.isMobile=(navigator.appVersion.match(/Android/i)!=null||k.match(/ Fennec\//)!=null||k.match(/Mobile/)!=null)},Chrome:function(i){i.noContextMenu=i.isMobile=!!navigator.userAgent.match(/ Mobile[ \/]/)},Opera:function(i){i.version=opera.version()},Edge:function(i){i.isMobile=!!navigator.userAgent.match(/ Phone/)},MSIE:function(j){j.isMobile=!!navigator.userAgent.match(/ Phone/);j.isIE9=!!(document.documentMode&&(window.performance||window.msPerformance));MathJax.HTML.setScriptBug=!j.isIE9||document.documentMode<9;MathJax.Hub.msieHTMLCollectionBug=(document.documentMode<9);if(document.documentMode<10&&!s.params.NoMathPlayer){try{new ActiveXObject("MathPlayer.Factory.1");j.hasMathPlayer=true}catch(m){}try{if(j.hasMathPlayer){var i=document.createElement("object");i.id="mathplayer";i.classid="clsid:32F66A20-7614-11D4-BD11-00104BD3F987";g.appendChild(i);document.namespaces.add("m","http://www.w3.org/1998/Math/MathML");j.mpNamespace=true;if(document.readyState&&(document.readyState==="loading"||document.readyState==="interactive")){document.write('');j.mpImported=true}}else{document.namespaces.add("mjx_IE_fix","http://www.w3.org/1999/xlink")}}catch(m){}}}})}catch(c){console.error(c.message)}d.Browser.Select(MathJax.Message.browsers);if(h.AuthorConfig&&typeof h.AuthorConfig.AuthorInit==="function"){h.AuthorConfig.AuthorInit()}d.queue=h.Callback.Queue();d.queue.Push(["Post",s.signal,"Begin"],["Config",s],["Cookie",s],["Styles",s],["Message",s],function(){var i=h.Callback.Queue(s.Jax(),s.Extensions());return i.Push({})},["Menu",s],s.onLoad(),function(){MathJax.isReady=true},["Typeset",s],["Hash",s],["MenuZoom",s],["Post",s.signal,"End"])})("MathJax")}}; diff --git a/NumpyExercises/Individual_Numpy_files/require.min.js.download b/NumpyExercises/Individual_Numpy_files/require.min.js.download new file mode 100644 index 0000000..84d1d67 --- /dev/null +++ b/NumpyExercises/Individual_Numpy_files/require.min.js.download @@ -0,0 +1,36 @@ +/* + RequireJS 2.1.10 Copyright (c) 2010-2014, The Dojo Foundation All Rights Reserved. + Available via the MIT or new BSD license. + see: http://github.com/jrburke/requirejs for details +*/ +var requirejs,require,define; +(function(ca){function G(b){return"[object Function]"===N.call(b)}function H(b){return"[object Array]"===N.call(b)}function v(b,c){if(b){var d;for(d=0;dthis.depCount&&!this.defined){if(G(c)){if(this.events.error&&this.map.isDefine||h.onError!==da)try{f=i.execCb(b,c,e,f)}catch(d){a=d}else f=i.execCb(b,c,e,f);this.map.isDefine&&void 0===f&&((e=this.module)?f=e.exports:this.usingExports&& +(f=this.exports));if(a)return a.requireMap=this.map,a.requireModules=this.map.isDefine?[this.map.id]:null,a.requireType=this.map.isDefine?"define":"require",w(this.error=a)}else f=c;this.exports=f;if(this.map.isDefine&&!this.ignore&&(p[b]=f,h.onResourceLoad))h.onResourceLoad(i,this.map,this.depMaps);y(b);this.defined=!0}this.defining=!1;this.defined&&!this.defineEmitted&&(this.defineEmitted=!0,this.emit("defined",this.exports),this.defineEmitComplete=!0)}}else this.fetch()}},callPlugin:function(){var a= +this.map,b=a.id,d=m(a.prefix);this.depMaps.push(d);r(d,"defined",t(this,function(f){var d,g;g=j(ba,this.map.id);var J=this.map.name,u=this.map.parentMap?this.map.parentMap.name:null,p=i.makeRequire(a.parentMap,{enableBuildCallback:!0});if(this.map.unnormalized){if(f.normalize&&(J=f.normalize(J,function(a){return c(a,u,!0)})||""),f=m(a.prefix+"!"+J,this.map.parentMap),r(f,"defined",t(this,function(a){this.init([],function(){return a},null,{enabled:!0,ignore:!0})})),g=j(k,f.id)){this.depMaps.push(f); +if(this.events.error)g.on("error",t(this,function(a){this.emit("error",a)}));g.enable()}}else g?(this.map.url=i.nameToUrl(g),this.load()):(d=t(this,function(a){this.init([],function(){return a},null,{enabled:!0})}),d.error=t(this,function(a){this.inited=!0;this.error=a;a.requireModules=[b];B(k,function(a){0===a.map.id.indexOf(b+"_unnormalized")&&y(a.map.id)});w(a)}),d.fromText=t(this,function(f,c){var g=a.name,J=m(g),k=O;c&&(f=c);k&&(O=!1);q(J);s(l.config,b)&&(l.config[g]=l.config[b]);try{h.exec(f)}catch(j){return w(C("fromtexteval", +"fromText eval for "+b+" failed: "+j,j,[b]))}k&&(O=!0);this.depMaps.push(J);i.completeLoad(g);p([g],d)}),f.load(a.name,p,d,l))}));i.enable(d,this);this.pluginMaps[d.id]=d},enable:function(){W[this.map.id]=this;this.enabling=this.enabled=!0;v(this.depMaps,t(this,function(a,b){var c,f;if("string"===typeof a){a=m(a,this.map.isDefine?this.map:this.map.parentMap,!1,!this.skipMap);this.depMaps[b]=a;if(c=j(K,a.id)){this.depExports[b]=c(this);return}this.depCount+=1;r(a,"defined",t(this,function(a){this.defineDep(b, +a);this.check()}));this.errback&&r(a,"error",t(this,this.errback))}c=a.id;f=k[c];!s(K,c)&&(f&&!f.enabled)&&i.enable(a,this)}));B(this.pluginMaps,t(this,function(a){var b=j(k,a.id);b&&!b.enabled&&i.enable(a,this)}));this.enabling=!1;this.check()},on:function(a,b){var c=this.events[a];c||(c=this.events[a]=[]);c.push(b)},emit:function(a,b){v(this.events[a],function(a){a(b)});"error"===a&&delete this.events[a]}};i={config:l,contextName:b,registry:k,defined:p,urlFetched:T,defQueue:A,Module:$,makeModuleMap:m, +nextTick:h.nextTick,onError:w,configure:function(a){a.baseUrl&&"/"!==a.baseUrl.charAt(a.baseUrl.length-1)&&(a.baseUrl+="/");var b=l.shim,c={paths:!0,bundles:!0,config:!0,map:!0};B(a,function(a,b){c[b]?(l[b]||(l[b]={}),V(l[b],a,!0,!0)):l[b]=a});a.bundles&&B(a.bundles,function(a,b){v(a,function(a){a!==b&&(ba[a]=b)})});a.shim&&(B(a.shim,function(a,c){H(a)&&(a={deps:a});if((a.exports||a.init)&&!a.exportsFn)a.exportsFn=i.makeShimExports(a);b[c]=a}),l.shim=b);a.packages&&v(a.packages,function(a){var b, +a="string"===typeof a?{name:a}:a;b=a.name;a.location&&(l.paths[b]=a.location);l.pkgs[b]=a.name+"/"+(a.main||"main").replace(ja,"").replace(R,"")});B(k,function(a,b){!a.inited&&!a.map.unnormalized&&(a.map=m(b))});if(a.deps||a.callback)i.require(a.deps||[],a.callback)},makeShimExports:function(a){return function(){var b;a.init&&(b=a.init.apply(ca,arguments));return b||a.exports&&ea(a.exports)}},makeRequire:function(a,e){function g(f,c,d){var j,l;e.enableBuildCallback&&(c&&G(c))&&(c.__requireJsBuild= +!0);if("string"===typeof f){if(G(c))return w(C("requireargs","Invalid require call"),d);if(a&&s(K,f))return K[f](k[a.id]);if(h.get)return h.get(i,f,a,g);j=m(f,a,!1,!0);j=j.id;return!s(p,j)?w(C("notloaded",'Module name "'+j+'" has not been loaded yet for context: '+b+(a?"":". Use require([])"))):p[j]}M();i.nextTick(function(){M();l=q(m(null,a));l.skipMap=e.skipMap;l.init(f,c,d,{enabled:!0});D()});return g}e=e||{};V(g,{isBrowser:z,toUrl:function(b){var e,d=b.lastIndexOf("."),g=b.split("/")[0];if(-1!== +d&&(!("."===g||".."===g)||1g.attachEvent.toString().indexOf("[native code"))&&!Z?(O=!0,g.attachEvent("onreadystatechange",b.onScriptLoad)): +(g.addEventListener("load",b.onScriptLoad,!1),g.addEventListener("error",b.onScriptError,!1)),g.src=d,M=g,D?y.insertBefore(g,D):y.appendChild(g),M=null,g;if(fa)try{importScripts(d),b.completeLoad(c)}catch(j){b.onError(C("importscripts","importScripts failed for "+c+" at "+d,j,[c]))}};z&&!r.skipDataMain&&U(document.getElementsByTagName("script"),function(b){y||(y=b.parentNode);if(L=b.getAttribute("data-main"))return q=L,r.baseUrl||(E=q.split("/"),q=E.pop(),Q=E.length?E.join("/")+"/":"./",r.baseUrl= +Q),q=q.replace(R,""),h.jsExtRegExp.test(q)&&(q=L),r.deps=r.deps?r.deps.concat(q):[q],!0});define=function(b,c,d){var g,h;"string"!==typeof b&&(d=c,c=b,b=null);H(c)||(d=c,c=null);!c&&G(d)&&(c=[],d.length&&(d.toString().replace(la,"").replace(ma,function(b,d){c.push(d)}),c=(1===d.length?["require"]:["require","exports","module"]).concat(c)));if(O){if(!(g=M))P&&"interactive"===P.readyState||U(document.getElementsByTagName("script"),function(b){if("interactive"===b.readyState)return P=b}),g=P;g&&(b|| +(b=g.getAttribute("data-requiremodule")),h=F[g.getAttribute("data-requirecontext")])}(h?h.defQueue:S).push([b,c,d])};define.amd={jQuery:!0};h.exec=function(b){return eval(b)};h(r)}})(this); diff --git a/NumpyExercises/Numpy_Jupyer_gini/exercise_series.html b/NumpyExercises/Numpy_Jupyer_gini/exercise_series.html new file mode 100644 index 0000000..8a84adb --- /dev/null +++ b/NumpyExercises/Numpy_Jupyer_gini/exercise_series.html @@ -0,0 +1,8111 @@ + + + + + +exercise_series + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + +
+ + diff --git a/NumpyExercises/Numpy_Jupyer_gini/exercise_series.ipynb b/NumpyExercises/Numpy_Jupyer_gini/exercise_series.ipynb new file mode 100644 index 0000000..7832b77 --- /dev/null +++ b/NumpyExercises/Numpy_Jupyer_gini/exercise_series.ipynb @@ -0,0 +1,645 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Measuring Income Equality with the Gini Coefficient\n", + "\n", + "As we discussed in our numpy exercises, one frequently used measure of inequality is the Gini Coefficient. The Gini Coefficient takes on a value of 1 when the distribution of some property is maximally unequal across a said of entities, and a value of 0 when it is evenly distributed. \n", + "\n", + "In this exercise, we will calculate the Gini Coefficient for income inequality across the countries of the world to get a sense of income inequality *across* countries. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Gradescope Autograding\n", + "\n", + "Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.\n", + "\n", + "**Starting with this assignment, submissions that have not been formatted with `black` will be automatically rejected.**\n", + "\n", + "For this assignment, please name your file `exercise_series.ipynb` before uploading.\n", + "\n", + "You can check that you have answers for all questions in your `results` dictionary with this code:\n", + "\n", + "```python\n", + "assert set(results.keys()) == {\n", + " \"ex2_mean\",\n", + " \"ex2_median\",\n", + " \"ex3_highest_gdp_percap\",\n", + " \"ex3_lowest_gdp_percap\",\n", + " \"ex4_lessthan20_000\",\n", + " \"ex5_switzerland\",\n", + " \"ex6_gini_loop\",\n", + " \"ex7_gini_vectorized\",\n", + " \"ex8_gini_2025\",\n", + "}\n", + "```\n", + "\n", + "### Submission Limits\n", + "\n", + "Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 1\n", + "\n", + "To get accustomed to Series, let's explore some data on the wealth of 10 randomly selected countries. Data below presents the GDP per capita for these countries in 2008. \n", + "\n", + "Use the code below to get started: \n", + "\n", + "```python\n", + "gdppercap = pd.Series(\n", + " [34605, 34493, 12393, 44200, 10041, 58138, 4709, 49284, 10109, 42536],\n", + " index=[\n", + " \"Bahrain\",\n", + " \"Belgium\",\n", + " \"Bulgaria\",\n", + " \"Ireland\",\n", + " \"Macedonia\",\n", + " \"Norway\",\n", + " \"Paraguay\",\n", + " \"Singapore\",\n", + " \"South Africa\",\n", + " \"Switzerland\",\n", + " ],\n", + ")\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "gdppercap = pd.Series(\n", + " [34605, 34493, 12393, 44200, 10041, 58138, 4709, 49284, 10109, 42536],\n", + " index=[\n", + " \"Bahrain\",\n", + " \"Belgium\",\n", + " \"Bulgaria\",\n", + " \"Ireland\",\n", + " \"Macedonia\",\n", + " \"Norway\",\n", + " \"Paraguay\",\n", + " \"Singapore\",\n", + " \"South Africa\",\n", + " \"Switzerland\",\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 2\n", + "\n", + "Find the mean, median, minimum and maximum values of GDP per capita in this data. " + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "30050.8\n", + "34549.0\n", + "58138\n", + "4709\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex2_mean': 30050.8, 'ex2_median': 34549.0}" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mean = gdppercap.mean()\n", + "median = gdppercap.median()\n", + "min = gdppercap.min()\n", + "max = gdppercap.max()\n", + "print(mean)\n", + "print(median)\n", + "print(max)\n", + "print(min)\n", + "\n", + "results = {}\n", + "results[\"ex2_mean\"] = mean\n", + "results[\"ex2_median\"] = median\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 3\n", + "\n", + "Programmatically, determine which country in our data has the highest income per capita, and which has the lowest income per capita.\n", + "\n", + "(Obviously, this is easier to do by just looking at the data, but that's only because this dataset is very small. With a real dataset, you would need to do it with code, so please write code to accomplish this task.)\n", + "\n", + "Hint: Country names form the index for this Series, so to get country names you'll need to access the index. \n", + "\n", + "Store the country names *as strings* with the keys `\"ex3_highest_gdp_percap\"` and `\"ex3_lowest_gdp_percap\"`" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex2_mean': 30050.8,\n", + " 'ex2_median': 34549.0,\n", + " 'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay'}" + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Maximum\n", + "gdppercap.head()\n", + "max_gdp = gdppercap.max()\n", + "# gdppercap.idxmax()\n", + "id_max = gdppercap[gdppercap == max_gdp].index\n", + "\n", + "# Minimum\n", + "min_gdp = gdppercap.min()\n", + "id_min = gdppercap[gdppercap == min_gdp].index\n", + "\n", + "\n", + "results[\"ex3_highest_gdp_percap\"] = id_max[0]\n", + "results[\"ex3_lowest_gdp_percap\"] = id_min[0]\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 4\n", + "\n", + "Get Python to print out the names of all the countries that have GDP per capita of less than \\$20,000.\n", + "\n", + "Store these countries in a list, sorted alphabetically, and store it in `results` under the key `\"ex4_lessthan20_000\"`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex2_mean': 30050.8,\n", + " 'ex2_median': 34549.0,\n", + " 'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa']}" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ex4 = gdppercap[gdppercap < 20000].index.to_list()\n", + "results[\"ex4_lessthan20_000\"] = ex4\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 5 \n", + "\n", + "Get Python to print out the GDP per capita of Switzerland. Store the result as `ex5_switzerland`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex2_mean': 30050.8,\n", + " 'ex2_median': 34549.0,\n", + " 'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa'],\n", + " 'ex5_switzerland': 42536}" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# exercise 5\n", + "ex5 = gdppercap.loc[\"Switzerland\"]\n", + "results[\"ex5_switzerland\"] = ex5\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 6\n", + "\n", + "One frequntly used measure of inequality is the Gini Coefficient. The Gini Coefficient takes on a value of 1 when the distribution of some variable is maximally unequal across a population, and a value of 0 when it is evenly distributed. We will calculate the Gini Coefficient for income inequality in our data. \n", + "\n", + "To visualize the Gini Coefficient, we plot the cumulative share of the population (ordered from poorest to richest) on the x-axis, and cumulative share of income earned by that group on the y-axis. The Gini Coefficient is then defined as $$\\frac{A}{A + B}$$, where the areas A and B are labeled below: \n", + "\n", + "![gini_coefficient](https://upload.wikimedia.org/wikipedia/commons/thumb/5/59/Economics_Gini_coefficient2.svg/800px-Economics_Gini_coefficient2.svg.png)\n", + "\n", + "If income is evenly distributed, then the poorest 20% of a population will also have 20% of the wealth; the poorest 40% will have 40% of the wealth, and so forth, resulting in a perfect 45 degree line. In this situation, there is no area between the 45% line and the actual income distribution, so $A=0$, and the Gini Coefficient is 0. \n", + "\n", + "If, by contrast, the top 10% of people hold all the wealth in a country, then there will be no wealth for the poorest 90% of people, then wealth will jump up at the far right side of the graph. This will generate a very large gap between the 45% line and actual income for most of the graph, generating a large value for the area $A$, creating a very high Gini Coefficient. \n", + "\n", + "To illustrate, here are a few different Gini plots. These come from someone studying inequality of participation, so to adapt this to our study of income, just imagine the y-axis plots share of income):\n", + "\n", + "![gini_distributions](https://miro.medium.com/max/595/0*3DTcZnzDwS6A6AtP)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For discrete data, the Gini Coefficient can be calculated with the following formula: \n", + "\n", + "$$\\frac{2 \\sum_{i=1}^n i y_i}{n \\sum_{i=1}^n y_i} -\\frac{n+1}{n}$$\n", + "\n", + "Where $i$ is each country's rank ordering from poorest to richest, and $y_i$ is the income of country $i$.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Exercise 6\n", + "\n", + "Using this formula, calculate the Gini coefficient for our income data. \n", + "\n", + "Begin by writing a function to calculate the Gini Coefficient for our data *by looping over the entries in our Series*. In other words, try and embrace the spirit of how you might normally think about interpreting the summation notation written above.\n", + "\n", + "Store the gini coefficient you calculate in `results` under the key `\"ex6_gini_loop\"`.\n", + "\n", + "**HINT**: Be careful with 0-indexing! Python counts from 0, but mathematical formulas (like $\\sum$) start from 1!\n", + "\n", + "**HINT 2**: I'll probalby ask you to use this more than once, so please put it in a function." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "def cal_gini_loop(data):\n", + " sorted_data = sorted(data)\n", + " total_income = sum(sorted_data)\n", + " cumulative_sum = 0\n", + " n = len(sorted_data)\n", + " for i in range(n):\n", + " y_i = sorted_data[i]\n", + " cumulative_sum += 2 * (i + 1) * y_i\n", + " gini_coeff = (cumulative_sum / (n * total_income)) - ((n+1)/n)\n", + " return gini_coeff\n", + "\n", + "x = cal_gini_loop(gdppercap)\n", + "print(x)\n", + "results[\"ex6_gini_loop\"] = x\n", + "results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.3382798461272245\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex2_mean': 30050.8,\n", + " 'ex2_median': 34549.0,\n", + " 'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa'],\n", + " 'ex5_switzerland': 42536,\n", + " 'ex6_gini_loop': 0.3382798461272245}" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def cal_gini_loop(data):\n", + " sorted_data = sorted(data)\n", + " total_income = sum(sorted_data)\n", + " cumulative_sum = 0\n", + " n = len(sorted_data)\n", + " for i in range(n):\n", + " y_i = sorted_data[i]\n", + " cumulative_sum += 2 * (i + 1) * y_i\n", + " gini_coeff = (cumulative_sum / (n * total_income)) - ((n + 1) / n)\n", + " return gini_coeff\n", + "\n", + "\n", + "x = cal_gini_loop(gdppercap)\n", + "print(x)\n", + "results[\"ex6_gini_loop\"] = x\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 7\n", + "\n", + "Excellent! But as we've seen in [our readings](https://nickeubank.github.io/practicaldatascience_book/notebooks/class_2/week_4/11_vectorization.html), in data science we generally strive to *not* loop over the entries in our arrays; instead, we aspire to write *vectorized code* that naturally applies a simple operation to each observation.\n", + "\n", + "So now write a new function to calculate the Gini Coefficient that *doesn't* use loops, and instead relies on vectorized code.\n", + "\n", + "Store the result in `results` under the key `\"ex7_gini_vectorized\"`.\n", + "\n", + "**HINT:** you will probably have to create some new series/vectors/arrays." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.3382798461272245\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex2_mean': 30050.8,\n", + " 'ex2_median': 34549.0,\n", + " 'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa'],\n", + " 'ex5_switzerland': 42536,\n", + " 'ex6_gini_loop': 0.3382798461272245,\n", + " 'ex7_gini_vectorized': 0.3382798461272245}" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "\n", + "\n", + "def cal_gini_vector(data):\n", + " data = np.array(data.sort_values())\n", + " n = len(data)\n", + " i = np.arange(1, n + 1)\n", + " numerator = np.sum(2 * data * i)\n", + " denominator = np.sum(n * data)\n", + " return numerator / denominator - (n + 1) / n\n", + "\n", + "\n", + "x = cal_gini_vector(gdppercap)\n", + "print(x)\n", + "results[\"ex7_gini_vectorized\"] = x\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 8\n", + "\n", + "The result we just generated offers a snap-shot of inequality for this subset of countries. But what are the dynamics of inequality for these countries?\n", + "\n", + "There is an idea in economics called the \"convergence hypothesis\", which argues that poorer countries are likely to grow faster, and as a result global inequality is likely to decline. Economists advocating for this hypothesis pointed out that while rich countries had to invent new technologies in order to grow, many poor countries simply had to take advantage of innovations already developed by rich countries. \n", + "\n", + "To test this hypothesis, let's do a small analysis of the dynamics of income inequality in our sample. Create the following Series in your Python session, which provides the average growth rate of GDP per capita for all the countries in our sample from 2000 to 2018. \n", + "\n", + "```python\n", + "avg_growth = pd.Series(\n", + " [\n", + " -0.29768835,\n", + " 0.980299584,\n", + " 4.52991925,\n", + " 3.686556736,\n", + " 2.621416804,\n", + " 0.775132075,\n", + " 2.015489468,\n", + " 3.345793635,\n", + " 1.349993318,\n", + " 0.982775018,\n", + " ],\n", + " index=[\n", + " \"Bahrain\",\n", + " \"Belgium\",\n", + " \"Bulgaria\",\n", + " \"Ireland\",\n", + " \"Macedonia\",\n", + " \"Norway\",\n", + " \"Paraguay\",\n", + " \"Singapore\",\n", + " \"South Africa\",\n", + " \"Switzerland\",\n", + " ],\n", + ")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using this data on average growth rates in GDP per capita, and assuming growth rates from 2000 to 2018 continue into the future, estimate what our Gini Coefficient may look like in 2025 (remembering that income in our data is from 2008, so we're extrapolating ahead 17 years)?\n", + "\n", + "**Hint:** the formula for compound growth (i.e. value of something growing at a rate of `x` percent for $t$ periods) is:\n", + "\n", + "$$future\\_value = current\\_value * (1 + \\frac{percentage\\_growth\\_rate}{100}))^t$$\n", + "\n", + "Store the answer in `results` under the key `\"ex8_gini_2025\"`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.3656264991306193\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex2_mean': 30050.8,\n", + " 'ex2_median': 34549.0,\n", + " 'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': ['Bulgaria', 'Macedonia', 'Paraguay', 'South Africa'],\n", + " 'ex5_switzerland': 42536,\n", + " 'ex6_gini_loop': 0.3382798461272245,\n", + " 'ex7_gini_vectorized': 0.3382798461272245,\n", + " 'ex8_gini_2025': 0.3656264991306193}" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "avg_growth = pd.Series(\n", + " [\n", + " -0.29768835,\n", + " 0.980299584,\n", + " 4.52991925,\n", + " 3.686556736,\n", + " 2.621416804,\n", + " 0.775132075,\n", + " 2.015489468,\n", + " 3.345793635,\n", + " 1.349993318,\n", + " 0.982775018,\n", + " ],\n", + " index=[\n", + " \"Bahrain\",\n", + " \"Belgium\",\n", + " \"Bulgaria\",\n", + " \"Ireland\",\n", + " \"Macedonia\",\n", + " \"Norway\",\n", + " \"Paraguay\",\n", + " \"Singapore\",\n", + " \"South Africa\",\n", + " \"Switzerland\",\n", + " ],\n", + ")\n", + "\n", + "t = 17\n", + "future_value = gdppercap * (1 + avg_growth / 100) ** t\n", + "ex8_gini_2025 = cal_gini_vector(future_value)\n", + "print(ex8_gini_2025)\n", + "results[\"ex8_gini_2025\"] = ex8_gini_2025\n", + "results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "assert set(results.keys()) == {\n", + " \"ex2_mean\",\n", + " \"ex2_median\",\n", + " \"ex3_highest_gdp_percap\",\n", + " \"ex3_lowest_gdp_percap\",\n", + " \"ex4_lessthan20_000\",\n", + " \"ex5_switzerland\",\n", + " \"ex6_gini_loop\",\n", + " \"ex7_gini_vectorized\",\n", + " \"ex8_gini_2025\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 9\n", + "\n", + "Interpret your result -- does it seem to imply that we are seeing covergence or not?\n", + "\n", + "[After you're done, you can see a more systematic version of this analysis here!](https://www.cgdev.org/blog/everything-you-know-about-cross-country-convergence-now-wrong)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Answer for number 9:\n", + "We are no seeing a convergence. This is because we can see that the gap is not reducing." + ] + } + ], + "metadata": { + "interpreter": { + "hash": "f06fa9c80cc08d4d343f66ad24a278ad0285590eac640a80c32c9d748f33a802" + }, + "kernelspec": { + "display_name": "Python 3.9.7 64-bit ('base': conda)", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/NumpyExercises/Numpy_Jupyer_gini/exercise_series.pdf b/NumpyExercises/Numpy_Jupyer_gini/exercise_series.pdf new file mode 100644 index 0000000..204b726 Binary files /dev/null and b/NumpyExercises/Numpy_Jupyer_gini/exercise_series.pdf differ diff --git a/NumpyExercises/exercise_numpy_vectors.ipynb b/NumpyExercises/exercise_numpy_vectors.ipynb new file mode 100644 index 0000000..db4a50e --- /dev/null +++ b/NumpyExercises/exercise_numpy_vectors.ipynb @@ -0,0 +1,465 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## In class numpy exercise\n", + "Katie Hucker and Simrun Sharma" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "from matplotlib import pyplot as plt\n", + "\n", + "results = {}" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[ 53308. 102050. 192994. ... 407460. 19856. 154754.]\n" + ] + } + ], + "source": [ + "your_array = np.loadtxt(\n", + " \"https://raw.githubusercontent.com/nickeubank/practicaldatascience/master/Example_Data/us_household_incomes.txt\"\n", + ")\n", + "print(your_array)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([8.77542e+05, 4.35030e+04, 1.01710e+04, 1.92100e+03, 3.98000e+02,\n", + " 1.17000e+02, 2.90000e+01, 8.00000e+00, 4.00000e+00, 2.00000e+00]),\n", + " array([ -16942. , 225842.5, 468627. , 711411.5, 954196. , 1196980.5,\n", + " 1439765. , 1682549.5, 1925334. , 2168118.5, 2410903. ]),\n", + " )" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.hist(your_array)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interpreting the histogram\n", + "It does not look like a normal distrubtion or a uniform distribution. It is skewed to thr right, with a majority of the data being being under 200,000 USD per household. It makes it look like common equality is relatively high, so it looks like everyone is equal. " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 53308., 102050., 192994., ..., 407460., 19856., 154754.])" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "under_500k = your_array[your_array < 500000]\n", + "under_500k" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([262190., 338722., 173615., 76580., 33854., 16134., 8329.,\n", + " 5401., 4556., 4038.]),\n", + " array([-16942., 34752., 86446., 138140., 189834., 241528., 293222.,\n", + " 344916., 396610., 448304., 499998.]),\n", + " )" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.hist(under_500k)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interpreting the 2nd histogram\n", + "\n", + "It is not uniform, still. It is still right skewed. However, we can now see that a majority US households earn under 100,000 USD. The skewness is for high income earners, it does not skew towards the lower income earners. " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "from ineqpy.inequality import gini" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.14711442173300704" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "gini(your_array)\n", + "ex4_share_below_poverty = len(your_array[your_array < 20000]) / len(your_array)\n", + "results[\"ex4_share_below_poverty\"] = ex4_share_below_poverty\n", + "ex4_share_below_poverty" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex4_share_below_poverty': 0.14711442173300704,\n", + " 'ex4_gini': 0.4810925546879211}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ex4_gini = gini(your_array)\n", + "ex4_gini\n", + "results[\"ex4_gini\"] = ex4_gini\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Ex 5\n", + "\n", + "Overall, European countries have lower gini scores (.27ish) then the United States. While countries in Africa and South America have higher gini scores(.45-.63). We intrepret this that U.S. has more equality of income than the higher scores. However, Europe does better than the United States with equality of wealth. This makes sense to us, as moree developed countries have more wealth and oppourutunities for wealth with infrastucure and resources. " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'ex4_share_below_poverty': 0.14711442173300704, 'ex4_gini': 0.4810925546879211, 'ex6_gini_policy_a': 0.46024685074894556, 'ex6_gini_policy_b': 0.4582821778789707, 'ex6_gini_which_reduced_more': 'Policy B'}\n" + ] + } + ], + "source": [ + "# Policy A (40k) and B (30k)\n", + "house_40k = your_array.copy()\n", + "house_30k = your_array.copy()\n", + "house_40k[house_40k < 40000] = house_40k[house_40k < 40000] + 5000\n", + "house_30k[house_30k < 30000] = house_30k[house_30k < 30000] + 7000\n", + "house_40k_gini = gini(house_40k)\n", + "house_30k_gini = gini(house_30k)\n", + "\n", + "results[\"ex6_gini_policy_a\"] = house_40k_gini\n", + "\n", + "results[\"ex6_gini_policy_b\"] = house_30k_gini\n", + "\n", + "results[\"ex6_gini_which_reduced_more\"] = \"Policy B\"\n", + "\n", + "print(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Policy B has a fairer income distribution compared to Policy A. " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex4_share_below_poverty': 0.14711442173300704,\n", + " 'ex4_gini': 0.4810925546879211,\n", + " 'ex6_gini_policy_a': 0.46024685074894556,\n", + " 'ex6_gini_policy_b': 0.4582821778789707,\n", + " 'ex6_gini_which_reduced_more': 'Policy B',\n", + " 'ex7_gini_policy_c': 0.4756173843900714}" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "house_250k = your_array.copy()\n", + "\n", + "house_250k[house_250k > 250000] = (house_250k[house_250k > 250000]) * 0.95\n", + "house_250k_gini = gini(house_250k)\n", + "results[\"ex7_gini_policy_c\"] = house_250k_gini\n", + "results" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex4_share_below_poverty': 0.14711442173300704,\n", + " 'ex4_gini': 0.4810925546879211,\n", + " 'ex6_gini_policy_a': 0.46024685074894556,\n", + " 'ex6_gini_policy_b': 0.4582821778789707,\n", + " 'ex6_gini_which_reduced_more': 'Policy B',\n", + " 'ex7_gini_policy_c': 0.4756173843900714,\n", + " 'ex8_revenue_raised': 929623340.85}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# print(house_250k)\n", + "ex_8 = your_array.copy()\n", + "ex_8 = ex_8[ex_8 > 250000] * 0.05\n", + "results[\"ex8_revenue_raised\"] = np.sum(ex_8)\n", + "results" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex4_share_below_poverty': 0.14711442173300704,\n", + " 'ex4_gini': 0.4810925546879211,\n", + " 'ex6_gini_policy_a': 0.46024685074894556,\n", + " 'ex6_gini_policy_b': 0.4582821778789707,\n", + " 'ex6_gini_which_reduced_more': 'Policy B',\n", + " 'ex7_gini_policy_c': 0.4756173843900714,\n", + " 'ex8_revenue_raised': 929623340.85,\n", + " 'ex9_transfers': 4208.230382379836}" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# ex9\n", + "results[\"ex9_transfers\"] = results[\"ex8_revenue_raised\"] / len(\n", + " your_array[your_array < 30000]\n", + ")\n", + "results" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex4_share_below_poverty': 0.14711442173300704,\n", + " 'ex4_gini': 0.4810925546879211,\n", + " 'ex6_gini_policy_a': 0.46024685074894556,\n", + " 'ex6_gini_policy_b': 0.4582821778789707,\n", + " 'ex6_gini_which_reduced_more': 'Policy B',\n", + " 'ex7_gini_policy_c': 0.4756173843900714,\n", + " 'ex8_revenue_raised': 929623340.85,\n", + " 'ex9_transfers': 4208.230382379836,\n", + " 'ex10_gini_policy_d': 0.46166900570205466}" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ex_10 = your_array.copy()\n", + "money_per_home = results[\"ex8_revenue_raised\"] / len(ex_10[ex_10 < 30_000])\n", + "ex_10[ex_10 < 30_000] = ex_10[ex_10 < 30_000] + money_per_home\n", + "ex_10[ex_10 > 250_000] = ex_10[ex_10 > 250_000] * 0.95\n", + "results[\"ex10_gini_policy_d\"] = gini(ex_10)\n", + "results" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.46264861963052434" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# ex11\n", + "ex_11 = your_array.copy()\n", + "money_per_home2 = results[\"ex8_revenue_raised\"] / len(ex_11[ex_11 < 40_000])\n", + "ex_11[ex_11 < 40_000] = ex_11[ex_11 < 40_000] + money_per_home2\n", + "ex_11[ex_11 > 250_000] = ex_11[ex_11 > 250_000] * 0.95\n", + "results[\"ex11_gini_policy_e\"] = gini(ex_11)\n", + "results[\"ex11_gini_policy_e\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex4_share_below_poverty': 0.14711442173300704,\n", + " 'ex4_gini': 0.4810925546879211,\n", + " 'ex6_gini_policy_a': 0.46024685074894556,\n", + " 'ex6_gini_policy_b': 0.4582821778789707,\n", + " 'ex6_gini_which_reduced_more': 'Policy B',\n", + " 'ex7_gini_policy_c': 0.4756173843900714,\n", + " 'ex8_revenue_raised': 929623340.85,\n", + " 'ex9_transfers': 4208.230382379836,\n", + " 'ex10_gini_policy_d': 0.46166900570205466,\n", + " 'ex11_gini_policy_e': 0.46264861963052434,\n", + " 'ex12_policy_recommendation': 'Policy D'}" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results[\"ex12_policy_recommendation\"] = \"Policy D\"\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Citations \n", + "\n", + " Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler and Matthew Sobek. IPUMS USA: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2021. https://doi.org/10.18128/D010.V11.0" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/NumpyExercises/metadata.yml b/NumpyExercises/metadata.yml new file mode 100644 index 0000000..87fe67d --- /dev/null +++ b/NumpyExercises/metadata.yml @@ -0,0 +1,13 @@ +--- +:id: 195162482 +:submitters: +- :name: Simrun Sharma + :email: simrun.sharma@duke.edu +- :name: Katelyn Hucker + :email: katelyn.hucker@duke.edu +:created_at: !ruby/object:ActiveSupport::TimeWithZone + utc: 2023-09-21 03:20:19.019812000 Z + zone: !ruby/object:ActiveSupport::TimeZone + name: America/Los_Angeles + time: 2023-09-20 20:20:19.019812000 Z +:status: processed diff --git a/Pandas_DataFrames/Exercise_series.ipynb b/Pandas_DataFrames/Exercise_series.ipynb new file mode 100644 index 0000000..14e7efe --- /dev/null +++ b/Pandas_DataFrames/Exercise_series.ipynb @@ -0,0 +1,608 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Measuring Income Equality with the Gini Coefficient\n", + "\n", + "As we discussed in our numpy exercises, one frequently used measure of inequality is the Gini Coefficient. The Gini Coefficient takes on a value of 1 when the distribution of some property is maximally unequal across a said of entities, and a value of 0 when it is evenly distributed. \n", + "\n", + "In this exercise, we will calculate the Gini Coefficient for income inequality across the countries of the world to get a sense of income inequality *across* countries. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Gradescope Autograding\n", + "\n", + "Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.\n", + "\n", + "**Starting with this assignment, submissions that have not been formatted with `black` will be automatically rejected.**\n", + "\n", + "For this assignment, please name your file `exercise_series.ipynb` before uploading.\n", + "\n", + "You can check that you have answers for all questions in your `results` dictionary with this code:\n", + "\n", + "```python\n", + "assert set(results.keys()) == {\n", + " \"ex2_mean\",\n", + " \"ex2_median\",\n", + " \"ex3_highest_gdp_percap\",\n", + " \"ex3_lowest_gdp_percap\",\n", + " \"ex4_lessthan20_000\",\n", + " \"ex5_switzerland\",\n", + " \"ex6_gini_loop\",\n", + " \"ex7_gini_vectorized\",\n", + " \"ex8_gini_2025\",\n", + "}\n", + "```\n", + "\n", + "### Submission Limits\n", + "\n", + "Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 1\n", + "\n", + "To get accustomed to Series, let's explore some data on the wealth of 10 randomly selected countries. Data below presents the GDP per capita for these countries in 2008. \n", + "\n", + "Use the code below to get started: \n", + "\n", + "```python\n", + "gdppercap = pd.Series(\n", + " [34605, 34493, 12393, 44200, 10041, 58138, 4709, 49284, 10109, 42536],\n", + " index=[\n", + " \"Bahrain\",\n", + " \"Belgium\",\n", + " \"Bulgaria\",\n", + " \"Ireland\",\n", + " \"Macedonia\",\n", + " \"Norway\",\n", + " \"Paraguay\",\n", + " \"Singapore\",\n", + " \"South Africa\",\n", + " \"Switzerland\",\n", + " ],\n", + ")\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "gdppercap = pd.Series(\n", + " [34605, 34493, 12393, 44200, 10041, 58138, 4709, 49284, 10109, 42536],\n", + " index=[\n", + " \"Bahrain\",\n", + " \"Belgium\",\n", + " \"Bulgaria\",\n", + " \"Ireland\",\n", + " \"Macedonia\",\n", + " \"Norway\",\n", + " \"Paraguay\",\n", + " \"Singapore\",\n", + " \"South Africa\",\n", + " \"Switzerland\",\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 2\n", + "\n", + "Find the mean, median, minimum and maximum values of GDP per capita in this data. " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "30050.8\n", + "34549.0\n", + "58138\n", + "4709\n" + ] + } + ], + "source": [ + "mean = gdppercap.mean()\n", + "median = gdppercap.median()\n", + "min = gdppercap.min()\n", + "max = gdppercap.max()\n", + "print(mean)\n", + "print(median)\n", + "print(max)\n", + "print(min)\n", + "\n", + "results = {}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 3\n", + "\n", + "Programmatically, determine which country in our data has the highest income per capita, and which has the lowest income per capita.\n", + "\n", + "(Obviously, this is easier to do by just looking at the data, but that's only because this dataset is very small. With a real dataset, you would need to do it with code, so please write code to accomplish this task.)\n", + "\n", + "Hint: Country names form the index for this Series, so to get country names you'll need to access the index. \n", + "\n", + "Store the country names *as strings* with the keys `\"ex3_highest_gdp_percap\"` and `\"ex3_lowest_gdp_percap\"`" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex3_highest_gdp_percap': 'Norway', 'ex3_lowest_gdp_percap': 'Paraguay'}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Maximum\n", + "gdppercap.head()\n", + "max_gdp = gdppercap.max()\n", + "# gdppercap.idxmax()\n", + "id_max = gdppercap[gdppercap == max_gdp].index\n", + "\n", + "# Minimum\n", + "min_gdp = gdppercap.min()\n", + "id_min = gdppercap[gdppercap == min_gdp].index\n", + "\n", + "\n", + "results[\"ex3_highest_gdp_percap\"] = id_max[0]\n", + "results[\"ex3_lowest_gdp_percap\"] = id_min[0]\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 4\n", + "\n", + "Get Python to print out the names of all the countries that have GDP per capita of less than \\$20,000.\n", + "\n", + "Store these countries in a list, sorted alphabetically, and store it in `results` under the key `\"ex4_lessthan20_000\"`" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\Simrun Sharma\\AppData\\Local\\Temp\\ipykernel_31340\\4179986109.py:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n", + " results[\"ex4_lessthan20_000\"] = ex4[0]\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': 12393}" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ex4 = gdppercap[gdppercap < 20000]\n", + "results[\"ex4_lessthan20_000\"] = ex4[0]\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 5 \n", + "\n", + "Get Python to print out the GDP per capita of Switzerland. Store the result as `ex5_switzerland`:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': 12393,\n", + " 'ex5_switzerland': 42536}" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# exercise 5\n", + "ex5 = gdppercap.loc[\"Switzerland\"]\n", + "results[\"ex5_switzerland\"] = ex5\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 6\n", + "\n", + "One frequntly used measure of inequality is the Gini Coefficient. The Gini Coefficient takes on a value of 1 when the distribution of some variable is maximally unequal across a population, and a value of 0 when it is evenly distributed. We will calculate the Gini Coefficient for income inequality in our data. \n", + "\n", + "To visualize the Gini Coefficient, we plot the cumulative share of the population (ordered from poorest to richest) on the x-axis, and cumulative share of income earned by that group on the y-axis. The Gini Coefficient is then defined as $$\\frac{A}{A + B}$$, where the areas A and B are labeled below: \n", + "\n", + "![gini_coefficient](https://upload.wikimedia.org/wikipedia/commons/thumb/5/59/Economics_Gini_coefficient2.svg/800px-Economics_Gini_coefficient2.svg.png)\n", + "\n", + "If income is evenly distributed, then the poorest 20% of a population will also have 20% of the wealth; the poorest 40% will have 40% of the wealth, and so forth, resulting in a perfect 45 degree line. In this situation, there is no area between the 45% line and the actual income distribution, so $A=0$, and the Gini Coefficient is 0. \n", + "\n", + "If, by contrast, the top 10% of people hold all the wealth in a country, then there will be no wealth for the poorest 90% of people, then wealth will jump up at the far right side of the graph. This will generate a very large gap between the 45% line and actual income for most of the graph, generating a large value for the area $A$, creating a very high Gini Coefficient. \n", + "\n", + "To illustrate, here are a few different Gini plots. These come from someone studying inequality of participation, so to adapt this to our study of income, just imagine the y-axis plots share of income):\n", + "\n", + "![gini_distributions](https://miro.medium.com/max/595/0*3DTcZnzDwS6A6AtP)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For discrete data, the Gini Coefficient can be calculated with the following formula: \n", + "\n", + "$$\\frac{2 \\sum_{i=1}^n i y_i}{n \\sum_{i=1}^n y_i} -\\frac{n+1}{n}$$\n", + "\n", + "Where $i$ is each country's rank ordering from poorest to richest, and $y_i$ is the income of country $i$.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Exercise 6\n", + "\n", + "Using this formula, calculate the Gini coefficient for our income data. \n", + "\n", + "Begin by writing a function to calculate the Gini Coefficient for our data *by looping over the entries in our Series*. In other words, try and embrace the spirit of how you might normally think about interpreting the summation notation written above.\n", + "\n", + "Store the gini coefficient you calculate in `results` under the key `\"ex6_gini_loop\"`.\n", + "\n", + "**HINT**: Be careful with 0-indexing! Python counts from 0, but mathematical formulas (like $\\sum$) start from 1!\n", + "\n", + "**HINT 2**: I'll probalby ask you to use this more than once, so please put it in a function." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "def cal_gini_loop(data):\n", + " sorted_data = sorted(data)\n", + " total_income = sum(sorted_data)\n", + " cumulative_sum = 0\n", + " n = len(sorted_data)\n", + " for i in range(n):\n", + " y_i = sorted_data[i]\n", + " cumulative_sum += 2 * (i + 1) * y_i\n", + " gini_coeff = (cumulative_sum / (n * total_income)) - ((n+1)/n)\n", + " return gini_coeff\n", + "\n", + "x = cal_gini_loop(gdppercap)\n", + "print(x)\n", + "results[\"ex6_gini_loop\"] = x\n", + "results" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.3382798461272245\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': 12393,\n", + " 'ex5_switzerland': 42536,\n", + " 'ex6_gini_loop': 0.3382798461272245}" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def cal_gini_loop(data):\n", + " sorted_data = sorted(data)\n", + " total_income = sum(sorted_data)\n", + " cumulative_sum = 0\n", + " n = len(sorted_data)\n", + " for i in range(n):\n", + " y_i = sorted_data[i]\n", + " cumulative_sum += 2 * (i + 1) * y_i\n", + " gini_coeff = (cumulative_sum / (n * total_income)) - ((n + 1) / n)\n", + " return gini_coeff\n", + "\n", + "\n", + "x = cal_gini_loop(gdppercap)\n", + "print(x)\n", + "results[\"ex6_gini_loop\"] = x\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 7\n", + "\n", + "Excellent! But as we've seen in [our readings](https://nickeubank.github.io/practicaldatascience_book/notebooks/class_2/week_4/11_vectorization.html), in data science we generally strive to *not* loop over the entries in our arrays; instead, we aspire to write *vectorized code* that naturally applies a simple operation to each observation.\n", + "\n", + "So now write a new function to calculate the Gini Coefficient that *doesn't* use loops, and instead relies on vectorized code.\n", + "\n", + "Store the result in `results` under the key `\"ex7_gini_vectorized\"`.\n", + "\n", + "**HINT:** you will probably have to create some new series/vectors/arrays." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.3382798461272245\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': 12393,\n", + " 'ex5_switzerland': 42536,\n", + " 'ex6_gini_loop': 0.3382798461272245,\n", + " 'ex7_gini_vectorized': 0.3382798461272245}" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "\n", + "\n", + "def cal_gini_vector(data):\n", + " data = np.array(data.sort_values())\n", + " n = len(data)\n", + " i = np.arange(1, n + 1)\n", + " numerator = np.sum(2 * data * i)\n", + " denominator = np.sum(n * data)\n", + " return numerator / denominator - (n + 1) / n\n", + "\n", + "\n", + "x = cal_gini_vector(gdppercap)\n", + "print(x)\n", + "results[\"ex7_gini_vectorized\"] = x\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 8\n", + "\n", + "The result we just generated offers a snap-shot of inequality for this subset of countries. But what are the dynamics of inequality for these countries?\n", + "\n", + "There is an idea in economics called the \"convergence hypothesis\", which argues that poorer countries are likely to grow faster, and as a result global inequality is likely to decline. Economists advocating for this hypothesis pointed out that while rich countries had to invent new technologies in order to grow, many poor countries simply had to take advantage of innovations already developed by rich countries. \n", + "\n", + "To test this hypothesis, let's do a small analysis of the dynamics of income inequality in our sample. Create the following Series in your Python session, which provides the average growth rate of GDP per capita for all the countries in our sample from 2000 to 2018. \n", + "\n", + "```python\n", + "avg_growth = pd.Series(\n", + " [\n", + " -0.29768835,\n", + " 0.980299584,\n", + " 4.52991925,\n", + " 3.686556736,\n", + " 2.621416804,\n", + " 0.775132075,\n", + " 2.015489468,\n", + " 3.345793635,\n", + " 1.349993318,\n", + " 0.982775018,\n", + " ],\n", + " index=[\n", + " \"Bahrain\",\n", + " \"Belgium\",\n", + " \"Bulgaria\",\n", + " \"Ireland\",\n", + " \"Macedonia\",\n", + " \"Norway\",\n", + " \"Paraguay\",\n", + " \"Singapore\",\n", + " \"South Africa\",\n", + " \"Switzerland\",\n", + " ],\n", + ")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using this data on average growth rates in GDP per capita, and assuming growth rates from 2000 to 2018 continue into the future, estimate what our Gini Coefficient may look like in 2025 (remembering that income in our data is from 2008, so we're extrapolating ahead 17 years)?\n", + "\n", + "**Hint:** the formula for compound growth (i.e. value of something growing at a rate of `x` percent for $t$ periods) is:\n", + "\n", + "$$future\\_value = current\\_value * (1 + \\frac{percentage\\_growth\\_rate}{100}))^t$$\n", + "\n", + "Store the answer in `results` under the key `\"ex8_gini_2025\"`" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.3656264991306193\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex3_highest_gdp_percap': 'Norway',\n", + " 'ex3_lowest_gdp_percap': 'Paraguay',\n", + " 'ex4_lessthan20_000': 12393,\n", + " 'ex5_switzerland': 42536,\n", + " 'ex6_gini_loop': 0.3382798461272245,\n", + " 'ex7_gini_vectorized': 0.3382798461272245,\n", + " 'ex8_gini_2025': 0.3656264991306193}" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "avg_growth = pd.Series(\n", + " [\n", + " -0.29768835,\n", + " 0.980299584,\n", + " 4.52991925,\n", + " 3.686556736,\n", + " 2.621416804,\n", + " 0.775132075,\n", + " 2.015489468,\n", + " 3.345793635,\n", + " 1.349993318,\n", + " 0.982775018,\n", + " ],\n", + " index=[\n", + " \"Bahrain\",\n", + " \"Belgium\",\n", + " \"Bulgaria\",\n", + " \"Ireland\",\n", + " \"Macedonia\",\n", + " \"Norway\",\n", + " \"Paraguay\",\n", + " \"Singapore\",\n", + " \"South Africa\",\n", + " \"Switzerland\",\n", + " ],\n", + ")\n", + "\n", + "t = 17\n", + "future_value = gdppercap * (1 + avg_growth / 100) ** t\n", + "ex8_gini_2025 = cal_gini_vector(future_value)\n", + "print(ex8_gini_2025)\n", + "results[\"ex8_gini_2025\"] = ex8_gini_2025\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 9\n", + "\n", + "Interpret your result -- does it seem to imply that we are seeing covergence or not?\n", + "\n", + "[After you're done, you can see a more systematic version of this analysis here!](https://www.cgdev.org/blog/everything-you-know-about-cross-country-convergence-now-wrong)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Answer for number 9:\n", + "We are no seeing a convergence. This is because we can see that the gap is not reducing." + ] + } + ], + "metadata": { + "interpreter": { + "hash": "f06fa9c80cc08d4d343f66ad24a278ad0285590eac640a80c32c9d748f33a802" + }, + "kernelspec": { + "display_name": "Python 3.9.7 64-bit ('base': conda)", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Pandas_DataFrames/US_ACS_2017_10pct_sample.dta b/Pandas_DataFrames/US_ACS_2017_10pct_sample.dta new file mode 100644 index 0000000..22ec84e Binary files /dev/null and b/Pandas_DataFrames/US_ACS_2017_10pct_sample.dta differ diff --git a/Pandas_DataFrames/exercise_cleaning.ipynb b/Pandas_DataFrames/exercise_cleaning.ipynb new file mode 100644 index 0000000..68db073 --- /dev/null +++ b/Pandas_DataFrames/exercise_cleaning.ipynb @@ -0,0 +1,773 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Cleaning Data Exercises\n", + "\n", + "In this exercise, we'll be returning to the American Community Survey data we used previously to measuring racial income inequality in the United States. In today's exercise, we'll be using it to measure the returns to education and how those returns vary by race and gender.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Gradescope Autograding\n", + "\n", + "Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.\n", + "\n", + "For this assignment, please name your file `exercise_missing.ipynb` before uploading.\n", + "\n", + "You can check that you have answers for all questions in your `results` dictionary with this code:\n", + "\n", + "```python\n", + "assert set(results.keys()) == {\n", + " \"ex5_age_young\",\n", + " \"ex5_age_old\",\n", + " \"ex7_avg_age\",\n", + " \"ex8_avg_age\",\n", + " \"ex9_num_college\",\n", + " \"ex11_share_male_w_degrees\",\n", + " \"ex11_share_female_w_degrees\",\n", + " \"ex12_comparing\",\n", + "}\n", + "```\n", + "\n", + "\n", + "### Submission Limits\n", + "\n", + "Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises\n", + "\n", + "### Exercise 1\n", + "\n", + "For these cleaning exercises, we'll return to the ACS data we've used before one last time. We'll be working with `US_ACS_2017_10pct_sample.dta`. Import the data (please use url for the autograder)." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "pd.set_option(\"mode.copy_on_write\", True)\n", + "\n", + "\n", + "# loading the dataset\n", + "acs = pd.read_stata(\n", + " \"https://github.com/nickeubank/MIDS_Data/raw/master/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta\"\n", + ")\n", + "\n", + "# initializing dictionary\n", + "results = {}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 2\n", + "\n", + "For our exercises today, we'll focus on `age`, `sex`, `educ` (education), and `inctot` (total income). Subset your data to those variables, and quickly look at a sample of 10 rows." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The following is a table that is subsetted for age, education, income, and sex\n", + " age educ inctot sex\n", + "0 4 nursery school to grade 4 9999999 female\n", + "1 17 grade 11 6000 female\n", + "2 63 4 years of college 6150 male\n", + "3 66 grade 12 14000 female\n", + "4 1 n/a or no schooling 9999999 male\n", + "... .. ... ... ...\n", + "318999 33 4 years of college 22130 female\n", + "319000 4 nursery school to grade 4 9999999 female\n", + "319001 20 grade 12 5000 male\n", + "319002 47 5+ years of college 240000 male\n", + "319003 33 5+ years of college 48000 male\n", + "\n", + "[319004 rows x 4 columns]\n" + ] + } + ], + "source": [ + "acs = acs[[\"age\", \"educ\", \"inctot\", \"sex\"]]\n", + "print(\"The following is a table that is subsetted for age, education, income, and sex\")\n", + "print(acs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 3\n", + "\n", + "As before, all the values of `9999999` have the potential to cause us real problems, so replace all the values of `inctot` that are `9999999` with `np.nan`. " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ageeducinctotsex
13674115grade 90.0male
4006877grade 1236200.0male
59267584 years of college0.0female
14106959grade 128400.0male
21202456grade 12100000.0male
184478564 years of college80000.0female
112077485+ years of college175000.0male
26371258grade 129100.0female
7754622grade 120.0female
107543604 years of college61200.0male
296739304 years of college10000.0female
7116815grade 100.0female
177867224 years of college3000.0male
2378862n/a or no schoolingNaNmale
62670495+ years of college230000.0male
1081275n/a or no schoolingNaNmale
954608nursery school to grade 4NaNmale
73254531 year of college9000.0male
22807414grade 5, 6, 7, or 8NaNfemale
232798584 years of college33000.0female
\n", + "
" + ], + "text/plain": [ + " age educ inctot sex\n", + "136741 15 grade 9 0.0 male\n", + "40068 77 grade 12 36200.0 male\n", + "59267 58 4 years of college 0.0 female\n", + "141069 59 grade 12 8400.0 male\n", + "212024 56 grade 12 100000.0 male\n", + "184478 56 4 years of college 80000.0 female\n", + "112077 48 5+ years of college 175000.0 male\n", + "263712 58 grade 12 9100.0 female\n", + "77546 22 grade 12 0.0 female\n", + "107543 60 4 years of college 61200.0 male\n", + "296739 30 4 years of college 10000.0 female\n", + "71168 15 grade 10 0.0 female\n", + "177867 22 4 years of college 3000.0 male\n", + "237886 2 n/a or no schooling NaN male\n", + "62670 49 5+ years of college 230000.0 male\n", + "108127 5 n/a or no schooling NaN male\n", + "95460 8 nursery school to grade 4 NaN male\n", + "73254 53 1 year of college 9000.0 male\n", + "228074 14 grade 5, 6, 7, or 8 NaN female\n", + "232798 58 4 years of college 33000.0 female" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# replace all the values of `inctot` that are `9999999` with `np.nan`\n", + "import numpy as np\n", + "\n", + "acs.loc[acs[\"inctot\"] == 9999999, \"inctot\"] = np.nan\n", + "acs[\"inctot\"].value_counts()\n", + "np.random.seed(12)\n", + "acs.sample(20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 4\n", + "\n", + "Attempt to calculate the average age of people in our data. What do you get? Why are you getting that error?\n", + "\n", + "You *should* get an error in trying to answer this question, but **PLEASE LEAVE THE CODE THAT GENERATES THIS ERROR COMMENTED OUT SO YOUR NOTEBOOK WILL RUN IN THE AUTOGRADER**. \n", + "\n", + "Then talk about the error in a markdown cell." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# I am now going to be calculating the average age of people in our data\n", + "# acs[\"age\"].mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 4: The reason why there is an error is because age is not appearing as a numerical variable, it is coming as a categorical variable. You can't take the mean of a categorical variable. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 5\n", + "\n", + "We want to be able to calculate things using age, so we need it to be a numeric type. Check the current type of `age`, and look at all the values of `age` to figure out why it's categorical and not numeric. You should find two problematic categories. Store the values of these categories in `\"ex5_age_young\"` and `\"ex5_age_old\"` (once you find them, it should be clear which is which)." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The dictionary now has the string attached to different ages\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex5_age_young': 'less than 1 year old',\n", + " 'ex5_age_old': '90 (90+ in 1980 and 1990)'}" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# print(acs[\"age\"].dtype)\n", + "\n", + "# print(acs.loc[:,\"age\"])\n", + "\n", + "# acs.loc[acs[\"age\"] == \"less than 1 year old\", 'age']\n", + "\n", + "# acs[~acs[\"age\"].str.isnumeric()]\n", + "\n", + "results[\"ex5_age_young\"] = \"less than 1 year old\"\n", + "results[\"ex5_age_old\"] = \"90 (90+ in 1980 and 1990)\"\n", + "\n", + "\n", + "print(\"The dictionary now has the string attached to different ages\")\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 6\n", + "\n", + "In order to convert `age` into a numeric variable, we need to replace those problematic entries with values that `pandas` can later convert into numbers. Pick appropriate substitutions for the existing values and replace the current values. \n", + "\n", + "**Hint 1:** Categorical variables act like strings, so you might want to use string methods! \n", + "\n", + "**Hint 2:** Remember that characters like parentheses, pluses, asterisks, etc. are special in Python strings, and you have to escape them if you want them to be interpreted literally!\n", + "\n", + "**Hint 3:** Because the US Census has been conducted regularly for hundreds of years but exactly how the census has been conducted have occasionally changed, variables are sometimes coded in a way that might be interpreted in different ways for different census years. For example, hypothetically, one might write `90 (90+ in 1980 and 1990)` if the Censuses conducted in 1980 and 1990 used to top-code age at 90 (any values *over* 90 were just coded as 90), but more recent Censuses no longer top-coded age and recorded ages over 90 as the respondents actual age." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The age is now numerical from categorical\n" + ] + } + ], + "source": [ + "# Exercise 6\n", + "\n", + "acs[\"age\"] = acs[\"age\"].str.replace(results[\"ex5_age_young\"], \"0\")\n", + "acs[\"age\"] = acs[\"age\"].str.replace(\"90 \\(90\\+ in 1980 and 1990\\)\", \"90\", regex=True)\n", + "\n", + "acs[\"age\"] = acs[\"age\"].astype(int)\n", + "\n", + "print(\"The age is now numerical from categorical\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 7\n", + "\n", + "Now convert age from a categorical to numeric. Calculate the average age amoung this group, and store it in `\"ex7_avg_age\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "41.30384885455982\n", + "The aveage age among this group is 41.30\n" + ] + } + ], + "source": [ + "print(acs[\"age\"].mean())\n", + "\n", + "results[\"ex7_avg_age\"] = acs[\"age\"].mean()\n", + "print(\"The aveage age among this group is {:.2f}\".format(results[\"ex7_avg_age\"]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 8\n", + "\n", + "Let's now filter out anyone in our data whose age is less than 18. Note that before made `age` a numeric variable, we couldn't do this! Again, calculate the average age and this time store it in `\"ex8_avg_age\"`. \n", + "\n", + "Use this sample of people 18 and over for all subsequent exercises." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The average age of people who are over 18 is 50 years old\n" + ] + } + ], + "source": [ + "# Execercise 8 where we are trying to calculate the average age of people under 18\n", + "age_filtered = acs[acs[\"age\"] >= 18]\n", + "results[\"ex8_avg_age\"] = age_filtered[\"age\"].mean()\n", + "\n", + "print(\n", + " \"The average age of people who are over 18 is {:.0f} years old\".format(\n", + " results[\"ex8_avg_age\"]\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 9\n", + "\n", + "Create an indicator variable for whether each person has *at least* a college Bachelor's degree called `college_degree`. Use this variable to calculate the number of people in the dataset with a college degree. You may assume that to get a college degree you need to complete at least 4 years of college. Save the result as `\"ex9_num_college\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "This is the number of students who completed at least 4 years of college 77013\n" + ] + } + ], + "source": [ + "# Exercise 9 creating an indicator variable to see how many students completed at least 4 years of college\n", + "college_degree = [\"4 years of college\", \"5+ years of college\"]\n", + "\n", + "results[\"ex9_num_college\"] = (\n", + " age_filtered.loc[age_filtered[\"educ\"].isin(college_degree)].value_counts().sum()\n", + ")\n", + "\n", + "print(\n", + " \"This is the number of students who completed at least 4 years of college {:.0f}\".format(\n", + " results[\"ex9_num_college\"]\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 10\n", + "\n", + "Let's examine how the educational gender gap. Use `pd.crosstab` to create a cross-tabulation of `sex` and `college_degree`. `pd.crosstab` will give you the number of people who have each combination of `sex` and `college_degree` (so in this case, it will give us a 2x2 table with Male and Female as rows, and `college_degree` True and False as columns, or vice versa. " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "educ False True \n", + "sex \n", + "male 85821 36181\n", + "female 90200 40832\n" + ] + } + ], + "source": [ + "# This is a 2 x 2 table with Male and Females as rows and college degree True or False as columns\n", + "\n", + "gender = age_filtered[\"sex\"]\n", + "\n", + "degree = age_filtered[\"educ\"].isin(college_degree)\n", + "\n", + "\n", + "cross_tab = pd.crosstab(gender, degree)\n", + "print(cross_tab)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 11\n", + "\n", + "Counts are kind of hard to interpret. `pd.crosstab` can also normalize values to give percentages. Look at the `pd.crosstab` help file to figure out how to normalize the values in the table. Normalize them so that you get the share of men with and without college degree, and the share of women with and without college degrees.\n", + "\n", + "Store the share (between 0 and 1) of men with college degrees in `\"ex11_share_male_w_degrees\"`, and the share of women with degrees in `\"ex11_share_female_w_degrees\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "educ False True \n", + "sex \n", + "male 0.703439 0.296561\n", + "female 0.688381 0.311619\n", + "The proportion of females with degrees is 0.31 and the number of males with degrees is 0.30\n" + ] + } + ], + "source": [ + "# Exercise 11 normalize them so you can get share of men/women with and without college degrees\n", + "cross_tab_normalized = pd.crosstab(gender, degree, normalize=\"index\")\n", + "print(cross_tab_normalized)\n", + "\n", + "results[\"ex11_share_male_w_degrees\"] = cross_tab_normalized.loc[\"male\", True]\n", + "results[\"ex11_share_female_w_degrees\"] = cross_tab_normalized.loc[\"female\", True]\n", + "print(\n", + " \"The proportion of females with degrees is {:.2f} and the number of males with degrees is {:.2f}\".format(\n", + " results[\"ex11_share_female_w_degrees\"], results[\"ex11_share_male_w_degrees\"]\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 12\n", + "\n", + "Now, let's recreate that table for people who are 40 and over and people under 40. Over time, what does this suggest about the absolute difference in the share of men and women earning college degrees? Has it gotten larger, stayed the same, or gotten smaller? Store your answer (either `\"the absolute difference has increased\"` or `\"the absolute difference has decreased\"`) in `\"ex12_comparing\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "True if they are 40 and below and False if they are 40 and above\n", + "educ False True \n", + "sex \n", + "male 0.373499 0.129095\n", + "female 0.331128 0.166278\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex5_age_young': 'less than 1 year old',\n", + " 'ex5_age_old': '90 (90+ in 1980 and 1990)',\n", + " 'ex7_avg_age': 41.30384885455982,\n", + " 'ex8_avg_age': 49.75769659413359,\n", + " 'ex9_num_college': 77013,\n", + " 'ex11_share_male_w_degrees': 0.29656071211947344,\n", + " 'ex11_share_female_w_degrees': 0.3116185359301545,\n", + " 'ex12_comparing': 'the absolute difference has increased'}" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Exercise 12 to create a table for people who are 40/over and people who are under 40\n", + "age_40 = acs[\"age\"] < 40\n", + "cross_age_40_tab = pd.crosstab(\n", + " gender, degree, values=age_40, aggfunc=\"sum\", normalize=\"all\"\n", + ")\n", + "\n", + "print(\"True if they are 40 and below and False if they are 40 and above\")\n", + "print(cross_age_40_tab)\n", + "\n", + "results[\"ex12_comparing\"] = \"the absolute difference has increased\"\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 13\n", + "\n", + "In words, what is causing the change noted in Exercise 12 (i.e., looking at the tables above, tell me a story about Men and Women's College attainment)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### The reason why women are catching up to men when we look at younger ages over 18 is due to the how society has changed the view on a women's role and women's rights. Now it is more acceptable and more desirable for a women to get a college degree. This means now women are filling the gap we saw in the cross table of women and men over 40. There is also imporvement on women's rights and education scholarships that are making it more attainable for women to get an education." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Want More Practice?\n", + "\n", + "Calculate the educational racial gap in the United States for White Americans, Black Americans, Hispanic Americans, and other groups. \n", + "\n", + "Note that to do these calculations, you'll have to deal with the fact that unlike most Americans, the American Census Bureau treats \"Hispanic\" not as a racial category, but a linguistic one. As a result, the racial category \"White\" in `race` actually includes most Hispanic Americans. For this analysis, we wish to work with the mutually exclusive categories of \"White, non-Hispanic\", \"White, Hispanic\", \"Black (Hispanic or non-Hispanic)\", and a category for everyone else. " + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "assert set(results.keys()) == {\n", + " \"ex5_age_young\",\n", + " \"ex5_age_old\",\n", + " \"ex7_avg_age\",\n", + " \"ex8_avg_age\",\n", + " \"ex9_num_college\",\n", + " \"ex11_share_male_w_degrees\",\n", + " \"ex11_share_female_w_degrees\",\n", + " \"ex12_comparing\",\n", + "}" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.10.6 ('base')", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "vscode": { + "interpreter": { + "hash": "718fed28bf9f8c7851519acf2fb923cd655120b36de3b67253eeb0428bd33d2d" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Pandas_DataFrames/exercise_dataframes.ipynb b/Pandas_DataFrames/exercise_dataframes.ipynb new file mode 100644 index 0000000..457a243 --- /dev/null +++ b/Pandas_DataFrames/exercise_dataframes.ipynb @@ -0,0 +1,1101 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Estimating Labor Market Returns to Education\n", + "\n", + "In this exercise, we're going to use data from the [American Communities Survey (ACS)](https://usa.ipums.org/usa/acs.shtml) to study the relationship betwen educational attainment and wages. The ACS is a survey conducted by the United States Census Bureau (though it is not \"The Census,\" which is a counting of every person in the United States that takes place every 10 years) to measure numerous features of the US population. The data we will be working with includes about 100 variables from the 2017 ACS survey, and is a 10% sample of the ACS (which itself is a 1% sample of the US population, so we're working with about a 0.1% sample of the United States). \n", + "\n", + "This data comes from [IPUMS](https://usa.ipums.org/usa/), which provides a very useful tool for getting subsets of major survey datasets, not just from the US, but [from government statistical agencies the world over](https://international.ipums.org/international-action/sample_details).\n", + "\n", + "This is *real* data, meaning that you are being provided the data as it is provided by IPUMS. Documentation for all variables used in this data can be found [here](https://usa.ipums.org/usa-action/variables/group) (you can either search by variable name to figure out the meaning of a variable in this data, or search for something you want to see if a variable with the right name is in this data). \n", + "\n", + "Within this data is information on both the educational background and current earnings of a representative sample of Americans. We will now use this data to estimate the labor-market returns to graduating high school and college, and to learn something about the meaning of an educational degree. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Gradescope Autograding\n", + "\n", + "Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.\n", + "\n", + "For this assignment, please name your file `exercise_dataframes.ipynb` before uploading.\n", + "\n", + "You can check that you have answers for all questions in your `results` dictionary with this code:\n", + "\n", + "```python\n", + "assert set(results.keys()) == {\n", + " \"ex2_num_obs\",\n", + " \"ex3_num_vars\",\n", + " \"ex8_updated_num_obs\",\n", + " \"ex9_updated_num_obs\",\n", + " \"ex11_grade12_income\",\n", + " \"ex12_college_income\",\n", + " \"ex12_college_income_pct\",\n", + " \"ex14_high_school_dropout\",\n", + " \"ex15_grade_9\",\n", + " \"ex15_grade_10\",\n", + " \"ex15_grade_11\",\n", + " \"ex15_grade_12\",\n", + " \"ex15_4_years_of_college\",\n", + " \"ex15_graduate\",\n", + "}\n", + "```\n", + "\n", + "### Submission Limits\n", + "\n", + "Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 1\n", + "\n", + "Data for these [exercises can be found here](https://github.com/nickeubank/MIDS_Data/tree/master/US_AmericanCommunitySurvey). \n", + "\n", + "Import `US_ACS_2017_10pct_sample.dta` into a pandas DataFrame (read it directly from a URL to help the autograder, please). \n", + "\n", + "This can be done with the command `pd.read_stata`, which will read in files created in the program Stata (and which uses the file suffix `.dta`). This is a format commonly used by social scientists." + ] + }, + { + "cell_type": "code", + "execution_count": 178, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# loading the dataset\n", + "acs = pd.read_stata(\n", + " \"https://github.com/nickeubank/MIDS_Data/raw/master/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Getting to Know Your Data\n", + "\n", + "When you get a new dataset like this, it's good to start by trying to get a feel for its contents and organization. Toy datasets you sometimes get in classes are often very small, and easy to look at, but this is a pretty large dataset, so you can't just open it up and get a good sense of it. Here are some ways to get to know your data. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 2\n", + "\n", + "How many observations are in your data? Store the answer in your `results` dictionary with the key `\"ex2_num_obs\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 179, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'ex2_num_obs': 319004}\n" + ] + } + ], + "source": [ + "results = {}\n", + "results[\"ex2_num_obs\"] = len(acs)\n", + "print(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 3\n", + "\n", + "How many variables are in your data? Store the answer in your `results` dictionary with the key `\"ex3_num_vars\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 180, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'ex2_num_obs': 319004, 'ex3_num_vars': 104}\n" + ] + } + ], + "source": [ + "results[\"ex3_num_vars\"] = len(acs.columns)\n", + "print(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 4\n", + "\n", + " Let's see what variables are in this dataset. First, try to see them all using the command:\n", + "\n", + "\n", + "```python\n", + "acs.columns\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you will see, `python` doesn't like to print out all the different variables when there are this many in a dataset. \n", + "\n", + "To get everything printed out, we can loop over all the columns and print them one at a time with the command:\n", + "\n", + "```\n", + "for c in acs.columns: print(c)\n", + "```\n", + "\n", + "It's definitely a bit of a hack, but honestly a pretty useful one!" + ] + }, + { + "cell_type": "code", + "execution_count": 181, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "year\n", + "datanum\n", + "serial\n", + "cbserial\n", + "numprec\n", + "subsamp\n", + "hhwt\n", + "hhtype\n", + "cluster\n", + "adjust\n", + "cpi99\n", + "region\n", + "stateicp\n", + "statefip\n", + "countyicp\n", + "countyfip\n", + "metro\n", + "city\n", + "citypop\n", + "strata\n", + "gq\n", + "farm\n", + "ownershp\n", + "ownershpd\n", + "mortgage\n", + "mortgag2\n", + "mortamt1\n", + "mortamt2\n", + "respmode\n", + "pernum\n", + "cbpernum\n", + "perwt\n", + "slwt\n", + "famunit\n", + "sex\n", + "age\n", + "marst\n", + "birthyr\n", + "race\n", + "raced\n", + "hispan\n", + "hispand\n", + "bpl\n", + "bpld\n", + "citizen\n", + "yrnatur\n", + "yrimmig\n", + "language\n", + "languaged\n", + "speakeng\n", + "hcovany\n", + "hcovpriv\n", + "hinsemp\n", + "hinspur\n", + "hinstri\n", + "hcovpub\n", + "hinscaid\n", + "hinscare\n", + "hinsva\n", + "hinsihs\n", + "school\n", + "educ\n", + "educd\n", + "gradeatt\n", + "gradeattd\n", + "schltype\n", + "degfield\n", + "degfieldd\n", + "degfield2\n", + "degfield2d\n", + "empstat\n", + "empstatd\n", + "labforce\n", + "occ\n", + "ind\n", + "classwkr\n", + "classwkrd\n", + "looking\n", + "availble\n", + "inctot\n", + "ftotinc\n", + "incwage\n", + "incbus00\n", + "incss\n", + "incwelfr\n", + "incinvst\n", + "incretir\n", + "incsupp\n", + "incother\n", + "incearn\n", + "poverty\n", + "migrate1\n", + "migrate1d\n", + "migplac1\n", + "migcounty1\n", + "migmet131\n", + "vetdisab\n", + "diffrem\n", + "diffphys\n", + "diffmob\n", + "diffcare\n", + "diffsens\n", + "diffeye\n", + "diffhear\n" + ] + } + ], + "source": [ + "# Exercise 4:\n", + "for c in acs.columns:\n", + " print(c)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 5\n", + "\n", + "That's a *lot* of variables, and definitely more than we need. In general, life is easier when working with these kinds of huge datasets if you can narrow down the number of variables a little. In this exercise, we will be looking at the relationship between education and wages, we need variables for: \n", + "\n", + "- Age\n", + "- Income\n", + "- Education\n", + "- Employment status (is the person actually working)\n", + "\n", + "These quantities of interest correspond to the following variables in our data: `age`, `inctot`, `educ`, and `empstat`. \n", + "\n", + "Subset your data to just those variables. " + ] + }, + { + "cell_type": "code", + "execution_count": 182, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['age', 'inctot', 'educ', 'empstat'], dtype='object')" + ] + }, + "execution_count": 182, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "acs = acs[[\"age\", \"inctot\", \"educ\", \"empstat\"]]\n", + "acs.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 6 \n", + "\n", + "Now that we have a more manageable number of variables, it's often very useful to look at a handful of rows of your data. The easiest way to do this is probably the `.head()` method (which will show you the first five rows), or the `tail()` method, which will show you the last five rows. \n", + "\n", + "But to get a good sense of your data, it's often better to use the `sample()` command, which returns a random set of rows. As the first and last rows are sometimes not representative, a random set of rows can be very helpful. Try looking at a random sample of 20 rows (note: you don't have to run `.sample()` ten times to get ten rows. Look at the `.sample` help file if you're stuck. " + ] + }, + { + "cell_type": "code", + "execution_count": 183, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ageinctoteducempstat
112403651400grade 12not in labor force
2150802622100grade 12employed
1070834020400grade 5, 6, 7, or 8employed
4996765-57005+ years of collegeunemployed
21680645940004 years of collegeemployed
165535547700grade 12employed
67231260grade 12not in labor force
2629252920100grade 11employed
115840109999999nursery school to grade 4n/a
2303246037100grade 12not in labor force
11256968126004 years of collegenot in labor force
243042331250005+ years of collegeemployed
16510129999999grade 5, 6, 7, or 8n/a
16751161186001 year of collegenot in labor force
229094109999999nursery school to grade 4n/a
6423849999999n/a or no schoolingn/a
179902601 year of collegenot in labor force
190413139999999grade 5, 6, 7, or 8n/a
10854433820001 year of collegeemployed
27556153350002 years of collegeemployed
\n", + "
" + ], + "text/plain": [ + " age inctot educ empstat\n", + "112403 65 1400 grade 12 not in labor force\n", + "215080 26 22100 grade 12 employed\n", + "107083 40 20400 grade 5, 6, 7, or 8 employed\n", + "49967 65 -5700 5+ years of college unemployed\n", + "216806 45 94000 4 years of college employed\n", + "16553 55 47700 grade 12 employed\n", + "67231 26 0 grade 12 not in labor force\n", + "262925 29 20100 grade 11 employed\n", + "115840 10 9999999 nursery school to grade 4 n/a\n", + "230324 60 37100 grade 12 not in labor force\n", + "112569 68 12600 4 years of college not in labor force\n", + "243042 33 125000 5+ years of college employed\n", + "16510 12 9999999 grade 5, 6, 7, or 8 n/a\n", + "167511 61 18600 1 year of college not in labor force\n", + "229094 10 9999999 nursery school to grade 4 n/a\n", + "64238 4 9999999 n/a or no schooling n/a\n", + "17990 26 0 1 year of college not in labor force\n", + "190413 13 9999999 grade 5, 6, 7, or 8 n/a\n", + "108544 33 82000 1 year of college employed\n", + "275561 53 35000 2 years of college employed" + ] + }, + "execution_count": 183, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "acs.sample(20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 7\n", + "\n", + "Do you see any immediate problems? What issues do you see? (Please do answer in markdown)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### I notice a couple of issues with the data. First taking a look at the education column I see that anyone who is at the age of 14 and below are categorized with multiple grades rather than just one grade level like children ages 16-18. In the labor force column there are alot of NA values. In the incot values we have a lot of zeros with no context of what these values could mean. Why would someone older have 0 for inctot and someone who is a small child for example age 7 have an inctot value of 9,999,999" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 8 \n", + "\n", + "One problem is that many people seem to have incomes of $9,999,999. Moreover, people with those incomes seem to be very young children. \n", + "\n", + "What you are seeing is one method (a relatively old one) for representing missing data. In this case, the value 9999999 is being used as a **sentinel value** — a way to denote missing data that was used back in the day when there was no way to add a special data type for mossing data. In this case, it identifies observations where the person is too young to work, so their income value is missing. \n", + "\n", + "So let's begin by dropping anyone who has `inctot` equal to 9999999.\n", + "\n", + "After dropping, how many observations do you have? Save your answer in your `results` dictionary under the key `\"ex8_updated_num_obs\"`" + ] + }, + { + "cell_type": "code", + "execution_count": 184, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'ex2_num_obs': 319004, 'ex3_num_vars': 104, 'ex8_updated_num_obs': 265103}\n" + ] + } + ], + "source": [ + "acs = acs[acs[\"inctot\"] != 9999999]\n", + "results[\"ex8_updated_num_obs\"] = len(acs[\"inctot\"])\n", + "print(results)\n", + "# acs.loc[acs.index == 11]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 9\n", + "\n", + "OK, the other potential problem is that our data includes lots of people who are unemployed and people who are not in the labor force (this means they not only don't have a job, but also aren't looking for a job). For this analysis, we want to focus on the wages of people who are currently employed. So subset the dataset for the people for whom `empstat` is equal to \"employed\". \n", + "\n", + "Note that our decision to only look at people who are employed impacts how we should interpret the relationship we estimate between education and income. Because we are only looking at employed people, we will be estimating the relationship between education and income *for people who are employed*. That means that if education affects the *likelihood* someone is employed, we won't capture that in this analysis.\n", + "\n", + "(You might also want to run `.sample()` after this just to make sure you were successful in your subsetting).\n", + "\n", + "After this subsetting, how many observations do you have? Save your answer in your `results` dictionary under the key `\"ex9_updated_num_obs\"`" + ] + }, + { + "cell_type": "code", + "execution_count": 185, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " age inctot educ empstat\n", + "1 17 6000 grade 11 employed\n", + "2 63 6150 4 years of college employed\n", + "5 50 50000 grade 12 employed\n", + "9 17 2000 grade 12 employed\n", + "10 47 18000 n/a or no schooling employed\n", + "148758\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex2_num_obs': 319004,\n", + " 'ex3_num_vars': 104,\n", + " 'ex8_updated_num_obs': 265103,\n", + " 'ex9_updated_num_obs': 148758}" + ] + }, + "execution_count": 185, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# exercise 9\n", + "acs = acs[acs[\"empstat\"] == \"employed\"]\n", + "acs_employed = acs[acs[\"empstat\"] == \"employed\"]\n", + "print(acs_employed.head())\n", + "print(acs.shape[0])\n", + "acs.sample(20)\n", + "results[\"ex9_updated_num_obs\"] = len(acs[\"empstat\"])\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 10\n", + "\n", + "Now let's turn to education. The `educ` variable seems to have a lot of discrete values. Let's see what values exist, and their distribution, using the `value_counts()` method. This is an *extremely* useful tool you'll use a lot! Try the following code (modified for the name of your dataset, of course):\n", + "\n", + "```python\n", + "acs[\"educ\"].value_counts()\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 186, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "educ\n", + "grade 12 47815\n", + "4 years of college 33174\n", + "1 year of college 22899\n", + "5+ years of college 20995\n", + "2 years of college 14077\n", + "grade 11 2747\n", + "grade 5, 6, 7, or 8 2092\n", + "grade 10 1910\n", + "n/a or no schooling 1291\n", + "grade 9 1290\n", + "nursery school to grade 4 468\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 186, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "acs[\"educ\"].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 11\n", + "\n", + "There are a lot of values in here, so let's just check a couple. What is the average value of `inctot` for people whose highest grade level is \"grade 12\" (in the US, that is someone who has graduated high school)?\n", + "\n", + "Save your answer in your `results` dictionary under the key `\"ex11_grade12_income\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 187, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " age inctot educ empstat\n", + "1 17 6000 grade 11 employed\n", + "2 63 6150 4 years of college employed\n", + "5 50 50000 grade 12 employed\n", + "9 17 2000 grade 12 employed\n", + "10 47 18000 n/a or no schooling employed\n", + "... .. ... ... ...\n", + "318995 67 125000 grade 12 employed\n", + "318999 33 22130 4 years of college employed\n", + "319001 20 5000 grade 12 employed\n", + "319002 47 240000 5+ years of college employed\n", + "319003 33 48000 5+ years of college employed\n", + "\n", + "[148758 rows x 4 columns]\n", + " age inctot educ empstat\n", + "5 50 50000 grade 12 employed\n", + "9 17 2000 grade 12 employed\n", + "22 28 58000 grade 12 employed\n", + "23 48 52000 grade 12 employed\n", + "39 20 43000 grade 12 employed\n", + "... .. ... ... ...\n", + "318973 50 32000 grade 12 employed\n", + "318983 64 45600 grade 12 employed\n", + "318987 36 50000 grade 12 employed\n", + "318995 67 125000 grade 12 employed\n", + "319001 20 5000 grade 12 employed\n", + "\n", + "[47815 rows x 4 columns]\n", + "{'ex2_num_obs': 319004, 'ex3_num_vars': 104, 'ex8_updated_num_obs': 265103, 'ex9_updated_num_obs': 148758, 'ex11_grade12_income': 38957.76068179442}\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "\n", + "print(acs_employed)\n", + "acs_grade_12 = acs[acs[\"educ\"] == \"grade 12\"]\n", + "print(acs_grade_12)\n", + "mean_values = np.mean([acs_grade_12[\"inctot\"]])\n", + "results[\"ex11_grade12_income\"] = mean_values\n", + "print(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 12\n", + "\n", + "What is the average income of someone who has completed an undergraduate degree but not done any postgraduate education (\"4 years of college\")? \n", + "\n", + "Save your answer in your `results` dictionary under the key `\"ex12_college_income\"`.\n", + "\n", + "In percentage terms, how much does an employed college graduate earn as compared to someone who is only a high school graduate? Use the reference category that gives an answer above 100.\n", + "\n", + "Store your answer in `\"ex12_college_income_pct\"`. Put your answer in percentage terms (so 100 implies they earn the same amount).\n", + "\n", + "*Make sure to interpret your result in words when you print it out!*" + ] + }, + { + "cell_type": "code", + "execution_count": 188, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "75485.05293301983\n", + "38957.76068179442\n", + "1.937612727527617\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex2_num_obs': 319004,\n", + " 'ex3_num_vars': 104,\n", + " 'ex8_updated_num_obs': 265103,\n", + " 'ex9_updated_num_obs': 148758,\n", + " 'ex11_grade12_income': 38957.76068179442,\n", + " 'ex12_college_income': 75485.05293301983,\n", + " 'ex12_college_income_pct': 193.7612727527617}" + ] + }, + "execution_count": 188, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# exercise 12\n", + "acs_4_year = acs[acs[\"educ\"] == \"4 years of college\"]\n", + "mean_4_year = np.mean(acs_4_year[\"inctot\"])\n", + "# print(acs_4_year)\n", + "print(mean_4_year)\n", + "print(mean_values)\n", + "results[\"ex12_college_income\"] = mean_4_year\n", + "\n", + "# percentage terms\n", + "percent = mean_4_year / mean_values\n", + "print(percent)\n", + "results[\"ex12_college_income_pct\"] = percent * 100\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Exercise 13\n", + "What does that suggest is the value of getting a college degree after graduating high school?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### The value being 193.76 means that the comparison between high school and college graduates is almost double. Meaning you earn almost double the income if you were to get an undergraduate degree. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 14\n", + "\n", + "What is the average income for someone who has not finished high school? What does that suggest is the value of a high school diploma? (Treat `n/a or no schooling` as having no formal schooling, not as missing).\n", + "\n", + "**Hint:** You may find the [.isin()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) method to be really helpful here.\n", + "\n", + "Save your answer in your `results` dictionary under the key `\"ex14_high_school_dropout\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 189, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.4854374262522865\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex2_num_obs': 319004,\n", + " 'ex3_num_vars': 104,\n", + " 'ex8_updated_num_obs': 265103,\n", + " 'ex9_updated_num_obs': 148758,\n", + " 'ex11_grade12_income': 38957.76068179442,\n", + " 'ex12_college_income': 75485.05293301983,\n", + " 'ex12_college_income_pct': 193.7612727527617,\n", + " 'ex14_high_school_dropout': 26226.45692998571}" + ] + }, + "execution_count": 189, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# exercise 14\n", + "high_school_dropout = [\n", + " \"grade 11\",\n", + " \"grade 5, 6, 7, or 8\",\n", + " \"grade 10\",\n", + " \"n/a or no schooling\",\n", + " \"grade 9\",\n", + " \"nursery school to grade 4\",\n", + "]\n", + "\n", + "results[\"ex14_high_school_dropout\"] = acs.loc[\n", + " acs[\"educ\"].isin(high_school_dropout), \"inctot\"\n", + "].mean()\n", + "\n", + "print(results[\"ex11_grade12_income\"] / results[\"ex14_high_school_dropout\"])\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "EXERCISE 14:\n", + "The average income for someone who hasn't finished highschool is : 26,226\n", + "The value of a highschool diploma is 75,485 so there is a huge significance to having a diploma compared to not having it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 15 \n", + "\n", + "Complete the following table (storing values under the provided keys where listed):\n", + "\n", + "- Average income for someone who only completed 9th grade (`ex15_grade_9`): _________\n", + "- Average income for someone who only completed 10th grade (`ex15_grade_10`): _________\n", + "- Average income for someone who only completed 11th grade (`ex15_grade_11`): _________\n", + "- Average income for someone who finished high school (12th grade) but never started college (`ex15_grade_12`): _________\n", + "- Average income for someone who completed 4 year of college (in the US, this corresponds to getting an undergraduate degree), but has no post-graduate education (no more than 4 years, `ex15_4_years_of_college`): _________\n", + "- Average income for someone who has some graduate education (more than 4 years, `ex15_graduate`): _________" + ] + }, + { + "cell_type": "code", + "execution_count": 190, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ex2_num_obs': 319004,\n", + " 'ex3_num_vars': 104,\n", + " 'ex8_updated_num_obs': 265103,\n", + " 'ex9_updated_num_obs': 148758,\n", + " 'ex11_grade12_income': 38957.76068179442,\n", + " 'ex12_college_income': 75485.05293301983,\n", + " 'ex12_college_income_pct': 193.7612727527617,\n", + " 'ex14_high_school_dropout': 26226.45692998571,\n", + " 'ex15_grade_9': 27171.907751937986,\n", + " 'ex15_grade_10': 23018.795811518325,\n", + " 'ex15_grade_11': 21541.68693119767,\n", + " 'ex15_grade_12': 38957.76068179442,\n", + " 'ex15_4_years_of_college': 75485.05293301983,\n", + " 'ex15_graduate': 110013.2213384139}" + ] + }, + "execution_count": 190, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Exercise 15\n", + "# ex15_grade_9\n", + "acs_9 = acs[acs[\"educ\"] == \"grade 9\"]\n", + "mean_9_grade = np.mean(acs_9[\"inctot\"])\n", + "results[\"ex15_grade_9\"] = mean_9_grade\n", + "# ex15_grade_10\n", + "acs_10 = acs[acs[\"educ\"] == \"grade 10\"]\n", + "mean_10_grade = np.mean(acs_10[\"inctot\"])\n", + "results[\"ex15_grade_10\"] = mean_10_grade\n", + "# ex15_grade_11\n", + "acs_11 = acs[acs[\"educ\"] == \"grade 11\"]\n", + "mean_11_grade = np.mean(acs_11[\"inctot\"])\n", + "results[\"ex15_grade_11\"] = mean_11_grade\n", + "# ex15_grade_12\n", + "acs_12 = acs[acs[\"educ\"] == \"grade 12\"]\n", + "mean_12_grade = np.mean(acs_12[\"inctot\"])\n", + "results[\"ex15_grade_12\"] = mean_12_grade\n", + "# ex15_4_years_of_college\n", + "acs_undergrad = acs[acs[\"educ\"] == \"4 years of college\"]\n", + "mean_undergrad = np.mean(acs_undergrad[\"inctot\"])\n", + "results[\"ex15_4_years_of_college\"] = mean_undergrad\n", + "# ex15_graduate\n", + "acs_grad = acs[acs[\"educ\"] == \"5+ years of college\"]\n", + "mean_grad = np.mean(acs_grad[\"inctot\"])\n", + "results[\"ex15_graduate\"] = mean_grad\n", + "\n", + "results" + ] + }, + { + "cell_type": "code", + "execution_count": 191, + "metadata": {}, + "outputs": [ + { + "ename": "", + "evalue": "", + "output_type": "error", + "traceback": [ + "\u001b[1;31mThe Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details." + ] + } + ], + "source": [ + "assert set(results.keys()) == {\n", + " \"ex2_num_obs\",\n", + " \"ex3_num_vars\",\n", + " \"ex8_updated_num_obs\",\n", + " \"ex9_updated_num_obs\",\n", + " \"ex11_grade12_income\",\n", + " \"ex12_college_income\",\n", + " \"ex12_college_income_pct\",\n", + " \"ex14_high_school_dropout\",\n", + " \"ex15_grade_9\",\n", + " \"ex15_grade_10\",\n", + " \"ex15_grade_11\",\n", + " \"ex15_grade_12\",\n", + " \"ex15_4_years_of_college\",\n", + " \"ex15_graduate\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 16 \n", + "\n", + "Why do you think there is no benefit from moving from grade 9 to grade 10, or grade 10 to grade 11, but there is a huge benefit to moving from grade 11 to graduating high school (grade 12)?\n", + "\n", + "(Think carefully before reading ahead!)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### When you are still under the grade 12 it means you haven't completed your full four years of highschool. The quality of education of just completing the first few years of education is not as valuable as highschool diploma or a undergraduate diploma. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Take-aways\n", + "\n", + "Congratulations! You just discovered \"the sheepskin effect!\": people with degrees tend to earn substantially more than people who have *almost* as much education, but don't have an actual degree. \n", + "\n", + "In economics, this is viewed as evidence that the reason employers pay people with high school degrees more than those without degree is *not* that they think those who graduated high school have learned specific, useful skills. If that were the case, we would expect employee earnings to rise with every year of high school, since in each year of high school we learn more. \n", + "\n", + "Instead, this suggests employees pay high school graduates more because they think *the kind of people* who can finish high school are the *kind of people* who are likely to succeed at their jobs. Finishing high school, in other words, isn't about accumulating specific knowledge; it's about showing that you *are the kind of person* who can rise to the challenge of finishing high school, also suggesting you are also the kind of person who can succeed as an employee. \n", + "\n", + "(Obviously, this does not tell us whether that is an *accurate* inference, just that that seems to be how employeers think.) \n", + "\n", + "In other words, in the eyes of employers, a high school degree is a *signal* about the kind of person you are, not certification that you've learned a specific set of skills (an idea that earned [Michael Spence](https://en.wikipedia.org/wiki/Michael_Spence) a Nobel Prize in Economics). " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.10.2 ('ds')", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "vscode": { + "interpreter": { + "hash": "b9e56a7b23b1fac2eea1a993b805ed5c611aea1439c1f46315b23590ab6d3ba0" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Pandas_DataFrames/exercise_merging.ipynb b/Pandas_DataFrames/exercise_merging.ipynb new file mode 100644 index 0000000..bf178c8 --- /dev/null +++ b/Pandas_DataFrames/exercise_merging.ipynb @@ -0,0 +1,3146 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Merging Data to Understand the Relationship between Drug Legalization and Violent Crime\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In recent years, many US states have decided to legalize the use of marijuana. \n", + "\n", + "When these ideas were first proposed, there were many theories about the relationship between crime and the \"War on Drugs\" (the term given to US efforts to arrest drug users and dealers over the past several decades). \n", + "\n", + "In this exercise, we're going to test a few of those theories using drug arrest data from the state of California. \n", + "\n", + "Though California has passed a number of laws lessening penalities for marijuana possession over the years, arguably the biggest changes were in 2010, when the state changed the penalty for possessing a small amount of marijuana from a criminal crime to a \"civil\" penality (meaning those found guilty only had to pay a fine, not go to jail), though possessing, selling, or producing larger quantities remained illegal. Then in 2016, the state fully legalized marijuana for recreational use, not only making possession of small amounts legal, but also creating a regulatory system for producing marijuana for sale. \n", + "\n", + "Proponents of drug legalization have long argued that the war on drugs contributes to violent crime by creating an opportunity for drug dealers and organized crime to sell and distribute drugs, a business which tends to generate violence when gangs battle over territory. According to this theory, with drug legalization, we should see violent crime decrease after legalization in places where drug arrests had previously been common. \n", + "\n", + "**To be clear,** this is far from the only argument for drug legalization! It is simply the argument we are well positioned to analyze today. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Gradescope Autograding\n", + "\n", + "Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.\n", + "\n", + "For this assignment, please name your file `exercise_merging.ipynb` before uploading.\n", + "\n", + "You can check that you have answers for all questions in your `results` dictionary with this code:\n", + "\n", + "```python\n", + "assert set(results.keys()) == {\n", + " \"ex6_merge_type\",\n", + " \"ex10_merged_successfully\",\n", + " \"ex16_num_obs\",\n", + " \"ex17_drug_change\",\n", + " \"ex18_violent_change\",\n", + " \"ex21_diffindiff\",\n", + " \"ex23_diffindiff_proportionate\",\n", + "}\n", + "```\n", + "\n", + "\n", + "### Submission Limits\n", + "\n", + "Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pre-Legalization Analysis\n", + "\n", + "### Exercise 1\n", + "We will begin by examining [county-level data on arrests from California in 2009](https://github.com/nickeubank/practicaldatascience/tree/master/Example_Data/ca). This data is derived directly from data hosted by the [Office of the California State Attorney General](https://openjustice.doj.ca.gov/data), but please follow the github link above and download and import the file `ca_arrests_2009.csv` (don't try and get it directly from the State Attorney General's office). " + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
COUNTYVIOLENTPROPERTYF_DRUGOFFF_SEXOFFF_ALLOTHERF_TOTALM_TOTALS_TOTAL
1682Alameda County43184640574926035021846937247431
1683Alpine County8421116830
1684Amador County1005910151994648012
1685Butte County64160254234429224890261
1686Calaveras County2118312314705019683
\n", + "
" + ], + "text/plain": [ + " COUNTY VIOLENT PROPERTY F_DRUGOFF F_SEXOFF F_ALLOTHER \\\n", + "1682 Alameda County 4318 4640 5749 260 3502 \n", + "1683 Alpine County 8 4 2 1 1 \n", + "1684 Amador County 100 59 101 5 199 \n", + "1685 Butte County 641 602 542 34 429 \n", + "1686 Calaveras County 211 83 123 14 70 \n", + "\n", + " F_TOTAL M_TOTAL S_TOTAL \n", + "1682 18469 37247 431 \n", + "1683 16 83 0 \n", + "1684 464 801 2 \n", + "1685 2248 9026 1 \n", + "1686 501 968 3 " + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Loading the arrest 2009 California dataset and loading necessary libraries. Initializing the results dictionary.\n", + "results = {}\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "arrest_df = pd.read_csv(\n", + " \"https://github.com/nickeubank/practicaldatascience/raw/master/Example_Data/ca/ca_arrests_2009.csv\",\n", + " index_col=0,\n", + ")\n", + "\n", + "arrest_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 2\n", + "Use your data exploration skills to get a feel for this data. If you need to, you can find the [original codebook here](https://github.com/nickeubank/practicaldatascience/blob/master/Example_Data/ca/arrests_codebook.pdf) (This data is a version of that data, but collapsed to one observation per county.)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 3\n", + "Figuring out what county has the most violent arrests isn't very meaningful if we don't normalize for size. A county with 10 people and 10 arrests for violent crimes is obviously worse than a county with 1,000,000 people an 11 arrests for violent crime. \n", + "\n", + "To address this, also import `nhgis_county_populations.csv` from [the directory we're working from](https://github.com/nickeubank/practicaldatascience/tree/master/Example_Data/ca)." + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " YEAR STATE COUNTY total_population\n", + "0 2005-2009 Alabama Autauga County 49584\n", + "1 2005-2009 Alabama Baldwin County 171997\n", + "2 2005-2009 Alabama Barbour County 29663\n", + "3 2005-2009 Alabama Bibb County 21464\n", + "4 2005-2009 Alabama Blount County 56804\n", + " YEAR STATE COUNTY total_population\n", + "186 2005-2009 California Alameda County 1457095\n", + "187 2005-2009 California Alpine County 1153\n", + "188 2005-2009 California Amador County 38039\n", + "189 2005-2009 California Butte County 217917\n", + "190 2005-2009 California Calaveras County 46548\n" + ] + } + ], + "source": [ + "nhgis_df = pd.read_csv(\n", + " \"https://github.com/nickeubank/practicaldatascience/raw/master/Example_Data/ca/nhgis_county_populations.csv\",\n", + " index_col=0,\n", + ")\n", + "\n", + "print(nhgis_df.head())\n", + "\n", + "nhgis_df_cali = nhgis_df.loc[nhgis_df[\"STATE\"] == \"California\", :]\n", + "\n", + "nhgis_df_cali_year = nhgis_df_cali[nhgis_df_cali[\"YEAR\"] == \"2005-2009\"]\n", + "\n", + "print(nhgis_df_cali_year.head())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 4\n", + "Use your data exploration skills to get used to this data, and figure out how it relates to your 2009 arrest data. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### These are the desciptions for each variable in our data\n", + "- F_drugoff: is the sum of arrest for felony drug offenses(narcotics, marijuana, etc.)\n", + "- F_sexoff : sum of arrest for felony sex offense(lewd, unlawful sexual intercourse)\n", + "- F_allother: sum of arrest for all other felony offenses\n", + "- M_Total: Sum of all misdemeanor arrests\n", + "- S_Total: Sum of all arrests for status offenses\n", + "- Property: Sum of arrests for felony property offenses(burglary, theft, motor vehicle theft,etc)\n", + "- Violent: Sum of arrests for felony violent offenses(homicide, assault, kidnapping)\n", + "- County: County of reporting agency" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### How does nhgis_county_populations.csv data related to 2009 arrest data\n", + "In our nhgis_county population data, we can see information for all 50 states in the United States and their total population size. We can see in the 2009 arrest data, we are only looking at arrests in California. We will need to only select from nhgis_county_populations data the state California and the Total population from California. Once, this information is extracted we can merge it with the 2009 arrest data. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 5\n", + "\n", + "Once you feel like you have a good sense of the relation between our arrest and population data, merge the two datasets." + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
COUNTYVIOLENTPROPERTYF_DRUGOFFF_SEXOFFF_ALLOTHERF_TOTALM_TOTALS_TOTALYEARSTATEtotal_population
0Alameda County431846405749260350218469372474312005-2009California1457095.0
1Alpine County84211168302005-2009California1153.0
2Amador County10059101519946480122005-2009California38039.0
3Butte County641602542344292248902612005-2009California217917.0
4Calaveras County21183123147050196832005-2009California46548.0
\n", + "
" + ], + "text/plain": [ + " COUNTY VIOLENT PROPERTY F_DRUGOFF F_SEXOFF F_ALLOTHER \\\n", + "0 Alameda County 4318 4640 5749 260 3502 \n", + "1 Alpine County 8 4 2 1 1 \n", + "2 Amador County 100 59 101 5 199 \n", + "3 Butte County 641 602 542 34 429 \n", + "4 Calaveras County 211 83 123 14 70 \n", + "\n", + " F_TOTAL M_TOTAL S_TOTAL YEAR STATE total_population \n", + "0 18469 37247 431 2005-2009 California 1457095.0 \n", + "1 16 83 0 2005-2009 California 1153.0 \n", + "2 464 801 2 2005-2009 California 38039.0 \n", + "3 2248 9026 1 2005-2009 California 217917.0 \n", + "4 501 968 3 2005-2009 California 46548.0 " + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "merged_df = pd.merge(arrest_df, nhgis_df_cali_year, how=\"left\", on=\"COUNTY\")\n", + "\n", + "merged_df.reset_index(drop=True, inplace=True)\n", + "\n", + "merged_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 6\n", + "\n", + "When merging data, the result will only be meaningful if your understanding of how the data sets you are merging relate to one another are correct. In some ways, this is obvious — for example, if the variable(s) that you are using to merge observations in the two datasets or to actually identifying observations that should be linked, then obviously merging using those variables will create a meaningless new dataset.\n", + "\n", + "But other properties that matter are often more subtle. For example, it's important to figure out whether your merge is a `1-to-1` merge (meaning there is only one observation of the variable you're merging on in both datasets), a `1-to-many` merge (meaning there is only one observation of the variable you're merging on in the first dataset, but multiple observations in the second), or a `many-to-many` merge (something you almost never do). \n", + "\n", + "Being correct in your assumptions about these things is *very* important. If you think there's only one observation per value of your merging variable in each dataset, but there are in fact 2, you'll end up with two observations for each value after the merge. Moreover, not only is the structure of your data now a mess, but the fact you were wrong means you didn't understand something about your data. \n", + "\n", + "So before running a merge, it is critical to answer the following questions:\n", + "\n", + "a) What variable(s) do you think will be consistent across these two datasets you can use for merging? \n", + "\n", + "\n", + "b) Do you think there will be exactly 1 observation for each value of this variable(s) in your arrest data?\n", + "\n", + "\n", + "c) Do you think there will be exactly 1 observation for each value of this variable(s) in your population data?\n", + "\n", + "\n", + "So in markdown, answer these three questions for this data.\n", + "\n", + "\n", + "\n", + "\n", + "Then also specify the type of merge you were hoping to accomplish as one of the following strings — `\"one-to-one\"`, `\"one-to-many\"`, `\"many-to-one\"`, or `\"many-to-many\"` — in your `results` dictionary under the key `\"ex6_validate_keyword\"`. Assume that the first dataset we are talking about (e.g., the `one` in `one-to-many`, if that were your selection) is your arrests data and the second dataset (e.g., the `many` in `one-to-many`, if that were your selection)." + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [], + "source": [ + "# Exercise 6 : Deciding the merge method\n", + "results[\"ex6_validate_keyword\"] = \"one=to-many\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 6\n", + "a) What variable(s) do you think will be consistent across these two datasets you can use for merging? \n", + " - Counties from the state of California are the only consist variables.\n", + "\n", + "b) Do you think there will be exactly 1 observation for each value of this variable(s) in your arrest data?\n", + " - Yes, I think for each county there is only one set of observations for each column within the arrest data. \n", + "\n", + "c) Do you think there will be exactly 1 observation for each value of this variable(s) in your population data?\n", + "- Within, the population data there are 2 observations the county variable. This is because the population data looks at two sets of time frames (2005-2009 and 2013-2017).\n", + "\n", + "So in markdown, answer these three questions for this data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Merge Validation\n", + "\n", + "Because of the importance of answering these questions accurately, `pandas` provides a utility for *validating* these assumptions when you do a merge: the `validate` keyword! Validate will accept `\"1:1\"`, `\"1:m\"`, `\"m:1\"`, and `\"m:m\"`. It will then check to make sure your merge matches the type of merge you think it is. I *highly* recommend always using this option (...and not just because I'm the one who added `validate` to pandas).\n", + "\n", + "*Note:* `validate` only actually tests if observations are unique when a `1` is specified; if you do a `1:1` merge but pass `validate=\"1:m\"`, `validate=\"m:1\"`, or `validate=\"m:m\"`, you won't get an error — a one-to-many merge that turns out to be a one-to-one isn't nearly as dangerous as a one-to-one merge that turns out to be one-to-many.\n", + "\n", + "### Exercise 7\n", + "\n", + "Repeat the merge you conducted above, but this time use the `validate` to make sure your assumptions about the data were correct. If you find that you made a mistake, revise your data until the merge you think is correct actually takes place.\n", + "\n", + "To aid the autograder, please make sure to comment out any code that generates an error.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [], + "source": [ + "validation_df = pd.merge(arrest_df, nhgis_df, on=\"COUNTY\", validate=\"1:m\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 8\n", + "\n", + "Were your assumptions about the data correct? If not, what had you (implicitly) assumed when you did your merge in Exercise 5 that turned out not to be correct?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#Exercise 8: Were your assumptions correct?\n", + "My assumption about the data was correct because I throughly looked at the data and checked different years as conditions. I implicity though I could merge the data and didn't realize that the arrest data was just for 2009 and population has two time periods. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Merge Diagnostics\n", + "\n", + "### Exercise 9\n", + "\n", + "Checking whether you are doing a 1-to-1, many-to-1, 1-to-many, or many-to-many merge is only the first type of diagnostic test you should run on *every* merge you conduct. The second test is to see if you data merged successfully!\n", + "\n", + "To help with this, the `merge` function in pandas offers a keyword option called `indicator`. If you set `indicator` to `True`, then pandas will add a column to the result of your merge called `_merge`. This variable will tell you, for each observation in your merged data, whether: \n", + "\n", + "- that observation came from a successful merge of both datasets, \n", + "- if that observation was in the left dataset (the first one you passed) but not the right dataset (the second one you passed), or \n", + "- if that observation was in the right dataset but not the left. \n", + "\n", + "This allows you to quickly identify failed merges!\n", + "\n", + "For example, suppose you had the following data:" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
keydf1_var
0key11
1key22
\n", + "
" + ], + "text/plain": [ + " key df1_var\n", + "0 key1 1\n", + "1 key2 2" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "pd.set_option(\"mode.copy_on_write\", True)\n", + "\n", + "df1 = pd.DataFrame({\"key\": [\"key1\", \"key2\"], \"df1_var\": [1, 2]})\n", + "df1" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
keydf2_var
0key1a
1Key2b
\n", + "
" + ], + "text/plain": [ + " key df2_var\n", + "0 key1 a\n", + "1 Key2 b" + ] + }, + "execution_count": 74, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2 = pd.DataFrame({\"key\": [\"key1\", \"Key2\"], \"df2_var\": [\"a\", \"b\"]})\n", + "df2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now suppose you *expected* that all observations should merge when you merge these datasets (because you hadn't noticed the typo in `df2` where `key2` has a capital `Key2`. If you just run a merge, it works without any problems:" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": {}, + "outputs": [], + "source": [ + "new_data = pd.merge(df1, df2, on=\"key\", how=\"outer\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And so you might carry on in life unaware your data is now corrupted: instead of two merged rows, you now have 3, only 1 of which merged correctly!" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
keydf1_vardf2_var
0key11.0a
1key22.0NaN
2Key2NaNb
\n", + "
" + ], + "text/plain": [ + " key df1_var df2_var\n", + "0 key1 1.0 a\n", + "1 key2 2.0 NaN\n", + "2 Key2 NaN b" + ] + }, + "execution_count": 76, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "new_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When what you really wanted was: " + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
keydf1_vardf2_var
0key11a
1key22b
\n", + "
" + ], + "text/plain": [ + " key df1_var df2_var\n", + "0 key1 1 a\n", + "1 key2 2 b" + ] + }, + "execution_count": 77, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2_correct = df2.copy()\n", + "df2_correct.loc[df2.key == \"Key2\", \"key\"] = \"key2\"\n", + "pd.merge(df1, df2_correct, on=\"key\", how=\"outer\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(in a small dataset, you'd quickly see you have 1 row instead of 2, but if you have millions of rows, a couple missing won't be evident). \n", + "\n", + "But now suppose we use the `indicator` function:" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "_merge\n", + "left_only 1\n", + "right_only 1\n", + "both 1\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 78, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "new_data = pd.merge(df1, df2, on=\"key\", how=\"outer\", indicator=True)\n", + "new_data._merge.value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We could immediately see that only one observation merged correct, and that one row from each dataset failed to merge!\n", + "\n", + "Moreover, we can look at the failed merges:" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
keydf1_vardf2_var_merge
1key22.0NaNleft_only
2Key2NaNbright_only
\n", + "
" + ], + "text/plain": [ + " key df1_var df2_var _merge\n", + "1 key2 2.0 NaN left_only\n", + "2 Key2 NaN b right_only" + ] + }, + "execution_count": 79, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "new_data[new_data._merge != \"both\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Allowing us to easily diagnose the problem. \n", + "\n", + "**Note:** The `pandas` merge function allows users to decide whether to keep only observations that merge (`how='inner'`), all the observations from the first dataset pasted to merge (`how='left'`), all the observations from the second dataset passed to merge (`how='right'`), or all observations (`how='outer'`):\n", + "\n", + "![join_types](https://nickeubank.github.io/practicaldatascience_book/_images/3.4.15_merge_types.png)\n", + "\n", + "But one danger to using the more restrictive options (like the default, `how='inner'`) is that the merge throws away all the observations that fail to merge, and while this may be the *eventual* goal of your analysis, it means that you don't get to see all the observations that failed to merge that maybe you thought *would* merge. In other words, it throws away the errors so you can't look at them! \n", + "\n", + "So to use `indicator` effectively, you have to:\n", + "\n", + "- Not use `how=\"inner\"`, and\n", + "- Check the values of `_merge` after your merge. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 10\n", + "\n", + "Now repeat your previous merge using *both* the `validate` keyword *and* the `indicator` keyword with `how='outer'`. \n", + "\n", + "How many observations successfully merged (were in both datasets)? Store the result in `results` under the key `\"ex10_merged_successfully\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "_merge\n", + "right_only 6275\n", + "both 166\n", + "left_only 2\n", + "Name: count, dtype: int64\n", + " COUNTY VIOLENT PROPERTY F_DRUGOFF F_SEXOFF \\\n", + "18 Del Norte County 144.0 104.0 79.0 13.0 \n", + "33 Inyo County 81.0 44.0 39.0 3.0 \n", + "168 Autauga County NaN NaN NaN NaN \n", + "169 Autauga County NaN NaN NaN NaN \n", + "170 Baldwin County NaN NaN NaN NaN \n", + "... ... ... ... ... ... \n", + "6438 Yauco Municipio NaN NaN NaN NaN \n", + "6439 Kusilvak Census Area NaN NaN NaN NaN \n", + "6440 Petersburg Borough NaN NaN NaN NaN \n", + "6441 LaSalle Parish NaN NaN NaN NaN \n", + "6442 Oglala Lakota County NaN NaN NaN NaN \n", + "\n", + " F_ALLOTHER F_TOTAL M_TOTAL S_TOTAL YEAR STATE \\\n", + "18 97.0 437.0 1268.0 5.0 NaN NaN \n", + "33 38.0 205.0 851.0 1.0 NaN NaN \n", + "168 NaN NaN NaN NaN 2005-2009 Alabama \n", + "169 NaN NaN NaN NaN 2013-2017 Alabama \n", + "170 NaN NaN NaN NaN 2005-2009 Alabama \n", + "... ... ... ... ... ... ... \n", + "6438 NaN NaN NaN NaN 2013-2017 Puerto Rico \n", + "6439 NaN NaN NaN NaN 2013-2017 Alaska \n", + "6440 NaN NaN NaN NaN 2013-2017 Alaska \n", + "6441 NaN NaN NaN NaN 2013-2017 Louisiana \n", + "6442 NaN NaN NaN NaN 2013-2017 South Dakota \n", + "\n", + " total_population _merge \n", + "18 NaN left_only \n", + "33 NaN left_only \n", + "168 49584.0 right_only \n", + "169 55036.0 right_only \n", + "170 171997.0 right_only \n", + "... ... ... \n", + "6438 37585.0 right_only \n", + "6439 8129.0 right_only \n", + "6440 3275.0 right_only \n", + "6441 14930.0 right_only \n", + "6442 14291.0 right_only \n", + "\n", + "[6277 rows x 13 columns]\n" + ] + }, + { + "data": { + "text/plain": [ + "{'ex6_validate_keyword': 'one=to-many', 'ex10_merged_successfully': 166}" + ] + }, + "execution_count": 80, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "merged_df_validate_indi = pd.merge(\n", + " arrest_df, nhgis_df, how=\"outer\", on=\"COUNTY\", validate=\"1:m\", indicator=True\n", + ")\n", + "print(merged_df_validate_indi._merge.value_counts())\n", + "print((merged_df_validate_indi[merged_df_validate_indi._merge != \"both\"]))\n", + "results[\"ex10_merged_successfully\"] = 166\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 11\n", + "\n", + "You *should* be able to get to the point that all counties in our arrest data merge with population data. If that did not happen, can you figure out why that did not happen? Can you fix the data so that all arrest data merges with population data?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I was able to fix the data so that the arrest data merges with the population data. This is because I only merged the counties that are in the Califonia area which match with the population and only the 2005-2009 time frame in the population data. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Comparing Arrest Rates\n", + "\n", + "### Exercise 12\n", + "\n", + "Now that we have arrest counts and population data, we can calculate arrest *rates*. For each county, create a new variable called `violent_arrest_rate_2009` that is the number of violent arrests for 2009 divided by the population of the county from 2005-2009, and a similar new variable called `drug_arrest_rate_2009` for drug arrests divided by population." + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0.002963430661693301, 0.006938421509106678, 0.0026288808854070824, 0.0029414868963871564, 0.0045329552290109135, 0.0027617732488929097, 0.0029303711901974357, nan, 0.0032397224069432365, 0.004913836654504631, 0.003728801405471299, 0.0038991341286636746, 0.0037429546221427945, nan, 0.005493288328490959, 0.0026585591972514587, 0.0053894619803570324, 0.0029336586499360984, 0.0036093955266550473, 0.003128582676077738, 0.002026662775474138, 0.003582423733557235, 0.004823898640009299, 0.004825892212108077, 0.005129884304736957, 0.00402321083172147, 0.00342041183240229, 0.002776663917744169, 0.0021738458526936112, 0.0020642757348334523, 0.0021441972661484857, 0.004866180048661801, 0.0028605748454061867, 0.003854304106193275, 0.0038172121566335477, 0.004265504232030544, 0.0032843041924417488, 0.004551777250144556, 0.004849234398720512, 0.002323106325028896, 0.0020601636163137603, 0.0032137304893974255, 0.002491647285902793, 0.002995250558874772, 0.002586586541945626, 0.004320987654320987, 0.00425637329970273, 0.003933956763379512, 0.0029275038882593954, 0.004376787782209773, 0.004695197892671744, 0.003894325176152209, 0.004668869415313892, 0.005243827153079877, 0.0028693889994799234, 0.002871339988110759, 0.003031496471027185, 0.004992525315206047]\n" + ] + } + ], + "source": [ + "violent_arrest_rate_2009 = []\n", + "drug_arrest_rate_2009 = []\n", + "for index, row in merged_df.iterrows():\n", + " drug_arrest_rate_2009.append(row[\"F_DRUGOFF\"] / row[\"total_population\"])\n", + " violent_arrest_rate_2009.append(row[\"VIOLENT\"] / row[\"total_population\"])\n", + "\n", + "print(violent_arrest_rate_2009)\n", + "merged_df[\"drug_arrest_rate_2009\"] = drug_arrest_rate_2009\n", + "merged_df[\"violent_arrest_rate_2009\"] = violent_arrest_rate_2009" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 13\n", + "\n", + "Make a scatter plot that shows the relationship between each county's violent arrest rate and it's drug arrest rate. Since we haven't done a lot with plotting yet, feel free to plot in whatever manner feels most comfortable. The easiest, if you're unsure, is just to use the `pandas` inbuilt `.plot()` method. Just specify the `x` keyword with your x-axis variable, the `y` keyword with your y-axis variable, and use `kind=\"scatter\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0.5, 1.0, 'Violent Arrest Rate vs Drug Arrest Rate')" + ] + }, + "execution_count": 82, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.scatter(violent_arrest_rate_2009, drug_arrest_rate_2009)\n", + "plt.xlabel(\"Violent Arrest Rate\")\n", + "plt.ylabel(\"Drug Arrest Rate\")\n", + "plt.title(\"Violent Arrest Rate vs Drug Arrest Rate\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 14\n", + "\n", + "Based on this simple comparison of 2009 violent arrest rates and drug arrest rates, what might you conclude about the relationship between the illegal drug trade and violent crime?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From this scatterplot, I am to determine a moderately positive relationship between Violent Arrest Rate and Drug Arrest Rate. Where if there is a higher number of violent arrests within the population we will also see higher number of drug arrest within each population of each county. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Comparing with 2018 Arrests\n", + "\n", + "The preceding analysis can tell us about whether violent crime and the drug trade are correlated, but it doesn't tell us much about whether they are *causally* related. It *could* be the case that people dealing drugs *cause* more violent crime, but it could also be that certain communities, for some other reason, tend to have *both* more drug sales *and* more violent crime. \n", + "\n", + "To help answer this question, let's examine whether violent crime arrest rates changed in response to drug legalization. In particular, let's do this by comparing violent crime arrest rates in 2009 (before drug legalization) to violent crime arrest rates in 2018 (after drug legalization). If the illegal drug trade causes violent crime, then we would expect the violent crime rate to fall in response to drug legalization.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 15\n", + "\n", + "Just as we created violent arrest rates and drug arrest rates for 2009, now we want to do it for 2018. Using the data on 2018 arrests (also in the [same repository](https://github.com/nickeubank/practicaldatascience/tree/master/Example_Data/ca) we used before) and the same dataset of population data (you'll have to use population from 2013-2017, as 2018 population data has yet to be released), create a dataset of arrest rates. \n", + "\n", + "As before, *be careful with your merges!!!*" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": {}, + "outputs": [], + "source": [ + "ca_2018_arrests = pd.read_csv(\n", + " \"https://raw.githubusercontent.com/nickeubank/practicaldatascience/master/Example_Data/ca/ca_arrests_2018.csv\"\n", + ")\n", + "nhgis_df_cali_correct = nhgis_df_cali.copy()\n", + "\n", + "nhgis_df_cali_2018 = nhgis_df_cali_correct[nhgis_df_cali_correct[\"YEAR\"] == \"2013-2017\"]\n", + "\n", + "merged_2018_df = pd.merge(\n", + " ca_2018_arrests, nhgis_df_cali_2018, how=\"left\", validate=\"1:m\", indicator=True\n", + ")\n", + "\n", + "merged_2018_df = merged_2018_df.reset_index(drop=True)\n", + "merged_2018_df.sample(5)\n", + "\n", + "violent_arrest_rate_2018 = []\n", + "drug_arrest_rate_2018 = []\n", + "for index, row in merged_2018_df.iterrows():\n", + " drug_arrest_rate_2018.append(row[\"F_DRUGOFF\"] / row[\"total_population\"])\n", + " violent_arrest_rate_2018.append(row[\"VIOLENT\"] / row[\"total_population\"])\n", + "\n", + "merged_2018_df[\"drug_arrest_rate_2018\"] = drug_arrest_rate_2018\n", + "merged_2018_df[\"violent_arrest_rate_2018\"] = violent_arrest_rate_2018" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 16\n", + "\n", + "Now merge our two county-level datasets so you have one row for every county, and variables for violent arrest rates in 2018, violent arrest rates in 2009, felony drug arrest rates in 2018, and felony drug arrest rates in 2009. Store the number of observations in your final data set in your `results` dictionary under the key `\"ex16_num_obs\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
COUNTYVIOLENT_xPROPERTY_xF_DRUGOFF_xF_SEXOFF_xF_ALLOTHER_xF_TOTAL_xM_TOTAL_xS_TOTAL_xYEAR_x...F_ALLOTHER_yF_TOTAL_yM_TOTAL_yS_TOTAL_yYEAR_ySTATE_ytotal_population_y_mergedrug_arrest_rate_2018violent_arrest_rate_2018
0Alameda County431846405749260350218469372474312005-2009...26191103728305822013-2017California1629615.0both0.0006520.002536
1Alpine County84211168302005-2009...3114102013-2017California1203.0both0.0008310.004156
2Amador County10059101519946480122005-2009...14228870112013-2017California37306.0both0.0008310.001930
3Butte County641602542344292248902612005-2009...7412239885312013-2017California225207.0both0.0010170.003486
4Calaveras County21183123147050196832005-2009...9632089702013-2017California45057.0both0.0006440.003263
5Colusa County585028713227591702005-2009...12623675302013-2017California21479.0both0.0002330.003073
6Contra Costa County29763532289518920691166118539502005-2009...248987851622322013-2017California1123678.0both0.0007080.002326
7Del Norte County14410479139743712685NaN...8935515250NaNNaNNaNleft_onlyNaNNaN
8El Dorado County5703994034561420314734542005-2009...4591317350802013-2017California185015.0both0.0005510.002594
9Fresno County437735143508305325314957317103772005-2009...2899996028736382013-2017California971616.0both0.0005710.004442
10Glenn County10478104141334331283732005-2009...117293101852013-2017California27935.0both0.0008230.003580
11Humboldt County50344074754394213874604202005-2009...6161545557752013-2017California135490.0both0.0009890.003631
12Imperial County5997797302844925856739692005-2009...3211176448332013-2017California179957.0both0.0008060.002228
13Inyo County8144393382058511NaN...1012496530NaNNaNNaNleft_onlyNaNNaN
14Kern County429043843320202297515171318285242005-2009...2878976725552702013-2017California878744.0both0.0008430.003883
15Kings County39032227849499153852034152005-2009...74717855545882013-2017California150183.0both0.0005730.004015
16Lake County3492042711730311443095132005-2009...2881015321702013-2017California64095.0both0.0028550.005710
17Lassen County1016235683287881122005-2009...9131059442013-2017California31470.0both0.0013030.003654
18Los Angeles County353193463032193204123524127707238608110492005-2009...18256734631723893472013-2017California10105722.0both0.0006340.002856
19Madera County453409260355051662336622005-2009...5021428431762013-2017California154440.0both0.0008480.003639
20Marin County50053245429276179157951782005-2009...25212255227562013-2017California260814.0both0.0003640.002009
21Mariposa County644046743200517212005-2009...189939002013-2017California17658.0both0.0007930.001812
22Mendocino County4152965511736416433654102005-2009...3431063323732013-2017California87497.0both0.0013710.004354
23Merced County116911671055781006447596377142005-2009...903267373264322013-2017California267390.0both0.0006210.003848
24Modoc County472910552143310502005-2009...8018030002013-2017California9017.0both0.0011090.007320
25Mono County52383813015952102005-2009...68025602013-2017California14058.0both0.0004980.004126
26Monterey County138513119671369114710119483042005-2009...87832339543812013-2017California433168.0both0.0006740.002872
27Napa County3674003252439215083471602005-2009...5741230319752013-2017California141005.0both0.0005960.002525
28Nevada County2112072401010877623191552005-2009...2366322307612013-2017California98838.0both0.0006070.001932
29Orange County6145785375245794049261506632016202005-2009...458217804669105372013-2017California3155816.0both0.0005880.001932
30Placer County712122678851972374970701502005-2009...9882631647772013-2017California374985.0both0.0006320.002037
31Plumas County10044104268318867132005-2009...64205683102013-2017California18724.0both0.0013890.004166
32Riverside County5825708258814794832240994008722352005-2009...351613893315571312013-2017California2355002.0both0.0005610.002257
33Sacramento County530258094492409385419866323323892005-2009...40201389523366872013-2017California1495400.0both0.0009480.003400
34San Benito County209140111201416211460272005-2009...120412128622013-2017California58671.0both0.0007840.003051
35San Bernardino County84741079987937204916337025664811192005-2009...4899240634846110442013-2017California2121220.0both0.0011880.004167
36San Diego County9812859876485795321319587840541072005-2009...676321845624164322013-2017California3283665.0both0.0006380.002611
37San Francisco County362927926867513339166781478292005-2009...34487499793702013-2017California864263.0both0.0009150.002377
38San Joaquin County32233136192018726061107223552492005-2009...208470651403152013-2017California724153.0both0.0006920.004197
39San Luis Obispo County6096475229440322759962672005-2009...4201552942752013-2017California280119.0both0.0003780.002299
40San Mateo County144615271487979355492131195202005-2009...910349113494432013-2017California763450.0both0.0002990.001853
41Santa Barbara County12921037876778614143243863662005-2009...899346614866502013-2017California442996.0both0.0011670.002948
42Santa Clara County4309473743255063095169723854111652005-2009...241810738271691212013-2017California1911226.0both0.0004600.002180
43Santa Cruz County753789872437443201104551152005-2009...77023038594102013-2017California273263.0both0.0006370.002829
44Shasta County46450543436505194478294222005-2009...99821307502842013-2017California178919.0both0.0008940.002878
45Sierra County14106394212402005-2009...10456702013-2017California2885.0both0.0010400.003466
46Siskiyou County1891451809163686178682005-2009...226585164302013-2017California43530.0both0.0016310.004112
47Solano County1599132614228110275455106821912005-2009...1160423910559182013-2017California434981.0both0.0011130.003439
48Sonoma County13591296168281954537216767992005-2009...90235361315392013-2017California500943.0both0.0005290.002944
49Stanislaus County22112595220911121049230143691332005-2009...1846646714008342013-2017California535684.0both0.0016500.004385
50Sutter County42635817337194118829161842005-2009...3451084331462013-2017California95583.0both0.0005540.004185
51Tehama County2362724571618911702871122005-2009...170570197122013-2017California63247.0both0.0003950.003305
52Trinity County655063105324152722005-2009...8826434602013-2017California13037.0both0.0057530.005139
53Tulare County21832080168117017677881153683012005-2009...17416596151633602013-2017California458809.0both0.0020290.005061
54Tuolumne County160252209142618962062832005-2009...3117322019122013-2017California53899.0both0.0016140.003766
55Ventura County227524252040120208389432576217342005-2009...21936811266126302013-2017California847834.0both0.0008630.002679
56Yolo County5856346143966225345426732005-2009...5181586460262013-2017California212605.0both0.0005640.002705
57Yuba County354368211392571229296742005-2009...2571013194202013-2017California74644.0both0.0021970.005238
\n", + "

58 rows × 29 columns

\n", + "
" + ], + "text/plain": [ + " COUNTY VIOLENT_x PROPERTY_x F_DRUGOFF_x F_SEXOFF_x \\\n", + "0 Alameda County 4318 4640 5749 260 \n", + "1 Alpine County 8 4 2 1 \n", + "2 Amador County 100 59 101 5 \n", + "3 Butte County 641 602 542 34 \n", + "4 Calaveras County 211 83 123 14 \n", + "5 Colusa County 58 50 28 7 \n", + "6 Contra Costa County 2976 3532 2895 189 \n", + "7 Del Norte County 144 104 79 13 \n", + "8 El Dorado County 570 399 403 45 \n", + "9 Fresno County 4377 3514 3508 305 \n", + "10 Glenn County 104 78 104 14 \n", + "11 Humboldt County 503 440 747 54 \n", + "12 Imperial County 599 779 730 28 \n", + "13 Inyo County 81 44 39 3 \n", + "14 Kern County 4290 4384 3320 202 \n", + "15 Kings County 390 322 278 49 \n", + "16 Lake County 349 204 271 17 \n", + "17 Lassen County 101 62 35 6 \n", + "18 Los Angeles County 35319 34630 32193 2041 \n", + "19 Madera County 453 409 260 35 \n", + "20 Marin County 500 532 454 29 \n", + "21 Mariposa County 64 40 46 7 \n", + "22 Mendocino County 415 296 551 17 \n", + "23 Merced County 1169 1167 1055 78 \n", + "24 Modoc County 47 29 10 5 \n", + "25 Mono County 52 38 38 1 \n", + "26 Monterey County 1385 1311 967 136 \n", + "27 Napa County 367 400 325 24 \n", + "28 Nevada County 211 207 240 10 \n", + "29 Orange County 6145 7853 7524 579 \n", + "30 Placer County 712 1226 788 51 \n", + "31 Plumas County 100 44 104 2 \n", + "32 Riverside County 5825 7082 5881 479 \n", + "33 Sacramento County 5302 5809 4492 409 \n", + "34 San Benito County 209 140 111 20 \n", + "35 San Bernardino County 8474 10799 8793 720 \n", + "36 San Diego County 9812 8598 7648 579 \n", + "37 San Francisco County 3629 2792 6867 51 \n", + "38 San Joaquin County 3223 3136 1920 187 \n", + "39 San Luis Obispo County 609 647 522 94 \n", + "40 San Mateo County 1446 1527 1487 97 \n", + "41 Santa Barbara County 1292 1037 876 77 \n", + "42 Santa Clara County 4309 4737 4325 506 \n", + "43 Santa Cruz County 753 789 872 43 \n", + "44 Shasta County 464 505 434 36 \n", + "45 Sierra County 14 10 6 3 \n", + "46 Siskiyou County 189 145 180 9 \n", + "47 Solano County 1599 1326 1422 81 \n", + "48 Sonoma County 1359 1296 1682 81 \n", + "49 Stanislaus County 2211 2595 2209 111 \n", + "50 Sutter County 426 358 173 37 \n", + "51 Tehama County 236 272 457 16 \n", + "52 Trinity County 65 50 63 10 \n", + "53 Tulare County 2183 2080 1681 170 \n", + "54 Tuolumne County 160 252 209 14 \n", + "55 Ventura County 2275 2425 2040 120 \n", + "56 Yolo County 585 634 614 39 \n", + "57 Yuba County 354 368 211 39 \n", + "\n", + " F_ALLOTHER_x F_TOTAL_x M_TOTAL_x S_TOTAL_x YEAR_x ... \\\n", + "0 3502 18469 37247 431 2005-2009 ... \n", + "1 1 16 83 0 2005-2009 ... \n", + "2 199 464 801 2 2005-2009 ... \n", + "3 429 2248 9026 1 2005-2009 ... \n", + "4 70 501 968 3 2005-2009 ... \n", + "5 132 275 917 0 2005-2009 ... \n", + "6 2069 11661 18539 50 2005-2009 ... \n", + "7 97 437 1268 5 NaN ... \n", + "8 614 2031 4734 54 2005-2009 ... \n", + "9 3253 14957 31710 377 2005-2009 ... \n", + "10 133 433 1283 73 2005-2009 ... \n", + "11 394 2138 7460 420 2005-2009 ... \n", + "12 449 2585 6739 69 2005-2009 ... \n", + "13 38 205 851 1 NaN ... \n", + "14 2975 15171 31828 524 2005-2009 ... \n", + "15 499 1538 5203 415 2005-2009 ... \n", + "16 303 1144 3095 13 2005-2009 ... \n", + "17 83 287 881 12 2005-2009 ... \n", + "18 23524 127707 238608 11049 2005-2009 ... \n", + "19 505 1662 3366 2 2005-2009 ... \n", + "20 276 1791 5795 178 2005-2009 ... \n", + "21 43 200 517 21 2005-2009 ... \n", + "22 364 1643 3654 10 2005-2009 ... \n", + "23 1006 4475 9637 714 2005-2009 ... \n", + "24 52 143 310 50 2005-2009 ... \n", + "25 30 159 521 0 2005-2009 ... \n", + "26 911 4710 11948 304 2005-2009 ... \n", + "27 392 1508 3471 60 2005-2009 ... \n", + "28 108 776 2319 155 2005-2009 ... \n", + "29 4049 26150 66320 1620 2005-2009 ... \n", + "30 972 3749 7070 150 2005-2009 ... \n", + "31 68 318 867 13 2005-2009 ... \n", + "32 4832 24099 40087 2235 2005-2009 ... \n", + "33 3854 19866 32332 389 2005-2009 ... \n", + "34 141 621 1460 27 2005-2009 ... \n", + "35 4916 33702 56648 1119 2005-2009 ... \n", + "36 5321 31958 78405 4107 2005-2009 ... \n", + "37 3339 16678 14782 9 2005-2009 ... \n", + "38 2606 11072 23552 49 2005-2009 ... \n", + "39 403 2275 9962 67 2005-2009 ... \n", + "40 935 5492 13119 520 2005-2009 ... \n", + "41 861 4143 24386 366 2005-2009 ... \n", + "42 3095 16972 38541 1165 2005-2009 ... \n", + "43 744 3201 10455 115 2005-2009 ... \n", + "44 505 1944 7829 422 2005-2009 ... \n", + "45 9 42 124 0 2005-2009 ... \n", + "46 163 686 1786 8 2005-2009 ... \n", + "47 1027 5455 10682 191 2005-2009 ... \n", + "48 954 5372 16767 99 2005-2009 ... \n", + "49 2104 9230 14369 133 2005-2009 ... \n", + "50 194 1188 2916 184 2005-2009 ... \n", + "51 189 1170 2871 12 2005-2009 ... \n", + "52 53 241 527 2 2005-2009 ... \n", + "53 1767 7881 15368 301 2005-2009 ... \n", + "54 261 896 2062 83 2005-2009 ... \n", + "55 2083 8943 25762 1734 2005-2009 ... \n", + "56 662 2534 5426 73 2005-2009 ... \n", + "57 257 1229 2967 4 2005-2009 ... \n", + "\n", + " F_ALLOTHER_y F_TOTAL_y M_TOTAL_y S_TOTAL_y YEAR_y STATE_y \\\n", + "0 2619 11037 28305 82 2013-2017 California \n", + "1 3 11 41 0 2013-2017 California \n", + "2 142 288 701 1 2013-2017 California \n", + "3 741 2239 8853 1 2013-2017 California \n", + "4 96 320 897 0 2013-2017 California \n", + "5 126 236 753 0 2013-2017 California \n", + "6 2489 8785 16223 2 2013-2017 California \n", + "7 89 355 1525 0 NaN NaN \n", + "8 459 1317 3508 0 2013-2017 California \n", + "9 2899 9960 28736 38 2013-2017 California \n", + "10 117 293 1018 5 2013-2017 California \n", + "11 616 1545 5577 5 2013-2017 California \n", + "12 321 1176 4483 3 2013-2017 California \n", + "13 101 249 653 0 NaN NaN \n", + "14 2878 9767 25552 70 2013-2017 California \n", + "15 747 1785 5545 88 2013-2017 California \n", + "16 288 1015 3217 0 2013-2017 California \n", + "17 91 310 594 4 2013-2017 California \n", + "18 18256 73463 172389 347 2013-2017 California \n", + "19 502 1428 4317 6 2013-2017 California \n", + "20 252 1225 5227 56 2013-2017 California \n", + "21 18 99 390 0 2013-2017 California \n", + "22 343 1063 3237 3 2013-2017 California \n", + "23 903 2673 7326 432 2013-2017 California \n", + "24 80 180 300 0 2013-2017 California \n", + "25 6 80 256 0 2013-2017 California \n", + "26 878 3233 9543 81 2013-2017 California \n", + "27 574 1230 3197 5 2013-2017 California \n", + "28 236 632 2307 61 2013-2017 California \n", + "29 4582 17804 66910 537 2013-2017 California \n", + "30 988 2631 6477 7 2013-2017 California \n", + "31 64 205 683 10 2013-2017 California \n", + "32 3516 13893 31557 131 2013-2017 California \n", + "33 4020 13895 23366 87 2013-2017 California \n", + "34 120 412 1286 2 2013-2017 California \n", + "35 4899 24063 48461 1044 2013-2017 California \n", + "36 6763 21845 62416 432 2013-2017 California \n", + "37 3448 7499 7937 0 2013-2017 California \n", + "38 2084 7065 14031 5 2013-2017 California \n", + "39 420 1552 9427 5 2013-2017 California \n", + "40 910 3491 13494 43 2013-2017 California \n", + "41 899 3466 14866 50 2013-2017 California \n", + "42 2418 10738 27169 121 2013-2017 California \n", + "43 770 2303 8594 10 2013-2017 California \n", + "44 998 2130 7502 84 2013-2017 California \n", + "45 10 45 67 0 2013-2017 California \n", + "46 226 585 1643 0 2013-2017 California \n", + "47 1160 4239 10559 18 2013-2017 California \n", + "48 902 3536 13153 9 2013-2017 California \n", + "49 1846 6467 14008 34 2013-2017 California \n", + "50 345 1084 3314 6 2013-2017 California \n", + "51 170 570 1971 2 2013-2017 California \n", + "52 88 264 346 0 2013-2017 California \n", + "53 1741 6596 15163 360 2013-2017 California \n", + "54 311 732 2019 12 2013-2017 California \n", + "55 2193 6811 26612 630 2013-2017 California \n", + "56 518 1586 4602 6 2013-2017 California \n", + "57 257 1013 1942 0 2013-2017 California \n", + "\n", + " total_population_y _merge drug_arrest_rate_2018 \\\n", + "0 1629615.0 both 0.000652 \n", + "1 1203.0 both 0.000831 \n", + "2 37306.0 both 0.000831 \n", + "3 225207.0 both 0.001017 \n", + "4 45057.0 both 0.000644 \n", + "5 21479.0 both 0.000233 \n", + "6 1123678.0 both 0.000708 \n", + "7 NaN left_only NaN \n", + "8 185015.0 both 0.000551 \n", + "9 971616.0 both 0.000571 \n", + "10 27935.0 both 0.000823 \n", + "11 135490.0 both 0.000989 \n", + "12 179957.0 both 0.000806 \n", + "13 NaN left_only NaN \n", + "14 878744.0 both 0.000843 \n", + "15 150183.0 both 0.000573 \n", + "16 64095.0 both 0.002855 \n", + "17 31470.0 both 0.001303 \n", + "18 10105722.0 both 0.000634 \n", + "19 154440.0 both 0.000848 \n", + "20 260814.0 both 0.000364 \n", + "21 17658.0 both 0.000793 \n", + "22 87497.0 both 0.001371 \n", + "23 267390.0 both 0.000621 \n", + "24 9017.0 both 0.001109 \n", + "25 14058.0 both 0.000498 \n", + "26 433168.0 both 0.000674 \n", + "27 141005.0 both 0.000596 \n", + "28 98838.0 both 0.000607 \n", + "29 3155816.0 both 0.000588 \n", + "30 374985.0 both 0.000632 \n", + "31 18724.0 both 0.001389 \n", + "32 2355002.0 both 0.000561 \n", + "33 1495400.0 both 0.000948 \n", + "34 58671.0 both 0.000784 \n", + "35 2121220.0 both 0.001188 \n", + "36 3283665.0 both 0.000638 \n", + "37 864263.0 both 0.000915 \n", + "38 724153.0 both 0.000692 \n", + "39 280119.0 both 0.000378 \n", + "40 763450.0 both 0.000299 \n", + "41 442996.0 both 0.001167 \n", + "42 1911226.0 both 0.000460 \n", + "43 273263.0 both 0.000637 \n", + "44 178919.0 both 0.000894 \n", + "45 2885.0 both 0.001040 \n", + "46 43530.0 both 0.001631 \n", + "47 434981.0 both 0.001113 \n", + "48 500943.0 both 0.000529 \n", + "49 535684.0 both 0.001650 \n", + "50 95583.0 both 0.000554 \n", + "51 63247.0 both 0.000395 \n", + "52 13037.0 both 0.005753 \n", + "53 458809.0 both 0.002029 \n", + "54 53899.0 both 0.001614 \n", + "55 847834.0 both 0.000863 \n", + "56 212605.0 both 0.000564 \n", + "57 74644.0 both 0.002197 \n", + "\n", + " violent_arrest_rate_2018 \n", + "0 0.002536 \n", + "1 0.004156 \n", + "2 0.001930 \n", + "3 0.003486 \n", + "4 0.003263 \n", + "5 0.003073 \n", + "6 0.002326 \n", + "7 NaN \n", + "8 0.002594 \n", + "9 0.004442 \n", + "10 0.003580 \n", + "11 0.003631 \n", + "12 0.002228 \n", + "13 NaN \n", + "14 0.003883 \n", + "15 0.004015 \n", + "16 0.005710 \n", + "17 0.003654 \n", + "18 0.002856 \n", + "19 0.003639 \n", + "20 0.002009 \n", + "21 0.001812 \n", + "22 0.004354 \n", + "23 0.003848 \n", + "24 0.007320 \n", + "25 0.004126 \n", + "26 0.002872 \n", + "27 0.002525 \n", + "28 0.001932 \n", + "29 0.001932 \n", + "30 0.002037 \n", + "31 0.004166 \n", + "32 0.002257 \n", + "33 0.003400 \n", + "34 0.003051 \n", + "35 0.004167 \n", + "36 0.002611 \n", + "37 0.002377 \n", + "38 0.004197 \n", + "39 0.002299 \n", + "40 0.001853 \n", + "41 0.002948 \n", + "42 0.002180 \n", + "43 0.002829 \n", + "44 0.002878 \n", + "45 0.003466 \n", + "46 0.004112 \n", + "47 0.003439 \n", + "48 0.002944 \n", + "49 0.004385 \n", + "50 0.004185 \n", + "51 0.003305 \n", + "52 0.005139 \n", + "53 0.005061 \n", + "54 0.003766 \n", + "55 0.002679 \n", + "56 0.002705 \n", + "57 0.005238 \n", + "\n", + "[58 rows x 29 columns]" + ] + }, + "execution_count": 96, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "merged_2018_2009_df = pd.merge(merged_df, merged_2018_df, how=\"outer\", on=\"COUNTY\")\n", + "merged_2018_2009_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 17\n", + "\n", + "Did drug arrests go down from 2009 to 2018 in response to drug legalization? (they sure better! This is what's called a \"sanity check\" of your data and analysis. If you find drug arrests went *up*, you know something went wrong with your code or your understanding of the situations). \n", + "\n", + "Store the average county-level change in drug arrests per capita in `results` under the key `\"ex17_drug_change\"`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 18\n", + "\n", + "Now we want to look at whether violent crime decreased following drug legalization. Did the average violent arrest rate decrease? By how much? (Note: We're assuming that arrest rates are proportionate to crime rates. If policing increased so that there were more arrests per crime committed, that would impact our interpretation of these results. But this is just an exercise, so...)\n", + "\n", + "Store the average county-level change in violent arrests per capita in `results` under the key `\"ex18_violent_change\"`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 19\n", + "\n", + "Based on your answers to exercises 17 and 18, what might you conclude about the relationship between the illegal drug trade and violent crime? Did legalizing drugs increase violent crime (assuming arrest rates are a good proxy for crime rates)? Decrease violent crime? Have no effect? " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Difference in Difference Analysis\n", + "\n", + "The preceding analysis is something we sometimes call a \"pre-post\" analysis, in that it is a comparison of how and outcome we care about (violent arrest rates) changes from before a treatment is introduced (\"pre\") to after (\"post\"). BUT: pre-post comparisons are imperfect. If we knew that violent crime was not going to change at all in a world without drug legalization, then this comparison is perfectly valid. But what if, absent drug legalization, violent crime would have fallen on its own (maybe because of advances in policing practices or a better economy)? Or maybe it would have increased?\n", + "\n", + "This is actually a very common problem. For example, imagine you're trying to figure out whether taking tylenol helps with headaches. You have a patient with a headache, you give them tylenol, and then the next day you ask them if they still have a headache, and find out that they don't — does that mean that tylenol cured the headache? Maybe... but most headaches eventually resolve on their own, so maybe the headache would have passed with or without the patient taking tylenol! In fact, there's a term for this phenomenon in medicine — the \"natural history\" of a disease, which is the trajectory that we think a disease might follow absent treatment. And the natural history of the disease is almost never for it to stay exactly the same indefinitely.\n", + "\n", + "(All of this is closely related to the discipline of causal inference, and if it makes your head to hurt, don't worry — that means you're doing it right! We will talk lots and lots more about it in the weeks and months to come.)\n", + "\n", + "One way to try to overcome this problem is with something called a difference-in-difference analysis. Rather than just looking at whether violent drug arrest rates increase or decrease between 2009 and 2018, we can split our sample of counties into those that were *more* impacted by drug legalization and those that were *less* impacted by drug legalization and evaluate whether we see a greater change in the violent drug arrest rate in the counties that were more impacted. \n", + "\n", + "What does it mean to have been \"more impacted\" by drug legalization? In this case, we can treat the counties that had higher drug arrest rates in 2009 as counties that were more impacted by drug legalization than those that had low drug arrest rates in 2009. After all, in a county that had no drug arrests, legalization wouldn't do anything, would it? \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Exercise 20\n", + "\n", + "First, split our sample into two groups: high drug arrests in 2009, and low drug arrests in 2009 (cut the sample at the average drug arrest rate in 2009). \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Exercise 21\n", + "\n", + "Now, determine weather violent crime changed *more* from 2009 to 2018 in the counties that had lots of drug arrests in 2009 (where legalization likely had more of an effect) than in counties with fewer drug arrests in 2009 (where legalization likely mattered less)? \n", + "\n", + "Calculate this difference-in-difference:\n", + "\n", + "```\n", + "(the change in violent crime rate per capita for counties with lots of drug arrests in 2009) \n", + "- (the change in violent crime rate per capita for counties with few drug arrests in 2009)\n", + "\n", + "```\n", + "\n", + "Store your \"difference-in-difference\" estimate in your `results` dictionary under the key `\"ex21_diffindiff\"`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 22\n", + "\n", + "Interpret your difference in difference result." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 23\n", + "\n", + "The quantity we estimated above is a little difficult to interpret. Rather than calculating the *absolute* change in violent arrest rates per capita, let's calculate the *proportionate* change.\n", + "\n", + "Calculate:\n", + "\n", + "```\n", + "(the county-level percentage change in violent crime rate with lots of drug arrests in 2009) \n", + "- (the county-level percentage change in violent crime rate with few drug arrests in 2009)\n", + "```\n", + "Store your \"difference-in-difference\" estimate in your `results` dictionary under the key `\"ex24_diffindiff_proportionate\"`. Report your result in percentages, such that a value of `-100` would imply that the average county experienced a 100% decrease in the violent arrest rate." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.10.6 ('base')", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + }, + "vscode": { + "interpreter": { + "hash": "718fed28bf9f8c7851519acf2fb923cd655120b36de3b67253eeb0428bd33d2d" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/jupyter_exercise_materials/Jupyter_First.pdf b/jupyter_exercise_materials/Jupyter_First.pdf new file mode 100644 index 0000000..a8f8a3e Binary files /dev/null and b/jupyter_exercise_materials/Jupyter_First.pdf differ diff --git a/jupyter_exercise_materials/__pycache__/analyze_health_and_income.cpython-311.pyc b/jupyter_exercise_materials/__pycache__/analyze_health_and_income.cpython-311.pyc new file mode 100644 index 0000000..3e9873b Binary files /dev/null and b/jupyter_exercise_materials/__pycache__/analyze_health_and_income.cpython-311.pyc differ diff --git a/jupyter_exercise_materials/analyze_health_and_income.R b/jupyter_exercise_materials/analyze_health_and_income.R new file mode 100644 index 0000000..580c86f --- /dev/null +++ b/jupyter_exercise_materials/analyze_health_and_income.R @@ -0,0 +1,26 @@ +###################### +# +# Import World Development Indicators +# and look at the relationship between income +# and health outcomes across countries +# +###################### + +# Download World Development Indicators +wdi <- read.csv("https://media.githubusercontent.com/media/nickeubank/MIDS_Data/master/World_Development_Indicators/wdi_small_tidy_2015.csv") + +# Get Mortality and GDP per capita for 2015 +wdi$loggdppercap <- log(wdi[["GDP.per.capita..constant.2010.US.."]]) + +# Plot +library(ggplot2) +ggplot( + wdi, + aes( + x = loggdppercap, + y = Mortality.rate..under.5..per.1.000.live.births. + ) +) + + geom_point() + + geom_label(aes(label = Country.Name)) + + geom_smooth() diff --git a/jupyter_exercise_materials/analyze_health_and_income.py b/jupyter_exercise_materials/analyze_health_and_income.py new file mode 100644 index 0000000..a998798 --- /dev/null +++ b/jupyter_exercise_materials/analyze_health_and_income.py @@ -0,0 +1,39 @@ +###################### +# +# Import World Development Indicators +# and look at the relationship between income +# and health outcomes across countries +# +###################### + +import pandas as pd +import numpy as np + +# Download World Development Indicators +wdi = pd.read_csv( + "https://media.githubusercontent.com/" + "media/nickeubank/MIDS_Data/" + "master/World_Development_Indicators/wdi_small_tidy_2015.csv" +) + +# GDP Per Capita has a REALLY long right tail, so we want to log it for readability. +wdi["Log GDP Per Capita"] = np.log(wdi["GDP per capita (constant 2010 US$)"]) + +# Plot +import seaborn.objects as so +import seaborn as sns +from matplotlib import style + +my_chart = ( + so.Plot( + wdi, x="Log GDP Per Capita", y="Mortality rate, under-5 (per 1,000 live births)" + ) + .add(so.Line(), so.PolyFit(order=2)) + .add(so.Dot()) + .label(title="Log GDP and Under-5 Mortality") + .theme({**style.library["seaborn-whitegrid"]}) +) + +my_chart + +print("Done!") diff --git a/jupyter_exercise_materials/first_jupyter_notebook.html b/jupyter_exercise_materials/first_jupyter_notebook.html new file mode 100644 index 0000000..f606625 --- /dev/null +++ b/jupyter_exercise_materials/first_jupyter_notebook.html @@ -0,0 +1,7877 @@ + + + + + +first_jupyter_notebook + + + + + + + + + + + + +
+ + + + + + + +
+ + diff --git a/jupyter_exercise_materials/first_jupyter_notebook.ipynb b/jupyter_exercise_materials/first_jupyter_notebook.ipynb new file mode 100644 index 0000000..e8f6715 --- /dev/null +++ b/jupyter_exercise_materials/first_jupyter_notebook.ipynb @@ -0,0 +1,533 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# National Income and Infant Mortality\n", + "\n", + "In this Jupyter Notebook, we'll analyze the relationship between a country's GDP per capita (a measure of average income per person) and infant mortality (in particular, the share of every 1,000 children born who do not reach their fifth birthday). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data\n", + "\n", + "Data for this analysis comes from the World Bank's *World Development Indicators* database. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "# Download World Development Indicators\n", + "wdi = pd.read_csv(\n", + " \"https://media.githubusercontent.com/\"\n", + " \"media/nickeubank/MIDS_Data/\"\n", + " \"master/World_Development_Indicators/wdi_small_tidy_2015.csv\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's quickly look at our data. The next command just shows us the first 5 rows of our data (we'll spend more time on these tools in later lessons). Note further than you can scroll right to see more columns!:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Country NameAdolescent fertility rate (births per 1,000 women ages 15-19)Antiretroviral therapy coverage for PMTCT (% of pregnant women living with HIV)Battle-related deaths (number of people)CPIA building human resources rating (1=low to 6=high)CPIA business regulatory environment rating (1=low to 6=high)CPIA debt policy rating (1=low to 6=high)CPIA economic management cluster average (1=low to 6=high)CPIA efficiency of revenue mobilization rating (1=low to 6=high)CPIA equity of public resource use rating (1=low to 6=high)...Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49)Women who believe a husband is justified in beating his wife (any of five reasons) (%)Women who believe a husband is justified in beating his wife when she argues with him (%)Women who believe a husband is justified in beating his wife when she burns the food (%)Women who believe a husband is justified in beating his wife when she goes out without telling him (%)Women who believe a husband is justified in beating his wife when she neglects the children (%)Women who believe a husband is justified in beating his wife when she refuses sex with him (%)Women who were first married by age 15 (% of women ages 20-24)Women who were first married by age 18 (% of women ages 20-24)Women's share of population ages 15+ living with HIV (%)
0Afghanistan73.1264NaN17273.03.52.53.03.03.03.0...32.680.259.218.266.948.433.48.834.8NaN
1Albania20.6922NaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaN30.3
2Algeria10.705228.0110.0NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaN44.8
3American SamoaNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4AndorraNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

5 rows × 129 columns

\n", + "
" + ], + "text/plain": [ + " Country Name \\\n", + "0 Afghanistan \n", + "1 Albania \n", + "2 Algeria \n", + "3 American Samoa \n", + "4 Andorra \n", + "\n", + " Adolescent fertility rate (births per 1,000 women ages 15-19) \\\n", + "0 73.1264 \n", + "1 20.6922 \n", + "2 10.7052 \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Antiretroviral therapy coverage for PMTCT (% of pregnant women living with HIV) \\\n", + "0 NaN \n", + "1 NaN \n", + "2 28.0 \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Battle-related deaths (number of people) \\\n", + "0 17273.0 \n", + "1 NaN \n", + "2 110.0 \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " CPIA building human resources rating (1=low to 6=high) \\\n", + "0 3.5 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " CPIA business regulatory environment rating (1=low to 6=high) \\\n", + "0 2.5 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " CPIA debt policy rating (1=low to 6=high) \\\n", + "0 3.0 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " CPIA economic management cluster average (1=low to 6=high) \\\n", + "0 3.0 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " CPIA efficiency of revenue mobilization rating (1=low to 6=high) \\\n", + "0 3.0 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " CPIA equity of public resource use rating (1=low to 6=high) ... \\\n", + "0 3.0 ... \n", + "1 NaN ... \n", + "2 NaN ... \n", + "3 NaN ... \n", + "4 NaN ... \n", + "\n", + " Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49) \\\n", + "0 32.6 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Women who believe a husband is justified in beating his wife (any of five reasons) (%) \\\n", + "0 80.2 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Women who believe a husband is justified in beating his wife when she argues with him (%) \\\n", + "0 59.2 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Women who believe a husband is justified in beating his wife when she burns the food (%) \\\n", + "0 18.2 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Women who believe a husband is justified in beating his wife when she goes out without telling him (%) \\\n", + "0 66.9 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Women who believe a husband is justified in beating his wife when she neglects the children (%) \\\n", + "0 48.4 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Women who believe a husband is justified in beating his wife when she refuses sex with him (%) \\\n", + "0 33.4 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Women who were first married by age 15 (% of women ages 20-24) \\\n", + "0 8.8 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Women who were first married by age 18 (% of women ages 20-24) \\\n", + "0 34.8 \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + " Women's share of population ages 15+ living with HIV (%) \n", + "0 NaN \n", + "1 30.3 \n", + "2 44.8 \n", + "3 NaN \n", + "4 NaN \n", + "\n", + "[5 rows x 129 columns]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "wdi.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['Country Name',\n", + " 'Adolescent fertility rate (births per 1,000 women ages 15-19)',\n", + " 'Antiretroviral therapy coverage for PMTCT (% of pregnant women living with HIV)',\n", + " 'Battle-related deaths (number of people)',\n", + " 'CPIA building human resources rating (1=low to 6=high)',\n", + " 'CPIA business regulatory environment rating (1=low to 6=high)',\n", + " 'CPIA debt policy rating (1=low to 6=high)',\n", + " 'CPIA economic management cluster average (1=low to 6=high)',\n", + " 'CPIA efficiency of revenue mobilization rating (1=low to 6=high)',\n", + " 'CPIA equity of public resource use rating (1=low to 6=high)',\n", + " ...\n", + " 'Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49)',\n", + " 'Women who believe a husband is justified in beating his wife (any of five reasons) (%)',\n", + " 'Women who believe a husband is justified in beating his wife when she argues with him (%)',\n", + " 'Women who believe a husband is justified in beating his wife when she burns the food (%)',\n", + " 'Women who believe a husband is justified in beating his wife when she goes out without telling him (%)',\n", + " 'Women who believe a husband is justified in beating his wife when she neglects the children (%)',\n", + " 'Women who believe a husband is justified in beating his wife when she refuses sex with him (%)',\n", + " 'Women who were first married by age 15 (% of women ages 20-24)',\n", + " 'Women who were first married by age 18 (% of women ages 20-24)',\n", + " 'Women's share of population ages 15+ living with HIV (%)'],\n", + " dtype='object', length=129)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Now we can just print out the column names:\n", + "wdi.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Visualizing the Relationship between Log GDP Per Capita and Infant Mortality\n", + "\n", + "[Now it's your turn! insert the plot from `analyze_health_and_income.py` here and make any required changes to make it work]" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Done!\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "c:\\Users\\Simrun Sharma\\miniconda3\\Lib\\site-packages\\seaborn\\_core\\rules.py:72: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead\n", + " if pd.api.types.is_categorical_dtype(vector):\n", + "c:\\Users\\Simrun Sharma\\miniconda3\\Lib\\site-packages\\seaborn\\_core\\rules.py:72: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead\n", + " if pd.api.types.is_categorical_dtype(vector):\n", + "c:\\Users\\Simrun Sharma\\miniconda3\\Lib\\site-packages\\seaborn\\_core\\plot.py:1491: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.\n", + " with pd.option_context(\"mode.use_inf_as_na\", True):\n", + "c:\\Users\\Simrun Sharma\\miniconda3\\Lib\\site-packages\\seaborn\\_core\\plot.py:1491: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.\n", + " with pd.option_context(\"mode.use_inf_as_na\", True):\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "execution_count": 4, + "metadata": { + "image/png": { + "height": 378.25, + "width": 509.15 + } + }, + "output_type": "execute_result" + } + ], + "source": [ + "import analyze_health_and_income as ahi\n", + "\n", + "ahi.my_chart" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tell Me Something Cool You've Learned\n", + "\n", + "Write me a little markdown cell (with some fun formatting!) telling me something you saw in the plot you didn't expect." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It seems there are some countries that have a high log GDP Per Capita but they have a high infant mortality rate. This is not what I expected. I felt as though a higher GDP would have a heavy correlation with a lower infant mortality rate. This means despite having a high GDP Per Capita sometimes countries are unable to achieve a good resource allocation system leading to some impoverished communites having to deal with a high infant mortality rate. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Export and Send Your Notebook to Me!\n", + "\n", + "When you are finished, the next step is to export this notebook. \n", + "\n", + "In my experience, the best way to do this is as follows: \n", + "\n", + "1. Along the top of your notebook, select \"Export\" (it might be in the three-dot menu).\n", + "2. Choose \"HTML\" and **save it next to this notebook file**. This is important because any images in the HTML have relative file paths that are set up to be in reference to the location of your notebook, so if you save it somewhere else, when you open it you may lose all your images. \n", + "3. Open that HTML in your normal web browser (Chrome, Firefox, etc), **not** in VS Code.\n", + "4. Print the page to PDF.\n", + "\n", + "Why do this instead of choosing the PDF option when exporting? If you try and export a notebook directly to PDF, VS Code will actually try and use a tool to convert it to a LaTeX document, compile that latex document, then print, and getting this setup right can be a pain, and the LaTeX conversion often causes problems. So while a little convoluted, that's my recommendation. \n", + "\n", + "Now that you have a PDF, please upload it to Gradescope!" + ] + } + ], + "metadata": { + "interpreter": { + "hash": "f06fa9c80cc08d4d343f66ad24a278ad0285590eac640a80c32c9d748f33a802" + }, + "kernelspec": { + "display_name": "Python 3.9.6 64-bit ('base': conda)", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}