Skip to content

Commit

Permalink
allow two ways to complete word_count (4b), spelling corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
joncbates committed Jun 7, 2015
1 parent 5386db3 commit b1abc03
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 7 deletions.
15 changes: 10 additions & 5 deletions lab1_word_count_student.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -568,7 +568,7 @@
" + #### All punctuation should be removed.\n",
" + #### Any leading or trailing spaces on a line should be removed.\n",
" \n",
"#### Define the function `removePunctuation` that converts all text to lower case, removes leading and trailing spaces, and removes any punctuation. Use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space. Reading `help(re.sub)` might be useful."
"#### Define the function `removePunctuation` that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces. Use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space. Reading `help(re.sub)` might be useful."
]
},
{
Expand All @@ -582,11 +582,12 @@
"# TODO: Replace <FILL IN> with appropriate code\n",
"import re\n",
"def removePunctuation(text):\n",
" \"\"\"Removes punctuation, changes to lowercase, and strips leading and trailing spaces.\n",
" \"\"\"Removes punctuation, changes to lower case, and strips leading and trailing spaces.\n",
"\n",
" Note:\n",
" Only spaces, letters, and numbers should be retained. Other characters should should be\n",
" eliminated. (e.g. it's becomes its)\n",
" eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after\n",
" punctuation is removed.\n",
"\n",
" Args:\n",
" text (str): A string.\n",
Expand All @@ -595,7 +596,8 @@
" str: The cleaned up string.\n",
" \"\"\"\n",
" <FILL IN>\n",
"print removePunctuation('Hi, you!')"
"print removePunctuation('Hi, you!')\n",
"print removePunctuation(' No under_score!')"
]
},
{
Expand Down Expand Up @@ -679,7 +681,10 @@
"outputs": [],
"source": [
"# TEST Words from lines (4d)\n",
"Test.assertEquals(shakespeareWordCount, 928908, 'incorrect value for shakespeareWordCount')\n",
"# This test allows for leading spaces to be removed either before or after\n",
"# punctuation is removed.\n",
"Test.assertTrue(shakespeareWordCount == 927631 or shakespeareWordCount == 928908,\n",
" 'incorrect value for shakespeareWordCount')\n",
"Test.assertEquals(shakespeareWordsRDD.top(5),\n",
" [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds'],\n",
" 'incorrect value for shakespeareWordsRDD')"
Expand Down
4 changes: 2 additions & 2 deletions spark_tutorial_student.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"#![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)\n",
"# **Spark Tutorial: Learning Apache Spark**\n",
"#### This tutorial will teach you how to use [Apache Spark](http://spark.apache.org/), a framework for large-scale data processing, within a notebook. Many traditional frameworks were designed to be run on a single computer. However, many datasets today are too large to be stored on a single computer, and even when a dataset can be stored on one computer (such as the datasets in this tutorial), the dataset can often be processed much more quickly using multiple computers. Spark has efficient implementations of a number of transformations and actions that can be composed together to perform data processing and analysis. Spark excels at distributing these operations across a cluster while abstracting away many of the underlying implementatation details. Spark has been designed with a focus on scalability and efficiency. With Spark you can begin developing your solution on your laptop, using a small dataset, and then use that same code to process terabytes or even petabytes across a distributed cluster.\n",
"#### This tutorial will teach you how to use [Apache Spark](http://spark.apache.org/), a framework for large-scale data processing, within a notebook. Many traditional frameworks were designed to be run on a single computer. However, many datasets today are too large to be stored on a single computer, and even when a dataset can be stored on one computer (such as the datasets in this tutorial), the dataset can often be processed much more quickly using multiple computers. Spark has efficient implementations of a number of transformations and actions that can be composed together to perform data processing and analysis. Spark excels at distributing these operations across a cluster while abstracting away many of the underlying implementation details. Spark has been designed with a focus on scalability and efficiency. With Spark you can begin developing your solution on your laptop, using a small dataset, and then use that same code to process terabytes or even petabytes across a distributed cluster.\n",
"#### **During this tutorial we will cover:**\n",
"#### *Part 1:* Basic notebook usage and [Python](https://docs.python.org/2/) integration\n",
"#### *Part 2:* An introduction to using [Apache Spark](https://spark.apache.org/) with the Python [pySpark API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD) running in the browser\n",
Expand Down Expand Up @@ -607,7 +607,7 @@
"#### One useful thing to do when we have a new dataset is to look at the first few entries to obtain a rough idea of what information is available. In Spark, we can do that using the `first()`, `take()`, `top()`, and `takeOrdered()` actions. Note that for the `first()` and `take()` actions, the elements that are returned depend on how the RDD is *partitioned*.\n",
"#### Instead of using the `collect()` action, we can use the `take(n)` action to return the first n elements of the RDD. The `first()` action returns the first element of an RDD, and is equivalent to `take(1)`.\n",
"#### The `takeOrdered()` action returns the first n elements of the RDD, using either their natural order or a custom comparator. The key advantage of using `takeOrdered()` instead of `first()` or `take()` is that `takeOrdered()` returns a deterministic result, while the other two actions may return differing results, depending on the number of partions or execution environment. `takeOrdered()` returns the list sorted in *ascending order*. The `top()` action is similar to `takeOrdered()` except that it returns the list in *descending order.*\n",
"#### The `reduce()` action reduces the elements of a RDD to a single value by applying a function that takes two parameters and returns a single value. The function should be commutative and associative, as `reduce()` is applied at the partition level and then again to aggregate results from partitions. If these rules don't hold, the results from `reduce()` will be inconsistent. Reducing locally at paritions makes `reduce()` very efficient."
"#### The `reduce()` action reduces the elements of a RDD to a single value by applying a function that takes two parameters and returns a single value. The function should be commutative and associative, as `reduce()` is applied at the partition level and then again to aggregate results from partitions. If these rules don't hold, the results from `reduce()` will be inconsistent. Reducing locally at partitions makes `reduce()` very efficient."
]
},
{
Expand Down

0 comments on commit b1abc03

Please sign in to comment.