diff --git a/2023_ProgrammingWithPython_AnswerKeys_RP.ipynb b/2023_ProgrammingWithPython_AnswerKeys_RP.ipynb new file mode 100644 index 0000000..cd25143 --- /dev/null +++ b/2023_ProgrammingWithPython_AnswerKeys_RP.ipynb @@ -0,0 +1,4140 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "30dd1211", + "metadata": {}, + "source": [ + "# Programming with Python\n", + "\n", + "Originally derived from [Novice Python lesson using Python 2.7 - Copyright © Software Carpentry](https://github.com/swcarpentry/python-novice-inflammation-2.7) by Zach Mielko\n", + "\n", + "Adapted and taught by RP Pornmongkolsuk" + ] + }, + { + "cell_type": "markdown", + "id": "1b753813", + "metadata": {}, + "source": [ + "## Data Types\n", + "\n", + "What you are touching right now is a computer. So if it's going to do anything, it should at least compute? The simplest thing you can do in Python is to perform a mathematical operation." + ] + }, + { + "cell_type": "markdown", + "id": "b4f199eb", + "metadata": {}, + "source": [ + "You can add. " + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "df61c2a2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "8" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "5+3" + ] + }, + { + "cell_type": "markdown", + "id": "7ee80ef7", + "metadata": {}, + "source": [ + "You can subtract." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "c38f252f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "5-3" + ] + }, + { + "cell_type": "markdown", + "id": "18d1cb60", + "metadata": {}, + "source": [ + "You can multiply." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "0dbedd05", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "20" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "2*10" + ] + }, + { + "cell_type": "markdown", + "id": "7e1ddabc", + "metadata": {}, + "source": [ + "And you can divide." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "9d124f39", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "6.0" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "30/5" + ] + }, + { + "cell_type": "markdown", + "id": "4ad60092", + "metadata": {}, + "source": [ + "Even though Jupyter Notebook automatically prints out the output of our mathematical operations, there's a specific `function` for printing the output. We'll going to talk more about `functions` later, but a function is very similar to a `command` in `bash` that you learned this morning. Calling a `function` is to tell Python to do something.\n", + "\n", + "And this is where the readability which is a strength of Python comes in, because gues what, the function for printing is called `print()`. In Python, when you call a `function`, you type the name of the function, followed by a set of parentheses. Inside the parentheses, you include whatever parameters needed for that function. \n", + "\n", + "For `print()`, the only parameter needed is just the thing you want to be printed." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "8f21037f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "8\n" + ] + } + ], + "source": [ + "print(5+3)" + ] + }, + { + "cell_type": "markdown", + "id": "7ec94efe", + "metadata": {}, + "source": [ + "You might wonder, what's the point of the `print()` function if the Notebook is going to print it out for me anyway. Well, by default, Jupyter Notebook only prints out the last output, so if you have a multi-line cell and are calling to print out a couple outputs without calling the `print()` function, it will only print the result of your last line." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "fef5e30b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "20+50\n", + "4-3" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "8c0c69b5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "70\n", + "1\n" + ] + } + ], + "source": [ + "print(20+50)\n", + "print(4-3)" + ] + }, + { + "cell_type": "markdown", + "id": "d9104f54", + "metadata": {}, + "source": [ + "Now that we know the function `print()`, let's try to print something else, shall we? Here's the single most popular and iconic first exercise of any programming language ever, it is to print out the phrase \"Hello World\" as if the computer has just gained its first consciousness." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "4a88bd52", + "metadata": { + "scrolled": true, + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "invalid syntax. Perhaps you forgot a comma? (4293340409.py, line 1)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m Cell \u001b[0;32mIn[23], line 1\u001b[0;36m\u001b[0m\n\u001b[0;31m print(Hello World)\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax. Perhaps you forgot a comma?\n" + ] + } + ], + "source": [ + "print(Hello World)" + ] + }, + { + "cell_type": "markdown", + "id": "46c097c0", + "metadata": {}, + "source": [ + "Oooooh noooo, there's an error, what do I do? Well, first, don't panic. Thankfully, this is not an experiment that when you make a mistake, you have to start it over from the beginning. You can always go back, edit the cell, and run it again. Moreover, errors can be really helpful. They give us a hint as to why our code didn't work. \n", + "\n", + "Let's break this one down. Here, it said `line 1`, meaning the error is in the first line, that's a given as there's only one line of code. Then, it said `SyntaxError: invalid syntax`. There is something wrong with our syntax. This is because, although numbers can be included into the code directly, the same cannot be said about texts. Think about it, how can a computer distinguish the word `print` that you meant as a function as opposed to the word \"print\" that you meant it literally? So the syntax for texts that you want Python to interpret literally, is to use quotes (`''`). " + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "b1be78c9", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Hello Wolrd\n" + ] + } + ], + "source": [ + "print('Hello Wolrd')" + ] + }, + { + "cell_type": "markdown", + "id": "0f10e3d7", + "metadata": {}, + "source": [ + "Alternatively, you can also use double quotes (`\"\"`)." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "fe6c7227", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Hello World\n" + ] + } + ], + "source": [ + "print(\"Hello World\")" + ] + }, + { + "cell_type": "markdown", + "id": "a5ae3e8b", + "metadata": {}, + "source": [ + "There, congratulations! You just wrote your first computer program in Python. If that makes you feel amazing, more power to you! If you think that's underwhelming, let's say we should be thankful that to print \"Hello World\" only takes one line of code in Python, because that is not always true in other programming languages. \n", + "\n", + "Note that when you open quote, don't forget to end quote as well. Otherwise, Python will literally interpret everything after the opening quote as the same line of text." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "ad20bc04", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "unterminated string literal (detected at line 1) (2771583741.py, line 1)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m Cell \u001b[0;32mIn[26], line 1\u001b[0;36m\u001b[0m\n\u001b[0;31m print(\"Hello World)\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m unterminated string literal (detected at line 1)\n" + ] + } + ], + "source": [ + "print(\"Hello World)" + ] + }, + { + "cell_type": "markdown", + "id": "cbe99779", + "metadata": {}, + "source": [ + "EOL stands for \"End of the Line.\" The error is saying that you ended the line without ending the quote. " + ] + }, + { + "cell_type": "markdown", + "id": "755f3487", + "metadata": {}, + "source": [ + "If you notice, there is a specific term for the \"text\" data type in Python--`string`. Think of it as a string of characters. You can check the type of of the data by calling the function `type()`. " + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "55c1a1a4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "str" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(\"Hello World\")" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "af13e610", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "str" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type('Hello World')" + ] + }, + { + "cell_type": "markdown", + "id": "8b936b84", + "metadata": {}, + "source": [ + "`str` stands for `string`. Now, let's try to use `type()` on a number." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "b174c029", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "int" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(1)" + ] + }, + { + "cell_type": "markdown", + "id": "e06df613", + "metadata": {}, + "source": [ + "`int` stands for `integer`, which you might notice, does not represent all real numbers. This is because Python treats numbers without decimal points differently from decimal numbers or fractions. Let's try calling `type()` on a decimal number!" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "3e2d2ca2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "float" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(20.3)" + ] + }, + { + "cell_type": "markdown", + "id": "bcffe0fc", + "metadata": {}, + "source": [ + "Or a fraction." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "6b50db7f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "float" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(5/3)" + ] + }, + { + "cell_type": "markdown", + "id": "0afe8069", + "metadata": {}, + "source": [ + "Decimal numbers are called `floats` in Python. I know the name seems kind of random, but it refers to how the decimal point can float left or right within a decimal number when you change the exponent above base 10, like 1.5 is 0.15 x 10^1. \n", + "\n", + "**[Sticky Check]**" + ] + }, + { + "cell_type": "markdown", + "id": "e3f00c3a", + "metadata": {}, + "source": [ + "## Variables\n", + "\n", + "Besides computing, one other thing that a computer is incredibly good at is to memorize things for us. In Python, you can store a `string`, an `integer`, a `float`, or any kind of information into something called, `variables`. To do that, you have to give the variable a name or a label that you can call it later, and assign a value--whether it be `string`, `integer`, etc--to it. The syntax for assigning the value in Python is the equal sign (`=`), where the name of the `variable` is to the left of the sign, and the value is to the right. " + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "8882d8da", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Hola, Mundo\n" + ] + } + ], + "source": [ + "greeting = \"Hola, Mundo\"\n", + "print(greeting)" + ] + }, + { + "cell_type": "markdown", + "id": "92fd3b5b", + "metadata": {}, + "source": [ + "`Variables` are stored within the Jupyter Notebook environment. It means that you can still call `greeting` again in another cell, as long as you stay within this Notebook.\n", + "\n", + "`print()` can print multiple items if you add all of them inside the parentheses separated by commas (`,`). " + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "26b50c3f", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Hola, Mundo Derek\n" + ] + } + ], + "source": [ + "neighbor = \"Derek\"\n", + "print(greeting, neighbor)" + ] + }, + { + "cell_type": "markdown", + "id": "52a0eac8", + "metadata": {}, + "source": [ + "Numbers also work the same way." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "4fc92f93", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "My daughter is 10 years old.\n", + "Next year she will be 11\n" + ] + } + ], + "source": [ + "age = 10\n", + "print('My daughter is', age, 'years old.')\n", + "print('Next year she will be', age + 1)" + ] + }, + { + "cell_type": "markdown", + "id": "c101fc54", + "metadata": {}, + "source": [ + "You can store a `float` or even an equation." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "c49586b6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "110.00000000000001\n" + ] + } + ], + "source": [ + "lb = 50.0\n", + "kg = lb * 2.2\n", + "print(kg)" + ] + }, + { + "cell_type": "markdown", + "id": "0dd8c2eb", + "metadata": {}, + "source": [ + "Python is very smart when it comes to combining different data types. It implicitly determines what the result data type should be or if it is possible." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "5ea739ee", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "my_float \n", + "my_integer \n", + "60.5 \n" + ] + } + ], + "source": [ + "my_float = 50.5\n", + "print(\"my_float\", type(my_float))\n", + "my_integer = 10\n", + "print(\"my_integer\", type(my_integer))\n", + "float_plus_integer = my_float + my_integer\n", + "print(float_plus_integer, type(float_plus_integer))" + ] + }, + { + "cell_type": "markdown", + "id": "6ab4197d", + "metadata": {}, + "source": [ + "When you combine a `float` and an `integer` together, the sum is automatically determined to be a `float`. Now, let's consider something that doesn't make sense, like combining a `string` and a `float`. " + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "278edb29", + "metadata": { + "scrolled": true, + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "TypeError", + "evalue": "unsupported operand type(s) for +: 'float' and 'str'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[37], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m my_string \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mHello\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[43mmy_float\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m+\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mmy_string\u001b[49m)\n", + "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'float' and 'str'" + ] + } + ], + "source": [ + "my_string = \"Hello\"\n", + "print(my_float + my_string)" + ] + }, + { + "cell_type": "markdown", + "id": "d8147371", + "metadata": {}, + "source": [ + "That resulted in an error. \n", + "\n", + "Note that when I named my `variables`, I used underscores (`_`) to separate between words within the name. This is because whitespaces can be meaningfully interpreted in Python. Just out of curiosity, let's try to name a `variable` `my float` without an underscore." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "a5a80dca", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "invalid syntax (1262429664.py, line 1)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m Cell \u001b[0;32mIn[38], line 1\u001b[0;36m\u001b[0m\n\u001b[0;31m my float = 20\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" + ] + } + ], + "source": [ + "my float = 20" + ] + }, + { + "cell_type": "markdown", + "id": "5b5bfdaa", + "metadata": {}, + "source": [ + "There are 4 rules to naming variables in Python.\n", + "- It must start with a letter or the underscore character\n", + "- It cannot start with a number\n", + "- It can only contain letters, numbers, and underscores\n", + "- Variable names are case-sensitive" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "e52c3184", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "invalid decimal literal (861385535.py, line 1)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m Cell \u001b[0;32mIn[39], line 1\u001b[0;36m\u001b[0m\n\u001b[0;31m 2var = 20\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid decimal literal\n" + ] + } + ], + "source": [ + "2var = 20" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "146166da", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "cannot assign to expression here. Maybe you meant '==' instead of '='? (1024293175.py, line 1)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m Cell \u001b[0;32mIn[40], line 1\u001b[0;36m\u001b[0m\n\u001b[0;31m var-1 = 1\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m cannot assign to expression here. Maybe you meant '==' instead of '='?\n" + ] + } + ], + "source": [ + "var-1 = 1" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "0577300e", + "metadata": { + "scrolled": true, + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'abc' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[41], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m ABC \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0\u001b[39m\n\u001b[0;32m----> 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[43mabc\u001b[49m)\n", + "\u001b[0;31mNameError\u001b[0m: name 'abc' is not defined" + ] + } + ], + "source": [ + "ABC = 0\n", + "print(abc)" + ] + }, + { + "cell_type": "markdown", + "id": "44891132", + "metadata": {}, + "source": [ + "Lastly, you cannot name `variables` with keywords in Python. You shouldn't follow this step as it will pretty much break your Notebook. For example, don't name your `variable` `float`. " + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "cd75f35d", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "# float = 20.0\n", + "# print(float)" + ] + }, + { + "cell_type": "markdown", + "id": "d2237241", + "metadata": {}, + "source": [ + "Nor should you name a `variable` `print`. " + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "c1852153", + "metadata": { + "scrolled": true, + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "# print = \"hi\"\n", + "# print" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "09eb28b5", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "# print(print)" + ] + }, + { + "cell_type": "markdown", + "id": "8ab0f7af", + "metadata": {}, + "source": [ + "Two `variables` can contain the same value. " + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "7da5a9cf", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "pencil pencil\n" + ] + } + ], + "source": [ + "x = y = 'pencil'\n", + "print(x, y)" + ] + }, + { + "cell_type": "markdown", + "id": "1a2c2f57", + "metadata": {}, + "source": [ + "`Variables` can also be reassigned to a different value. I'm just curious. Let's change `x` to a different value, 5. What happens to `y`?" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "44966782", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "5 pencil\n" + ] + } + ], + "source": [ + "x = 5\n", + "print(x,y)" + ] + }, + { + "cell_type": "markdown", + "id": "f95848ed", + "metadata": {}, + "source": [ + "One thing to keep in mind is that `variables` don't remember where they came from. When you use a variable in a computation, Python will use the value of that `variable` when the code is run. So, the order of the lines matters." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "76a416c0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 4\n" + ] + } + ], + "source": [ + "a = 3\n", + "b = a + 1\n", + "a = 0\n", + "print(a, b)" + ] + }, + { + "cell_type": "markdown", + "id": "588bbf75", + "metadata": {}, + "source": [ + "These behaviors of and interactions between data types and variables can be confusing at first, so don't worry if you don't get it right away. But I encourage you to play around with them more. Eventually, these behaviors will just become your intuition when working in Python. **[Sticky Check]** " + ] + }, + { + "cell_type": "markdown", + "id": "1c0acc59", + "metadata": {}, + "source": [ + "## More String\n", + "Let's talk more `strings`. `Strings` are useful to hold small bits of text - words, sentences, etc. We use them all the time in Python.\n", + "\n", + "`Strings` are made up of characters strung in a certain order. These characters are considered elements of a `string`. Something really cool in Pythin is that we can access those elements directly by using an **integer index** and **square brackets**:" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "0d5779ef", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "P\n" + ] + } + ], + "source": [ + "name = 'Python'\n", + "print(name[0])" + ] + }, + { + "cell_type": "markdown", + "id": "e548f1f5", + "metadata": {}, + "source": [ + "**This is very important!!** Python starts counting at **0**. So the _first letter_ in our **name** `string` is at `name[0]`" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "aee9a019", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "y\n", + "t\n", + "n\n" + ] + } + ], + "source": [ + "print(name[1])\n", + "print(name[2])\n", + "print(name[5])" + ] + }, + { + "cell_type": "markdown", + "id": "b114819b", + "metadata": {}, + "source": [ + "In Python, we can also use negative numbers to count backwards from the end of the sequence." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "609c4e48", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "n\n", + "o\n" + ] + } + ], + "source": [ + "print(name[-1])\n", + "print(name[-2])" + ] + }, + { + "cell_type": "markdown", + "id": "05eae4ed", + "metadata": {}, + "source": [ + "We can even extract a portion of a `string` using a concept called **slices**. We can do that with the format **[start:end]** where **start is inclusive** and **end is exclusive**.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "79fcd1c1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Python\n", + "hon\n", + "Pyt\n" + ] + } + ], + "source": [ + "print(name[0:6])\n", + "print(name[3:6])\n", + "print(name[0:3])" + ] + }, + { + "cell_type": "markdown", + "id": "93e62ef5", + "metadata": {}, + "source": [ + "Notice that the word 'Python' only has 6 letters and `name[0:6]` includes the whole thing even though `name[5]` returns the last letter. Because the end parameter in a slice is exclusive, it is the number just after." + ] + }, + { + "cell_type": "markdown", + "id": "319d73bc", + "metadata": {}, + "source": [ + "A slice notation is kind of funny, you need to include the colon but the start and stop parameters are optional. By default, they will be the beginning and end of the sequence" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "035f3866", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Python\n", + "hon\n", + "Pyt\n" + ] + } + ], + "source": [ + "print(name[:])\n", + "print(name[3:])\n", + "print(name[:3])" + ] + }, + { + "cell_type": "markdown", + "id": "7b4874bd", + "metadata": {}, + "source": [ + "One more optional parameter you can use is called a **step** in the format **[start:end:step]**. This parameter determines what you \"count by\" and is 1 by default." + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "ed6abfaf", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Python\n", + "Pto\n" + ] + } + ], + "source": [ + "print(name[0:6:1])\n", + "print(name[0:6:2])" + ] + }, + { + "cell_type": "markdown", + "id": "40abb354", + "metadata": {}, + "source": [ + "We'll get to use this a lot more later in the lesson, so don't worry if you haven't gotten a hold of it yet. " + ] + }, + { + "cell_type": "markdown", + "id": "a5bcf998", + "metadata": {}, + "source": [ + "Multiple `strings` can be concatenated together using the `+` sign." + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "feef9a1a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'I am Ironman.'" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'I ' + 'am ' + 'Ironman.'" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "451fc4fd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I am a billionaire.\n" + ] + } + ], + "source": [ + "a = 'I '\n", + "b = 'am '\n", + "c = 'a billionaire.'\n", + "print(a + b + c)" + ] + }, + { + "cell_type": "markdown", + "id": "204a109f", + "metadata": {}, + "source": [ + "We can also check if a letter exists in a string:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "a624aaf2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'a' in name" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "0dc20b07", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \n" + ] + } + ], + "source": [ + "print(type(True), type(False))" + ] + }, + { + "cell_type": "markdown", + "id": "a228f5c5", + "metadata": {}, + "source": [ + "**True** and **False** are actually a new data type called a `boolean`. We'll cover these later, but it's very helpful to do things like check membership or compare two numbers, and `boolean` values are used for that. If you're wondering, `boolean` just came from the name of George Boole who developed the concept.\n", + "\n", + "Notice that the word `in` is bold and green in Jupyter notebooks. Commands like these are known as **keywords**." + ] + }, + { + "cell_type": "markdown", + "id": "79285ea3", + "metadata": {}, + "source": [ + "We can also check if a letter does **not** exist, by using the keyword `not`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "0e95e657", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 58, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'a' not in name" + ] + }, + { + "cell_type": "markdown", + "id": "0c8ec010", + "metadata": {}, + "source": [ + "Okay, we just went over quite a few basic elements in Python. Now, let's take a deep breath before we apply the knowledge we just learned on something more relevant to our research--genomic data. **[Sticky Check]**.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "e3021ef1", + "metadata": {}, + "source": [ + "## Working with a DNA Sequence" + ] + }, + { + "cell_type": "markdown", + "id": "7a39f5d2", + "metadata": {}, + "source": [ + "Here's a made-up DNA sequence, and let's make sure we have the same sequence here (**copy and paste it in the collaborative notebook**). This is a `string`, and we are going to store it in a `variable` called `seq`:" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "67fa0ad5", + "metadata": {}, + "outputs": [], + "source": [ + "seq = 'ACCTGCATGC'" + ] + }, + { + "cell_type": "markdown", + "id": "92f31657", + "metadata": {}, + "source": [ + "Let's use indexes to retrieve the first couple nucleotides in the sequence and store them in variables. " + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "33a2a3cf", + "metadata": {}, + "outputs": [], + "source": [ + "b1 = seq[0]\n", + "b2 = seq[1]\n", + "b3 = seq[2]" + ] + }, + { + "cell_type": "markdown", + "id": "99ffedb7", + "metadata": {}, + "source": [ + "We can name these variables with numbers in them, but remember, a variable name cannot start with a number.\n", + "\n", + "Remember, too, that we can combine multiple strings together using the `+` sign, let's try to string together the first 2 bases as well as recombine them in in a reverse order:" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "c5525610", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACC\n" + ] + } + ], + "source": [ + "first3 = b1 + b2 + b3\n", + "print(first3)" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "64911ed2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CCA\n" + ] + } + ], + "source": [ + "r3 = b3 + b2 + b1\n", + "print(r3)" + ] + }, + { + "cell_type": "markdown", + "id": "b7a9e57d", + "metadata": {}, + "source": [ + "You can call elements within the `string`, but you **cannnot**--and this is a very fitting word--mutate the elements within it. " + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "6ce458a8", + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "TypeError", + "evalue": "'str' object does not support item assignment", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[63], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mr3\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m]\u001b[49m \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mG\u001b[39m\u001b[38;5;124m'\u001b[39m\n", + "\u001b[0;31mTypeError\u001b[0m: 'str' object does not support item assignment" + ] + } + ], + "source": [ + "r3[2] = 'G'" + ] + }, + { + "cell_type": "markdown", + "id": "fe669835", + "metadata": {}, + "source": [ + "What we **can** do is reassignment. For example, if you want to mutate `r3` but also want to store the old value somewhere, we can reassign the value of current `r3` to a new variable called `old_r3`, and give `r3` a new value." + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "ee89b39c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CCA CCG\n" + ] + } + ], + "source": [ + "old_r3 = r3\n", + "r3 = b3 + b2 + 'G'\n", + "print(old_r3, r3)" + ] + }, + { + "cell_type": "markdown", + "id": "549ebebc", + "metadata": {}, + "source": [ + "We call `strings` **immutable**. We'll cover some data types that are mutable, but when working with strings, we have to build new ones." + ] + }, + { + "cell_type": "markdown", + "id": "81787b18-2f03-48d5-90b7-1e8e07d409c7", + "metadata": {}, + "source": [ + "Now, at this point you should know enough about data types and variables to start doing something with the DNA sequence. But before that, let's quickly go over the recap slides to see what we've learned so far. **[Sticky Check]** **[RECAP]**" + ] + }, + { + "cell_type": "markdown", + "id": "8248b99d", + "metadata": {}, + "source": [ + "## Loops - Reversing\n", + "Now, let's perform an operation **\"reverse complement\"** on the sequence. It's a useful process when working with sequences, and it's a great example of a problem that you can easily explain and even do it by hand, but may not immediately know how to program it.\n", + "\n", + "With any programming endeavor, we'll start by breaking the problem down into parts and then assemble them later. So to reverse-complemment a sequence, first we'll learn how to reverse it. \n", + "\n", + "How should we go about reversing a sequence? We can try using the same approach as demonstrated before." + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "190bbf12", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACCTGCATGC CGTA\n" + ] + } + ], + "source": [ + "rev = seq[9] + seq[8] + seq[7] + seq[6] # ...\n", + "print(seq, rev)" + ] + }, + { + "cell_type": "markdown", + "id": "067d1bc6", + "metadata": {}, + "source": [ + "But that's a bad approach for two reasons:\n", + "\n", + "1. It doesn't scale:\n", + " if we want to print the characters in a string that's hundreds of letters long,\n", + " we'd be better off just typing them in.\n", + "\n", + "1. It's fragile:\n", + " if we give it a longer sequence,\n", + " it only reverses part of the data,\n", + " and if we give it a shorter one,\n", + " it produces an error because we're asking for characters that don't exist." + ] + }, + { + "cell_type": "markdown", + "id": "fa55d254", + "metadata": {}, + "source": [ + "A better way to do this is with a `loop`. The main idea of a `loop` is to basically go through all element within a `string` one by one regardless of how long the `string` is. This is what a basic `loop` looks like." + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "fcfdb4a3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A\n", + "C\n", + "C\n", + "T\n", + "G\n", + "C\n", + "A\n", + "T\n", + "G\n", + "C\n" + ] + } + ], + "source": [ + "for s in seq:\n", + " print(s)" + ] + }, + { + "cell_type": "markdown", + "id": "89033fa6", + "metadata": {}, + "source": [ + "Just two lines, but a few things alread happened here. \n", + "\n", + "1. We introduced new keywords **`for`** and a previously introduced keyword, `in`\\*. This constructs the loop and says we want to work on each element in `seq`. This line must end with a **colon**.\n", + "2. We created a new variable `s`. This is called a **`loop variable`**, because it changes at each iteration of the loop.\n", + "3. We indented the code that we want to repeat. I mentioned before that whitespaces in Python are meaningful. The body of the loop is defined by the lines that are indented after the colon. This is not always the case in other programming languages which usually use special characters like curly braces to define the boundary of a `loop`. \n", + "\n", + "\\* Note: the `in` keyword behavior is modifed by the `for` keyword. It does not check if the value on the left is a member of the collection of values on the right. Rather it is used as a seperator from the loop variable and the sequence, and gives the syntax of a Python `for loop` readability." + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "67f5054c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Before\n", + "Loop: A\n", + "Loop: C\n", + "Loop: C\n", + "Loop: T\n", + "Loop: G\n", + "Loop: C\n", + "Loop: A\n", + "Loop: T\n", + "Loop: G\n", + "Loop: C\n", + "After\n" + ] + } + ], + "source": [ + "print('Before')\n", + "for b in seq:\n", + " print('Loop:', b)\n", + "print('After')" + ] + }, + { + "cell_type": "markdown", + "id": "c20077ab", + "metadata": {}, + "source": [ + "We can do other things in the body - they don't have to relate to the `loop variable`. For example, if we want to count how long the sequence is, we can use a `variable` to which we add 1 every iteration:" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "8546f8e2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n" + ] + } + ], + "source": [ + "count = 0\n", + "for s in seq:\n", + " count = count + 1\n", + "print(count)" + ] + }, + { + "cell_type": "markdown", + "id": "25a931e6", + "metadata": {}, + "source": [ + "### Exercise 1: reverse" + ] + }, + { + "cell_type": "markdown", + "id": "7824ee03", + "metadata": {}, + "source": [ + "Let's make this a hands-on exercise to write some code that reverses the sequence in `seq` and puts it in a variable called `rev`. Remember, you can add strings together and the order is important! \n", + "\n", + "**Hint:** you can assign a blank string to a `variable` by just typing `empty_string = ''`" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "43e685a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CGTACGTCCA\n" + ] + } + ], + "source": [ + "# Your code here!\n", + "\n", + "rev = ''\n", + "for s in seq:\n", + " rev = s + rev\n", + "print(rev)" + ] + }, + { + "cell_type": "markdown", + "id": "b967af4e", + "metadata": {}, + "source": [ + "**[Sticky Check]**" + ] + }, + { + "cell_type": "markdown", + "id": "06f5018b", + "metadata": {}, + "source": [ + "## Dictionaries - Lookups" + ] + }, + { + "cell_type": "markdown", + "id": "b6b03c33", + "metadata": {}, + "source": [ + "So, we've learned how to reverse a sequence using a `loop`, next up is the **'complement'** part. If you are trying to make a complement DNA strand--I mean you probably have the table memorized by now, but imagine doing it for the first time--you are going to look up the table for all complementary bases. \n", + "\n", + "Python has a data type called a `dictionary` that's great for looking things up. It associates a **`key`** to a **`value`**--similar to a word and a definition. \n", + "\n", + "`Dictionaries` are createed using curly braces `{}`. Each `key` and `value` are seperated by a colon. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "bc8a3cd4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'key': 'value', 'second_key': 2, 3: 'third_value'}\n" + ] + } + ], + "source": [ + "my_dictionary = {\"key\": \"value\", \"second_key\": 2, 3: \"third_value\"}\n", + "print(my_dictionary)" + ] + }, + { + "cell_type": "markdown", + "id": "222016f2", + "metadata": {}, + "source": [ + "Let's say we want to keep track of the counts of the nucleotides in our sequence. We can create a `dictionary` where the `keys` are the letters, and the `values` are quantity of each:" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "aaec7dae", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'A': 2, 'C': 4, 'G': 2, 'T': 2}\n" + ] + } + ], + "source": [ + "counts = {'A': 2, 'C': 4, 'G': 2, 'T': 2}\n", + "print(counts)" + ] + }, + { + "cell_type": "markdown", + "id": "1b4cc680", + "metadata": {}, + "source": [ + "Like a `string`, we access elements in the `dictionary` with **square brackets `[]`**. But, instead of a number, we use the `key` to index, because elements inside a `dictionary` are unordered. So to ask *how many A's do I have*, you do" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "9679a4e7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I have 2 As\n" + ] + } + ], + "source": [ + "print('I have', counts['A'], 'As')" + ] + }, + { + "cell_type": "markdown", + "id": "ad2511b7", + "metadata": {}, + "source": [ + "But unlike a `string`, `dictionaries` are **mutable**. This means we can change their contents. Let's change the number of As to 3:" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "aab7c07a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Now I have 3 As\n" + ] + } + ], + "source": [ + "counts['A'] = 3\n", + "print('Now I have', counts['A'], 'As')" + ] + }, + { + "cell_type": "markdown", + "id": "0775c1cc", + "metadata": {}, + "source": [ + "And we can use a variable as the `key`, so let's choose a specific letter and put that in the variable `base`" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "91b25807", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I have 4 Cs\n" + ] + } + ], + "source": [ + "base = 'C'\n", + "print('I have', counts[base], base + 's')\n" + ] + }, + { + "cell_type": "markdown", + "id": "66173c6d", + "metadata": {}, + "source": [ + "One thing about a `dictionary` is that each `key` has to be unique. However, two different `keys` can give the same `value`." + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "197a46a5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "vegetable\n" + ] + } + ], + "source": [ + "fruit_or_veg = {'tomato': 'fruit', 'tomato': 'vegetable'}\n", + "print(fruit_or_veg['tomato'])" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "e863a589", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "veg veg\n" + ] + } + ], + "source": [ + "fruit_or_veg = {'cabbage':'veg', 'lettuce':'veg'}\n", + "print(fruit_or_veg['cabbage'], fruit_or_veg['lettuce'])" + ] + }, + { + "cell_type": "markdown", + "id": "d35ead5d", + "metadata": {}, + "source": [ + "Since `dictionaries` are collections of items, we can loop over them. Let's see what happens when we loop over our `counts` `dictionary`:" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "735cc9ae", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A\n", + "C\n", + "G\n", + "T\n" + ] + } + ], + "source": [ + "for c in counts:\n", + " print(c)" + ] + }, + { + "cell_type": "markdown", + "id": "63e6b436", + "metadata": {}, + "source": [ + "When we loop over the `dictionary`, the **loop variable** is set to each **key**. Looping a `dictionary` doesn't give you the `value` directly, but we can get the `value` indirectly since we have the `key` and the `dictionary`:" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "a3f93cbd", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A 3\n", + "C 4\n", + "G 2\n", + "T 2\n" + ] + } + ], + "source": [ + "for b in counts:\n", + " qty = counts[b]\n", + " print(b, qty)" + ] + }, + { + "cell_type": "markdown", + "id": "d90b8191", + "metadata": {}, + "source": [ + "### Exercise 2 - counting quantity" + ] + }, + { + "cell_type": "markdown", + "id": "c04c11a3", + "metadata": {}, + "source": [ + "As an exercise, let's get the total quantity of nucleotides we have in the `dictionary` counts. This `loop` should look a lot like counting the bases in the sequence." + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "c41078c5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'A': 2, 'C': 4, 'G': 2, 'T': 2}\n" + ] + } + ], + "source": [ + "# Your code here!\n", + "\n", + "counts['A'] = 0\n", + "counts['T'] = 0\n", + "counts['C'] = 0\n", + "counts['G'] = 0\n", + "\n", + "for b in seq:\n", + " counts[b] = counts[b] + 1\n", + " \n", + "print(counts)" + ] + }, + { + "cell_type": "markdown", + "id": "9caf4ed0", + "metadata": {}, + "source": [ + "`Dictionaries` are great because they're flexible, so we can use them to look up anything like complementary bases. So let's start by creating a dictionary with the complements for A and T." + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "f91b1420", + "metadata": {}, + "outputs": [], + "source": [ + "comps = {'A':'T', 'T':'A'}" + ] + }, + { + "cell_type": "markdown", + "id": "d8d8cce1", + "metadata": {}, + "source": [ + "So to read the complement of `A`, we can write `comps['A']`:" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "ea89866f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "T\n" + ] + } + ], + "source": [ + "print(comps['A'])" + ] + }, + { + "cell_type": "markdown", + "id": "6e588792", + "metadata": {}, + "source": [ + "We just have A and T here. With `dictionaries`, we can add more items or even change existing ones:" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "080e2204", + "metadata": {}, + "outputs": [], + "source": [ + "comps['C'] = 'G'\n", + "comps['G'] = 'C'" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "d5855f27", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}\n" + ] + } + ], + "source": [ + "print(comps)" + ] + }, + { + "cell_type": "markdown", + "id": "94fee962", + "metadata": {}, + "source": [ + "Now, we can complement the sequence from before by looking up the complement inside a loop" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "19c4d6a2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACCTGCATGC CGTACGTCCA GCATGCAGGT\n" + ] + } + ], + "source": [ + "revcomp = ''\n", + "# starting with rev, the already reversed sequence\n", + "for s in rev:\n", + " c = comps[s]\n", + " revcomp = revcomp + c\n", + "print(seq, rev, revcomp)" + ] + }, + { + "cell_type": "markdown", + "id": "a43f6c0a", + "metadata": {}, + "source": [ + "**[Sticky Check]**" + ] + }, + { + "cell_type": "markdown", + "id": "e78c54e1", + "metadata": {}, + "source": [ + "## Lists\n", + "\n", + "We've looked at `strings` and `dictionaries` as collection types. There's another one that's really useful, and it's called a `list`. In contrast to a `dictionary`, a `list` is an **ordered** collection of elements. Lists use the `[]` square brackets, and elments are separated by commas.\n", + "\n", + "Here's one that contains our 3 sequences." + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "88602470", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ACCTGCATGC', 'CGTACGTCCA', 'GCATGCAGGT'] \n" + ] + } + ], + "source": [ + "seqs = [seq, rev, revcomp]\n", + "print(seqs, type(seqs))" + ] + }, + { + "cell_type": "markdown", + "id": "8c746d87", + "metadata": {}, + "source": [ + "Just like with `strings`, we can use a `for loop` to iterate through each element in the `list` and perform commands. Let's use a `for loop` to print each element in the `list`." + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "77dd704c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACCTGCATGC\n", + "CGTACGTCCA\n", + "GCATGCAGGT\n" + ] + } + ], + "source": [ + "for s in seqs:\n", + " print(s)" + ] + }, + { + "cell_type": "markdown", + "id": "bfd54c65", + "metadata": {}, + "source": [ + "**`Lists`**, unlike `strings`, are **mutable**. So we can add or remove to a `list`, reorder it, or swap items out." + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "4566a9f9", + "metadata": {}, + "outputs": [], + "source": [ + "seqs[0] = 'AAAAAGGGGG'" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "b207ba6a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['AAAAAGGGGG', 'CGTACGTCCA', 'GCATGCAGGT']\n" + ] + } + ], + "source": [ + "print(seqs)" + ] + }, + { + "cell_type": "markdown", + "id": "f21b7ef6", + "metadata": {}, + "source": [ + "Both `dictionaries` and `lists` let you delete things with the `del` keyword - again because they're mutable." + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "86d81e6a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['CGTACGTCCA', 'GCATGCAGGT']\n" + ] + } + ], + "source": [ + "del seqs[0]\n", + "print(seqs)" + ] + }, + { + "cell_type": "markdown", + "id": "959e3e92", + "metadata": {}, + "source": [ + "**[Sticky Check]**" + ] + }, + { + "cell_type": "markdown", + "id": "fad49f07", + "metadata": {}, + "source": [ + "## Making choices / Conditionals\n", + "\n", + "Think about the advantages of programming-you can do repetitive tasks accurately and efficiently. But sometimes, you don't always want to do every repeat exactly the same way every time. It's useful for your program to be able to read some data or results, and make a decision about what to do next. This is called a **conditional**, and most common version is the `if-else`. This topic will bring back some memory from learning about logic in math classes.\n", + "\n", + "Here's how it works. we start with an `if` keyword, and then we write some expression that will either be **True** or **False**. Then we write a colon, and indent what should happen **if** the expression was **True**. \n", + "\n", + "We can also write what should happen if the expression was **False**, after an `else`." + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "188056cb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Seq contains GC\n", + "done\n" + ] + } + ], + "source": [ + "if 'GC' in seq:\n", + " print('Seq contains GC')\n", + "else:\n", + " print('Seq does not contain GC')\n", + "print('done')" + ] + }, + { + "cell_type": "markdown", + "id": "71846b2a", + "metadata": {}, + "source": [ + "Only one or the other is ever executed\n", + "\n", + "Conditional statements don't have to include an `else`. If there isn't one, Python simply does nothing if the test is false." + ] + }, + { + "cell_type": "markdown", + "id": "5330cb5b", + "metadata": {}, + "source": [ + "Let's do an example where we combine the concepts of **for loops** and **if statements**. In this example, let's write code that calculates the GC-content percentage of a DNA sequence. So what is the percentage of 'G' and 'C' nucelotides in a DNA sequence. We can use the **for loop** to go through each nucleotide and the **if statement** to determine if the letter is a 'G' or a 'C'. " + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "9cfe0749", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GC-content percentage: 60.0 %\n" + ] + } + ], + "source": [ + "gc = 0 ## counting GC\n", + "atgc = 0 ## counting everything\n", + "\n", + "for s in seq:\n", + " if s == 'G': # notice the double indent\n", + " gc = gc + 1\n", + " if s == 'C':\n", + " gc = gc + 1\n", + " atgc = atgc + 1 # outside the ifs\n", + " \n", + "# Outside the loop\n", + "percent = (gc * 100) / atgc\n", + "print('GC-content percentage:', percent, '%')" + ] + }, + { + "cell_type": "markdown", + "id": "e4f997bb", + "metadata": {}, + "source": [ + "Notice the new operator `==`. It is a logical operator that means \"is equal to\". Its counterpart, `!=`, means \"is not equal to\"." + ] + }, + { + "cell_type": "markdown", + "id": "c87a1d45", + "metadata": {}, + "source": [ + "This code works, but notice something, we've repeated ourselves. In both cases (`G` and `C`), we're running the same code. That's quite redundant. Surely, there's a way to make this code more succinct. One of Python's philosophies is *Don't Repeat Yourself*, and it's generally a good idea.\n", + "\n", + "So, when we want to check if one condition is true **or** another, we use the keyword `or` in between the two expressions." + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "67a87f74", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GC-content percentage: 60.0 %\n" + ] + } + ], + "source": [ + "gc = 0\n", + "atgc = 0\n", + "for s in seq:\n", + " if s == 'G' or s == 'C':\n", + " gc = gc + 1\n", + " atgc = atgc + 1 # outside the ifs\n", + "# Outside the loop\n", + "percent = (gc * 100) / atgc\n", + "print('GC-content percentage:', percent, '%')" + ] + }, + { + "cell_type": "markdown", + "id": "f19ab4fc", + "metadata": {}, + "source": [ + "Note that there is a counterpart for `or` called `and` which looks for both conditional expressions resulting in **True** instead of at least 1. " + ] + }, + { + "cell_type": "markdown", + "id": "8b457588", + "metadata": {}, + "source": [ + "Now, another thing we can test for is if values are greater or less than others. This can be done with the operators: `<`, `>`, `<=`, `>=`, which perform in the same way as the mathematical expressions. We can classify these percentages as high, normal, or low. Let's say anything below 35 is low, anything from 36-55 is normal, and anything above 55 is high. We start by writing the first condition:" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "ddcadb47", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Low\n" + ] + } + ], + "source": [ + "percent = 20\n", + "if percent < 35:\n", + " print('Low')\n", + "else:\n", + " if percent < 56:\n", + " print('Normal')\n", + " else:\n", + " print('High')" + ] + }, + { + "cell_type": "markdown", + "id": "d55dd620", + "metadata": {}, + "source": [ + "Notice how adding multiple options results in lots of indentation. There is another keyword, `elif`, that we can use to shorten this. `elif` combines `else` and `if`. Here is the previous code but with `elif`:" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "19986249", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Low\n" + ] + } + ], + "source": [ + "percent = 20\n", + "if percent < 35:\n", + " print('Low')\n", + "elif percent < 56:\n", + " print('Normal')\n", + "else:\n", + " print('High')" + ] + }, + { + "cell_type": "markdown", + "id": "bf08e89a", + "metadata": {}, + "source": [ + "**[Sticky Check]**" + ] + }, + { + "cell_type": "markdown", + "id": "d4196f3b", + "metadata": {}, + "source": [ + "### Recap in presentation: dictionaries, lists, and conditionals" + ] + }, + { + "cell_type": "markdown", + "id": "68c5d5fa", + "metadata": {}, + "source": [ + "## Break" + ] + }, + { + "cell_type": "markdown", + "id": "af091079", + "metadata": {}, + "source": [ + "## Functions" + ] + }, + { + "cell_type": "markdown", + "id": "cc474901", + "metadata": {}, + "source": [ + "At this point, you've learned about different data types and variables, and how to work with them. But there's a crucial feature of any programming language that I kind of mentioned before but haven't properly introduced--**functions**. `print()` was the first function that we saw today.There's about 75 functions that are built-in and just ready to use. They can do simple things like tell you the length of a collection, or more advanced things like sorting, working with files, or converting between data types.\n", + "\n", + "I like to think of functions as recipes and Python is a cook. Functions are essentially just a set of instructions, and your input parameters are the ingredients. There are built-in functions, like the commonly-known recipes that every cook knows how to make. Python already knows how to perform these functions--like to print an output. And there are custom functions, like recipes that you invented yourself. You have to inform these recipes to Python for it to be able to perform them.\n", + "\n", + "Just like we talked about function `print`, to use a function, you type its name, followed by parenthesis.\n", + "\n", + "Let's use the **len** function to get the length of our sequence" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "b3fc853e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n" + ] + } + ], + "source": [ + "print(len(seq)) # much less code than writing a loop!" + ] + }, + { + "cell_type": "markdown", + "id": "9bd153d0", + "metadata": {}, + "source": [ + "And that works on any collection - so, `dictionaries`, `lists`, etc." + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "947994ad", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "4 2\n" + ] + } + ], + "source": [ + "print(len(comps), len(seqs))" + ] + }, + { + "cell_type": "markdown", + "id": "f2de4d79", + "metadata": {}, + "source": [ + "There's the `sorted` function that takes a `list` and returns a new `list` in order" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "446c0c64", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[2, 3, 4, 7, 8]\n" + ] + } + ], + "source": [ + "x = [7,3,2,8,4]\n", + "y = sorted(x)\n", + "print(y)\n" + ] + }, + { + "cell_type": "markdown", + "id": "174f240e", + "metadata": {}, + "source": [ + "Functions open up a lot of possibilities. They're bits of code you can use, but don't have to write. One really important function is called `help()`. Python has its own built-in documentation on functions, and to use it, you just type:" + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "id": "e43d0ed0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Help on built-in function sorted in module builtins:\n", + "\n", + "sorted(iterable, /, *, key=None, reverse=False)\n", + " Return a new list containing all items from the iterable in ascending order.\n", + " \n", + " A custom key function can be supplied to customize the sort order, and the\n", + " reverse flag can be set to request the result in descending order.\n", + "\n" + ] + } + ], + "source": [ + "help(sorted)" + ] + }, + { + "cell_type": "markdown", + "id": "ca421d5a", + "metadata": {}, + "source": [ + "There's also plenty of documentation on the internet: https://docs.python.org/3/library/functions.html" + ] + }, + { + "cell_type": "markdown", + "id": "20f2421e", + "metadata": {}, + "source": [ + "Many data types have their own functions **built into the data type**. We talked about `strings`, and they often contain multiple words separated by spaces. So there's a built-in function called **split()** that makes a `list` containing those words. **By default, it seperates by removing whitespace** but it can optionally take an input string to remove instead. These functions use a different syntax, `datatype.function_name()` instead of `function_name()`." + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "id": "6376510b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['rna', 'dna', 'protein']\n" + ] + } + ], + "source": [ + "seqs = 'rna dna protein' #This order is important for sorting later\n", + "seqs_list = seqs.split()\n", + "print(seqs_list)" + ] + }, + { + "cell_type": "markdown", + "id": "b9747e1c", + "metadata": {}, + "source": [ + "This function is **part of** the `string`, so when we want to use the function, we start with the string, then type a dot and the function name. Functions like these that are part of something are called **methods**. There are thousands of functions that are **part of** different data types. " + ] + }, + { + "cell_type": "markdown", + "id": "6f4651e6", + "metadata": {}, + "source": [ + "Let's use the `help()` function on the `str` data type and see what methods it has. " + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "id": "bb8c904b-860b-4f2e-b983-9475a2f7fa66", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Help on class str in module builtins:\n", + "\n", + "class str(object)\n", + " | str(object='') -> str\n", + " | str(bytes_or_buffer[, encoding[, errors]]) -> str\n", + " | \n", + " | Create a new string object from the given object. If encoding or\n", + " | errors is specified, then the object must expose a data buffer\n", + " | that will be decoded using the given encoding and error handler.\n", + " | Otherwise, returns the result of object.__str__() (if defined)\n", + " | or repr(object).\n", + " | encoding defaults to sys.getdefaultencoding().\n", + " | errors defaults to 'strict'.\n", + " | \n", + " | Methods defined here:\n", + " | \n", + " | __add__(self, value, /)\n", + " | Return self+value.\n", + " | \n", + " | __contains__(self, key, /)\n", + " | Return key in self.\n", + " | \n", + " | __eq__(self, value, /)\n", + " | Return self==value.\n", + " | \n", + " | __format__(self, format_spec, /)\n", + " | Return a formatted version of the string as described by format_spec.\n", + " | \n", + " | __ge__(self, value, /)\n", + " | Return self>=value.\n", + " | \n", + " | __getattribute__(self, name, /)\n", + " | Return getattr(self, name).\n", + " | \n", + " | __getitem__(self, key, /)\n", + " | Return self[key].\n", + " | \n", + " | __getnewargs__(...)\n", + " | \n", + " | __gt__(self, value, /)\n", + " | Return self>value.\n", + " | \n", + " | __hash__(self, /)\n", + " | Return hash(self).\n", + " | \n", + " | __iter__(self, /)\n", + " | Implement iter(self).\n", + " | \n", + " | __le__(self, value, /)\n", + " | Return self<=value.\n", + " | \n", + " | __len__(self, /)\n", + " | Return len(self).\n", + " | \n", + " | __lt__(self, value, /)\n", + " | Return self int\n", + " | \n", + " | Return the number of non-overlapping occurrences of substring sub in\n", + " | string S[start:end]. Optional arguments start and end are\n", + " | interpreted as in slice notation.\n", + " | \n", + " | encode(self, /, encoding='utf-8', errors='strict')\n", + " | Encode the string using the codec registered for encoding.\n", + " | \n", + " | encoding\n", + " | The encoding in which to encode the string.\n", + " | errors\n", + " | The error handling scheme to use for encoding errors.\n", + " | The default is 'strict' meaning that encoding errors raise a\n", + " | UnicodeEncodeError. Other possible values are 'ignore', 'replace' and\n", + " | 'xmlcharrefreplace' as well as any other name registered with\n", + " | codecs.register_error that can handle UnicodeEncodeErrors.\n", + " | \n", + " | endswith(...)\n", + " | S.endswith(suffix[, start[, end]]) -> bool\n", + " | \n", + " | Return True if S ends with the specified suffix, False otherwise.\n", + " | With optional start, test S beginning at that position.\n", + " | With optional end, stop comparing S at that position.\n", + " | suffix can also be a tuple of strings to try.\n", + " | \n", + " | expandtabs(self, /, tabsize=8)\n", + " | Return a copy where all tab characters are expanded using spaces.\n", + " | \n", + " | If tabsize is not given, a tab size of 8 characters is assumed.\n", + " | \n", + " | find(...)\n", + " | S.find(sub[, start[, end]]) -> int\n", + " | \n", + " | Return the lowest index in S where substring sub is found,\n", + " | such that sub is contained within S[start:end]. Optional\n", + " | arguments start and end are interpreted as in slice notation.\n", + " | \n", + " | Return -1 on failure.\n", + " | \n", + " | format(...)\n", + " | S.format(*args, **kwargs) -> str\n", + " | \n", + " | Return a formatted version of S, using substitutions from args and kwargs.\n", + " | The substitutions are identified by braces ('{' and '}').\n", + " | \n", + " | format_map(...)\n", + " | S.format_map(mapping) -> str\n", + " | \n", + " | Return a formatted version of S, using substitutions from mapping.\n", + " | The substitutions are identified by braces ('{' and '}').\n", + " | \n", + " | index(...)\n", + " | S.index(sub[, start[, end]]) -> int\n", + " | \n", + " | Return the lowest index in S where substring sub is found,\n", + " | such that sub is contained within S[start:end]. Optional\n", + " | arguments start and end are interpreted as in slice notation.\n", + " | \n", + " | Raises ValueError when the substring is not found.\n", + " | \n", + " | isalnum(self, /)\n", + " | Return True if the string is an alpha-numeric string, False otherwise.\n", + " | \n", + " | A string is alpha-numeric if all characters in the string are alpha-numeric and\n", + " | there is at least one character in the string.\n", + " | \n", + " | isalpha(self, /)\n", + " | Return True if the string is an alphabetic string, False otherwise.\n", + " | \n", + " | A string is alphabetic if all characters in the string are alphabetic and there\n", + " | is at least one character in the string.\n", + " | \n", + " | isascii(self, /)\n", + " | Return True if all characters in the string are ASCII, False otherwise.\n", + " | \n", + " | ASCII characters have code points in the range U+0000-U+007F.\n", + " | Empty string is ASCII too.\n", + " | \n", + " | isdecimal(self, /)\n", + " | Return True if the string is a decimal string, False otherwise.\n", + " | \n", + " | A string is a decimal string if all characters in the string are decimal and\n", + " | there is at least one character in the string.\n", + " | \n", + " | isdigit(self, /)\n", + " | Return True if the string is a digit string, False otherwise.\n", + " | \n", + " | A string is a digit string if all characters in the string are digits and there\n", + " | is at least one character in the string.\n", + " | \n", + " | isidentifier(self, /)\n", + " | Return True if the string is a valid Python identifier, False otherwise.\n", + " | \n", + " | Call keyword.iskeyword(s) to test whether string s is a reserved identifier,\n", + " | such as \"def\" or \"class\".\n", + " | \n", + " | islower(self, /)\n", + " | Return True if the string is a lowercase string, False otherwise.\n", + " | \n", + " | A string is lowercase if all cased characters in the string are lowercase and\n", + " | there is at least one cased character in the string.\n", + " | \n", + " | isnumeric(self, /)\n", + " | Return True if the string is a numeric string, False otherwise.\n", + " | \n", + " | A string is numeric if all characters in the string are numeric and there is at\n", + " | least one character in the string.\n", + " | \n", + " | isprintable(self, /)\n", + " | Return True if the string is printable, False otherwise.\n", + " | \n", + " | A string is printable if all of its characters are considered printable in\n", + " | repr() or if it is empty.\n", + " | \n", + " | isspace(self, /)\n", + " | Return True if the string is a whitespace string, False otherwise.\n", + " | \n", + " | A string is whitespace if all characters in the string are whitespace and there\n", + " | is at least one character in the string.\n", + " | \n", + " | istitle(self, /)\n", + " | Return True if the string is a title-cased string, False otherwise.\n", + " | \n", + " | In a title-cased string, upper- and title-case characters may only\n", + " | follow uncased characters and lowercase characters only cased ones.\n", + " | \n", + " | isupper(self, /)\n", + " | Return True if the string is an uppercase string, False otherwise.\n", + " | \n", + " | A string is uppercase if all cased characters in the string are uppercase and\n", + " | there is at least one cased character in the string.\n", + " | \n", + " | join(self, iterable, /)\n", + " | Concatenate any number of strings.\n", + " | \n", + " | The string whose method is called is inserted in between each given string.\n", + " | The result is returned as a new string.\n", + " | \n", + " | Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'\n", + " | \n", + " | ljust(self, width, fillchar=' ', /)\n", + " | Return a left-justified string of length width.\n", + " | \n", + " | Padding is done using the specified fill character (default is a space).\n", + " | \n", + " | lower(self, /)\n", + " | Return a copy of the string converted to lowercase.\n", + " | \n", + " | lstrip(self, chars=None, /)\n", + " | Return a copy of the string with leading whitespace removed.\n", + " | \n", + " | If chars is given and not None, remove characters in chars instead.\n", + " | \n", + " | partition(self, sep, /)\n", + " | Partition the string into three parts using the given separator.\n", + " | \n", + " | This will search for the separator in the string. If the separator is found,\n", + " | returns a 3-tuple containing the part before the separator, the separator\n", + " | itself, and the part after it.\n", + " | \n", + " | If the separator is not found, returns a 3-tuple containing the original string\n", + " | and two empty strings.\n", + " | \n", + " | removeprefix(self, prefix, /)\n", + " | Return a str with the given prefix string removed if present.\n", + " | \n", + " | If the string starts with the prefix string, return string[len(prefix):].\n", + " | Otherwise, return a copy of the original string.\n", + " | \n", + " | removesuffix(self, suffix, /)\n", + " | Return a str with the given suffix string removed if present.\n", + " | \n", + " | If the string ends with the suffix string and that suffix is not empty,\n", + " | return string[:-len(suffix)]. Otherwise, return a copy of the original\n", + " | string.\n", + " | \n", + " | replace(self, old, new, count=-1, /)\n", + " | Return a copy with all occurrences of substring old replaced by new.\n", + " | \n", + " | count\n", + " | Maximum number of occurrences to replace.\n", + " | -1 (the default value) means replace all occurrences.\n", + " | \n", + " | If the optional argument count is given, only the first count occurrences are\n", + " | replaced.\n", + " | \n", + " | rfind(...)\n", + " | S.rfind(sub[, start[, end]]) -> int\n", + " | \n", + " | Return the highest index in S where substring sub is found,\n", + " | such that sub is contained within S[start:end]. Optional\n", + " | arguments start and end are interpreted as in slice notation.\n", + " | \n", + " | Return -1 on failure.\n", + " | \n", + " | rindex(...)\n", + " | S.rindex(sub[, start[, end]]) -> int\n", + " | \n", + " | Return the highest index in S where substring sub is found,\n", + " | such that sub is contained within S[start:end]. Optional\n", + " | arguments start and end are interpreted as in slice notation.\n", + " | \n", + " | Raises ValueError when the substring is not found.\n", + " | \n", + " | rjust(self, width, fillchar=' ', /)\n", + " | Return a right-justified string of length width.\n", + " | \n", + " | Padding is done using the specified fill character (default is a space).\n", + " | \n", + " | rpartition(self, sep, /)\n", + " | Partition the string into three parts using the given separator.\n", + " | \n", + " | This will search for the separator in the string, starting at the end. If\n", + " | the separator is found, returns a 3-tuple containing the part before the\n", + " | separator, the separator itself, and the part after it.\n", + " | \n", + " | If the separator is not found, returns a 3-tuple containing two empty strings\n", + " | and the original string.\n", + " | \n", + " | rsplit(self, /, sep=None, maxsplit=-1)\n", + " | Return a list of the substrings in the string, using sep as the separator string.\n", + " | \n", + " | sep\n", + " | The separator used to split the string.\n", + " | \n", + " | When set to None (the default value), will split on any whitespace\n", + " | character (including \\\\n \\\\r \\\\t \\\\f and spaces) and will discard\n", + " | empty strings from the result.\n", + " | maxsplit\n", + " | Maximum number of splits (starting from the left).\n", + " | -1 (the default value) means no limit.\n", + " | \n", + " | Splitting starts at the end of the string and works to the front.\n", + " | \n", + " | rstrip(self, chars=None, /)\n", + " | Return a copy of the string with trailing whitespace removed.\n", + " | \n", + " | If chars is given and not None, remove characters in chars instead.\n", + " | \n", + " | split(self, /, sep=None, maxsplit=-1)\n", + " | Return a list of the substrings in the string, using sep as the separator string.\n", + " | \n", + " | sep\n", + " | The separator used to split the string.\n", + " | \n", + " | When set to None (the default value), will split on any whitespace\n", + " | character (including \\\\n \\\\r \\\\t \\\\f and spaces) and will discard\n", + " | empty strings from the result.\n", + " | maxsplit\n", + " | Maximum number of splits (starting from the left).\n", + " | -1 (the default value) means no limit.\n", + " | \n", + " | Note, str.split() is mainly useful for data that has been intentionally\n", + " | delimited. With natural text that includes punctuation, consider using\n", + " | the regular expression module.\n", + " | \n", + " | splitlines(self, /, keepends=False)\n", + " | Return a list of the lines in the string, breaking at line boundaries.\n", + " | \n", + " | Line breaks are not included in the resulting list unless keepends is given and\n", + " | true.\n", + " | \n", + " | startswith(...)\n", + " | S.startswith(prefix[, start[, end]]) -> bool\n", + " | \n", + " | Return True if S starts with the specified prefix, False otherwise.\n", + " | With optional start, test S beginning at that position.\n", + " | With optional end, stop comparing S at that position.\n", + " | prefix can also be a tuple of strings to try.\n", + " | \n", + " | strip(self, chars=None, /)\n", + " | Return a copy of the string with leading and trailing whitespace removed.\n", + " | \n", + " | If chars is given and not None, remove characters in chars instead.\n", + " | \n", + " | swapcase(self, /)\n", + " | Convert uppercase characters to lowercase and lowercase characters to uppercase.\n", + " | \n", + " | title(self, /)\n", + " | Return a version of the string where each word is titlecased.\n", + " | \n", + " | More specifically, words start with uppercased characters and all remaining\n", + " | cased characters have lower case.\n", + " | \n", + " | translate(self, table, /)\n", + " | Replace each character in the string using the given translation table.\n", + " | \n", + " | table\n", + " | Translation table, which must be a mapping of Unicode ordinals to\n", + " | Unicode ordinals, strings, or None.\n", + " | \n", + " | The table must implement lookup/indexing via __getitem__, for instance a\n", + " | dictionary or list. If this operation raises LookupError, the character is\n", + " | left untouched. Characters mapped to None are deleted.\n", + " | \n", + " | upper(self, /)\n", + " | Return a copy of the string converted to uppercase.\n", + " | \n", + " | zfill(self, width, /)\n", + " | Pad a numeric string with zeros on the left, to fill a field of the given width.\n", + " | \n", + " | The string is never truncated.\n", + " | \n", + " | ----------------------------------------------------------------------\n", + " | Static methods defined here:\n", + " | \n", + " | __new__(*args, **kwargs) from builtins.type\n", + " | Create and return a new object. See help(type) for accurate signature.\n", + " | \n", + " | maketrans(...)\n", + " | Return a translation table usable for str.translate().\n", + " | \n", + " | If there is only one argument, it must be a dictionary mapping Unicode\n", + " | ordinals (integers) or characters to Unicode ordinals, strings or None.\n", + " | Character keys will be then converted to ordinals.\n", + " | If there are two arguments, they must be strings of equal length, and\n", + " | in the resulting dictionary, each character in x will be mapped to the\n", + " | character at the same position in y. If there is a third argument, it\n", + " | must be a string, whose characters will be mapped to None in the result.\n", + "\n" + ] + } + ], + "source": [ + "help(str)" + ] + }, + { + "cell_type": "markdown", + "id": "8689d46b", + "metadata": {}, + "source": [ + "### Exercise 3: using functions" + ] + }, + { + "cell_type": "markdown", + "id": "01b1c440", + "metadata": {}, + "source": [ + "Write code that does the following:\n", + "1. Makes a list of the words from the bases string\n", + "2. Converts them all into uppercase\n", + "3. Reverses their order\n", + "4. (Bonus) Print the first letter from each word\n", + "\n", + "**Hint:** Use the `help()` function on `str` and `list` to see what methods are available to assist with the task. " + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "2e21935c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['THYMINE', 'GUANINE', 'CYTOSINE', 'ADENINE']\n" + ] + } + ], + "source": [ + "bases = 'adenine cytosine guanine thymine'\n", + "# Your code here!\n", + "\n", + "base_list = bases.upper().split()\n", + "base_list.reverse()\n", + "print(base_list)" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "931c4262-8554-45af-8048-09e1a368be9f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "T\n", + "G\n", + "C\n", + "A\n" + ] + } + ], + "source": [ + "for base in base_list:\n", + " print(base[0])" + ] + }, + { + "cell_type": "markdown", + "id": "48a7516c", + "metadata": {}, + "source": [ + "## Writing functions" + ] + }, + { + "cell_type": "markdown", + "id": "48f0ff95", + "metadata": {}, + "source": [ + "We've already written code that performs basic operations on DNA sequences: reverse complement and calculating GC content percentage. But that code only runs on our example `seq`, and it would be more useful if we could run it on any sequence\n", + "\n", + "To do that, we're going to write our own functions to do those. This is like making our own custom recipe for Python to cook!!!" + ] + }, + { + "cell_type": "markdown", + "id": "e9778cc9", + "metadata": {}, + "source": [ + "Let's write a short function\n", + "\n", + "- Start with `def`, a keyword meaning we define a function\n", + "- Give it a name, in this case **double**\n", + " - Use a descriptive name that indicates what the function does\n", + "- Open parenthesis and create some argument variables (more on that in a minute). \"What do you want me to double\"\n", + "- Close the parenthesis and type a colon (like loops or conditionals)\n", + "- Write indented code that does the work of our function\n", + "- Use the `return` keyword to return values from the function\n" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "id": "33bb52b6", + "metadata": {}, + "outputs": [], + "source": [ + "def double(x):\n", + " result = x * 2\n", + " return result" + ] + }, + { + "cell_type": "markdown", + "id": "7e231140", + "metadata": {}, + "source": [ + "Calling your own functions is the same as calling the built-in ones - just use the name and parenthesis" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "id": "22bbc7db", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "200\n" + ] + } + ], + "source": [ + "doubled_value = double(100)\n", + "print(doubled_value)" + ] + }, + { + "cell_type": "markdown", + "id": "cac83a69", + "metadata": {}, + "source": [ + "What if someone wants to know what your function does? Let's try using `help()`:" + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "id": "2509a9fb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Help on function double in module __main__:\n", + "\n", + "double(x)\n", + "\n" + ] + } + ], + "source": [ + "help(double)" + ] + }, + { + "cell_type": "markdown", + "id": "ede23cae", + "metadata": {}, + "source": [ + "Notice that the message given does not actually say anything about what it does. You can write a description about your function to help others read and understand your code using what is called a **docstring**. This is a string that you add right after the `def` statement as the first line in the code block. Here is an example:" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "id": "68449b08", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Help on function double in module __main__:\n", + "\n", + "double(x)\n", + " Takes a value and doubles it\n", + "\n" + ] + } + ], + "source": [ + "def double(x):\n", + " \"Takes a value and doubles it\"\n", + " result = x * 2\n", + " return result\n", + "help(double)" + ] + }, + { + "cell_type": "markdown", + "id": "b26bebd5", + "metadata": {}, + "source": [ + "We will cover docstrings and how to format them in a later lesson. For now, just remember that you can (**and should**) write documentation to help yourself and others use functions you wrote. " + ] + }, + { + "cell_type": "markdown", + "id": "f0defb00", + "metadata": {}, + "source": [ + "### Arguments and return values\n", + "\n", + "Within our function we use this variable `x`, which we didn't assign any value to it traditionally. It's a function argument, meaning that it will be set with whatever we put inside the parenthesis. It's a proxy to our parameter. Like when you are reading a recipe, it might say chop up some lemon, but that lemon isn't real. The lemon in the recipe book is the function argument. When you are cooking it, that's when the lemon is real. That lemon in your hand is the parameter. \n", + "\n", + "While this is different than saying x = 100 explicitly, it's actually a great feature. It's the reason we can reuse functions and not have them interfere with each other.\n", + "\n", + "Every time we call `double`, that function gets its own private `x` to use. If we have a variable called x in other places, it won't replace or conflict with the one inside our function. Which is great, because I use simple short names all the time!\n", + "\n", + "We also have this new keyword, `return`. Most functions will use this, where you want to do some work or computation and return a result. In this example, you'd expect a function with an action name like `double` would do some work and provide you the result." + ] + }, + { + "cell_type": "markdown", + "id": "9d6c5360", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "source": [ + "## Reverse-complementing\n", + "\n", + "We have two operations that are perfect to turn into functions, then we can use them independently or together." + ] + }, + { + "cell_type": "markdown", + "id": "7e2e855a", + "metadata": {}, + "source": [ + "To write a function that reverses a sequence, we want to take an input value, the sequence, reverse it, and return that." + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "id": "075eae0b", + "metadata": {}, + "outputs": [], + "source": [ + "def reverse(seq):\n", + " \"Returns the reverse of a sequence\"\n", + " rev = ''\n", + " for s in seq:\n", + " rev = s + rev\n", + " return rev\n" + ] + }, + { + "cell_type": "code", + "execution_count": 108, + "id": "c316d6d6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GTCA\n" + ] + } + ], + "source": [ + "print(reverse('ACTG'))" + ] + }, + { + "cell_type": "code", + "execution_count": 109, + "id": "bd2bec5f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACCTGCATGC\n", + "CGTACGTCCA\n" + ] + } + ], + "source": [ + "print(seq)\n", + "print(reverse(seq))" + ] + }, + { + "cell_type": "markdown", + "id": "1eda227f", + "metadata": {}, + "source": [ + "To test that the function works, we could:\n", + "- see if the output is correct ourselves\n", + "- test that double-reversing gives us the original sequence" + ] + }, + { + "cell_type": "code", + "execution_count": 110, + "id": "5b9007bf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ACCTGCATGC'" + ] + }, + "execution_count": 110, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "reverse(reverse(seq))" + ] + }, + { + "cell_type": "code", + "execution_count": 111, + "id": "df971824", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 111, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "reverse(reverse(seq)) == seq" + ] + }, + { + "cell_type": "markdown", + "id": "822b8af6", + "metadata": {}, + "source": [ + "Now we can move onto complementing. So let's pull up that code and move it into a function. Let's again tidy up our variable names" + ] + }, + { + "cell_type": "code", + "execution_count": 112, + "id": "4cb8f5cf", + "metadata": {}, + "outputs": [], + "source": [ + "def complement(seq):\n", + " \"Switches all A and T characters. Switches all G and C characters\"\n", + " comp = ''\n", + " for s in seq:\n", + " c = comps[s]\n", + " comp = comp + c\n", + " return comp" + ] + }, + { + "cell_type": "code", + "execution_count": 113, + "id": "525bbfe8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACCTGCATGC\n", + "TGGACGTACG\n" + ] + } + ], + "source": [ + "print(seq)\n", + "print(complement(seq))" + ] + }, + { + "cell_type": "markdown", + "id": "541890d9", + "metadata": {}, + "source": [ + "That works, but there's still one piece that's odd. We have our **seq** variable that's private to this function and nobody else can access it. But we're depending on this **comps** dictionary, which we don't create or check.\n", + "\n", + "That's called a **global variable**, because it exists outside of a function and it's available for reading and writing everywhere. And it's dangerous because \n", + "\n", + "1. If we run this function without assigning `comps`, the function will fail\n", + "2. If we run it with `comps` set to something we don't expect, our function will give us the wrong answer\n", + "3. We can't look at this function and know what it's going to do, because we don't know what complements is or where it comes from\n" + ] + }, + { + "cell_type": "markdown", + "id": "820a3fc6", + "metadata": {}, + "source": [ + "Some things do need to be global variables, but generally we should avoid them. So here, I'm going to create a **local variable** inside the function with my complements. This is good practice." + ] + }, + { + "cell_type": "code", + "execution_count": 114, + "id": "cba44b16", + "metadata": {}, + "outputs": [], + "source": [ + "def complement(seq):\n", + " \"Switches all A and T characters. Switches all G and C characters\"\n", + " comps = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'}\n", + " comp = ''\n", + " for s in seq:\n", + " c = comps[s]\n", + " comp = comp + c\n", + " return comp" + ] + }, + { + "cell_type": "code", + "execution_count": 115, + "id": "4e434c1b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'TGG'" + ] + }, + "execution_count": 115, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "complement('ACC')" + ] + }, + { + "cell_type": "code", + "execution_count": 116, + "id": "98454cef", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'TGG'" + ] + }, + "execution_count": 116, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "complements = 42\n", + "complement('ACC')" + ] + }, + { + "cell_type": "markdown", + "id": "09e21adf", + "metadata": {}, + "source": [ + "Now we've got solid functions for reverse and complement. We've tested them independently, so now let's put them together in another function" + ] + }, + { + "cell_type": "code", + "execution_count": 117, + "id": "9ff783ca", + "metadata": {}, + "outputs": [], + "source": [ + "def reverse_complement(seq):\n", + " \"Reverses the complement of a sequence\"\n", + " return reverse(complement(seq))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 118, + "id": "ae673765", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'TTGG'" + ] + }, + "execution_count": 118, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "reverse_complement('CCAA')" + ] + }, + { + "cell_type": "markdown", + "id": "00b518e8-f0f3-42f8-ad6c-09e6ae0e89ec", + "metadata": {}, + "source": [ + "#### Recap + exercise in slides [Sticky check]" + ] + }, + { + "cell_type": "markdown", + "id": "66d056eb", + "metadata": {}, + "source": [ + "### Exercise 4: update the reverse function" + ] + }, + { + "cell_type": "markdown", + "id": "4864160e", + "metadata": {}, + "source": [ + "Update the `reverse()` function to use splicing instead of a loop. With splicing, this can be done by making the step argument -1. Recall that the format for splicing is [start:end:step]. \n", + "\n", + "Afterwards, run the `reverse_complement()` function on the given sequence. Do you have to make any chances to the `complement()` or `reverse_complement()` functions written before?" + ] + }, + { + "cell_type": "code", + "execution_count": 119, + "id": "3dd7f7aa", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'TTGG'" + ] + }, + "execution_count": 119, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Update this!\n", + "def reverse(seq):\n", + " \"Returns the reverse of a sequence\"\n", + " rev = ''\n", + " for s in seq:\n", + " rev = s + rev\n", + " return rev\n", + "\n", + "# Write any additional code here (if needed)\n", + "def reverse(seq):\n", + " \"Returns the reverse of a sequence\"\n", + " return seq[::-1]\n", + "\n", + "# Does this behave the same after the update?\n", + "reverse_complement(\"CCAA\")" + ] + }, + { + "cell_type": "markdown", + "id": "594261df", + "metadata": {}, + "source": [ + "# Working with Files" + ] + }, + { + "cell_type": "markdown", + "id": "8f8d3bf3", + "metadata": {}, + "source": [ + "When working with real-world data, it will typically be in a file, and **not** in your code. Fortunately, Python has functions to read files. These work with simple text files, and if you need to handle images or other binary formats, there are libraries that can help with that.\n", + "\n", + "Sequence data is typically in stored text files, like `fasta`, which is pretty much a simple text format. So let's walk through reading one of those." + ] + }, + { + "cell_type": "markdown", + "id": "fc7d2129", + "metadata": {}, + "source": [ + "## Open and close\n", + "\n", + "When you use a word processor or spreadsheet, you open files, work with them, and then close them when you're done. In Python, you do the same thing." + ] + }, + { + "cell_type": "code", + "execution_count": 120, + "id": "b5d3ea30", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ">lcl|AE014075.1_gene_2 [locus_tag=c0002] [location=534..911]\n", + "\n", + "GTGTTCTACAGAGAGAAGCGTAGAGCAATAGGCTGTATTTTGAGAAAGCTGTGTGAGTGGAAAAGTGTAC\n", + "\n", + "GGATTCTGGAAGCTGAATGCTGTGCAGATCATATCCATATGCTTGTGGAGATCCCGCCCAAAATGAGCGT\n", + "\n", + "ATCAGGCTTTATGGGATATCTGAAAGGGAAAAGCAGTCTGATGCCTTACGAGCAGTTTGGTGATTTGAAA\n", + "\n", + "TTCAAATACAGGAACAGGGAGTTCTGGTGCAGAGGGTATTACGTCGATACGGTGGGTAAGAACACGGCGA\n", + "\n", + "\n", + "\n" + ] + } + ], + "source": [ + "f = open('ae.fa')\n", + "for line in f:\n", + " print(line)\n", + "f.close()" + ] + }, + { + "cell_type": "markdown", + "id": "6840f975", + "metadata": {}, + "source": [ + "Let's go through the steps we just did.\n", + "\n", + "1. We used the `open()` **function** on a string that represents a path to a file.\n", + " - The result of that function was saved to the variable `f`. This value is called a **file object**.\n", + "2. We wrote a for loop. When you write a for loop for a file object, each loop variable represents a line in the file. \n", + "3. We printed the loop variable for each loop. \n", + "4. We used the **method** `.close()` to close the file. " + ] + }, + { + "cell_type": "markdown", + "id": "7a455578", + "metadata": {}, + "source": [ + "## Reading lines\n", + "\n", + "When we work with text files, we can read them line by line in a `loop`. Each line of text from the file is set into our `loop variable`, and we print it out from the `loop`.\n", + "\n", + "You'll probably notice that we have a blank line in between each line from the file. Text files and programs indicate that there is a new line using a special newline character, `\\n`. In our previous example, each line in the file includes the `\\n` newline character at the end.\n", + "\n", + "Let's look into this by adding each line to a single `string`, `all_lines`, and compare printing the `string` vs the raw data." + ] + }, + { + "cell_type": "code", + "execution_count": 121, + "id": "a4afc818", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ">lcl|AE014075.1_gene_2 [locus_tag=c0002] [location=534..911]\n", + "GTGTTCTACAGAGAGAAGCGTAGAGCAATAGGCTGTATTTTGAGAAAGCTGTGTGAGTGGAAAAGTGTAC\n", + "GGATTCTGGAAGCTGAATGCTGTGCAGATCATATCCATATGCTTGTGGAGATCCCGCCCAAAATGAGCGT\n", + "ATCAGGCTTTATGGGATATCTGAAAGGGAAAAGCAGTCTGATGCCTTACGAGCAGTTTGGTGATTTGAAA\n", + "TTCAAATACAGGAACAGGGAGTTCTGGTGCAGAGGGTATTACGTCGATACGGTGGGTAAGAACACGGCGA\n", + "\n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "'>lcl|AE014075.1_gene_2 [locus_tag=c0002] [location=534..911]\\nGTGTTCTACAGAGAGAAGCGTAGAGCAATAGGCTGTATTTTGAGAAAGCTGTGTGAGTGGAAAAGTGTAC\\nGGATTCTGGAAGCTGAATGCTGTGCAGATCATATCCATATGCTTGTGGAGATCCCGCCCAAAATGAGCGT\\nATCAGGCTTTATGGGATATCTGAAAGGGAAAAGCAGTCTGATGCCTTACGAGCAGTTTGGTGATTTGAAA\\nTTCAAATACAGGAACAGGGAGTTCTGGTGCAGAGGGTATTACGTCGATACGGTGGGTAAGAACACGGCGA\\n\\n'" + ] + }, + "execution_count": 121, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "f = open('ae.fa')\n", + "all_lines = ''\n", + "for line in f:\n", + " all_lines = all_lines + line\n", + "f.close()\n", + "print(all_lines) # First output is the print result\n", + "all_lines # Second is what the string data looks like" + ] + }, + { + "cell_type": "markdown", + "id": "8c2a185b", + "metadata": {}, + "source": [ + "If we know we have the whole line, we can strip off the newline character with the `.strip()` method." + ] + }, + { + "cell_type": "code", + "execution_count": 122, + "id": "e5eefcf1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'There is a newline at the end\\n'" + ] + }, + "execution_count": 122, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "string_with_newline = \"There is a newline at the end\\n\"\n", + "string_with_newline" + ] + }, + { + "cell_type": "code", + "execution_count": 123, + "id": "c026a958", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'There is a newline at the end'" + ] + }, + "execution_count": 123, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "string_with_newline.strip()" + ] + }, + { + "cell_type": "markdown", + "id": "e08e0600", + "metadata": {}, + "source": [ + "Let's try this with our original code to print each line in a file." + ] + }, + { + "cell_type": "code", + "execution_count": 124, + "id": "b7f31192", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ">lcl|AE014075.1_gene_2 [locus_tag=c0002] [location=534..911]\n", + "GTGTTCTACAGAGAGAAGCGTAGAGCAATAGGCTGTATTTTGAGAAAGCTGTGTGAGTGGAAAAGTGTAC\n", + "GGATTCTGGAAGCTGAATGCTGTGCAGATCATATCCATATGCTTGTGGAGATCCCGCCCAAAATGAGCGT\n", + "ATCAGGCTTTATGGGATATCTGAAAGGGAAAAGCAGTCTGATGCCTTACGAGCAGTTTGGTGATTTGAAA\n", + "TTCAAATACAGGAACAGGGAGTTCTGGTGCAGAGGGTATTACGTCGATACGGTGGGTAAGAACACGGCGA\n", + "\n" + ] + } + ], + "source": [ + "f = open('ae.fa')\n", + "for line in f:\n", + " line = line.strip()\n", + " print(line)\n", + "f.close()" + ] + }, + { + "cell_type": "markdown", + "id": "68842b55-454e-4f95-976e-a34fcf31e8f0", + "metadata": {}, + "source": [ + "#### Recap and Exercise in slides [Sticky check]" + ] + }, + { + "cell_type": "markdown", + "id": "214b6c96", + "metadata": {}, + "source": [ + "### Exercise 5: reading a fasta file" + ] + }, + { + "cell_type": "markdown", + "id": "d92f9b10", + "metadata": {}, + "source": [ + "Write a function called `read_fasta(filename)` that takes in the input filename and returns a single string for all of the DNA sequences in the file. " + ] + }, + { + "cell_type": "code", + "execution_count": 125, + "id": "f440795a", + "metadata": {}, + "outputs": [], + "source": [ + "#Your code here!\n", + "def read_fasta(filename):\n", + " all_lines = ''\n", + " \n", + " f = open(filename)\n", + " for line in f:\n", + " if line[0] in ['A', 'T', 'C', 'G']:\n", + " all_lines = all_lines + line.strip()\n", + " return all_lines" + ] + }, + { + "cell_type": "code", + "execution_count": 126, + "id": "4ae34103", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GTGTTCTACAGAGAGAAGCGTAGAGCAATAGGCTGTATTTTGAGAAAGCTGTGTGAGTGGAAAAGTGTACGGATTCTGGAAGCTGAATGCTGTGCAGATCATATCCATATGCTTGTGGAGATCCCGCCCAAAATGAGCGTATCAGGCTTTATGGGATATCTGAAAGGGAAAAGCAGTCTGATGCCTTACGAGCAGTTTGGTGATTTGAAATTCAAATACAGGAACAGGGAGTTCTGGTGCAGAGGGTATTACGTCGATACGGTGGGTAAGAACACGGCGA\n" + ] + } + ], + "source": [ + "print(read_fasta('ae.fa'))" + ] + }, + { + "cell_type": "code", + "execution_count": 131, + "id": "e71e3fc3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TCGCCGTGTTCTTACCCACCGTATCGACGTAATACCCTCTGCACCAGAACTCCCTGTTCCTGTATTTGAATTTCAAATCACCAAACTGCTCGTAAGGCATCAGACTGCTTTTCCCTTTCAGATATCCCATAAAGCCTGATACGCTCATTTTGGGCGGGATCTCCACAAGCATATGGATATGATCTGCACAGCATTCAGCTTCCAGAATCCGTACACTTTTCCACTCACACAGCTTTCTCAAAATACAGCCTATTGCTCTACGCTTCTCTCTGTAGAACAC\n" + ] + } + ], + "source": [ + "print(reverse_complement(read_fasta('ae.fa')))" + ] + }, + { + "cell_type": "markdown", + "id": "51ad96b0", + "metadata": {}, + "source": [ + "## Scripts" + ] + }, + { + "cell_type": "markdown", + "id": "1961bd2d", + "metadata": {}, + "source": [ + "We've done a good job of organizing our code into functions here, but we've only been running them from this notebook. So next, we're going to take our code and put it in a script - starting with the `read_fasta` function." + ] + }, + { + "cell_type": "markdown", + "id": "7a289960", + "metadata": {}, + "source": [ + "`$ nano read_fasta.py`" + ] + }, + { + "cell_type": "markdown", + "id": "20a9788e", + "metadata": {}, + "source": [ + "Let's start with a script that reads the `ae.fa` file specifically and prints it. " + ] + }, + { + "cell_type": "markdown", + "id": "a89399f4", + "metadata": {}, + "source": [ + "Notice that the first line contains a `%%` operator followed by the command writefile and a file name. This operator is specific to jupyter notebooks, called a \"Cell Magic Command\", and copies the code written in a cell into a file." + ] + }, + { + "cell_type": "code", + "execution_count": 132, + "id": "b09350aa", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting read_fasta_v1.py\n" + ] + } + ], + "source": [ + "%%writefile read_fasta_v1.py\n", + "def read_fasta(filename):\n", + " \"Reads a fasta file and returns all sequences concatenated\"\n", + " sequence = ''\n", + " f = open(filename)\n", + " for line in f:\n", + " line = line.strip()\n", + " if '>' not in line:\n", + " # Append to the last sequence\n", + " sequence = sequence + line\n", + " f.close()\n", + " return sequence\n", + "\n", + "print(read_fasta('ae.fa'))\n" + ] + }, + { + "cell_type": "markdown", + "id": "ee78eda5", + "metadata": {}, + "source": [ + "Our script reads our `ae.fa` file every time we run it, but we know most programs don't work that way. The programs we used in bash expected a data file as an *argument*, and that's a good convention for programs we write too.\n", + "\n", + "In Python, our program can get these arguments, but we have to load a module called `sys` from the standard library, a collection of modules included in python but not available by default. The documentation for these is part of the documentation for python: https://docs.python.org/3/library/sys.html\n", + "\n", + "Libraries are incredibly useful - there are libraries for working with numeric and scientific data, generating plots, fetching data from the web, working with image and document files, databases, etc. And of course, there's a library for getting things like your script's command-line arguments.\n", + "\n", + "So, let's change our `read_fasta.py` program slightly." + ] + }, + { + "cell_type": "code", + "execution_count": 129, + "id": "b57df7d9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Writing read_fasta_v2.py\n" + ] + } + ], + "source": [ + "%%writefile read_fasta_v2.py\n", + "import sys\n", + "\n", + "def read_fasta(filename):\n", + " \"Reads a fasta file and returns all sequences concatenated\"\n", + " sequence = ''\n", + " f = open(filename)\n", + " for line in f:\n", + " line = line.strip()\n", + " if '>' not in line:\n", + " # Append to the last sequence\n", + " sequence = sequence + line\n", + " f.close()\n", + " return sequence\n", + "\n", + "print(read_fasta(sys.argv[1]))\n" + ] + }, + { + "cell_type": "markdown", + "id": "5e8465cc", + "metadata": {}, + "source": [ + "But what happens if we don't have an input file name? According to the documentation, `sys.argv`, returns a list where the first item `sys.argv[0]` is the name of the script by default, and each additional item in the list are the command line arguments. If no argument was passed, `sys.argv` should be a list of just the script name." + ] + }, + { + "cell_type": "code", + "execution_count": 130, + "id": "9ebc6a4e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Writing read_fasta_v3.py\n" + ] + } + ], + "source": [ + "%%writefile read_fasta_v3.py\n", + "import sys\n", + "\n", + "def read_fasta(filename):\n", + " \"Reads a fasta file and returns all sequences concatenated\"\n", + " sequence = ''\n", + " f = open(filename)\n", + " for line in f:\n", + " line = line.strip()\n", + " if '>' not in line:\n", + " # Append to the last sequence\n", + " sequence = sequence + line\n", + " f.close()\n", + " return sequence\n", + "\n", + "if len(sys.argv) < 2:\n", + " print('Usage:', sys.argv[0], '')\n", + " sys.exit(1)\n", + "\n", + "print(read_fasta(sys.argv[1]))\n" + ] + }, + { + "cell_type": "markdown", + "id": "83854705", + "metadata": {}, + "source": [ + "### Summary slide in presentation" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}