diff --git a/courses/CODE_OF_CONDUCT.md b/courses/CODE_OF_CONDUCT.md index d8beac73..b5163fab 100644 --- a/courses/CODE_OF_CONDUCT.md +++ b/courses/CODE_OF_CONDUCT.md @@ -25,7 +25,7 @@ Note: Some LinkedIn-managed communities have codes of conduct that pre-date this We encourage all communities to resolve issues on their own whenever possible. This builds a broader and deeper understanding and ultimately a healthier interaction. In the event that an issue cannot be resolved locally, please feel free to report your concerns by contacting [oss@linkedin.com](mailto:oss@linkedin.com). -In your report please include: +In your report, please include: * Your contact information. * Names (real, usernames or pseudonyms) of any individuals involved. If there are additional witnesses, please include them as well. diff --git a/courses/CONTRIBUTING.md b/courses/CONTRIBUTING.md index 97bb92e0..5b2e4f1a 100644 --- a/courses/CONTRIBUTING.md +++ b/courses/CONTRIBUTING.md @@ -7,16 +7,16 @@ As a contributor, you represent that the content you submit is not plagiarised. ### Contributing Guidelines Ensure that you adhere to the following guidelines: -* Should be about principles and concepts that can be applied in any company or individual project. Do not focus on particular tools or tech stack(which usually change over time). +* Should be about principles and concepts that can be applied in any company or individual project. Do not focus on particular tools or tech stack (which usually change over time). * Adhere to the [Code of Conduct](/school-of-sre/CODE_OF_CONDUCT/). * Should be relevant to the roles and responsibilities of an SRE. -* Should be locally tested (see steps for testing) and well formatted. +* Should be locally tested (see steps for testing) and well-formatted. * It is good practice to open an issue first and discuss your changes before submitting a pull request. This way, you can incorporate ideas from others before you even start. ### Building and testing locally Run the following commands to build and view the site locally before opening a PR. -``` +```shell python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt diff --git a/courses/index.md b/courses/index.md index f1d474fc..df5ff091 100644 --- a/courses/index.md +++ b/courses/index.md @@ -2,16 +2,16 @@ -Site Reliability Engineers (SREs) sits at the intersection of software engineering and systems engineering. While there are potentially infinite permutations and combinations of how infrastructure and software components can be put together to achieve an objective, focusing on foundational skills allows SREs to work with complex systems and software, regardless of whether these systems are proprietary, 3rd party, open systems, run on cloud/on-prem infrastructure, etc. Particularly important is to gain a deep understanding of how these areas of systems and infrastructure relate to each other and interact with each other. The combination of software and systems engineering skills is rare and is generally built over time with exposure to a wide variety of infrastructure, systems, and software. +Site Reliability Engineers (SREs) sits at the intersection of software engineering and systems engineering. While there are potentially infinite permutations and combinations of how infrastructure and software components can be put together to achieve an objective, focusing on foundational skills allows SREs to work with complex systems and software, regardless of whether these systems are proprietary, 3rd party, open systems, run on cloud/on-prem infrastructure, etc. Particularly, it is important to gain a deep understanding of how these areas of systems and infrastructure relate to each other and interact with each other. The combination of software and systems engineering skills is rare and is generally built over time with exposure to a wide variety of infrastructure, systems, and software. SREs bring in engineering practices to keep the site up. Each distributed system is an agglomeration of many components. SREs validate business requirements, convert them to SLAs for each of the components that constitute the distributed system, monitor and measure adherence to SLAs, re-architect or scale out to mitigate or avoid SLA breaches, add these learnings as feedback to new systems or projects and thereby reduce operational toil. Hence SREs play a vital role right from the day 0 design of the system. -In early 2019, we started visiting campuses across India to recruit the best and brightest minds to make sure LinkedIn, and all the services that make up its complex technology stack are always available for everyone. This critical function at LinkedIn falls under the purview of the Site Engineering team and Site Reliability Engineers (SREs) who are Software Engineers, specialized in reliability. +In early 2019, we started visiting campuses across India to recruit the best and brightest minds to make sure LinkedIn and all the services that make up its complex technology stack are always available for everyone. This critical function at LinkedIn falls under the purview of the Site Engineering team and Site Reliability Engineers (SREs) who are Software Engineers, specialized in reliability. -As we continued on this journey we started getting a lot of questions from these campuses on what exactly the site reliability engineering role entails? And, how could someone learn the skills and the disciplines involved to become a successful site reliability engineer? Fast forward a few months, and a few of these campus students had joined LinkedIn either as interns or as full-time engineers to become a part of the Site Engineering team; we also had a few lateral hires who joined our organization who were not from a traditional SRE background. That's when a few of us got together and started to think about how we can onboard new graduate engineers to the Site Engineering team. +As we continued on this journey, we started getting a lot of questions from these campuses on what exactly the site reliability engineering role entails? And, how could someone learn the skills and the disciplines involved to become a successful site reliability engineer? Fast forward a few months, and a few of these campus students had joined LinkedIn either as interns or as full-time engineers to become a part of the Site Engineering team; we also had a few lateral hires who joined our organization who were not from a traditional SRE background. That's when a few of us got together and started to think about how we can onboard new graduate engineers to the Site Engineering team. There are very few resources out there guiding someone on the basic skill sets one has to acquire as a beginner SRE. Because of the lack of these resources, we felt that individuals have a tough time getting into open positions in the industry. We created the School Of SRE as a starting point for anyone wanting to build their career as an SRE. -In this course, we are focusing on building strong foundational skills. The course is structured in a way to provide more real life examples and how learning each of these topics can play an important role in day to day job responsibilities of an SRE. Currently we are covering the following topics under the School Of SRE: +In this course, we are focusing on building strong foundational skills. The course is structured in a way to provide more real life examples and how learning each of these topics can play an important role in day-to-day job responsibilities of an SRE. Currently, we are covering the following topics under the School Of SRE: - Level 101 - Fundamentals Series @@ -20,8 +20,8 @@ In this course, we are focusing on building strong foundational skills. The cour - [Linux Networking](https://linkedin.github.io/school-of-sre/level101/linux_networking/intro/) - [Python and Web](https://linkedin.github.io/school-of-sre/level101/python_web/intro/) - Data - - [Relational databases(MySQL)](https://linkedin.github.io/school-of-sre/level101/databases_sql/intro/) - - [NoSQL concepts](https://linkedin.github.io/school-of-sre/level101/databases_nosql/intro/) + - [Relational Databases (MySQL)](https://linkedin.github.io/school-of-sre/level101/databases_sql/intro/) + - [NoSQL Concepts](https://linkedin.github.io/school-of-sre/level101/databases_nosql/intro/) - [Big Data](https://linkedin.github.io/school-of-sre/level101/big_data/intro/) - [Systems Design](https://linkedin.github.io/school-of-sre/level101/systems_design/intro/) - [Metrics and Monitoring](https://linkedin.github.io/school-of-sre/level101/metrics_and_monitoring/introduction/) @@ -30,11 +30,11 @@ In this course, we are focusing on building strong foundational skills. The cour - Level 102 - [Linux Intermediate](https://linkedin.github.io/school-of-sre/level102/linux_intermediate/introduction/) - Linux Advanced - - [Containers and orchestration](https://linkedin.github.io/school-of-sre/level102/containerization_and_orchestration/intro/) + - [Containers and Orchestration](https://linkedin.github.io/school-of-sre/level102/containerization_and_orchestration/intro/) - [System Calls and Signals](https://linkedin.github.io/school-of-sre/level102/system_calls_and_signals/intro/) - [Networking](https://linkedin.github.io/school-of-sre/level102/networking/introduction/) - [System Design](https://linkedin.github.io/school-of-sre/level102/system_design/intro/) - - [System troubleshooting and performance improvements](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/introduction/) + - [System Troubleshooting and Performance Improvements](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/introduction/) - [Continuous Integration and Continuous Delivery](https://linkedin.github.io/school-of-sre/level102/continuous_integration_and_continuous_delivery/introduction/) We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added references that could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer. diff --git a/courses/level101/big_data/evolution.md b/courses/level101/big_data/evolution.md index a232ae63..a7450016 100644 --- a/courses/level101/big_data/evolution.md +++ b/courses/level101/big_data/evolution.md @@ -6,7 +6,7 @@ 1. **HDFS** 1. The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. - 2. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. + 2. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large datasets. 3. HDFS is part of the [Apache Hadoop Core project](https://github.com/apache/hadoop). ![HDFS Architecture](images/hdfs_architecture.png) @@ -14,7 +14,7 @@ The main components of HDFS include: 1. NameNode: is the arbitrator and central repository of file namespace in the cluster. The NameNode executes the operations such as opening, closing, and renaming files and directories. 2. DataNode: manages the storage attached to the node on which it runs. It is responsible for serving all the read and writes requests. It performs operations on instructions on NameNode such as creation, deletion, and replications of blocks. - 3. Client: Responsible for getting the required metadata from the namenode and then communicating with the datanodes for reads and writes.


+ 3. Client: Responsible for getting the required metadata from the NameNode and then communicating with the DataNodes for reads and writes.


2. **YARN** YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be known as a large-scale distributed operating system used for Big Data processing. @@ -22,21 +22,21 @@ ![YARN Architecture](images/yarn_architecture.gif) The main components of YARN architecture include: - 1. Client: It submits map-reduce(MR) jobs to the resource manager. + 1. Client: It submits map-reduce (MR) jobs to the resource manager. 2. Resource Manager: It is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. It has two major components: 1. Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure scheduler, which means that it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources. 2. Application manager: It is responsible for accepting the application and negotiating the first container from the resource manager. It also restarts the Application Manager container if a task fails. 3. Node Manager: It takes care of individual nodes on the Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep up with the Node Manager. It monitors resource usage, performs log management, and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and starting it at the request of the Application master. - 4. Application Master: An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status, and monitoring the progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time. - 5. Container: It is a collection of physical resources such as RAM, CPU cores, and disk on a single node. The containers are invoked by Container Launch Context(CLC) which is a record that contains information such as environment variables, security tokens, dependencies, etc.

+ 4. Application Master: An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status, and monitoring the progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context (CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time. + 5. Container: It is a collection of physical resources such as RAM, CPU cores, and disk on a single node. The containers are invoked by Container Launch Context (CLC) which is a record that contains information such as environment variables, security tokens, dependencies, etc.

# MapReduce framework ![MapReduce Framework](images/map_reduce.jpg) -1. The term MapReduce represents two separate and distinct tasks Hadoop programs perform-Map Job and Reduce Job. Map jobs take data sets as input and process them to produce key-value pairs. Reduce job takes the output of the Map job i.e. the key-value pairs and aggregates them to produce desired results. -2. Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on computing clusters. Mapreduce helps to split the input data set into a number of parts and run a program on all data parts parallel at once. +1. The term MapReduce represents two separate and distinct tasks Hadoop programs perform—Map Job and Reduce Job. Map jobs take datasets as input and process them to produce key-value pairs. Reduce job takes the output of the Map job i.e. the key-value pairs and aggregates them to produce desired results. +2. Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large datasets on computing clusters. MapReduce helps to split the input dataset into a number of parts and run a program on all data parts parallel at once. 3. Please find the below Word count example demonstrating the usage of the MapReduce framework: ![Word Count Example](images/mapreduce_example.jpg) @@ -45,41 +45,42 @@ # Other tooling around Hadoop 1. [**Hive**](https://hive.apache.org/) - 1. Uses a language called HQL which is very SQL like. Gives non-programmers the ability to query and analyze data in Hadoop. Is basically an abstraction layer on top of map-reduce. + 1. Uses a language called HQL which is very SQL like. Gives non-programmers the ability to query and analyze data in Hadoop. Is basically an abstraction layer on top of map-reduce. 2. Ex. HQL query: - 1. _SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);_ + 1. `SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);` 3. In mysql: - 1. _SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;_ + 1. `SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;` 2. [**Pig**](https://pig.apache.org/) - 1. Uses a scripting language called Pig Latin, which is more workflow driven. Don't need to be an expert Java programmer but need a few coding skills. Is also an abstraction layer on top of map-reduce. + 1. Uses a scripting language called Pig Latin, which is more workflow driven. Don't need to be an expert Java programmer but need a few coding skills. Is also an abstraction layer on top of map-reduce. 2. Here is a quick question for you: - What is the output of running the pig queries in the right column against the data present in the left column in the below image? + What is the output of running the Pig queries in the right column against the data present in the left column in the below image? ![Pig Example](images/pig_example.png) Output: - ``` +

     7,Komal,Nayak,24,9848022334,trivendram
     8,Bharathi,Nambiayar,24,9848022333,Chennai
     5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
     6,Archana,Mishra,23,9848022335,Chennai
-    ```
+    
3. [**Spark**](https://spark.apache.org/) - 1. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms. + 1. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well-suited to machine learning algorithms. 4. [**Presto**](https://prestodb.io/) 1. Presto is a high performance, distributed SQL query engine for Big Data. 2. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, and MongoDB. - 3. Example presto query: - ``` - use studentDB; - show tables; - SELECT roll_no, name FROM studentDB.studentDetails where section=’A’ limit 5; - ``` + 3. Example Presto query: +

+    USE studentDB;
+    SHOW TABLES;
+    SELECT roll_no, name FROM studentDB.studentDetails WHERE section=’A’ LIMIT 5;
+    
+
# Data Serialisation and storage -1. In order to transport the data over the network or to store on some persistent storage, we use the process of translating data structures or objects state into binary or textual form. We call this process serialization.. -2. Avro data is stored in a container file (a .avro file) and its schema (the .avsc file) is stored with the data file. +1. In order to transport the data over the network or to store on some persistent storage, we use the process of translating data structures or objects state into binary or textual form. We call this process serialization. +2. Avro data is stored in a container file (a `.avro` file) and its schema (the `.avsc` file) is stored with the data file. 3. Apache Hive provides support to store a table as Avro and can also query data in this serialisation format. diff --git a/courses/level101/big_data/intro.md b/courses/level101/big_data/intro.md index 3e5beb5e..13fb3b94 100644 --- a/courses/level101/big_data/intro.md +++ b/courses/level101/big_data/intro.md @@ -16,18 +16,18 @@ Writing programs to draw analytics from data. ## Course Contents 1. [Overview of Big Data](https://linkedin.github.io/school-of-sre/level101/big_data/intro/#overview-of-big-data) -2. [Usage of Big Data techniques](https://linkedin.github.io/school-of-sre/level101/big_data/intro/#usage-of-big-data-techniques) +2. [Usage of Big Data Techniques](https://linkedin.github.io/school-of-sre/level101/big_data/intro/#usage-of-big-data-techniques) 3. [Evolution of Hadoop](https://linkedin.github.io/school-of-sre/level101/big_data/evolution/) -4. [Architecture of hadoop](https://linkedin.github.io/school-of-sre/level101/big_data/evolution/#architecture-of-hadoop) +4. [Architecture of Hadoop](https://linkedin.github.io/school-of-sre/level101/big_data/evolution/#architecture-of-hadoop) 1. HDFS 2. Yarn -5. [MapReduce framework](https://linkedin.github.io/school-of-sre/level101/big_data/evolution/#mapreduce-framework) -6. [Other tooling around hadoop](https://linkedin.github.io/school-of-sre/level101/big_data/evolution/#other-tooling-around-hadoop) +5. [MapReduce Framework](https://linkedin.github.io/school-of-sre/level101/big_data/evolution/#mapreduce-framework) +6. [Other Tooling Around Hadoop](https://linkedin.github.io/school-of-sre/level101/big_data/evolution/#other-tooling-around-hadoop) 1. Hive 2. Pig 3. Spark 4. Presto -7. [Data Serialisation and storage](https://linkedin.github.io/school-of-sre/level101/big_data/evolution/#data-serialisation-and-storage) +7. [Data Serialization and Storage](https://linkedin.github.io/school-of-sre/level101/big_data/evolution/#data-serialisation-and-storage) # Overview of Big Data @@ -50,7 +50,7 @@ Writing programs to draw analytics from data. 1. Take the example of the traffic lights problem. 1. There are more than 300,000 traffic lights in the US as of 2018. 2. Let us assume that we placed a device on each of them to collect metrics and send it to a central metrics collection system. - 3. If each of the IoT devices sends 10 events per minute, we have 300000x10x60x24 = 432x10^7 events per day. + 3. If each of the IoT devices sends 10 events per minute, we have `300000 x 10 x 60 x 24 = 432 x 10 ^ 7` events per day. 4. How would you go about processing that and telling me how many of the signals were “green” at 10:45 am on a particular day? 2. Consider the next example on Unified Payments Interface (UPI) transactions: 1. We had about 1.15 billion UPI transactions in the month of October 2019 in India. diff --git a/courses/level101/big_data/tasks.md b/courses/level101/big_data/tasks.md index fc2ec5ec..e7a416c0 100644 --- a/courses/level101/big_data/tasks.md +++ b/courses/level101/big_data/tasks.md @@ -2,9 +2,9 @@ ## Post-training tasks: -1. Try setting up your own 3 node Hadoop cluster. - 1. A VM based solution can be found [here](http://hortonworks.com/wp-content/uploads/2015/04/Import_on_VBox_4_07_2015.pdf) -2. Write a simple spark/MR job of your choice and understand how to generate analytics from data. +1. Try setting up your own three-node Hadoop cluster. + 1. A VM-based solution can be found [here](http://hortonworks.com/wp-content/uploads/2015/04/Import_on_VBox_4_07_2015.pdf) +2. Write a simple Spark/MR job of your choice and understand how to generate analytics from data. 1. Sample dataset can be found [here](https://grouplens.org/datasets/movielens/) ## References: diff --git a/courses/level101/databases_nosql/intro.md b/courses/level101/databases_nosql/intro.md index b91ec204..3d2aca16 100644 --- a/courses/level101/databases_nosql/intro.md +++ b/courses/level101/databases_nosql/intro.md @@ -5,12 +5,12 @@ ## What to expect from this course -At the end of training, you will have an understanding of what a NoSQL database is, what kind of advantages or disadvantages it has over traditional RDBMS, learn about different types of NoSQL databases and understand some of the underlying concepts & trade offs w.r.t to NoSQL. +At the end of training, you will have an understanding of what a NoSQL database is, what kind of advantages or disadvantages it has over traditional RDBMS, learn about different types of NoSQL databases and understand some of the underlying concepts & trade-offs w.r.t to NoSQL. ## What is not covered under this course -We will not be deep diving into any specific NoSQL Database. +We will not be deep diving into any specific NoSQL database. ## Course Contents @@ -36,19 +36,17 @@ Such databases have existed since the late 1960s, but the name "NoSQL" was only ### Types of NoSQL databases: -Over time due to the way these NoSQL databases were developed to suit requirements at different companies, we ended up with quite a few types of them. However, they can be broadly classified into 4 types. Some of the databases can overlap between different types. They are +Over time due to the way these NoSQL databases were developed to suit requirements at different companies, we ended up with quite a few types of them. However, they can be broadly classified into 4 types. Some of the databases can overlap between different types. They are: +1. **Document databases:** They store data in documents similar to [JSON](https://www.json.org/json-en.html) (JavaScript Object Notation) objects. Each document contains pairs of fields and values. The values can typically be a variety of types including things like strings, numbers, booleans, arrays, or objects, and their structures typically align with objects developers are working with in code. The advantages include intuitive data model & flexible schemas. Because of their variety of field value types and powerful query languages, document databases are great for a wide variety of use cases and can be used as a general purpose database. They can horizontally scale-out to accomodate large data volumes. Ex: MongoDB, Couchbase +2. **Key-Value databases:** These are a simpler type of databases where each item contains keys and values. A value can typically only be retrieved by referencing its key, so learning how to query for a specific key-value pair is typically simple. Key-value databases are great for use cases where you need to store large amounts of data but you don’t need to perform complex queries to retrieve it. Common use cases include storing user preferences or caching. Ex: [Redis](https://redis.io/), [DynamoDB](https://aws.amazon.com/dynamodb/), [Voldemort](https://www.project-voldemort.com/voldemort/)/[Venice](https://engineering.linkedin.com/blog/2017/04/building-venice--a-production-software-case-study) (Linkedin). -1. **Document databases:** They store data in documents similar to [JSON](https://www.json.org/json-en.html) (JavaScript Object Notation) objects. Each document contains pairs of fields and values. The values can typically be a variety of types including things like strings, numbers, booleans, arrays, or objects, and their structures typically align with objects developers are working with in code. The advantages include intuitive data model & flexible schemas. Because of their variety of field value types and powerful query languages, document databases are great for a wide variety of use cases and can be used as a general purpose database. They can horizontally scale-out to accomodate large data volumes. Ex: MongoDB, Couchbase -2. **Key-Value databases:** These are a simpler type of databases where each item contains keys and values. A value can typically only be retrieved by referencing its key, so learning how to query for a specific key-value pair is typically simple. Key-value databases are great for use cases where you need to store large amounts of data but you don’t need to perform complex queries to retrieve it. Common use cases include storing user preferences or caching. Ex: [Redis](https://redis.io/), [DynamoDB](https://aws.amazon.com/dynamodb/), [Voldemort](https://www.project-voldemort.com/voldemort/)/[Venice](https://engineering.linkedin.com/blog/2017/04/building-venice--a-production-software-case-study) (Linkedin), 3. **Wide-Column stores:** They store data in tables, rows, and dynamic columns. Wide-column stores provide a lot of flexibility over relational databases because each row is not required to have the same columns. Many consider wide-column stores to be two-dimensional key-value databases. Wide-column stores are great for when you need to store large amounts of data and you can predict what your query patterns will be. Wide-column stores are commonly used for storing Internet of Things data and user profile data. [Cassandra](https://cassandra.apache.org/) and [HBase](https://hbase.apache.org/) are two of the most popular wide-column stores. 4. **Graph Databases:** These databases store data in nodes and edges. Nodes typically store information about people, places, and things while edges store information about the relationships between the nodes. The underlying storage mechanism of graph databases can vary. Some depend on a relational engine and “store” the graph data in a table (although a table is a logical element, therefore this approach imposes another level of abstraction between the graph database, the graph database management system and the physical devices where the data is actually stored). Others use a key-value store or document-oriented database for storage, making them inherently NoSQL structures. Graph databases excel in use cases where you need to traverse relationships to look for patterns such as social networks, fraud detection, and recommendation engines. Ex: [Neo4j](https://neo4j.com/) - ### **Comparison** -
@@ -128,7 +126,6 @@ Over time due to the way these NoSQL databases were developed to suit requiremen The table below summarizes the main differences between SQL and NoSQL databases. - - @@ -196,19 +193,15 @@ The table below summarizes the main differences between SQL and NoSQL databases.
@@ -175,7 +172,7 @@ The table below summarizes the main differences between SQL and NoSQL databases. Supported Most do not support multi-record ACID transactions. However, some—like MongoDB—do. + Most do not support multi-record ACID transactions. However, some like MongoDB do.
- - ### Advantages - - * **Flexible Data Models** Most NoSQL systems feature flexible schemas. A flexible schema means you can easily modify your database schema to add or remove fields to support for evolving application requirements. This facilitates with continuous application development of new features without database operation overhead. * **Horizontal Scaling** - Most NoSQL systems allow you to scale horizontally, which means you can add in cheaper & commodity hardware, whenever you want to scale a system. On the other hand SQL systems generally scale Vertically (a more powerful server). NoSQL systems can also host huge data sets when compared to traditional SQL systems. + Most NoSQL systems allow you to scale horizontally, which means you can add in cheaper & commodity hardware, whenever you want to scale a system. On the other hand, SQL systems generally scale Vertically (a more powerful server). NoSQL systems can also host huge datasets when compared to traditional SQL systems. * **Fast Queries** @@ -216,4 +209,4 @@ The table below summarizes the main differences between SQL and NoSQL databases. * **Developer productivity** - NoSQL systems tend to map data based on the programming data structures. As a result developers need to perform fewer data transformations leading to increased productivity & fewer bugs. + NoSQL systems tend to map data based on the programming data structures. As a result, developers need to perform fewer data transformations leading to increased productivity & fewer bugs. diff --git a/courses/level101/databases_nosql/key_concepts.md b/courses/level101/databases_nosql/key_concepts.md index f321721c..35434a41 100644 --- a/courses/level101/databases_nosql/key_concepts.md +++ b/courses/level101/databases_nosql/key_concepts.md @@ -1,15 +1,10 @@ # Key Concepts -Lets looks at some of the key concepts when we talk about NoSQL or distributed systems - - -### CAP Theorem - - - -In a keynote titled “[Towards Robust Distributed Systems](https://sites.cs.ucsb.edu/~rich/class/cs293b-cloud/papers/Brewer_podc_keynote_2000.pdf)” at ACM’s PODC symposium in 2000 Eric Brewer came up with the so-called CAP-theorem which is widely adopted today by large web companies as well as in the NoSQL community. The CAP acronym stands for **C**onsistency, **A**vailability & **P**artition Tolerance. +Lets looks at some of the key concepts when we talk about NoSQL or distributed systems. +### CAP Theorem +In a keynote titled “[Towards Robust Distributed Systems](https://sites.cs.ucsb.edu/~rich/class/cs293b-cloud/papers/Brewer_podc_keynote_2000.pdf)” at ACM’s PODC symposium in 2000, Eric Brewer came up with the so-called CAP-theorem which is widely adopted today by large web companies as well as in the NoSQL community. The CAP acronym stands for **C**onsistency, **A**vailability & **P**artition Tolerance. * **Consistency** @@ -24,17 +19,15 @@ In a keynote titled “[Towards Robust Distributed Systems](https://sites.cs.ucs It is the ability of the system to continue operations in the event of a network partition. A network partition occurs when a failure causes two or more islands of networks where the systems can’t talk to each other across the islands temporarily or permanently. -Brewer alleges that one can at most choose two of these three characteristics in a shared-data system. The CAP-theorem states that a choice can only be made for two options out of consistency, availability and partition tolerance. A growing number of use cases in large scale applications tend to value reliability implying that availability & redundancy are more valuable than consistency. As a result these systems struggle to meet ACID properties. They attain this by loosening on the consistency requirement i.e Eventual Consistency. +Brewer alleges that one can at most choose two of these three characteristics in a shared-data system. The CAP-theorem states that a choice can only be made for two options out of consistency, availability and partition tolerance. A growing number of use cases in large scale applications tend to value reliability implying that availability & redundancy are more valuable than consistency. As a result these systems struggle to meet ACID properties. They attain this by loosening on the consistency requirement, i.e Eventual Consistency. -**Eventual Consistency **means that all readers will see writes, as time goes on: “In a steady state, the system will eventually return the last written value”. Clients therefore may face an inconsistent state of data as updates are in progress. For instance, in a replicated database updates may go to one node which replicates the latest version to all other nodes that contain a replica of the modified dataset so that the replica nodes eventually will have the latest version. +**Eventual Consistency** means that all readers will see writes, as time goes on: “In a steady state, the system will eventually return the last written value”. Clients therefore may face an inconsistent state of data as updates are in progress. For instance, in a replicated database updates may go to one node which replicates the latest version to all other nodes that contain a replica of the modified dataset so that the replica nodes eventually will have the latest version. NoSQL systems support different levels of eventual consistency models. For example: - - * **Read Your Own Writes Consistency** - Clients will see their updates immediately after they are written. The reads can hit nodes other than the one where it was written. However they might not see updates by other clients immediately. + Clients will see their updates immediately after they are written. The reads can hit nodes other than the one where it was written. However, they might not see updates by other clients immediately. * **Session Consistency** @@ -44,16 +37,12 @@ NoSQL systems support different levels of eventual consistency models. For examp A system provides causal consistency if the following condition holds: write operations that are related by potential causality are seen by each process of the system in order. Different processes may observe concurrent writes in different orders - - - Eventual consistency is useful if concurrent updates of the same partitions of data are unlikely and if clients do not immediately depend on reading updates issued by themselves or by other clients. Depending on what consistency model was chosen for the system (or parts of it), determines where the requests are routed, ex: replicas. **CAP alternatives illustration** -
Choice @@ -110,11 +99,9 @@ Web caching When data is distributed across nodes, it can be modified on different nodes at the same time (assuming strict consistency is enforced). Questions arise on conflict resolution for concurrent updates. Some of the popular conflict resolution mechanism are - - * **Timestamps** - This is the most obvious solution. You sort updates based on chronological order and choose the latest update. However this relies on clock synchronization across different parts of the infrastructure. This gets even more complicated when parts of systems are spread across different geographic locations. + This is the most obvious solution. You sort updates based on chronological order and choose the latest update. However, this relies on clock synchronization across different parts of the infrastructure. This gets even more complicated when parts of systems are spread across different geographic locations. * **Optimistic Locking** @@ -126,33 +113,24 @@ When data is distributed across nodes, it can be modified on different nodes at

- ![alt_text](images/vector_clocks.png "Vector Clocks") - - -

Vector clocks illustration

-Vector clocks have the following advantages over other conflict resolution mechanism - - +Vector clocks have the following advantages over other conflict resolution mechanism: 1. No dependency on synchronized clocks 2. No total ordering of revision nos required for casual reasoning No need to store and maintain multiple versions of the data on different nodes.** ** - ### Partitioning When the amount of data crosses the capacity of a single node, we need to think of splitting data, creating replicas for load balancing & disaster recovery. Depending on how dynamic the infrastructure is, we have a few approaches that we can take. - - 1. **Memory cached** - These are partitioned in-memory databases that are primarily used for transient data. These databases are generally used as a front for traditional RDBMS. Most frequently used data is replicated from a rdbms into a memory database to facilitate fast queries and to take the load off from backend DB’s. A very common example is memcached or couchbase. + These are partitioned in-memory databases that are primarily used for transient data. These databases are generally used as a front for traditional RDBMS. Most frequently used data is replicated from a RDBMS into a memory database to facilitate fast queries and to take the load off from backend DB’s. A very common example is Memcached or Couchbase. 2. **Clustering** @@ -160,7 +138,7 @@ When the amount of data crosses the capacity of a single node, we need to think 3. **Separating reads from writes** - In this method, you will have multiple replicas hosting the same data. The incoming writes are typically sent to a single node (Leader) or multiple nodes (multi-Leader), while the rest of the replicas (Follower) handle reads requests. The leader replicates writes asynchronously to all followers. However the write lag can’t be completely avoided. Sometimes a leader can crash before it replicates all the data to a follower. When this happens, a follower with the most consistent data can be turned into a leader. As you can realize now, it is hard to enforce full consistency in this model. You also need to consider the ratio of read vs write traffic. This model won’t make sense when writes are higher than reads. The replication methods can also vary widely. Some systems do a complete transfer of state periodically, while others use a delta state transfer approach. You could also transfer the state by transferring the operations in order. The followers can then apply the same operations as the leader to catch up. + In this method, you will have multiple replicas hosting the same data. The incoming writes are typically sent to a single node (Leader) or multiple nodes (multi-Leader), while the rest of the replicas (Follower) handle reads requests. The leader replicates writes asynchronously to all followers. However, the write lag can’t be completely avoided. Sometimes a leader can crash before it replicates all the data to a follower. When this happens, a follower with the most consistent data can be turned into a leader. As you can realize now, it is hard to enforce full consistency in this model. You also need to consider the ratio of read vs write traffic. This model won’t make sense when writes are higher than reads. The replication methods can also vary widely. Some systems do a complete transfer of state periodically, while others use a delta state transfer approach. You could also transfer the state by transferring the operations in order. The followers can then apply the same operations as the leader to catch up. 4. **Sharding** @@ -168,25 +146,20 @@ When the amount of data crosses the capacity of a single node, we need to think

- ![alt_text]( images/database_sharding.png "Sharding") -

Sharding example

- ### Hashing A hash function is a function that maps one piece of data—typically describing some kind of object, often of arbitrary size—to another piece of data, typically an integer, known as _hash code_, or simply _hash_. In a partitioned database, it is important to consistently map a key to a server/replica. For ex: you can use a very simple hash as a modulo function. - _p = k mod n_ Where - p -> partition, @@ -195,14 +168,13 @@ Where n -> no of nodes -The downside of this simple hash is that, whenever the cluster topology changes, the data distribution also changes. When you are dealing with memory caches, it will be easy to distribute partitions around. Whenever a node joins/leaves a topology, partitions can reorder themselves, a cache miss can be re-populated from backend DB. However when you look at persistent data, it is not possible as the new node doesn’t have the data needed to serve it. This brings us to consistent hashing. - +The downside of this simple hash is that, whenever the cluster topology changes, the data distribution also changes. When you are dealing with memory caches, it will be easy to distribute partitions around. Whenever a node joins/leaves a topology, partitions can reorder themselves, a cache miss can be re-populated from backend DB. However, when you look at persistent data, it is not possible as the new node doesn’t have the data needed to serve it. This brings us to consistent hashing. #### Consistent Hashing Consistent hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed _hash table_ by assigning them a position on an abstract circle, or _hash ring_. This allows servers and objects to scale without affecting the overall system. -Say that our hash function h() generates a 32-bit integer. Then, to determine to which server we will send a key k, we find the server s whose hash h(s) is the smallest integer that is larger than h(k). To make the process simpler, we assume the table is circular, which means that if we cannot find a server with a hash larger than h(k), we wrap around and start looking from the beginning of the array. +Say that our hash function *h*() generates a 32-bit integer. Then, to determine to which server we will send a key *k*, we find the server *s* whose hash *h*(*s*) is the smallest integer that is larger than *h*(*k*). To make the process simpler, we assume the table is circular, which means that if we cannot find a server with a hash larger than *h*(*k*), we wrap around and start looking from the beginning of the array.

@@ -212,62 +184,50 @@ Say that our hash function h() generates a 32-bit integer. Then, to determine to

Consistent hashing illustration

-In consistent hashing when a server is removed or added then only the keys from that server are relocated. For example, if server S3 is removed then, all keys from server S3 will be moved to server S4 but keys stored on server S4 and S2 are not relocated. But there is one problem, when server S3 is removed then keys from S3 are not equally distributed among remaining servers S4 and S2. They are only assigned to server S4 which increases the load on server S4. - -To evenly distribute the load among servers when a server is added or removed, it creates a fixed number of replicas ( known as virtual nodes) of each server and distributes it along the circle. So instead of server labels S1, S2 and S3, we will have S10 S11…S19, S20 S21…S29 and S30 S31…S39. The factor for a number of replicas is also known as _weight_, depending on the situation. +In consistent hashing, when a server is removed or added, then only the keys from that server are relocated. For example, if server S3 is removed then, all keys from server S3 will be moved to server S4 but keys stored on server S4 and S2 are not relocated. But there is one problem, when server S3 is removed then keys from S3 are not equally distributed among remaining servers S4 and S2. They are only assigned to server S4 which increases the load on server S4. +To evenly distribute the load among servers when a server is added or removed, it creates a fixed number of replicas (known as virtual nodes) of each server and distributes it along the circle. So instead of server labels S1, S2 and S3, we will have S10,S11,…,S19, S20,S21,…,S29 and S30,S31,…,S39. The factor for a number of replicas is also known as _weight_, depending on the situation. - +All keys which are mapped to replicas Sij are stored on server Si. To find a key, we do the same thing, find the position of the key on the circle and then move forward until you find a server replica. If the server replica is Sij, then the key is stored in server Si. -All keys which are mapped to replicas Sij are stored on server Si. To find a key we do the same thing, find the position of the key on the circle and then move forward until you find a server replica. If the server replica is Sij then the key is stored in server Si. +Suppose server S3 is removed, then all S3 replicas with labels S30,S31,…,S39 must be removed. Now, the objects keys adjacent to S3X labels will be automatically re-assigned to S1X, S2X and S4X. All keys originally assigned to S1, S2 & S4 will not be moved. -Suppose server S3 is removed, then all S3 replicas with labels S30 S31 … S39 must be removed. Now the objects keys adjacent to S3X labels will be automatically re-assigned to S1X, S2X and S4X. All keys originally assigned to S1, S2 & S4 will not be moved. - -Similar things happen if we add a server. Suppose we want to add a server S5 as a replacement of S3 then we need to add labels S50 S51 … S59. In the ideal case, one-fourth of keys from S1, S2 and S4 will be reassigned to S5. +Similar things happen if we add a server. Suppose we want to add a server S5 as a replacement of S3, then we need to add labels S50,S51,…,S59. In the ideal case, one-fourth of keys from S1, S2 and S4 will be reassigned to S5. When applied to persistent storages, further issues arise: if a node has left the scene, data stored on this node becomes unavailable, unless it has been replicated to other nodes before; in the opposite case of a new node joining the others, adjacent nodes are no longer responsible for some pieces of data which they still store but not get asked for anymore as the corresponding objects are no longer hashed to them by requesting clients. In order to address this issue, a replication factor (r) can be introduced. Introducing replicas in a partitioning scheme—besides reliability benefits—also makes it possible to spread workload for read requests that can go to any physical node responsible for a requested piece of data. Scalability doesn’t work if the clients have to decide between multiple versions of the dataset, because they need to read from a quorum of servers which in turn reduces the efficiency of load balancing. - - - ### Quorum Quorum is the minimum number of nodes in a cluster that must be online and be able to communicate with each other. If any additional node failure occurs beyond this threshold, the cluster will stop running. - - - - -To attain a quorum, you need a majority of the nodes. Commonly it is (N/2 + 1), where N is the total no of nodes in the system. For ex, +To attain a quorum, you need a majority of the nodes. Commonly, it is (N/2 + 1), where _N_ is the total no of nodes in the system. For example, -In a 3 node cluster, you need 2 nodes for a majority, +- In a 3-node cluster, you need 2 nodes for a majority. -In a 5 node cluster, you need 3 nodes for a majority, +- In a 5-node cluster, you need 3 nodes for a majority. -In a 6 node cluster, you need 4 nodes for a majority. +- In a 6-node cluster, you need 4 nodes for a majority.

![alt_text](images/Quorum.png "image_tooltip") -

Quorum example

- +

Quorum example

Network problems can cause communication failures among cluster nodes. One set of nodes might be able to communicate together across a functioning part of a network but not be able to communicate with a different set of nodes in another part of the network. This is known as split brain in cluster or cluster partitioning. Now the partition which has quorum is allowed to continue running the application. The other partitions are removed from the cluster. -Eg: In a 5 node cluster, consider what happens if nodes 1, 2, and 3 can communicate with each other but not with nodes 4 and 5. Nodes 1, 2, and 3 constitute a majority, and they continue running as a cluster. Nodes 4 and 5, being a minority, stop running as a cluster. If node 3 loses communication with other nodes, all nodes stop running as a cluster. However, all functioning nodes will continue to listen for communication, so that when the network begins working again, the cluster can form and begin to run. +Eg: In a 5-node cluster, consider what happens if nodes 1, 2, and 3 can communicate with each other but not with nodes 4 and 5. Nodes 1, 2, and 3 constitute a majority, and they continue running as a cluster. Nodes 4 and 5, being a minority, stop running as a cluster. If node 3 loses communication with other nodes, all nodes stop running as a cluster. However, all functioning nodes will continue to listen for communication, so that when the network begins working again, the cluster can form and begin to run. Below diagram demonstrates Quorum selection on a cluster partitioned into two sets.

- ![alt_text](images/cluster_quorum.png "image_tooltip") **

Cluster Quorum example

** diff --git a/courses/level101/databases_sql/backup_recovery.md b/courses/level101/databases_sql/backup_recovery.md index 81867e40..26b660cc 100644 --- a/courses/level101/databases_sql/backup_recovery.md +++ b/courses/level101/databases_sql/backup_recovery.md @@ -1,5 +1,5 @@ ### Backup and Recovery -Backups are a very crucial part of any database setup. They are generally a copy of the data that can be used to reconstruct the data in case of any major or minor crisis with the database. In general terms backups can be of two types:- +Backups are a very crucial part of any database setup. They are generally a copy of the data that can be used to reconstruct the data in case of any major or minor crisis with the database. In general terms, backups can be of two types: - **Physical Backup** - the data directory as it is on the disk - **Logical Backup** - the table structure and records in it @@ -7,65 +7,87 @@ Backups are a very crucial part of any database setup. They are generally a copy Both the above kinds of backups are supported by MySQL with different tools. It is the job of the SRE to identify which should be used when. #### Mysqldump -This utility is available with MySQL installation. It helps in getting the logical backup of the database. It outputs a set of SQL statements to reconstruct the data. It is not recommended to use mysqldump for large tables as it might take a lot of time and the file size will be huge. However, for small tables it is the best and the quickest option. +This utility is available with MySQL installation. It helps in getting the logical backup of the database. It outputs a set of SQL statements to reconstruct the data. It is not recommended to use `mysqldump` for large tables as it might take a lot of time and the file size will be huge. However, for small tables it is the best and the quickest option. -`mysqldump [options] > dump_output.sql` +```shell +mysqldump [options] > dump_output.sql +``` -There are certain options that can be used with mysqldump to get an appropriate dump of the database. +There are certain options that can be used with `mysqldump` to get an appropriate dump of the database. -To dump all the databases +To dump all the databases: -`mysqldump -u -p --all-databases > all_dbs.sql` +```shell +mysqldump -u -p --all-databases > all_dbs.sql +``` -To dump specific databases +To dump specific databases: -`mysqldump -u -p --databases db1 db2 db3 > dbs.sql` +```shell +mysqldump -u -p --databases db1 db2 db3 > dbs.sql +``` -To dump a single database -`mysqldump -u -p --databases db1 > db1.sql` +To dump a single database: +```shell +mysqldump -u -p --databases db1 > db1.sql +``` OR +```shell +mysqldump -u -p db1 > db1.sql +``` -`mysqldump -u -p db1 > db1.sql` - -The difference between the above two commands is that the latter one does not contain the **CREATE DATABASE** command in the backup output. +The difference between the above two commands is that the latter one does not contain the `CREATE DATABASE` command in the backup output. -To dump specific tables in a database +To dump specific tables in a database: -`mysqldump -u -p db1 table1 table2 > db1_tables.sql` +```shell +mysqldump -u -p db1 table1 table2 > db1_tables.sql +``` -To dump only table structures and no data +To dump only table structures and no data: -`mysqldump -u -p --no-data db1 > db1_structure.sql` +```shell +mysqldump -u -p --no-data db1 > db1_structure.sql +``` -To dump only table data and no CREATE statements +To dump only table data and no `CREATE` statements: -`mysqldump -u -p --no-create-info db1 > db1_data.sql` +```shell +mysqldump -u -p --no-create-info db1 > db1_data.sql +``` -To dump only specific records from a table +To dump only specific records from a table: -`mysqldump -u -p --no-create-info db1 table1 --where=”salary>80000” > db1_table1_80000.sql` +```shell +mysqldump -u -p --no-create-info db1 table1 --where=”salary>80000” > db1_table1_80000.sql +``` -Mysqldump can also provide output in CSV, other delimited text or XML format to support use-cases if any. The backup from mysqldump utility is offline i.e. when the backup finishes it will not have the changes to the database which were made when the backup was going on. For example, if the backup started at 3 PM and finished at 4 PM, it will not have the changes made to the database between 3 and 4 PM. +`mysqldump` can also provide output in CSV, other delimited text or XML format to support use-cases if any. The backup from `mysqldump` utility is offline, i.e. when the backup finishes it will not have the changes to the database which were made when the backup was going on. For example, if the backup started at 3:00 pm and finished at 4:00 pm, it will not have the changes made to the database between 3:00 and 4:00 pm. -**Restoring** from mysqldump can be done in the following two ways:- +**Restoring** from `mysqldump` can be done in the following two ways: From shell -`mysql -u -p < all_dbs.sql` - +```shell +mysql -u -p < all_dbs.sql +``` OR -From shell if the database is already created +From shell, if the database is already created: -`mysql -u -p db1 < db1.sql` +```shell +mysql -u -p db1 < db1.sql +``` -From within MySQL shell +From within MySQL shell: -`mysql> source all_dbs.sql` +```shell +mysql> source all_dbs.sql +``` -#### Percona Xtrabackup -This utility is installed separately from the MySQL server and is open source, provided by Percona. It helps in getting the full or partial physical backup of the database. It provides online backup of the database i.e. it will have the changes made to the database when the backup was going on as explained at the end of the previous section. +#### Percona XtraBackup +This utility is installed separately from the MySQL server and is open source, provided by Percona. It helps in getting the full or partial physical backup of the database. It provides online backup of the database, i.e. it will have the changes made to the database when the backup was going on as explained at the end of the previous section. - **Full Backup** - the complete backup of the database. - **Partial Backup** - Incremental @@ -74,74 +96,90 @@ This utility is installed separately from the MySQL server and is open source, p ![partial backups - differential and cummulative](images/partial_backup.png "Differential and Cumulative Backups") -Percona xtrabackup allows us to get both full and incremental backups as we desire. However, incremental backups take less space than a full backup (if taken per day) but the restore time of incremental backups is more than that of full backups. +Percona XtraBackup allows us to get both full and incremental backups as we desire. However, incremental backups take less space than a full backup (if taken per day) but the restore time of incremental backups is more than that of full backups. **Creating a full backup** -`xtrabackup --defaults-file= --user= --password= --backup --target-dir=` +```shell +xtrabackup --defaults-file= --user= --password= --backup --target-dir= +``` -Example +Example: -`xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/` +```shell +xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/ +``` Some other options -- `--stream` - can be used to stream the backup files to standard output in a specified format. xbstream is the only option for now. -- `--tmp-dir` - set this to a tmp directory to be used for temporary files while taking backups. +- `--stream` - can be used to stream the backup files to standard output in a specified format. `xbstream` is the only option for now. +- `--tmp-dir` - set this to a `tmp` directory to be used for temporary files while taking backups. - `--parallel` - set this to the number of threads that can be used to parallely copy data files to target directory. -- `--compress` - by default - quicklz is used. Set this to have the backup in compressed format. Each file is a .qp compressed file and can be extracted by qpress file archiver. -- `--decompress` - decompresses all the files which were compressed with the .qp extension. It will not delete the .qp files after decompression. To do that, use `--remove-original` along with this. Please note that the decompress option should be run separately from the xtrabackup command that used the compress option. +- `--compress` - by default - `quicklz` is used. Set this to have the backup in compressed format. Each file is a `.qp` compressed file and can be extracted by `qpress` file archiver. +- `--decompress` - decompresses all the files which were compressed with the `.qp` extension. It will not delete the `.qp` files after decompression. To do that, use `--remove-original` along with this. Please note that the `decompress` option should be run separately from the `xtrabackup` command that used the compress option. **Preparing a backup** -Once the backup is done with the --backup option, we need to prepare it in order to restore it. This is done to make the datafiles consistent with point-in-time. There might have been some transactions going on while the backup was being executed and those have changed the data files. When we prepare a backup, all those transactions are applied to the data files. +Once the backup is done with the `--backup` option, we need to prepare it in order to restore it. This is done to make the data files consistent with point-in-time. There might have been some transactions going on while the backup was being executed and those have changed the data files. When we prepare a backup, all those transactions are applied to the data files. -`xtrabackup --prepare --target-dir=` +```shell +xtrabackup --prepare --target-dir= +``` -Example +Example: -`xtrabackup --prepare --target-dir=/mnt/data/backup/` +```shell +xtrabackup --prepare --target-dir=/mnt/data/backup/ +``` It is not recommended to halt a process which is preparing the backup as that might cause data file corruption and backup cannot be used further. The backup will have to be taken again. **Restoring a Full Backup** -To restore the backup which is created and prepared from above commands, just copy everything from the backup target-dir to the data-dir of MySQL server, change the ownership of all files to mysql user (the linux user used by MySQL server) and start mysql. +To restore the backup which is created and prepared from above commands, just copy everything from the backup `target-dir` to the `data-dir` of MySQL server, change the ownership of all files to MySQL user (the Linux user used by MySQL server) and start MySQL. Or the below command can be used as well, -`xtrabackup --defaults-file=/etc/my.cnf --copy-back --target-dir=/mnt/data/backups/` +```shell +xtrabackup --defaults-file=/etc/my.cnf --copy-back --target-dir=/mnt/data/backups/ +``` **Note** - the backup has to be prepared in order to restore it. **Creating Incremental backups** -Percona Xtrabackup helps create incremental backups, i.e only the changes can be backed up since the last backup. Every InnoDB page contains a log sequence number or LSN that is also mentioned as one of the last lines of backup and prepare commands. -``` +Percona XtraBackup helps create incremental backups, i.e, only the changes can be backed up since the last backup. Every InnoDB page contains a log sequence number or LSN that is also mentioned as one of the last lines of backup and prepare commands. + +```shell xtrabackup: Transaction log of lsn to was copied. ``` OR -``` +```shell InnoDB: Shutdown completed; log sequence number completed OK! ``` + This indicates that the backup has been taken till the log sequence number mentioned. This is a key information in understanding incremental backups and working towards automating one. Incremental backups do not compare data files for changes, instead, they go through the InnoDB pages and compare their LSN to the last backup’s LSN. So, without one full backup, the incremental backups are useless. -The xtrabackup command creates a xtrabackup_checkpoint file which has the information about the LSN of the backup. Below are the key contents of the file:- -``` +The `xtrabackup` command creates a `xtrabackup_checkpoint` file which has the information about the LSN of the backup. Below are the key contents of the file: + +```shell backup_type = full-backuped | incremental from_lsn = 0 (full backup) | to_lsn of last backup to_lsn = last_lsn = ``` -There is a difference between **to\_lsn** and **last\_lsn**. When the **last\_lsn** is more than **to\_lsn** that means there are transactions that ran while we took the backup and are yet to be applied. That is what --prepare is used for. +There is a difference between `to_lsn` and `last_lsn`. When the `last_lsn` is more than `to_lsn` that means there are transactions that ran while we took the backup and are yet to be applied. That is what `--prepare` is used for. To take incremental backups, first, we require one full backup. -`xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/full/` - -Let’s assume the contents of the xtrabackup_checkpoint file to be as follows. +```shell +xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/full/ ``` + +Let’s assume the contents of the `xtrabackup_checkpoint` file to be as follows: + +```shell backup_type = full-backuped from_lsn = 0 to_lsn = 1000 @@ -149,62 +187,80 @@ last_lsn = 1000 ``` Now that we have one full backup, we can have an incremental backup that takes the changes. We will go with differential incremental backups. -`xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/incr1/ --incremental-basedir=/mnt/data/backup/full/` - -There are delta files created in the incr1 directory like, **ibdata1.delta**, **db1/tbl1.ibd.delta** with the changes from the full directory. The xtrabackup_checkpoint file will thus have the following contents. +```shell +xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/incr1/ --incremental-basedir=/mnt/data/backup/full/ ``` + +There are delta files created in the `incr1` directory like, `ibdata1.delta`, `db1/tbl1.ibd.delta` with the changes from the full directory. The `xtrabackup_checkpoint` file will thus have the following contents. + +```shell backup_type = incremental from_lsn = 1000 to_lsn = 1500 last_lsn = 1500 ``` -Hence, the **from\_lsn** here is equal to the **to\_lsn** of the last backup or the basedir provided for the incremental backups. For the next incremental backup we can use this incremental backup as the basedir. -`xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/incr2/ --incremental-basedir=/mnt/data/backup/incr1/` +Hence, the `from_lsn` here is equal to the `to_lsn` of the last backup or the `basedir` provided for the incremental backups. For the next incremental backup, we can use this incremental backup as the `basedir`. -The xtrabackup_checkpoint file will thus have the following contents. +```shell +xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/incr2/ --incremental-basedir=/mnt/data/backup/incr1/ ``` + +The `xtrabackup_checkpoint` file will thus have the following contents: + +```shell backup_type = incremental from_lsn = 1500 to_lsn = 2000 last_lsn = 2200 ``` + **Preparing Incremental backups** -Preparing incremental backups is not the same as preparing a full backup. When prepare runs, two operations are performed - *committed transactions are applied on the data files* and *uncommitted transactions are rolled back*. While preparing incremental backups, we have to skip rollback of uncommitted transactions as it is likely that they might get committed in the next incremental backup. If we rollback uncommitted transactions the further incremental backups cannot be applied. +Preparing incremental backups is not the same as preparing a full backup. When prepare runs, two operations are performed - *committed transactions are applied on the data files* and *uncommitted transactions are rolled back*. While preparing incremental backups, we have to skip rollback of uncommitted transactions as it is likely that they might get committed in the next incremental backup. If we rollback uncommitted transactions, the further incremental backups cannot be applied. -We use **--apply-log-only** option along with **--prepare** to avoid the rollback phase. +We use `--apply-log-only` option along with `--prepare` to avoid the rollback phase. -From the last section, we had the following directories with complete backup -``` +From the last section, we had the following directories with complete backup: + +```shell /mnt/data/backup/full /mnt/data/backup/incr1 /mnt/data/backup/incr2 ``` -First, we prepare the full backup, but only with the --apply-log-only option. -`xtrabackup --prepare --apply-log-only --target-dir=/mnt/data/backup/full` +First, we prepare the full backup, but only with the `--apply-log-only` option. -The output of the command will contain the following at the end. +```shell +xtrabackup --prepare --apply-log-only --target-dir=/mnt/data/backup/full ``` + +The output of the command will contain the following at the end. + +```shell InnoDB: Shutdown complete; log sequence number 1000 Completed OK! ``` -Note the LSN mentioned at the end is the same as the to\_lsn from the xtrabackup_checkpoint created for full backup. + +Note the LSN mentioned at the end is the same as the `to_lsn` from the `xtrabackup_checkpoint` created for full backup. Next, we apply the changes from the first incremental backup to the full backup. -`xtrabackup --prepare --apply-log-only --target-dir=/mnt/data/backup/full --incremental-dir=/mnt/data/backup/incr1` +```shell +xtrabackup --prepare --apply-log-only --target-dir=/mnt/data/backup/full --incremental-dir=/mnt/data/backup/incr1 +``` This applies the delta files in the incremental directory to the full backup directory. It rolls the data files in the full backup directory forward to the time of incremental backup and applies the redo logs as usual. Lastly, we apply the last incremental backup same as the previous one with just a small change. -`xtrabackup --prepare --target-dir=/mnt/data/backup/full --incremental-dir=/mnt/data/backup/incr1` +```shell +xtrabackup --prepare --target-dir=/mnt/data/backup/full --incremental-dir=/mnt/data/backup/incr1 +``` -We do not have to use the **--apply-log-only** option with it. It applies the *incr2 delta files* to the full backup data files taking them forward, applies redo logs on them and finally rollbacks the uncommitted transactions to produce the final result. The data now present in the full backup directory can now be used to restore. +We do not have to use the `--apply-log-only` option with it. It applies the *incr2 delta files* to the full backup data files taking them forward, applies redo logs on them and finally rollbacks the uncommitted transactions to produce the final result. The data now present in the full backup directory can now be used to restore. -**Note** - To create cumulative incremental backups, the incremental-basedir should always be the full backup directory for every incremental backup. While preparing, we can start with the full backup with the --apply-log-only option and use just the last incremental backup for the final --prepare as that has all the changes since the full backup. +**Note**: To create cumulative incremental backups, the `incremental-basedir` should always be the full backup directory for every incremental backup. While preparing, we can start with the full backup with the `--apply-log-only` option and use just the last incremental backup for the final `--prepare` as that has all the changes since the full backup. **Restoring Incremental backups** diff --git a/courses/level101/databases_sql/concepts.md b/courses/level101/databases_sql/concepts.md index f7321fa8..caadc759 100644 --- a/courses/level101/databases_sql/concepts.md +++ b/courses/level101/databases_sql/concepts.md @@ -28,44 +28,42 @@ A query language to interact with and manage data. - [CRUD operations](https://stackify.com/what-are-crud-operations/) - create, read, update, delete queries + [CRUD operations](https://stackify.com/what-are-crud-operations/)—create, read, update, delete queries - Management operations - create DBs/tables/indexes etc, backup, import/export, users, access controls + Management operations—create DBs/tables/indexes, backup, import/export, users, access controls, etc - Exercise: Classify the below queries into the four types - DDL (definition), DML(manipulation), DCL(control) and TCL(transactions) and explain in detail. + *Exercise*: Classify the below queries into the four types—DDL (definition), DML (manipulation), DCL (control) and TCL (transactions) and explain in detail. insert, create, drop, delete, update, commit, rollback, truncate, alter, grant, revoke You can practise these in the [lab section](https://linkedin.github.io/school-of-sre/level101/databases_sql/lab/). - - * Constraints Rules for data that can be stored. Query fails if you violate any of these defined on a table. + *Primary key*: One or more columns that contain UNIQUE values, and cannot contain NULL values. A table can have only ONE primary key. An index on it is created by default. + + *Foreign key*: Links two tables together. Its value(s) match a primary key in a different table - Primary key: one or more columns that contain UNIQUE values, and cannot contain NULL values. A table can have only ONE primary key. An index on it is created by default. + *Not null*: Does not allow null values - Foreign key: links two tables together. Its value(s) match a primary key in a different table \ - Not null: Does not allow null values \ - Unique: Value of column must be unique across all rows \ - Default: Provides a default value for a column if none is specified during insert + *Unique*: Value of column must be unique across all rows - Check: Allows only particular values (like Balance >= 0) + *Default*: Provides a default value for a column if none is specified during insert + *Check*: Allows only particular values (like Balance >= 0) * [Indexes](https://datageek.blog/en/2018/06/05/rdbms-basics-indexes-and-clustered-indexes/) Most indexes use B+ tree structure. - Why use them: Speeds up queries (in large tables that fetch only a few rows, min/max queries, by eliminating rows from consideration etc) + Why use them: Speeds up queries (in large tables that fetch only a few rows, min/max queries, by eliminating rows from consideration, etc) - Types of indexes: unique, primary key, fulltext, secondary - - Write-heavy loads, mostly full table scans or accessing large number of rows etc. do not benefit from indexes + *Types of indexes*: unique, primary key, fulltext, secondary + Write-heavy loads, mostly full table scans or accessing large number of rows, etc. do not benefit from indexes * [Joins](https://www.sqlservertutorial.net/sql-server-basics/sql-server-joins/) @@ -73,21 +71,20 @@ Allows you to fetch related data from multiple tables, linking them together with some common field. Powerful but also resource-intensive and makes scaling databases difficult. This is the cause of many slow performing queries when run at scale, and the solution is almost always to find ways to reduce the joins. - * [Access control](https://dev.mysql.com/doc/refman/8.0/en/access-control.html) - DBs have privileged accounts for admin tasks, and regular accounts for clients. There are finegrained controls on what actions(DDL, DML etc. discussed earlier )are allowed for these accounts. + DBs have privileged accounts for admin tasks, and regular accounts for clients. There are fine-grained controls on what actions (DDL, DML, etc. discussed earlier) are allowed for these accounts. DB first verifies the user credentials (authentication), and then examines whether this user is permitted to perform the request (authorization) by looking up these information in some internal tables. - Other controls include activity auditing that allows examining the history of actions done by a user, and resource limits which define the number of queries, connections etc. allowed. + Other controls include activity auditing that allows examining the history of actions done by a user, and resource limits which define the number of queries, connections, etc. allowed. ### Popular databases -Commercial, closed source - Oracle, Microsoft SQL Server, IBM DB2 +Commercial, closed source: Oracle, Microsoft SQL Server, IBM DB2 -Open source with optional paid support - MySQL, MariaDB, PostgreSQL +Open source with optional paid support: MySQL, MariaDB, PostgreSQL Individuals and small companies have always preferred open source DBs because of the huge cost associated with commercial software. diff --git a/courses/level101/databases_sql/conclusion.md b/courses/level101/databases_sql/conclusion.md index d9626aab..350f58bd 100644 --- a/courses/level101/databases_sql/conclusion.md +++ b/courses/level101/databases_sql/conclusion.md @@ -1,5 +1,5 @@ # Conclusion -We have covered basic concepts of SQL databases. We have also covered some of the tasks that an SRE may be responsible for - there is so much more to learn and do. We hope this course gives you a good start and inspires you to explore further. +We have covered basic concepts of SQL databases. We have also covered some of the tasks that an SRE may be responsible for—there is so much more to learn and do. We hope this course gives you a good start and inspires you to explore further. ### Further reading diff --git a/courses/level101/databases_sql/innodb.md b/courses/level101/databases_sql/innodb.md index a42ae9af..c6fed27a 100644 --- a/courses/level101/databases_sql/innodb.md +++ b/courses/level101/databases_sql/innodb.md @@ -1,6 +1,6 @@ ### Why should you use this? -General purpose, row level locking, ACID support, transactions, crash recovery and multi-version concurrency control etc. +General purpose, row level locking, ACID support, transactions, crash recovery and multi-version concurrency control, etc. ### Architecture @@ -11,7 +11,7 @@ General purpose, row level locking, ACID support, transactions, crash recovery a ### Key components: * Memory: - * Buffer pool: LRU cache of frequently used data(table and index) to be processed directly from memory, which speeds up processing. Important for tuning performance. + * Buffer pool: LRU cache of frequently used data (table and index) to be processed directly from memory, which speeds up processing. Important for tuning performance. * Change buffer: Caches changes to secondary index pages when those pages are not in the buffer pool and merges it when they are fetched. Merging may take a long time and impact live queries. It also takes up part of the buffer pool. Avoids the extra I/O to read secondary indexes in. * Adaptive hash index: Supplements InnoDB’s B-Tree indexes with fast hash lookup tables like a cache. Slight performance penalty for misses, also adds maintenance overhead of updating it. Hash collisions cause AHI rebuilding for large DBs. * Log buffer: Holds log data before flush to disk. diff --git a/courses/level101/databases_sql/intro.md b/courses/level101/databases_sql/intro.md index 7a6e980e..9ba46c75 100644 --- a/courses/level101/databases_sql/intro.md +++ b/courses/level101/databases_sql/intro.md @@ -8,14 +8,14 @@ You will have an understanding of what relational databases are, their advantages, and some MySQL specific concepts. ### What is not covered under this course -* In depth implementation details +* In-depth implementation details * Advanced topics like normalization, sharding * Specific tools for administration ### Introduction -The main purpose of database systems is to manage data. This includes storage, adding new data, deleting unused data, updating existing data, retrieving data within a reasonable response time, other maintenance tasks to keep the system running etc. +The main purpose of database systems is to manage data. This includes storage, adding new data, deleting unused data, updating existing data, retrieving data within a reasonable response time, other maintenance tasks to keep the system running, etc. ### Pre-reads [RDBMS Concepts](https://beginnersbook.com/2015/04/rdbms-concepts/) diff --git a/courses/level101/databases_sql/lab.md b/courses/level101/databases_sql/lab.md index c3293def..810c46e4 100644 --- a/courses/level101/databases_sql/lab.md +++ b/courses/level101/databases_sql/lab.md @@ -2,40 +2,36 @@ Install Docker - **Setup** -Create a working directory named sos or something similar, and cd into it. - -Enter the following into a file named my.cnf under a directory named custom. +Create a working directory named `sos` or something similar, and `cd` into it. +Enter the following into a file named `my.cnf` under a directory named `custom`: -``` +```shell sos $ cat custom/my.cnf [mysqld] + # These settings apply to MySQL server # You can set port, socket path, buffer size etc. # Below, we are configuring slow query settings + slow_query_log=1 slow_query_log_file=/var/log/mysqlslow.log long_query_time=1 ``` - Start a container and enable slow query log with the following: - -``` +```shell sos $ docker run --name db -v custom:/etc/mysql/conf.d -e MYSQL_ROOT_PASSWORD=realsecret -d mysql:8 sos $ docker cp custom/my.cnf $(docker ps -qf "name=db"):/etc/mysql/conf.d/custom.cnf sos $ docker restart $(docker ps -qf "name=db") ``` +Import a sample database: -Import a sample database - - -``` +```shell sos $ git clone git@github.com:datacharmer/test_db.git sos $ docker cp test_db $(docker ps -qf "name=db"):/home/test_db/ sos $ docker exec -it $(docker ps -qf "name=db") bash @@ -45,17 +41,18 @@ root@3ab5b18b0c7d:/etc# touch /var/log/mysqlslow.log root@3ab5b18b0c7d:/etc# chown mysql:mysql /var/log/mysqlslow.log ``` - _Workshop 1: Run some sample queries_ -Run the following -``` + +Run the following: + +```shell $ mysql -uroot -prealsecret mysql mysql> # inspect DBs and tables # the last 4 are MySQL internal DBs -mysql> show databases; +mysql> SHOW DATABASES; +--------------------+ | Database | +--------------------+ @@ -66,8 +63,8 @@ mysql> show databases; | sys | +--------------------+ -> use employees; -mysql> show tables; +mysql> USE employees; +mysql> SHOW TABLES; +----------------------+ | Tables_in_employees | +----------------------+ @@ -82,20 +79,22 @@ mysql> show tables; +----------------------+ # read a few rows -mysql> select * from employees limit 5; +mysql> SELECT * FROM employees LIMIT 5; # filter data by conditions -mysql> select count(*) from employees where gender = 'M' limit 5; +mysql> SELECT COUNT(*) FROM employees WHERE gender = 'M' LIMIT 5; # find count of particular data -mysql> select count(*) from employees where first_name = 'Sachin'; +mysql> SELECT COUNT(*) FROM employees WHERE first_name = 'Sachin'; ``` _Workshop 2: Use explain and explain analyze to profile a query, identify and add indexes required for improving performance_ -``` + +```shell # View all indexes on table -#(\G is to output horizontally, replace it with a ; to get table output) -mysql> show index from employees from employees\G +# (\G is to output horizontally, replace it with a ; to get table output) + +mysql> SHOW INDEX FROM employees FROM employees\G *************************** 1. row *************************** Table: employees Non_unique: 0 @@ -113,10 +112,11 @@ Index_comment: Visible: YES Expression: NULL -# This query uses an index, idenitfied by 'key' field +# This query uses an index, identified by 'key' field # By prefixing explain keyword to the command, # we get query plan (including key used) -mysql> explain select * from employees where emp_no < 10005\G + +mysql> EXPLAIN SELECT * FROM employees WHERE emp_no < 10005\G *************************** 1. row *************************** id: 1 select_type: SIMPLE @@ -132,7 +132,8 @@ possible_keys: PRIMARY Extra: Using where # Compare that to the next query which does not utilize any index -mysql> explain select first_name, last_name from employees where first_name = 'Sachin'\G + +mysql> EXPLAIN SELECT first_name, last_name FROM employees WHERE first_name = 'Sachin'\G *************************** 1. row *************************** id: 1 select_type: SIMPLE @@ -148,20 +149,22 @@ possible_keys: NULL Extra: Using where # Let's see how much time this query takes -mysql> explain analyze select first_name, last_name from employees where first_name = 'Sachin'\G + +mysql> EXPLAIN ANALYZE SELECT first_name, last_name FROM employees WHERE first_name = 'Sachin'\G *************************** 1. row *************************** EXPLAIN: -> Filter: (employees.first_name = 'Sachin') (cost=30143.55 rows=29911) (actual time=28.284..3952.428 rows=232 loops=1) -> Table scan on employees (cost=30143.55 rows=299113) (actual time=0.095..1996.092 rows=300024 loops=1) -# Cost(estimated by query planner) is 30143.55 +# Cost (estimated by query planner) is 30143.55 # actual time=28.284ms for first row, 3952.428 for all rows # Now lets try adding an index and running the query again -mysql> create index idx_firstname on employees(first_name); + +mysql> CREATE INDEX idx_firstname ON employees(first_name); Query OK, 0 rows affected (1.25 sec) Records: 0 Duplicates: 0 Warnings: 0 -mysql> explain analyze select first_name, last_name from employees where first_name = 'Sachin'; +mysql> EXPLAIN ANALYZE SELECT first_name, last_name FROM employees WHERE first_name = 'Sachin'; +--------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +--------------------------------------------------------------------------------------------------------------------------------------------+ @@ -173,26 +176,31 @@ mysql> explain analyze select first_name, last_name from employees where first_n # Actual time=0.551ms for first row # 2.934ms for all rows. A huge improvement! # Also notice that the query involves only an index lookup, -# and no table scan (reading all rows of table) -# ..which vastly reduces load on the DB. +# and no table scan (reading all rows of the table), +# which vastly reduces load on the DB. ``` _Workshop 3: Identify slow queries on a MySQL server_ -``` + +```shell # Run the command below in two terminal tabs to open two shells into the container. -docker exec -it $(docker ps -qf "name=db") bash -# Open a mysql prompt in one of them and execute this command +$ docker exec -it $(docker ps -qf "name=db") bash + +# Open a `mysql` prompt in one of them and execute this command # We have configured to log queries that take longer than 1s, -# so this sleep(3) will be logged -mysql -uroot -prealsecret mysql +# so this `sleep(3)` will be logged + +$ mysql -uroot -prealsecret mysql mysql> select sleep(3); # Now, in the other terminal, tail the slow log to find details about the query + root@62c92c89234d:/etc# tail -f /var/log/mysqlslow.log /usr/sbin/mysqld, Version: 8.0.21 (MySQL Community Server - GPL). started with: Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sock Time Id Command Argument + # Time: 2020-11-26T14:53:44.822348Z # User@Host: root[root] @ localhost [] Id: 9 # Query_time: 5.404938 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 1 @@ -200,6 +208,7 @@ use employees; # Time: 2020-11-26T14:53:58.015736Z # User@Host: root[root] @ localhost [] Id: 9 # Query_time: 10.000225 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 1 + SET timestamp=1606402428; select sleep(3); ``` diff --git a/courses/level101/databases_sql/mysql.md b/courses/level101/databases_sql/mysql.md index 6e127897..2f393ad2 100644 --- a/courses/level101/databases_sql/mysql.md +++ b/courses/level101/databases_sql/mysql.md @@ -6,25 +6,21 @@ MySQL architecture enables you to select the right storage engine for your needs Application layer: -* Connection handling - each client gets its own connection which is cached for the duration of access) -* Authentication - server checks (username,password,host) info of client and allows/rejects connection -* Security: server determines whether the client has privileges to execute each query (check with _show privileges_ command) +* Connection handling: each client gets its own connection which is cached for the duration of access +* Authentication: server checks (username, password, host) info of client and allows/rejects connection +* Security: server determines whether the client has privileges to execute each query (check with `SHOW PRIVILEGES` command) Server layer: - - -* Services and utilities - backup/restore, replication, cluster etc -* SQL interface - clients run queries for data access and manipulation -* SQL parser - creates a parse tree from the query (lexical/syntactic/semantic analysis and code generation) -* Optimizer - optimizes queries using various algorithms and data available to it(table level stats), modifies queries, order of scanning, indexes to use etc. (check with explain command) -* Caches and buffers - cache stores query results, buffer pool(InnoDB) stores table and index data in [LRU](https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)) fashion +* Services and utilities: backup/restore, replication, cluster, etc +* SQL interface: clients run queries for data access and manipulation +* SQL parser: creates a parse tree from the query (lexical/syntactic/semantic analysis and code generation) +* Optimizer: optimizes queries using various algorithms and data available to it (table-level stats), modifies queries, order of scanning, indexes to use, etc. (check with `EXPLAIN` command) +* Caches and buffers: cache stores query results, buffer pool (InnoDB) stores table and index data in [LRU](https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)) fashion Storage engine options: - - -* InnoDB: most widely used, transaction support, ACID compliant, supports row-level locking, crash recovery and multi-version concurrency control. Default since MySQL 5.5+. +* InnoDB: most-widely used, transaction support, ACID compliant, supports row-level locking, crash recovery and multi-version concurrency control. Default since MySQL 5.5+. * MyISAM: fast, does not support transactions, provides table-level locking, great for read-heavy workloads, mostly in web and data warehousing. Default upto MySQL 5.1. * Archive: optimised for high speed inserts, compresses data as it is inserted, does not support transactions, ideal for storing and retrieving large amounts of seldom referenced historical, archived data * Memory: tables in memory. Fastest engine, supports table-level locking, does not support transactions, ideal for creating temporary tables or quick lookups, data is lost after a shutdown @@ -35,4 +31,4 @@ It is possible to migrate from one storage engine to another. But this migration General guideline is to use InnoDB unless you have a specific need for one of the other storage engines. -Running `mysql> SHOW ENGINES; `shows you the supported engines on your MySQL server. \ No newline at end of file +Running `mysql> SHOW ENGINES;` shows you the supported engines on your MySQL server. \ No newline at end of file diff --git a/courses/level101/databases_sql/operations.md b/courses/level101/databases_sql/operations.md index 2a97f2bc..acc2a506 100644 --- a/courses/level101/databases_sql/operations.md +++ b/courses/level101/databases_sql/operations.md @@ -1,8 +1,8 @@ * Explain and explain+analyze - EXPLAIN <query> analyzes query plans from the optimizer, including how tables are joined, which tables/rows are scanned etc. + `EXPLAIN ` analyzes query plans from the optimizer, including how tables are joined, which tables/rows are scanned, etc. - Explain analyze shows the above and additional info like execution cost, number of rows returned, time taken etc. + `EXPLAIN ANALYZE` shows the above and additional info like execution cost, number of rows returned, time taken, etc. This knowledge is useful to tweak queries and add indexes. @@ -12,7 +12,7 @@ * [Slow query logs](https://dev.mysql.com/doc/refman/5.7/en/slow-query-log.html) - Used to identify slow queries (configurable threshold), enabled in config or dynamically with a query + Used to identify slow queries (configurable threshold), enabled in config or dynamically with a query. Checkout the [lab section](https://linkedin.github.io/school-of-sre/level101/databases_sql/lab/) about identifying slow queries. @@ -20,27 +20,23 @@ This includes creation and changes to users, like managing privileges, changing password etc. - - * Backup and restore strategies, pros and cons - Logical backup using mysqldump - slower but can be done online + - Logical backup using `mysqldump` - slower but can be done online - Physical backup (copy data directory or use xtrabackup) - quick backup/recovery. Copying data directory requires locking or shut down. xtrabackup is an improvement because it supports backups without shutting down (hot backup). - - Others - PITR, snapshots etc. + - Physical backup (copy data directory or use XtraBackup) - quick backup/recovery. Copying data directory requires locking or shut down. XtraBackup is an improvement because it supports backups without shutting down (hot backup). + - Others - PITR, snapshots etc. * Crash recovery process using redo logs - After a crash, when you restart server it reads redo logs and replays modifications to recover - + After a crash, when you restart server, it reads redo logs and replays modifications to recover * Monitoring MySQL - Key MySQL metrics: reads, writes, query runtime, errors, slow queries, connections, running threads, InnoDB metrics + - Key MySQL metrics: reads, writes, query runtime, errors, slow queries, connections, running threads, InnoDB metrics - Key OS metrics: CPU, load, memory, disk I/O, network + - Key OS metrics: CPU, load, memory, disk I/O, network * Replication @@ -49,7 +45,7 @@ * High Availability - Ability to cope with failure at software, hardware and network level. Essential for anyone who needs 99.9%+ uptime. Can be implemented with replication or clustering solutions from MySQL, Percona, Oracle etc. Requires expertise to setup and maintain. Failover can be manual, scripted or using tools like Orchestrator. + Ability to cope with failure at software, hardware and network level. Essential for anyone who needs 99.9%+ uptime. Can be implemented with replication or clustering solutions from MySQL, Percona, Oracle, etc. Requires expertise to setup and maintain. Failover can be manual, scripted or using tools like Orchestrator. * [Data directory](https://dev.mysql.com/doc/refman/8.0/en/data-directory.html) @@ -57,7 +53,7 @@ * [MySQL configuration](https://dev.mysql.com/doc/refman/5.7/en/server-configuration.html) - This can be done by passing [parameters during startup](https://dev.mysql.com/doc/refman/5.7/en/server-options.html), or in a [file](https://dev.mysql.com/doc/refman/8.0/en/option-files.html). There are a few [standard paths](https://dev.mysql.com/doc/refman/8.0/en/option-files.html#option-file-order) where MySQL looks for config files, `/etc/my.cnf` is one of the commonly used paths. These options are organized under headers (mysqld for server and mysql for client), you can explore them more in the lab that follows. + This can be done by passing [parameters during startup](https://dev.mysql.com/doc/refman/5.7/en/server-options.html), or in a [file](https://dev.mysql.com/doc/refman/8.0/en/option-files.html). There are a few [standard paths](https://dev.mysql.com/doc/refman/8.0/en/option-files.html#option-file-order) where MySQL looks for config files, `/etc/my.cnf` is one of the commonly used paths. These options are organized under headers (`mysqld` for server and `mysql` for client), you can explore them more in the lab that follows. * [Logs](https://dev.mysql.com/doc/refman/5.7/en/server-logs.html) diff --git a/courses/level101/databases_sql/query_performance.md b/courses/level101/databases_sql/query_performance.md index 5e22c46c..95acc96f 100644 --- a/courses/level101/databases_sql/query_performance.md +++ b/courses/level101/databases_sql/query_performance.md @@ -2,7 +2,7 @@ Query Performance is a very crucial aspect of relational databases. If not tuned correctly, the select queries can become slow and painful for the application, and for the MySQL server as well. The important task is to identify the slow queries and try to improve their performance by either rewriting them or creating proper indexes on the tables involved in it. #### The Slow Query Log -The slow query log contains SQL statements that take a longer time to execute then set in the config parameter long_query_time. These queries are the candidates for optimization. There are some good utilities to summarize the slow query logs like, mysqldumpslow (provided by MySQL itself), pt-query-digest (provided by Percona), etc. Following are the config parameters that are used to enable and effectively catch slow queries +The slow query log contains SQL statements that take a longer time to execute than set in the config parameter `long_query_time`. These queries are the candidates for optimization. There are some good utilities to summarize the slow query logs like, `mysqldumpslow` (provided by MySQL itself), `pt-query-digest` (provided by Percona), etc. Following are the config parameters that are used to enable and effectively catch slow queries | Variable | Explanation | Example value | | --- | --- | --- | @@ -11,32 +11,50 @@ The slow query log contains SQL statements that take a longer time to execute th | long_query_time | Threshold time. The query that takes longer than this time is logged in slow query log | 5 | | log_queries_not_using_indexes | When enabled with the slow query log, the queries which do not make use of any index are also logged in the slow query log even though they take less time than long_query_time. | ON | -So, for this section, we will be enabling **slow_query_log**, **long_query_time** will be kept to **0.3 (300 ms)**, and **log_queries_not_using** index will be enabled as well. +So, for this section, we will be enabling `slow_query_log`, `long_query_time` will be kept to **0.3 (300 ms)**, and `log_queries_not_using` index will be enabled as well. -Below are the queries that we will execute on the employees database. +Below are the queries that we will execute on the `employees` database. -1. select * from employees where last_name = 'Koblick'; -2. select * from salaries where salary >= 100000; -3. select * from titles where title = 'Manager'; -4. select * from employees where year(hire_date) = 1995; -5. select year(e.hire_date), max(s.salary) from employees e join salaries s on e.emp_no=s.emp_no group by year(e.hire_date); +1. + ``` + SELECT * FROM employees WHERE last_name = 'Koblick' + ``` +1. + ``` + SELECT * FROM salaries WHERE salary >= 100000 + ``` +1. + ``` + SELECT * FROM titles WHERE title = 'Manager' + ``` +1. + ``` + SELECT * FROM employees WHERE year(hire_date) = 1995 + ``` +1. + ``` + SELECT year(e.hire_date), max(s.salary) FROM employees e JOIN salaries s ON e.emp_no=s.emp_no GROUP BY year(e.hire_date) + ``` -Now, queries **1**, **3** and **4** executed under 300 ms but if we check the slow query logs, we will find these queries logged as they are not using any of the index. Queries **2** and **5** are taking longer than 300ms and also not using any index. +Now, queries **1**, **3** and **4** executed under 300ms but if we check the slow query logs, we will find these queries logged as they are not using any of the index. Queries **2** and **5** are taking longer than 300ms and also not using any index. -Use the following command to get the summary of the slow query log +Use the following command to get the summary of the slow query log: -`mysqldumpslow /var/lib/mysql/mysql-slow.log` +```shell +mysqldumpslow /var/lib/mysql/mysql-slow.log +``` ![slow query log analysis](images/mysqldumpslow_out.png "slow query log analysis") -There are some more queries in the snapshot that were along with the queries mentioned. Mysqldumpslow replaces actual values that were used by N (in case of numbers) and S (in case of strings). That can be overridden by `-a` option, however that will increase the output lines if different values are used in similar queries. +There are some more queries in the snapshot that were along with the queries mentioned. `mysqldumpslow` replaces actual values that were used by _N_ (in case of numbers) and _S_ (in case of strings). That can be overridden by `-a` option, however, that will increase the output lines if different values are used in similar queries. #### The EXPLAIN Plan -The **EXPLAIN** command is used with any query that we want to analyze. It describes the query execution plan, how MySQL sees and executes the query. EXPLAIN works with Select, Insert, Update and Delete statements. It tells about different aspects of the query like, how tables are joined, indexes used or not, etc. The important thing here is to understand the basic Explain plan output of a query to determine its performance. +The `EXPLAIN` command is used with any query that we want to analyze. It describes the query execution plan, how MySQL sees and executes the query. `EXPLAIN` works with `SELECT`, `INSERT`, `UPDATE` and `DELETE` statements. It tells about different aspects of the query like, how tables are joined, indexes used or not, etc. The important thing here is to understand the basic `EXPLAIN` plan output of a query to determine its performance. Let's take the following query as an example, -``` -mysql> explain select * from salaries where salary = 100000; + +```shell +mysql> EXPLAIN SELECT * FROM salaries WHERE salary = 100000; +----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------------+ @@ -44,40 +62,45 @@ mysql> explain select * from salaries where salary = 100000; +----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) ``` -The key aspects to understand in the above output are:- + +The key aspects to understand in the above output are: - **Partitions** - the number of partitions considered while executing the query. It is only valid if the table is partitioned. - **Possible_keys** - the list of indexes that were considered during creation of the execution plan. - **Key** - the index that will be used while executing the query. - **Rows** - the number of rows examined during the execution. - **Filtered** - the percentage of rows that were filtered out of the rows examined. The maximum and most optimized result will have 100 in this field. -- **Extra** - this tells some extra information on how MySQL evaluates, whether the query is using only where clause to match target rows, any index or temporary table, etc. +- **Extra** - this tells some extra information on how MySQL evaluates, whether the query is using only `WHERE` clause to match target rows, any index or temporary table, etc. -So, for the above query, we can determine that there are no partitions, there are no candidate indexes to be used and so no index is used at all, over 2M rows are examined and only 10% of them are included in the result, and lastly, only a where clause is used to match the target rows. +So, for the above query, we can determine that there are no partitions, there are no candidate indexes to be used and so no index is used at all, over 2M rows are examined and only 10% of them are included in the result, and lastly, only a `WHERE` clause is used to match the target rows. #### Creating an Index Indexes are used to speed up selecting relevant rows for a given column value. Without an index, MySQL starts with the first row and goes through the entire table to find matching rows. If the table has too many rows, the operation becomes costly. With indexes, MySQL determines the position to start looking for the data without reading the full table. A primary key is also an index which is also the fastest and is stored along with the table data. Secondary indexes are stored outside of the table data and are used to further enhance the performance of SQL statements. Indexes are mostly stored as B-Trees, with some exceptions like spatial indexes use R-Trees and memory tables use hash indexes. -There are 2 ways to create indexes:- +There are 2 ways to create indexes: -- While creating a table - if we know beforehand the columns that will drive the most number of where clauses in select queries, then we can put an index over them while creating a table. -- Altering a Table - To improve the performance of a troubling query, we create an index on a table which already has data in it using ALTER or CREATE INDEX command. This operation does not block the table but might take some time to complete depending on the size of the table. +- While creating a table - if we know beforehand the columns that will drive the most number of `WHERE` clauses in `SELECT` queries, then we can put an index over them while creating a table. +- Altering a Table - To improve the performance of a troubling query, we create an index on a table which already has data in it using `ALTER` or `CREATE INDEX` command. This operation does not block the table but might take some time to complete depending on the size of the table. Let’s look at the query that we discussed in the previous section. It’s clear that scanning over 2M records is not a good idea when only 10% of those records are actually in the resultset. Hence, we create an index on the salary column of the salaries table. -`create index idx_salary on salaries(salary)` - +```SQL +CREATE INDEX idx_salary ON salaries(salary) +``` OR -`alter table salaries add index idx_salary(salary)` - -And the same explain plan now looks like this +```SQL +ALTER TABLE salaries ADD INDEX idx_salary(salary) ``` -mysql> explain select * from salaries where salary = 100000; + +And the same explain plan now looks like this: + +```shell +mysql> EXPLAIN SELECT * FROM salaries WHERE salary = 100000; +----+-------------+----------+------------+------+---------------+------------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------------+---------+-------+------+----------+-------+ @@ -85,11 +108,13 @@ mysql> explain select * from salaries where salary = 100000; +----+-------------+----------+------------+------+---------------+------------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.00 sec) ``` -Now the index used is idx_salary, the one we recently created. The index actually helped examine only 13 records and all of them are in the resultset. Also, the query execution time is also reduced from over 700ms to almost negligible. -Let’s look at another example. Here we are searching for a specific combination of first\_name and last\_name. But, we might also search based on last_name only. -``` -mysql> explain select * from employees where last_name = 'Dredge' and first_name = 'Yinghua'; +Now the index used is `idx_salary`, the one we recently created. The index actually helped examine only 13 records and all of them are in the resultset. Also, the query execution time is also reduced from over 700ms to almost negligible. + +Let’s look at another example. Here, we are searching for a specific combination of `first_name` and `last_name`. But, we might also search based on `last_name` only. + +```shell +mysql> EXPLAIN SELECT * FROM employees WHERE last_name = 'Dredge' AND first_name = 'Yinghua'; +----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+ @@ -97,11 +122,15 @@ mysql> explain select * from employees where last_name = 'Dredge' and first_name +----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) ``` -Now only 1% record out of almost 300K is the resultset. Although the query time is particularly quick as we have only 300K records, this will be a pain if the number of records are over millions. In this case, we create an index on last\_name and first\_name, not separately, but a composite index including both the columns. -`create index idx_last_first on employees(last_name, first_name)` +Now only 1% record out of almost 300K is the resultset. Although the query time is particularly quick as we have only 300K records, this will be a pain if the number of records are over millions. In this case, we create an index on `last_name` and `first_name`, not separately, but a composite index including both the columns. + +```SQL +CREATE INDEX idx_last_first ON employees(last_name, first_name) ``` -mysql> explain select * from employees where last_name = 'Dredge' and first_name = 'Yinghua'; + +```shell +mysql> EXPLAIN SELECT * FROM employees WHERE last_name = 'Dredge' AND first_name = 'Yinghua'; +----+-------------+-----------+------------+------+----------------+----------------+---------+-------------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-----------+------------+------+----------------+----------------+---------+-------------+------+----------+-------+ @@ -109,9 +138,11 @@ mysql> explain select * from employees where last_name = 'Dredge' and first_name +----+-------------+-----------+------------+------+----------------+----------------+---------+-------------+------+----------+-------+ 1 row in set, 1 warning (0.00 sec) ``` -We chose to put last\_name before first\_name while creating the index as the optimizer starts from the leftmost prefix of the index while evaluating the query. For example, if we have a 3-column index like idx(c1, c2, c3), then the search capability of the index follows - (c1), (c1, c2) or (c1, c2, c3) i.e. if your where clause has only first_name this index won’t work. -``` -mysql> explain select * from employees where first_name = 'Yinghua'; + +We chose to put `last_name` before `first_name` while creating the index as the optimizer starts from the leftmost prefix of the index while evaluating the query. For example, if we have a 3-column index like `idx(c1, c2, c3)`, then the search capability of the index follows - (c1), (c1, c2) or (c1, c2, c3) i.e. if your `WHERE` clause has only `first_name`, this index won’t work. + +```shell +mysql> EXPLAIN SELECT * FROM employees WHERE first_name = 'Yinghua'; +----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+ @@ -119,9 +150,11 @@ mysql> explain select * from employees where first_name = 'Yinghua'; +----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) ``` -But, if you have only the last_name in the where clause, it will work as expected. -``` -mysql> explain select * from employees where last_name = 'Dredge'; + +But, if you have only the `last_name` in the `WHERE` clause, it will work as expected. + +```shell +mysql> EXPLAIN SELECT * FROM employees WHERE last_name = 'Dredge'; +----+-------------+-----------+------------+------+----------------+----------------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-----------+------------+------+----------------+----------------+---------+-------+------+----------+-------+ @@ -129,22 +162,28 @@ mysql> explain select * from employees where last_name = 'Dredge'; +----+-------------+-----------+------------+------+----------------+----------------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.00 sec) ``` -For another example, use the following queries:- -``` -create table employees_2 like employees; -create table salaries_2 like salaries; -alter table salaries_2 drop primary key; + +For another example, use the following queries: + +```SQL +CREATE TABLE employees_2 LIKE employees; +CREATE TABLE salaries_2 LIKE salaries; +ALTER TABLE salaries_2 DROP PRIMARY KEY; ``` -We made copies of employees and salaries tables without the Primary Key of salaries table to understand an example of Select with Join. + +We made copies of `employees` and `salaries` tables without the Primary Key of `salaries` table to understand an example of `SELECT` with `JOIN`. When you have queries like the below, it becomes tricky to identify the pain point of the query. -``` -mysql> select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge'; + +```shell +mysql> SELECT e.first_name, e.last_name, s.salary, e.hire_date FROM employees_2 e JOIN salaries_2 s ON e.emp_no=s.emp_no WHERE e.last_name='Dredge'; 1860 rows in set (4.44 sec) ``` + This query is taking about 4.5 seconds to complete with 1860 rows in the resultset. Let’s look at the Explain plan. There will be 2 records in the Explain plan as 2 tables are used in the query. -``` -mysql> explain select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge'; + +```shell +mysql> EXPLAIN SELECT e.first_name, e.last_name, s.salary, e.hire_date FROM employees_2 e JOIN salaries_2 s ON e.emp_no=s.emp_no WHERE e.last_name='Dredge'; +----+-------------+-------+------------+--------+------------------------+---------+---------+--------------------+---------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+--------+------------------------+---------+---------+--------------------+---------+----------+-------------+ @@ -153,13 +192,17 @@ mysql> explain select e.first_name, e.last_name, s.salary, e.hire_date from empl +----+-------------+-------+------------+--------+------------------------+---------+---------+--------------------+---------+----------+-------------+ 2 rows in set, 1 warning (0.00 sec) ``` -These are in order of evaluation i.e. salaries\_2 will be evaluated first and then employees\_2 will be joined to it. As it looks like, it scans almost all the rows of salaries\_2 table and tries to match the employees\_2 rows as per the join condition. Though where clause is used in fetching the final resultset, but the index corresponding to the where clause is not used for the employees\_2 table. -If the join is done on two indexes which have the same data-types, it will always be faster. So, let’s create an index on the *emp_no* column of salaries_2 table and analyze the query again. +These are in order of evaluation, i.e. `salaries_2` will be evaluated first and then `employees_2` will be joined to it. As it looks like, it scans almost all the rows of `salaries_2` table and tries to match the `employees_2` rows as per the `JOIN` condition. Though `WHERE` clause is used in fetching the final resultset, but the index corresponding to the `WHERE` clause is not used for the `employees_2` table. + +If the join is done on two indexes which have the same data-types, it will always be faster. So, let’s create an index on the `emp_no` column of `salaries_2` table and analyze the query again. -`create index idx_empno on salaries_2(emp_no);` +```SQL +CREATE INDEX idx_empno ON salaries_2(emp_no) ``` -mysql> explain select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge'; + +```shell +mysql> EXPLAIN SELECT e.first_name, e.last_name, s.salary, e.hire_date FROM employees_2 e JOIN salaries_2 s ON e.emp_no=s.emp_no WHERE e.last_name='Dredge'; +----+-------------+-------+------------+------+------------------------+----------------+---------+--------------------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+------------------------+----------------+---------+--------------------+------+----------+-------+ @@ -168,8 +211,10 @@ mysql> explain select e.first_name, e.last_name, s.salary, e.hire_date from empl +----+-------------+-------+------------+------+------------------------+----------------+---------+--------------------+------+----------+-------+ 2 rows in set, 1 warning (0.00 sec) ``` -Now, not only did the index help the optimizer to examine only a few rows in both tables, it reversed the order of the tables in evaluation. The employees\_2 table is evaluated first and rows are selected as per the index respective to the where clause. Then the records are joined to salaries\_2 table as per the index used due to the join condition. The execution time of the query came down **from 4.5s to 0.02s**. -``` -mysql> select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge'\G + +Now, not only did the index help the optimizer to examine only a few rows in both tables, it reversed the order of the tables in evaluation. The `employees_2` table is evaluated first and rows are selected as per the index respective to the `WHERE` clause. Then, the records are joined to `salaries_2` table as per the index used due to the `JOIN` condition. The execution time of the query came down **from 4.5s to 0.02s**. + +```shell +mysql> SELECT e.first_name, e.last_name, s.salary, e.hire_date FROM employees_2 e JOIN salaries_2 s ON e.emp_no=s.emp_no WHERE e.last_name='Dredge'\G 1860 rows in set (0.02 sec) ``` \ No newline at end of file diff --git a/courses/level101/databases_sql/replication.md b/courses/level101/databases_sql/replication.md index f059e3cc..3c2ff110 100644 --- a/courses/level101/databases_sql/replication.md +++ b/courses/level101/databases_sql/replication.md @@ -1,61 +1,62 @@ ### MySQL Replication Replication enables data from one MySQL host (termed as Primary) to be copied to another MySQL host (termed as Replica). MySQL Replication is asynchronous in nature by default, but it can be changed to semi-synchronous with some configurations. -Some common applications of MySQL replication are:- +Some common applications of MySQL replication are: - **Read-scaling** - as multiple hosts can replicate the data from a single primary host, we can set up as many replicas as we need and scale reads through them, i.e. application writes will go to a single primary host and the reads can balance between all the replicas that are there. Such a setup can improve the write performance as well, as the primary is dedicated to only updates and not reads. - **Backups using replicas** - the backup process can sometimes be a little heavy. But if we have replicas configured, then we can use one of them to get the backup without affecting the primary data at all. - **Disaster Recovery** - a replica in some other geographical region paves a proper path to configure disaster recovery. -MySQL supports different types of synchronizations as well:- +MySQL supports different types of synchronizations as well: - **Asynchronous** - this is the default synchronization method. It is one-way, i.e. one host serves as primary and one or more hosts as replica. We will discuss this method throughout the replication topic. ![replication topologies](images/replication_topologies.png "Different Replication Scenarios") - **Semi-Synchronous** - in this type of synchronization, a commit performed on the primary host is blocked until at least one replica acknowledges it. Post the acknowledgement from any one replica, the control is returned to the session that performed the transaction. This ensures strong consistency but the replication is slower than asynchronous. -- **Delayed** - we can deliberately lag the replica in a typical MySQL replication by the number of seconds desired by the use case. This type of replication safeguards from severe human errors of dropping or corrupting the data on the primary, for example, in the above diagram for Delayed Replication, if a DROP DATABASE is executed by mistake on the primary, we still have 30 minutes to recover the data from R2 as that command has not been replicated on R2 yet. +- **Delayed** - we can deliberately lag the replica in a typical MySQL replication by the number of seconds desired by the use case. This type of replication safeguards from severe human errors of dropping or corrupting the data on the primary, for example, in the above diagram for Delayed Replication, if a `DROP DATABASE` is executed by mistake on the primary, we still have 30 minutes to recover the data from R2 as that command has not been replicated on R2 yet. **Pre-Requisites** -Before we dive into setting up replication, we should know about the binary logs. Binary logs play a very important role in MySQL replication. Binary logs, or commonly known as *binlogs* contain events about the changes done to the database, like table structure changes, data changes via DML operations, etc. They are not used to log SELECT statements. For replication, the primary sends the information to the replicas using its binlogs about the changes done to the database, and the replicas make the same data changes. +Before we dive into setting up replication, we should know about the binary logs. Binary logs play a very important role in MySQL replication. Binary logs, or commonly known as *binlogs* contain events about the changes done to the database, like table structure changes, data changes via DML operations, etc. They are not used to log `SELECT` statements. For replication, the primary sends the information to the replicas using its `binlogs` about the changes done to the database, and the replicas make the same data changes. + +With respect to MySQL replication, the binary log format can be of two types that decides the main type of replication: -With respect to MySQL replication, the binary log format can be of two types that decides the main type of replication:- - Statement-Based Replication or SBR - Row-Based Replication or RBR -**Statement Based Binlog Format** +**Statement-Based Binlog Format** -Originally, the replication in MySQL was based on SQL statements getting replicated and executed on the replica from the primary. This is called statement based logging. The binlog contains the exact SQL statement run by the session. +Originally, the replication in MySQL was based on SQL statements getting replicated and executed on the replica from the primary. This is called statement-based logging. The `binlog` contains the exact SQL statement run by the session. ![SBR update example](images/sbr_example_update_1.png "SBR update example") -So If we run the above statements to insert 3 records and the update 3 in a single update statement, they will be logged exactly the same as when we executed them. +So, if we run the above statements to insert 3 records and the update 3 in a single update statement, they will be logged exactly the same as when we executed them. ![SBR binlog](images/sbr_binlog_view_1.png "SBR binlog") -**Row Based Binlog Format** +**Row-Based Binlog Format** -The Row based is the default one in the latest MySQL releases. This is a lot different from the Statement format as here, row events are logged instead of statements. By that we mean, in the above example one update statement affected 3 records, but binlog had only one update statement; if it is a row based format, binlog will have an event for each record updated. +The row-based is the default one in the latest MySQL releases. This is a lot different from the Statement format as here, row events are logged instead of statements. By that we mean, in the above example one update statement affected 3 records, but `binlog` had only one `UPDATE` statement; if it is a row-based format, `binlog` will have an event for each record updated. ![RBR update example](images/rbr_example_update_1.png "RBR update example") ![RBR binlog](images/rbr_binlog_view_1.png "RBR binlog") -**Statement Based v/s Row Based binlogs** +**Statement-Based v/s Row-Based binlogs** Let’s have a look at the operational differences between statement-based and row-based binlogs. -| Statement Based | Row Based | -|---|---| +| Statement-Based | Row-Based | +|------------------------------|--------------------------------------------------------------------| | Logs SQL statements as executed | Logs row events based on SQL statements executed | | Takes lesser disk space | Takes more disk space | | Restoring using binlogs is faster | Restoring using binlogs is slower | -| When used for replication, if any statement has a predefined function that has its own value, like sysdate(), uuid() etc, the output could be different on the replica which makes it inconsistent. | Whatever is executed becomes a row event with values, so there will be no problem if such functions are used in SQL statements. | -| Only statements are logged so no other row events are generated. | A lot of events are generated when a table is copied into another using INSERT INTO SELECT. | +| When used for replication, if any statement has a predefined function that has its own value, like `sysdate()`, `uuid()` etc, the output could be different on the replica which makes it inconsistent. | Whatever is executed becomes a row event with values, so there will be no problem if such functions are used in SQL statements. | +| Only statements are logged so no other row events are generated. | A lot of events are generated when a table is copied into another using `INSERT INTO SELECT`. | -**Note** - There is another type of binlog format called **Mixed**. With mixed logging, statement based is used by default but it switches to row based in certain cases. If MySQL cannot guarantee that statement based logging is safe for the statements executed, it issues a warning and switches to row based for those statements. +**Note**: There is another type of `binlog` format called **Mixed**. With mixed logging, statement-based is used by default but it switches to row-based in certain cases. If MySQL cannot guarantee that statement-based logging is safe for the statements executed, it issues a warning and switches to row-based for those statements. We will be using binary log format as Row for the entire replication topic. @@ -65,22 +66,23 @@ We will be using binary log format as Row for the entire replication topic. The above figure indicates how a typical MySQL replication works. -1. Replica_IO_Thread is responsible to fetch the binlog events from the primary binary logs to the replica +1. `Replica_IO_Thread` is responsible to fetch the binlog events from the primary binary logs to the replica. 2. On the Replica host, relay logs are created which are exact copies of the binary logs. If the binary logs on primary are in row format, the relay logs will be the same. -3. Replica_SQL_Thread applies the relay logs on the replica MySQL server. -4. If log-bin is enabled on the replica, then the replica will have its own binary logs as well. If log-slave-updates is enabled, then it will have the updates from the primary logged in the binlogs as well. +3. `Replica_SQL_Thread` applies the relay logs on the replica MySQL server. +4. If `log-bin` is enabled on the replica, then the replica will have its own binary logs as well. If `log-slave-updates` is enabled, then it will have the updates from the primary logged in the binlogs as well. #### Setting up Replication -In this section, we will set up a simple asynchronous replication. The binlogs will be in row based format. The replication will be set up on two fresh hosts with no prior data present. There are two different ways in which we can set up replication. +In this section, we will set up a simple asynchronous replication. The binlogs will be in row-based format. The replication will be set up on two fresh hosts with no prior data present. There are two different ways in which we can set up replication. - **Binlog based** - Each replica keeps a record of the binlog coordinates on the primary - current binlog and position in the binlog till where it has read and processed. So, at a time different replicas might be reading different parts of the same binlog. -- **GTID based** - Every transaction gets an identifier called global transaction identifier or GTID. There is no need to keep the record of binlog coordinates, as long as the replica has all the GTIDs executed on the primary, it is consistent with the primary. A typical GTID is the server_uuid:# positive integer. +- **GTID based** - Every transaction gets an identifier called global transaction identifier or GTID. There is no need to keep the record of binlog coordinates, as long as the replica has all the GTIDs executed on the primary, it is consistent with the primary. A typical GTID is the `server_uuid:#` positive integer. -We will set up a GTID based replication in the following section but will also discuss binlog based replication setup as well. +We will set up a GTID-based replication in the following section but will also discuss binlog-based replication setup as well. **Primary Host Configurations** -The following config parameters should be present in the primary my.cnf file for setting up GTID based replication. +The following config parameters should be present in the primary `my.cnf` file for setting up GTID-based replication. + ``` server-id - a unique ID for the mysql server log-bin - the binlog location @@ -88,9 +90,11 @@ binlog-format - ROW | STATEMENT (we will use ROW) gtid-mode - ON enforce-gtid-consistency - ON (allows execution of only those statements which can be logged using GTIDs) ``` + **Replica Host Configurations** -The following config parameters should be present in the replica my.cnf file for setting up replication. +The following config parameters should be present in the replica `my.cnf` file for setting up replication. + ``` server-id - different than the primary host log-bin - (optional, if you want replica to log its own changes as well) @@ -99,21 +103,25 @@ gtid-mode - ON enforce-gtid-consistency - ON log-slave-updates - ON (if binlog is enabled, then we can enable this. This enables the replica to log the changes coming from the primary along with its own changes. Helps in setting up chain replication) ``` + **Replication User** -Every replica connects to the primary using a mysql user for replicating. So there must be a mysql user account for the same on the primary host. Any user can be used for this purpose provided it has REPLICATION SLAVE privilege. If the sole purpose is replication then we can have a user with only the required privilege. +Every replica connects to the primary using a `mysql` user for replicating. So there must be a `mysql` user account for the same on the primary host. Any user can be used for this purpose provided it has `REPLICATION SLAVE` privilege. If the sole purpose is replication, then we can have a user with only the required privilege. -On the primary host -``` -mysql> create user repl_user@ identified by 'xxxxx'; +On the primary host: -mysql> grant replication slave on *.* to repl_user@''; +```shell +mysql> CREATE USER repl_user@ IDENTIFIED BY 'xxxxx'; + +mysql> GRANT REPLICATION SLAVE ON *.* TO repl_user@''; ``` + **Obtaining Starting position from Primary** -Run the following command on the primary host -``` -mysql> show master status\G +Run the following command on the primary host: + +```shell +mysql> SHOW MASTER STATUS\G *************************** 1. row *************************** File: mysql-bin.000001 Position: 73 @@ -122,46 +130,60 @@ mysql> show master status\G Executed_Gtid_Set: e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-3 1 row in set (0.00 sec) ``` -If we are working with binary log based replication, the top two output lines are the most important ones. That tells the current binlog on the primary host and till what position it has written. For fresh hosts we know that no data is written so we can directly set up replication using the very first binlog file and position 4. If we are setting up a replication from a backup, then that changes the way we obtain the starting position. For GTIDs, the executed_gtid_set is the value where primary is right now. Again, for a fresh setup, we don’t have to specify anything about the starting point and it will start from the transaction id 1, but when we set up from a backup, the backup will contain the GTID positions till where backup has been taken. + +If we are working with binary log-based replication, the top two output lines are the most important ones. That tells the current binlog on the primary host and till what position it has written. For fresh hosts we know that no data is written, so we can directly set up replication using the very first `binlog` file and position 4. If we are setting up a replication from a backup, then that changes the way we obtain the starting position. For GTIDs, the `executed_gtid_set` is the value where primary is right now. Again, for a fresh setup, we don’t have to specify anything about the starting point and it will start from the transaction id 1, but when we set up from a backup, the backup will contain the GTID positions till where backup has been taken. **Setting up Replica** -The replication setup must know about the primary host, the user and password to connect, the binlog coordinates (for binlog based replication) or the GTID auto-position parameter. -The following command is used for setting up -``` -change master to +The replication setup must know about the primary host, the user and password to connect, the binlog coordinates (for binlog-based replication) or the GTID auto-position parameter. +The following command is used for setting up: + +```SQL +CHANGE MASTER TO master_host = '', master_port = , master_user = 'repl_user', master_password = 'xxxxx', master_auto_position = 1; ``` -**Note** - the *Change Master To* command has been replaced with *Change Replication Source To* from Mysql 8.0.23 onwards, also all the *master* and *slave* keywords are replaced with *source* and *replica*. -If it is binlog based replication, then instead of master_auto_position, we need to specify the binlog coordinates. +**Note**: The `CHANGE MASTER TO` command has been replaced with `CHANGE REPLICATION SOURCE TO` from Mysql 8.0.23 onwards, also all the *master* and *slave* keywords are replaced with *source* and *replica*. + +If it is binlog-based replication, then instead of `master_auto_position`, we need to specify the binlog coordinates. + ``` master_log_file = 'mysql-bin.000001', master_log_pos = 4 ``` + **Starting Replication and Check Status** Now that everything is configured, we just need to start the replication on the replica via the following command -`start slave;` +```SQL +START SLAVE; +``` OR from MySQL 8.0.23 onwards, -`start replica;` +```SQL +START REPLICA; +``` -Whether or not the replication is running successfully, we can determine by running the following command +Whether or not the replication is running successfully, we can determine by running the following command: -`show slave status\G` +```SQL +SHOW SLAVE STATUS\G +``` OR from MySQL 8.0.23 onwards, -`show replica status\G` +```SQL +SHOW REPLICA STATUS\G ``` -mysql> show replica status\G + +```shell +mysql> SHOW REPLICA STATUS\G *************************** 1. row *************************** Replica_IO_State: Waiting for master to send event Source_Host: @@ -225,36 +247,42 @@ Source_SSL_Verify_Server_Cert: No Network_Namespace: 1 row in set (0.00 sec) ``` -Some of the parameters are explained below:- - -- **Relay_Source_Log_File** - the primary’s file where replica is currently reading from -- **Execute_Source_Log_Pos** - for the above file on which position is the replica reading currently from. These two parameters are of utmost importance when binlog based replication is used. -- **Replica_IO_Running** - IO thread of replica is running or not -- **Replica_SQL_Running** - SQL thread of replica is running or not -- **Seconds_Behind_Source** - the difference of seconds when a statement was executed on Primary and then on Replica. This indicates how much replication lag is there. -- **Source_UUID** - the uuid of the primary host -- **Retrieved_Gtid_Set** - the GTIDs fetched from the primary host by the replica to be executed. -- **Executed_Gtid_Set** - the GTIDs executed on the replica. This set remains the same for the entire cluster if the replicas are in sync. -- **Auto_Position** - it directs the replica to fetch the next GTID automatically + +Some of the parameters are explained below: + +| Parameters | Description | +|--------------------------|------------------------------------------------------------| +| Relay_Source_Log_File | the primary’s file where replica is currently reading from | +| Execute_Source_Log_Pos | for the above file on which position is the replica reading currently from. These two parameters are of utmost importance when binlog based replication is used | +| Replica_IO_Running | IO thread of replica is running or not | +| Replica_SQL_Running | SQL thread of replica is running or not | +| Seconds_Behind_Source | the difference of seconds when a statement was executed on Primary and then on Replica. This indicates how much replication lag is there | +| Source_UUID | the uuid of the primary host | +| Retrieved_Gtid_Set | the GTIDs fetched from the primary host by the replica to be executed | +| Executed_Gtid_Set | the GTIDs executed on the replica. This set remains the same for the entire cluster if the replicas are in sync | +| Auto_Position | it directs the replica to fetch the next GTID automatically| **Create a Replica for the already setup cluster** The steps discussed in the previous section talks about the setting up replication on two fresh hosts. When we have to set up a replica for a host which is already serving applications, then the backup of the primary is used, either fresh backup taken for the replica (should only be done if the traffic it is serving is less) or use a recently taken backup. -If the size of the databases on the MySQL primary server is small, less than 100G recommended, then mysqldump can be used to take backup along with the following options. +If the size of the databases on the MySQL primary server is small, less than 100G recommended, then `mysqldump` can be used to take backup along with the following options. -`mysqldump -uroot -p -hhost_ip -P3306 --all-databases --single-transaction --master-data=1 > primary_host.bkp` +```shell +mysqldump -uroot -p -hhost_ip -P3306 --all-databases --single-transaction --master-data=1 > primary_host.bkp +``` - `--single-transaction` - this option starts a transaction before taking the backup which ensures it is consistent. As transactions are isolated from each other, so no other writes affect the backup. -- `--master-data` - this option is required if binlog based replication is desired to be set up. It includes the binary log file and log file position in the backup file. +- `--master-data` - this option is required if binlog-based replication is desired to be set up. It includes the binary log file and log file position in the backup file. -When GTID mode is enabled and **mysqldump** is executed, it includes the GTID executed to be used to start the replica after the backup position. The contents of the mysqldump output file will have the following +When GTID mode is enabled and `mysqldump` is executed, it includes the GTID executed to be used to start the replica after the backup position. The contents of the `mysqldump` output file will have the following ![GTID info in mysqldump](images/mysqldump_gtid_text.png "GTID info in mysqldump") -It is recommended to comment these before restoring otherwise they could throw errors. Also, using master-data=2 will automatically comment the master\_log\_file line. +It is recommended to comment these before restoring otherwise they could throw errors. Also, using `master-data=2` will automatically comment the `master_log_file` line. + +Similarly, when taking backup of the host using `xtrabackup`, the file `xtrabckup_info` file contains the information about binlog file and file position, as well as the GTID executed set. -Similarly, when taking backup of the host using **xtrabackup**, the file *xtrabckup_info* file contains the information about binlog file and file position, as well as the GTID executed set. ``` server_version = 8.0.25 start_time = 2021-06-22 03:45:17 @@ -269,9 +297,11 @@ format = file compressed = N encrypted = N ``` -Now, after setting MySQL server on the desired host, restore the backup taken from any one of the above methods. If the intended way is binlog based replication, then use the binlog file and position info in the following command + +Now, after setting MySQL server on the desired host, restore the backup taken from any one of the above methods. If the intended way is binlog-based replication, then use the binlog file and position info in the following command: + ``` -change Replication Source to +CHANGE REPLICATION SOURCE TO source_host = ‘primary_ip’, source_port = 3306, source_user = ‘repl_user’, @@ -279,20 +309,23 @@ source_password = ‘xxxxx’, source_log_file = ‘mysql-bin.000007’, source_log_pos = ‘196’; ``` -If the replication needs to be set via GITDs, then run the below command to tell the replica about the GTIDs already executed. On the Replica host, run th following commands + +If the replication needs to be set via GITDs, then run the below command to tell the replica about the GTIDs already executed. On the Replica host, run the following commands: + ``` -reset master; +RESET MASTER; set global gtid_purged = ‘e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-5’ -change replication source to +CHANGE REPLICATION SOURCE TO source_host = ‘primary_ip’, source_port = 3306, source_user = ‘repl_user’, source_password = ‘xxxxx’, source_auto_position = 1 ``` -The reset master command resets the position of the binary log to initial. It can be skipped if the host is a freshly installed MySQL, but we restored a backup so it is necessary. The gtid_purged global variable lets the replica know the GTIDs that have already been executed, so that the replication can start after that. Then in the change source command, we set the auto-position to 1 which automatically gets the next GTID to proceed. + +The reset master command resets the position of the binary log to initial. It can be skipped if the host is a freshly installed MySQL, but we restored a backup so it is necessary. The `gtid_purged` global variable lets the replica know the GTIDs that have already been executed, so that the replication can start after that. Then in the change source command, we set the `auto-position` to 1 which automatically gets the next GTID to proceed. #### Further Reading diff --git a/courses/level101/databases_sql/select_query.md b/courses/level101/databases_sql/select_query.md index 6879dffc..50c01dd4 100644 --- a/courses/level101/databases_sql/select_query.md +++ b/courses/level101/databases_sql/select_query.md @@ -1,6 +1,7 @@ ### SELECT Query -The most commonly used command while working with MySQL is SELECT. It is used to fetch the result set from one or more tables. -The general form of a typical select query looks like:- +The most commonly used command while working with MySQL is `SELECT`. It is used to fetch the resultset from one or more tables. +The general form of a typical select query looks like: + ``` SELECT expr FROM table1 @@ -9,19 +10,21 @@ FROM table1 [ORDER BY column_list ASC|DESC] [LIMIT #] ``` -The above general form contains some commonly used clauses of a SELECT query:- + +The above general form contains some commonly used clauses of a `SELECT` query: - **expr** - comma-separated column list or * (for all columns) - **WHERE** - a condition is provided, if true, directs the query to select only those records. -- **GROUP BY** - groups the entire result set based on the column list provided. An aggregate function is recommended to be present in the select expression of the query. **HAVING** supports grouping by putting a condition on the selected or any other aggregate function. -- **ORDER BY** - sorts the result set based on the column list in ascending or descending order. +- **GROUP BY** - groups the entire resultset based on the column list provided. An aggregate function is recommended to be present in the select expression of the query. **HAVING** supports grouping by putting a condition on the selected or any other aggregate function. +- **ORDER BY** - sorts the resultset based on the column list in ascending or descending order. - **LIMIT** - commonly used to limit the number of records. Let’s have a look at some examples for a better understanding of the above. The dataset used for the examples below is available [here](https://dev.mysql.com/doc/employee/en/employees-installation.html) and is free to use. **Select all records** -``` -mysql> select * from employees limit 5; + +```shell +mysql> SELECT * FROM employees LIMIT 5; +--------+------------+------------+-----------+--------+------------+ | emp_no | birth_date | first_name | last_name | gender | hire_date | +--------+------------+------------+-----------+--------+------------+ @@ -33,9 +36,11 @@ mysql> select * from employees limit 5; +--------+------------+------------+-----------+--------+------------+ 5 rows in set (0.00 sec) ``` + **Select specific fields for all records** -``` -mysql> select first_name, last_name, gender from employees limit 5; + +```shell +mysql> SELECT first_name, last_name, gender FROM employees LIMIT 5; +------------+-----------+--------+ | first_name | last_name | gender | +------------+-----------+--------+ @@ -47,9 +52,11 @@ mysql> select first_name, last_name, gender from employees limit 5; +------------+-----------+--------+ 5 rows in set (0.00 sec) ``` + **Select all records Where hire_date >= January 1, 1990** -``` -mysql> select * from employees where hire_date >= '1990-01-01' limit 5; + +```shell +mysql> SELECT * FROM employees WHERE hire_date >= '1990-01-01' LIMIT 5; +--------+------------+------------+-------------+--------+------------+ | emp_no | birth_date | first_name | last_name | gender | hire_date | +--------+------------+------------+-------------+--------+------------+ @@ -61,9 +68,11 @@ mysql> select * from employees where hire_date >= '1990-01-01' limit 5; +--------+------------+------------+-------------+--------+------------+ 5 rows in set (0.01 sec) ``` + **Select first_name and last_name from all records Where birth_date >= 1960 AND gender = ‘F’** -``` -mysql> select first_name, last_name from employees where year(birth_date) >= 1960 and gender='F' limit 5; + +```shell +mysql> SELECT first_name, last_name FROM employees WHERE year(birth_date) >= 1960 AND gender='F' LIMIT 5; +------------+-----------+ | first_name | last_name | +------------+-----------+ @@ -75,32 +84,38 @@ mysql> select first_name, last_name from employees where year(birth_date) >= 196 +------------+-----------+ 5 rows in set (0.00 sec) ``` + **Display the total number of records** -``` -mysql> select count(*) from employees; + +```shell +mysql> SELECT COUNT(*) FROM employees; +----------+ -| count(*) | +| COUNT(*) | +----------+ | 300024 | +----------+ 1 row in set (0.05 sec) ``` + **Display gender-wise count of all records** -``` -mysql> select gender, count(*) from employees group by gender; + +```shell +mysql> SELECT gender, COUNT(*) FROM employees GROUP BY gender; +--------+----------+ -| gender | count(*) | +| gender | COUNT(*) | +--------+----------+ | M | 179973 | | F | 120051 | +--------+----------+ 2 rows in set (0.14 sec) ``` + **Display the year of hire_date and number of employees hired that year, also only those years where more than 20k employees were hired** -``` -mysql> select year(hire_date), count(*) from employees group by year(hire_date) having count(*) > 20000; + +```shell +mysql> SELECT year(hire_date), COUNT(*) FROM employees GROUP BY year(hire_date) HAVING COUNT(*) > 20000; +-----------------+----------+ -| year(hire_date) | count(*) | +| year(hire_date) | COUNT(*) | +-----------------+----------+ | 1985 | 35316 | | 1986 | 36150 | @@ -113,9 +128,11 @@ mysql> select year(hire_date), count(*) from employees group by year(hire_date) +-----------------+----------+ 8 rows in set (0.14 sec) ``` -**Display all records ordered by their hire_date in descending order. If hire_date is the same then in order of their birth_date ascending order** -``` -mysql> select * from employees order by hire_date desc, birth_date asc limit 5; + +**Display all records ordered by their hire_date in descending order. If hire_date is the same, then in order of their birth_date ascending order** + +```shell +mysql> SELECT * FROM employees ORDER BY hire_date DESC, birth_date ASC LIMIT 5; +--------+------------+------------+-----------+--------+------------+ | emp_no | birth_date | first_name | last_name | gender | hire_date | +--------+------------+------------+-----------+--------+------------+ @@ -127,9 +144,12 @@ mysql> select * from employees order by hire_date desc, birth_date asc limit 5; +--------+------------+------------+-----------+--------+------------+ 5 rows in set (0.12 sec) ``` + ### SELECT - JOINS -JOIN statement is used to produce a combined result set from two or more tables based on certain conditions. It can be also used with Update and Delete statements but we will be focussing on the select query. -Following is a basic general form for joins +`JOIN` statement is used to produce a combined resultset from two or more tables based on certain conditions. It can be also used with `UPDATE` and `DELETE` statements, but we will be focussing on the select query. + +Following is a basic general form for joins: + ``` SELECT table1.col1, table2.col1, ... (any combination) FROM @@ -137,16 +157,18 @@ table1 table2 ON (or USING depends on join_type) table1.column_for_joining = table2.column_for_joining WHERE … ``` -Any number of columns can be selected, but it is recommended to select only those which are relevant to increase the readability of the resultset. All other clauses like where, group by are not mandatory. + +Any number of columns can be selected, but it is recommended to select only those which are relevant to increase the readability of the resultset. All other clauses like `WHERE`, `GROUP BY` are not mandatory. Let’s discuss the types of JOINs supported by MySQL Syntax. **Inner Join** This joins table A with table B on a condition. Only the records where the condition is True are selected in the resultset. -Display some details of employees along with their salary -``` -mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e join salaries s on e.emp_no=s.emp_no limit 5; +Display some details of employees along with their salary: + +```shell +mysql> SELECT e.emp_no,e.first_name,e.last_name,s.salary FROM employees e JOIN salaries s ON e.emp_no=s.emp_no LIMIT 5; +--------+------------+-----------+--------+ | emp_no | first_name | last_name | salary | +--------+------------+-----------+--------+ @@ -158,9 +180,11 @@ mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e join s +--------+------------+-----------+--------+ 5 rows in set (0.00 sec) ``` -Similar result can be achieved by -``` -mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e join salaries s using (emp_no) limit 5; + +Similar result can be achieved by: + +```shell +mysql> SELECT e.emp_no,e.first_name,e.last_name,s.salary FROM employees e JOIN salaries s USING (emp_no) LIMIT 5; +--------+------------+-----------+--------+ | emp_no | first_name | last_name | salary | +--------+------------+-----------+--------+ @@ -172,9 +196,11 @@ mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e join s +--------+------------+-----------+--------+ 5 rows in set (0.00 sec) ``` -And also by -``` -mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e natural join salaries s limit 5; + +And also by: + +```shell +mysql> SELECT e.emp_no,e.first_name,e.last_name,s.salary FROM employees e NATURAL JOIN salaries s LIMIT 5; +--------+------------+-----------+--------+ | emp_no | first_name | last_name | salary | +--------+------------+-----------+--------+ @@ -186,15 +212,18 @@ mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e natura +--------+------------+-----------+--------+ 5 rows in set (0.00 sec) ``` + **Outer Join** -Majorly of two types:- +Majorly of two types: + - **LEFT** - joining complete table A with table B on a condition. All the records from table A are selected, but from table B, only those records are selected where the condition is True. -- **RIGHT** - Exact opposite of the left join. +- **RIGHT** - Exact opposite of the `LEFT JOIN`. -Let us assume the below tables for understanding left join better. -``` -mysql> select * from dummy1; +Let us assume the below tables for understanding `LEFT JOIN` better. + +```shell +mysql> SELECT * FROM dummy1; +----------+------------+ | same_col | diff_col_1 | +----------+------------+ @@ -203,7 +232,7 @@ mysql> select * from dummy1; | 3 | C | +----------+------------+ -mysql> select * from dummy2; +mysql> SELECT * FROM dummy2; +----------+------------+ | same_col | diff_col_2 | +----------+------------+ @@ -211,9 +240,11 @@ mysql> select * from dummy2; | 3 | Y | +----------+------------+ ``` -A simple select join will look like the one below. -``` -mysql> select * from dummy1 d1 left join dummy2 d2 on d1.same_col=d2.same_col; + +A simple `SELECT JOIN` will look like the one below: + +```shell +mysql> SELECT * FROM dummy1 d1 LEFT JOIN dummy2 d2 ON d1.same_col=d2.same_col; +----------+------------+----------+------------+ | same_col | diff_col_1 | same_col | diff_col_2 | +----------+------------+----------+------------+ @@ -223,9 +254,11 @@ mysql> select * from dummy1 d1 left join dummy2 d2 on d1.same_col=d2.same_col; +----------+------------+----------+------------+ 3 rows in set (0.00 sec) ``` -Which can also be written as -``` -mysql> select * from dummy1 d1 left join dummy2 d2 using(same_col); + +Which can also be written as: + +```shell +mysql> SELECT * FROM dummy1 d1 LEFT JOIN dummy2 d2 USING(same_col); +----------+------------+------------+ | same_col | diff_col_1 | diff_col_2 | +----------+------------+------------+ @@ -235,9 +268,11 @@ mysql> select * from dummy1 d1 left join dummy2 d2 using(same_col); +----------+------------+------------+ 3 rows in set (0.00 sec) ``` -And also as -``` -mysql> select * from dummy1 d1 natural left join dummy2 d2; + +And also as: + +```shell +mysql> SELECT * FROM dummy1 d1 NATURAL LEFT JOIN dummy2 d2; +----------+------------+------------+ | same_col | diff_col_1 | diff_col_2 | +----------+------------+------------+ @@ -247,13 +282,15 @@ mysql> select * from dummy1 d1 natural left join dummy2 d2; +----------+------------+------------+ 3 rows in set (0.00 sec) ``` + **Cross Join** This does a cross product of table A and table B without any condition. It doesn’t have a lot of applications in the real world. -A Simple Cross Join looks like this -``` -mysql> select * from dummy1 cross join dummy2; +A Simple `CROSS JOIN` looks like this: + +```shell +mysql> SELECT * FROM dummy1 CROSS JOIN dummy2; +----------+------------+----------+------------+ | same_col | diff_col_1 | same_col | diff_col_2 | +----------+------------+----------+------------+ @@ -266,9 +303,11 @@ mysql> select * from dummy1 cross join dummy2; +----------+------------+----------+------------+ 6 rows in set (0.01 sec) ``` -One use case that can come in handy is when you have to fill in some missing entries. For example, all the entries from dummy1 must be inserted into a similar table dummy3, with each record must have 3 entries with statuses 1, 5 and 7. -``` -mysql> desc dummy3; + +One use case that can come in handy is when you have to fill in some missing entries. For example, all the entries from `dummy1` must be inserted into a similar table `dummy3`, with each record must have 3 entries with statuses 1, 5 and 7. + +```shell +mysql> DESC dummy3; +----------+----------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------+----------+------+-----+---------+-------+ @@ -278,12 +317,14 @@ mysql> desc dummy3; +----------+----------+------+-----+---------+-------+ 3 rows in set (0.02 sec) ``` -Either you create an insert query script with as many entries as in dummy1 or use cross join to produce the required resultset. -``` -mysql> select * from dummy1 -cross join -(select 1 union select 5 union select 7) T2 -order by same_col; + +Either you create an `INSERT` query script with as many entries as in `dummy1` or use `CROSS JOIN` to produce the required resultset. + +```shell +mysql> SELECT * FROM dummy1 +CROSS JOIN +(SELECT 1 UNION SELECT 5 UNION SELECT 7) T2 +ORDER BY same_col; +----------+------------+---+ | same_col | diff_col_1 | 1 | +----------+------------+---+ @@ -299,13 +340,15 @@ order by same_col; +----------+------------+---+ 9 rows in set (0.00 sec) ``` + The **T2** section in the above query is called a *sub-query*. We will discuss the same in the next section. **Natural Join** This implicitly selects the common column from table A and table B and performs an inner join. -``` -mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e natural join salaries s limit 5; + +```shell +mysql> SELECT e.emp_no,e.first_name,e.last_name,s.salary FROM employees e NATURAL JOIN salaries s LIMIT 5; +--------+------------+-----------+--------+ | emp_no | first_name | last_name | salary | +--------+------------+-----------+--------+ @@ -317,20 +360,22 @@ mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e natura +--------+------------+-----------+--------+ 5 rows in set (0.00 sec) ``` -Notice how natural join and using takes care that the common column is displayed only once if you are not explicitly selecting columns for the query. -Some More Examples +Notice how `NATURAL JOIN` and using takes care that the common column is displayed only once if you are not explicitly selecting columns for the query. -Display emp_no, salary, title and dept of the employees where salary > 80000 -``` -mysql> select e.emp_no, s.salary, t.title, d.dept_no -from +**Some More Examples** + +Display `emp_no`, `salary`, `title` and `dept` of the employees where salary > 80000. + +```shell +mysql> SELECT e.emp_no, s.salary, t.title, d.dept_no +FROM employees e -join salaries s using (emp_no) -join titles t using (emp_no) -join dept_emp d using (emp_no) -where s.salary > 80000 -limit 5; +JOIN salaries s USING (emp_no) +JOIN titles t USING (emp_no) +JOIN dept_emp d USING (emp_no) +WHERE s.salary > 80000 +LIMIT 5; +--------+--------+--------------+---------+ | emp_no | salary | title | dept_no | +--------+--------+--------------+---------+ @@ -342,16 +387,18 @@ limit 5; +--------+--------+--------------+---------+ 5 rows in set (0.00 sec) ``` -Display title-wise count of employees in each department order by dept_no -``` -mysql> select d.dept_no, t.title, count(*) -from titles t -left join dept_emp d using (emp_no) -group by d.dept_no, t.title -order by d.dept_no -limit 10; + +Display title-wise count of employees in each department ordered by `dept_no`: + +```shell +mysql> SELECT d.dept_no, t.title, COUNT(*) +FROM titles t +LEFT JOIN dept_emp d USING (emp_no) +GROUP BY d.dept_no, t.title +ORDER BY d.dept_no +LIMIT 10; +---------+--------------------+----------+ -| dept_no | title | count(*) | +| dept_no | title | COUNT(*) | +---------+--------------------+----------+ | d001 | Manager | 2 | | d001 | Senior Staff | 13940 | @@ -366,18 +413,20 @@ limit 10; +---------+--------------------+----------+ 10 rows in set (1.32 sec) ``` + #### SELECT - Subquery -A subquery is generally a smaller resultset that can be used to power a select query in many ways. It can be used in a ‘where’ condition, can be used in place of join mostly where a join could be an overkill. -These subqueries are also termed as derived tables. They must have a table alias in the select query. +A subquery is generally a smaller resultset that can be used to power a `SELECT` query in many ways. It can be used in a `WHERE` condition, can be used in place of `JOIN` mostly where a `JOIN` could be an overkill. +These subqueries are also termed as derived tables. They must have a table alias in the `SELECT` query. Let’s look at some examples of subqueries. -Here we got the department name from the departments table by a subquery which used dept_no from dept_emp table. -``` -mysql> select e.emp_no, -(select dept_name from departments where dept_no=d.dept_no) dept_name from employees e -join dept_emp d using (emp_no) -limit 5; +Here, we got the department name from the `departments` table by a subquery which used `dept_no` from `dept_emp` table. + +```shell +mysql> SELECT e.emp_no, +(SELECT dept_name FROM departments WHERE dept_no=d.dept_no) dept_name FROM employees e +JOIN dept_emp d USING (emp_no) +LIMIT 5; +--------+-----------------+ | emp_no | dept_name | +--------+-----------------+ @@ -389,24 +438,26 @@ limit 5; +--------+-----------------+ 5 rows in set (0.01 sec) ``` -Here, we used the ‘avg’ query above (which got the avg salary) as a subquery to list the employees whose latest salary is more than the average. -``` -mysql> select avg(salary) from salaries; + +Here, we used the `AVG` query above (which got the avg salary) as a subquery to list the employees whose latest salary is more than the average. + +```shell +mysql> SELECT AVG(salary) FROM salaries; +-------------+ -| avg(salary) | +| AVG(salary) | +-------------+ | 63810.7448 | +-------------+ 1 row in set (0.80 sec) -mysql> select e.emp_no, max(s.salary) -from employees e -natural join salaries s -group by e.emp_no -having max(s.salary) > (select avg(salary) from salaries) -limit 10; +mysql> SELECT e.emp_no, MAX(s.salary) +FROM employees e +NATURAL JOIN salaries s +GROUP BY e.emp_no +HAVING MAX(s.salary) > (SELECT AVG(salary) FROM salaries) +LIMIT 10; +--------+---------------+ -| emp_no | max(s.salary) | +| emp_no | MAX(s.salary) | +--------+---------------+ | 10001 | 88958 | | 10002 | 72527 | diff --git a/courses/level101/git/branches.md b/courses/level101/git/branches.md index 1edd10d8..da4501b2 100644 --- a/courses/level101/git/branches.md +++ b/courses/level101/git/branches.md @@ -1,6 +1,6 @@ # Working With Branches -Coming back to our local repo which has two commits. So far, what we have is a single line of history. Commits are chained in a single line. But sometimes you may have a need to work on two different features in parallel in the same repo. Now one option here could be making a new folder/repo with the same code and use that for another feature development. But there's a better way. Use _branches._ Since git follows tree like structure for commits, we can use branches to work on different sets of features. From a commit, two or more branches can be created and branches can also be merged. +Coming back to our local repo which has two commits. So far, what we have is a single line of history. Commits are chained in a single line. But sometimes you may have a need to work on two different features in parallel in the same repo. Now one option here could be making a new folder/repo with the same code and use that for another feature development. But there's a better way. Use _branches_. Since git follows tree-like structure for commits, we can use branches to work on different sets of features. From a commit, two or more branches can be created and branches can also be merged. Using branches, there can exist multiple lines of histories and we can checkout to any of them and work on it. Checking out, as we discussed earlier, would simply mean replacing contents of the directory (repo) with the snapshot at the checked out version. @@ -13,7 +13,7 @@ $ git log --oneline --graph * df2fb7a adding file 1 ``` -We create a branch called `b1`. Git log tells us that b1 also points to the last commit (7f3b00e) but the `HEAD` is still pointing to master. If you remember, HEAD points to the commit/reference wherever you are checkout to. So if we checkout to `b1`, HEAD should point to that. Let's confirm: +We create a branch called `b1`. Git log tells us that `b1` also points to the last commit (`7f3b00e`) but the `HEAD` is still pointing to `master`. If you remember, `HEAD` points to the commit/reference wherever you are checkout to. So if we checkout to `b1`, `HEAD` should point to that. Let's confirm: ```bash $ git checkout b1 @@ -23,7 +23,7 @@ $ git log --oneline --graph * df2fb7a adding file 1 ``` -`b1` still points to the same commit but HEAD now points to `b1`. Since we create a branch at commit `7f3b00e`, there will be two lines of histories starting this commit. Depending on which branch you are checked out on, the line of history will progress. +`b1` still points to the same commit but `HEAD` now points to `b1`. Since we create a branch at commit `7f3b00e`, there will be two lines of histories starting this commit. Depending on which branch you are checked out on, the line of history will progress. At this moment, we are checked out on branch `b1`, so making a new commit will advance branch reference `b1` to that commit and current `b1` commit will become its parent. Let's do that. @@ -44,7 +44,7 @@ $ git log --oneline --graph $ ``` -Do note that master is still pointing to the old commit it was pointing to. We can now checkout to master branch and make commits there. This will result in another line of history starting from commit 7f3b00e. +Do note that master is still pointing to the old commit it was pointing to. We can now checkout to `master` branch and make commits there. This will result in another line of history starting from commit `7f3b00e`. ```bash # checkout to master branch @@ -66,7 +66,7 @@ $ git log --oneline --graph * df2fb7a adding file 1 ``` -Notice how branch b1 is not visible here since we are on the master. Let's try to visualize both to get the whole picture: +Notice how branch `b1` is not visible here since we are on the `master`. Let's try to visualize both to get the whole picture: ```bash $ git log --oneline --graph --all @@ -77,13 +77,13 @@ $ git log --oneline --graph --all * df2fb7a adding file 1 ``` -Above tree structure should make things clear. Notice a clear branch/fork on commit 7f3b00e. This is how we create branches. Now they both are two separate lines of history on which feature development can be done independently. +Above tree structure should make things clear. Notice a clear branch/fork on commit `7f3b00e`. This is how we create branches. Now they both are two separate lines of history on which feature development can be done independently. **To reiterate, internally, git is just a tree of commits. Branch names (human readable) are pointers to those commits in the tree. We use various git commands to work with the tree structure and references. Git accordingly modifies contents of our repo.** ## Merges -Now say the feature you were working on branch `b1` is complete and you need to merge it on master branch, where all the final version of code goes. So first you will checkout to branch master and then you pull the latest code from upstream (eg: GitHub). Then you need to merge your code from `b1` into master. There could be two ways this can be done. +Now say the feature you were working on branch `b1` is complete and you need to merge it on `master` branch, where all the final version of code goes. So first, you will `checkout` to branch `master` and then you `pull` the latest code from `upstream` (eg: GitHub). Then you need to merge your code from `b1` into `master`. There could be two ways this can be done. Here is the current history: @@ -96,7 +96,7 @@ $ git log --oneline --graph --all * df2fb7a adding file 1 ``` -**Option 1: Directly merge the branch.** Merging the branch b1 into master will result in a new merge commit. This will merge changes from two different lines of history and create a new commit of the result. +**Option 1: Directly merge the branch.** Merging the branch `b1` into `master` will result in a new merge commit. This will merge changes from two different lines of history and create a new commit of the result. ```bash $ git merge b1 @@ -114,9 +114,9 @@ $ git log --oneline --graph --all * df2fb7a adding file 1 ``` -You can see a new merge commit created (8fc28f9). You will be prompted for the commit message. If there are a lot of branches in the repo, this result will end-up with a lot of merge commits. Which looks ugly compared to a single line of history of development. So let's look at an alternative approach +You can see a new merge commit created (`8fc28f9`). You will be prompted for the commit message. If there are a lot of branches in the repo, this result will end-up with a lot of merge commits. Which looks ugly compared to a single line of history of development. So let's look at an alternative approach. -First let's [reset](https://git-scm.com/docs/git-reset) our last merge and go to the previous state. +First, let's [reset](https://git-scm.com/docs/git-reset) our last merge and go to the previous state. ```bash $ git reset --hard 60dc441 @@ -129,7 +129,7 @@ $ git log --oneline --graph --all * df2fb7a adding file 1 ``` -**Option 2: Rebase.** Now, instead of merging two branches which has a similar base (commit: 7f3b00e), let us rebase branch b1 on to current master. **What this means is take branch `b1` (from commit 7f3b00e to commit 872a38f) and rebase (put them on top of) master (60dc441).** +**Option 2: Rebase.** Now, instead of merging two branches which has a similar base (commit: `7f3b00e`), let us rebase branch `b1` on to current master. **What this means is take branch `b1` (from commit `7f3b00e` to commit `872a38f`) and rebase (put them on top of) master (`60dc441`).** ```bash # Switch to b1 diff --git a/courses/level101/git/git-basics.md b/courses/level101/git/git-basics.md index 16f1ca46..a63461e1 100644 --- a/courses/level101/git/git-basics.md +++ b/courses/level101/git/git-basics.md @@ -3,14 +3,14 @@ ## Prerequisites 1. Have Git installed [https://git-scm.com/downloads](https://git-scm.com/downloads) -2. Have taken any git high level tutorial or following LinkedIn learning courses +2. Have taken any git high-level tutorial or following LinkedIn learning courses - [https://www.linkedin.com/learning/git-essential-training-the-basics/](https://www.linkedin.com/learning/git-essential-training-the-basics/) - [https://www.linkedin.com/learning/git-branches-merges-and-remotes/](https://www.linkedin.com/learning/git-branches-merges-and-remotes/) - [The Official Git Docs](https://git-scm.com/doc) ## What to expect from this course -As an engineer in the field of computer science, having knowledge of version control tools becomes almost a requirement. While there are a lot of version control tools that exist today like SVN, Mercurial, etc, Git perhaps is the most used one and this course we will be working with Git. While this course does not start with Git 101 and expects basic knowledge of git as a prerequisite, it will reintroduce the git concepts known by you with details covering what is happening under the hood as you execute various git commands. So that next time you run a git command, you will be able to press enter more confidently! +As an engineer in the field of computer science, having knowledge of version control tools becomes almost a requirement. While there are a lot of version control tools that exist today like SVN, Mercurial, etc, Git perhaps is the most used one and this course we will be working with Git. While this course does not start with Git 101 and expects basic knowledge of git as a prerequisite, it will reintroduce the git concepts known by you with details covering what is happening under the hood as you execute various `git` commands. So that next time you run a `git` command, you will be able to press `enter` more confidently! ## What is not covered under this course @@ -49,11 +49,11 @@ $ ls .git/ HEAD config description hooks info objects refs ``` -There are a bunch of folders and files in the `.git` folder. As I said, all these enables git to do its magic. We will look into some of these folders and files. But for now, what we have is an empty git repository. +There are a bunch of folders and files in the `.git` folder. As I said, all these enable git to do its magic. We will look into some of these folders and files. But for now, what we have is an empty git repository. ### Tracking a File -Now as you might already know, let us create a new file in our repo (we will refer to the folder as _repo_ now.) And see git status +Now as you might already know, let us create a new file in our repo (we will refer to the folder as _repo_ now.) And see `git status`: ```bash $ echo "I am file 1" > file1.txt @@ -70,7 +70,7 @@ Untracked files: nothing added to commit but untracked files present (use "git add" to track) ``` -The current git status says `No commits yet` and there is one untracked file. Since we just created the file, git is not tracking that file. We explicitly need to ask git to track files and folders. (also checkout [gitignore](https://git-scm.com/docs/gitignore)) And how we do that is via `git add` command as suggested in the above output. Then we go ahead and create a commit. +The current git status says `No commits yet` and there is one untracked file. Since we just created the file, git is not tracking that file. We explicitly need to ask git to track files and folders. (Also checkout [gitignore](https://git-scm.com/docs/gitignore)) And how we do that is via `git add` command as suggested in the above output. Then, we go ahead and create a commit. ```bash $ git add file1.txt @@ -90,7 +90,7 @@ $ git commit -m "adding file 1" create mode 100644 file1.txt ``` -Notice how after adding the file, git status says `Changes to be committed:`. What it means is whatever is listed there, will be included in the next commit. Then we go ahead and create a commit, with an attached messaged via `-m`. +Notice how after adding the file, `git status` says `Changes to be committed:`. What it means is whatever is listed there, will be included in the next commit. Then, we go ahead and create a commit, with an attached message via `-m`. ### More About a Commit @@ -123,7 +123,7 @@ $ git log --oneline --graph `git log`, as the name suggests, prints the log of all the git commits. Here you see two additional arguments, `--oneline` prints the shorter version of the log, ie: the commit message only and not the person who made the commit and when. `--graph` prints it in graph format. -**Now at this moment the commits might look like just one in each line but all commits are stored as a tree like data structure internally by git. That means there can be two or more children commits of a given commit. And not just a single line of commits. We will look more into this part when we get to the Branches section. For now this is our commit history:** +**Now at this moment, the commits might look like just one in each line but all commits are stored as a tree like data structure internally by git. That means there can be two or more children commits of a given commit. And not just a single line of commits. We will look more into this part when we get to the Branches section. For now, this is our commit history:** ```bash df2fb7a ===> 7f3b00e @@ -131,7 +131,7 @@ $ git log --oneline --graph ### Are commits really linked? -As I just said, the two commits we just made are linked via tree like data structure and we saw how they are linked. But let's actually verify it. Everything in git is an object. Newly created files are stored as an object. Changes to file are stored as an objects and even commits are objects. To view contents of an object we can use the following command with the object's ID. We will take a look at the contents of the second commit +As I just said, the two commits we just made are linked via tree like data structure and we saw how they are linked. But let's actually verify it. Everything in git is an object. Newly created files are stored as an object. Changes to file are stored as an objects and even commits are objects. To view contents of an object, we can use the following command with the object's ID. We will take a look at the contents of the second commit: ```bash $ git cat-file -p 7f3b00e @@ -143,7 +143,7 @@ committer Sanket Patel 1603273316 -0700 adding file 2 ``` -Take a note of `parent` attribute in the above output. It points to the commit id of the first commit we made. So this proves that they are linked! Additionally you can see the second commit's message in this object. As I said all this magic is enabled by `.git` folder and the object to which we are looking at also is in that folder. +Take a note of `parent` attribute in the above output. It points to the commit id of the first commit we made. So this proves that they are linked! Additionally, you can see the second commit's message in this object. As I said all this magic is enabled by `.git` folder and the object to which we are looking at also is in that folder. ```bash $ ls .git/objects/7f/3b00eaa957815884198e2fdfec29361108d6a9 @@ -154,7 +154,7 @@ It is stored in `.git/objects/` folder. All the files and changes to them as wel ### The Version Control part of Git -We already can see two commits (versions) in our git log. One thing a version control tool gives you is ability to browse back and forth in history. For example: some of your users are running an old version of code and they are reporting an issue. In order to debug the issue, you need access to the old code. The one in your current repo is the latest code. In this example, you are working on the second commit (7f3b00e) and someone reported an issue with the code snapshot at commit (df2fb7a). This is how you would get access to the code at any older commit +We already can see two commits (versions) in our git log. One thing a version control tool gives you is ability to browse back and forth in history. For example: some of your users are running an old version of code and they are reporting an issue. In order to debug the issue, you need access to the old code. The one in your current repo is the latest code. In this example, you are working on the second commit (`7f3b00e`) and someone reported an issue with the code snapshot at commit (`df2fb7a`). This is how you would get access to the code at any older commit. ```bash # Current contents, two files present @@ -181,13 +181,13 @@ $ ls file1.txt ``` -So this is how we would get access to old versions/snapshots. All we need is a _reference_ to that snapshot. Upon executing `git checkout ...`, what git does for you is use the `.git` folder, see what was the state of things (files and folders) at that version/reference and replace the contents of current directory with those contents. The then-existing content will no longer be present in the local dir (repo) but we can and will still get access to them because they are tracked via git commit and `.git` folder has them stored/tracked. +So this is how we would get access to old versions/snapshots. All we need is a _reference_ to that snapshot. Upon executing `git checkout ...`, what git does for you is use the `.git` folder, see what was the state of things (files and folders) at that version/reference and replace the contents of current directory with those contents. The then-existing content will no longer be present in the local dir (repo) but we can and will still get access to them because they are tracked via `git commit` and `.git` folder has them stored/tracked. ### Reference I mention in the previous section that we need a _reference_ to the version. By default, git repo is made of tree of commits. And each commit has a unique IDs. But the unique ID is not the only thing we can reference commits via. There are multiple ways to reference commits. For example: `HEAD` is a reference to current commit. _Whatever commit your repo is checked out at, `HEAD` will point to that._ `HEAD~1` is reference to previous commit. So while checking out previous version in section above, we could have done `git checkout HEAD~1`. -Similarly, master is also a reference (to a branch). Since git uses tree like structure to store commits, there of course will be branches. And the default branch is called `master`. Master (or any branch reference) will point to the latest commit in the branch. Even though we have checked out to the previous commit in out repo, `master` still points to the latest commit. And we can get back to the latest version by checkout at `master` reference +Similarly, `master` is also a reference (to a branch). Since git uses tree like structure to store commits, there of course will be branches. And the default branch is called `master`. Master (or any branch reference) will point to the latest commit in the branch. Even though we have checked out to the previous commit in out repo, `master` still points to the latest commit. And we can get back to the latest version by `checkout` at `master` reference ```bash $ git checkout master @@ -218,7 +218,7 @@ $ cat .git/refs/heads/master 7f3b00eaa957815884198e2fdfec29361108d6a9 ``` -Viola! Where master is pointing to is stored in a file. **Whenever git needs to know where master reference is pointing to, or if git needs to update where master points, it just needs to update the file above.** So when you create a new commit, a new commit is created on top of the current commit and the master file is updated with the new commit's ID. +Viola! Where `master` is pointing to is stored in a file. **Whenever git needs to know where master reference is pointing to, or if git needs to update where master points, it just needs to update the file above.** So when you create a new commit, a new commit is created on top of the current commit and the master file is updated with the new commit's ID. Similary, for `HEAD` reference: @@ -239,7 +239,7 @@ $ git log --oneline --graph * df2fb7a adding file 1 ``` -Now let's change master to point to the previous/first commit. +Now, let's change `master` to point to the previous/first commit. ```bash $ echo df2fb7a61f5d40c1191e0fdeb0fc5d6e7969685a > .git/refs/heads/master diff --git a/courses/level101/git/github-hooks.md b/courses/level101/git/github-hooks.md index b04d7753..d44fda83 100644 --- a/courses/level101/git/github-hooks.md +++ b/courses/level101/git/github-hooks.md @@ -1,13 +1,13 @@ # Git with GitHub -Till now all the operations we did were in our local repo while git also helps us in a collaborative environment. GitHub is one place on the internet where you can centrally host your git repos and collaborate with other developers. +Till now all the operations we did were in our local repo while git also helps us in a collaborative environment. GitHub is one place on the Internet where you can centrally host your git repos and collaborate with other developers. Most of the workflow will remain the same as we discussed, with addition of couple of things: - 1. Pull: to pull latest changes from github (the central) repo - 2. Push: to push your changes to github repo so that it's available to all people + 1. Pull: to pull latest changes from GitHub (the central) repo + 2. Push: to push your changes to GitHub repo so that it's available to all people -GitHub has written nice guides and tutorials about this and you can refer them here: +GitHub has written nice guides and tutorials about this and you can refer to them here: - [GitHub Hello World](https://guides.github.com/activities/hello-world/) - [Git Handbook](https://guides.github.com/introduction/git-handbook/) @@ -22,7 +22,7 @@ applypatch-msg.sample fsmonitor-watchman.sample pre-applypatch.sample pr commit-msg.sample post-update.sample pre-commit.sample pre-rebase.sample prepare-commit-msg.sample ``` -Names are self explanatory. These hooks are useful when you want to do certain things when a certain event happens. If you want to run tests before pushing code, you would want to setup `pre-push` hooks. Let's try to create a pre commit hook. +Names are self-explanatory. These hooks are useful when you want to do certain things when a certain event happens. If you want to run tests before pushing code, you would want to setup `pre-push` hooks. Let's try to create a pre commit hook. ```bash $ echo "echo this is from pre commit hook" > .git/hooks/pre-commit diff --git a/courses/level101/linux_basics/command_line_basics.md b/courses/level101/linux_basics/command_line_basics.md index 630ccc8f..43a2ecc6 100644 --- a/courses/level101/linux_basics/command_line_basics.md +++ b/courses/level101/linux_basics/command_line_basics.md @@ -2,60 +2,52 @@ ## Lab Environment Setup -One can use an online bash interpreter to run all the commands that are provided as examples in this course. This will also help you in getting a hands-on experience of various linux commands. +One can use an online Bash interpreter to run all the commands that are provided as examples in this course. This will also help you in getting a hands-on experience of various Linux commands. -[REPL](https://repl.it/languages/bash) is one of the popular online bash interpreters for running linux commands. We will be using it for running all the commands mentioned in this course. +[REPL](https://repl.it/languages/bash) is one of the popular online Bash interpreters for running Linux commands. We will be using it for running all the commands mentioned in this course. ## What is a Command A command is a program that tells the operating system to perform -specific work. Programs are stored as files in linux. Therefore, a +specific work. Programs are stored as files in Linux. Therefore, a command is also a file which is stored somewhere on the disk. Commands may also take additional arguments as input from the user. These arguments are called command line arguments. Knowing how to use the commands is important and there are many ways to get help in Linux, especially for commands. Almost every command will have some form of -documentation, most commands will have a command-line argument -h or -\--help that will display a reasonable amount of documentation. But the -most popular documentation system in Linux is called man pages - short +documentation, most commands will have a command-line argument `-h` or +`--help` that will display a reasonable amount of documentation. But the +most popular documentation system in Linux is called `man` pages—short for manual pages. -Using \--help to show the documentation for ls command. +Using `--help` to show the documentation for `ls` command. ![](images/linux/commands/image19.png) ## File System Organization -The linux file system has a hierarchical (or tree-like) structure with -its highest level directory called root ( denoted by / ). Directories -present inside the root directory stores file related to the system. +The Linux file system has a hierarchical (or tree-like) structure with +its highest-level directory called `root` (denoted by `/`). Directories +present inside the root directory stores files related to the system. These directories in turn can either store system files or application -files or user related files. +files or user-related files. ![](images/linux/commands/image17.png) - - bin | The executable program of most commonly used commands reside in bin directory - dev | This directory contains files related to devices on the system - - etc | This directory contains all the system configuration files - - home | This directory contains user related files and directories. - - lib | This directory contains all the library files - - mnt | This directory contains files related to mounted devices on the system - - proc | This directory contains files related to the running processes on the system - - root | This directory contains root user related files and directories. - - sbin | This directory contains programs used for system administration. - - tmp | This directory is used to store temporary files on the system - - usr | This directory is used to store application programs on the system +| Directory | Description | +|------------|--------------------------------------------------------------------------------| +| bin | The executable program of most commonly used commands reside in `bin` directory| +| dev | This directory contains files related to devices on the system | +| etc | This directory contains all the system configuration files | +| home | This directory contains user-related files and directories | +| lib | This directory contains all the library files | +| mnt | This directory contains files related to mounted devices on the system | +| proc | This directory contains files related to the running processes on the system | +| root | This directory contains root user-related files and directories | +| sbin | This directory contains programs used for system administration | +| tmp | This directory is used to store temporary files on the system | +| usr | This directory is used to store application programs on the system | ## Commands for Navigating the File System @@ -70,42 +62,42 @@ file system: We will now try to understand what each command does and how to use these commands. You should also practice the given examples on the -online bash shell. +online Bash shell. ### pwd (print working directory) At any given moment of time, we will be standing in a certain directory. To get the name of the directory in which we are standing, we can use -the pwd command in linux. +the `pwd` command in Linux. ![](images/linux/commands/image2.png) -We will now use the cd command to move to a different directory and then +We will now use the `cd` command to move to a different directory and then print the working directory. ![](images/linux/commands/image20.png) ### cd (change directory) -The cd command can be used to change the working directory. Using the +The `cd` command can be used to change the working directory. Using the command, you can move from one directory to another. -In the below example, we are initially in the root directory. we have -then used the cd command to change the directory. +In the below example, we are initially in the `root` directory. We have +then used the `cd` command to change the directory. ![](images/linux/commands/image3.png) ### ls (list files and directories)** -The ls command is used to list the contents of a directory. It will list +The `ls` command is used to list the contents of a directory. It will list down all the files and folders present in the given directory. -If we just type ls in the shell, it will list all the files and +If we just type `ls` in the shell, it will list all the files and directories present in the current directory. ![](images/linux/commands/image7.png) -We can also provide the directory name as argument to ls command. It +We can also provide the directory name as argument to `ls` command. It will then list all the files and directories inside the given directory. ![](images/linux/commands/image4.png) @@ -127,17 +119,17 @@ files: We will now try to understand what each command does and how to use these commands. You should also practice the given examples on the -online bash shell. +online Bash shell. ### touch (create new file) -The touch command can be used to create an empty new file. -This command is very useful for many other purposes but we will discuss +The `touch` command can be used to create an empty new file. +This command is very useful for many other purposes, but we will discuss the simplest use case of creating a new file. -General syntax of using touch command +General syntax of using `touch` command: -``` +```shell touch ``` @@ -145,12 +137,12 @@ touch ### mkdir (create new directories) -The mkdir command is used to create directories.You can use ls command +The `mkdir` command is used to create directories. You can use `ls` command to verify that the new directory is created. -General syntax of using mkdir command +General syntax of using `mkdir` command: -``` +```shell mkdir ``` @@ -158,90 +150,90 @@ mkdir ### rm (delete files and directories) -The rm command can be used to delete files and directories. It is very +The `rm` command can be used to delete files and directories. It is very important to note that this command permanently deletes the files and directories. It's almost impossible to recover these files and -directories once you have executed rm command on them successfully. Do +directories once you have executed `rm` command on them successfully. Do run this command with care. -General syntax of using rm command: +General syntax of using `rm` command: -``` +```shell rm ``` -Let's try to understand the rm command with an example. We will try to -delete the file and directory we created using touch and mkdir command +Let's try to understand the `rm` command with an example. We will try to +delete the file and directory we created using `touch` and `mkdir` command respectively. ![](images/linux/commands/image18.png) ### cp (copy files and directories) -The cp command is used to copy files and directories from one location -to another. Do note that the cp command doesn't do any change to the +The `cp` command is used to copy files and directories from one location +to another. Do note that the `cp` command doesn't do any change to the original files or directories. The original files or directories and -their copy both co-exist after running cp command successfully. +their copy both co-exist after running `cp` command successfully. -General syntax of using cp command: +General syntax of using `cp` command: -``` +```shell cp ``` -We are currently in the '/home/runner' directory. We will use the mkdir -command to create a new directory named "test_directory". We will now -try to copy the "\_test_runner.py" file to the directory we created just +We are currently in the `/home/runner` directory. We will use the `mkdir` +command to create a new directory named `test_directory`. We will now +try to copy the `_test_runner.py` file to the directory we created just now. ![](images/linux/commands/image23.png) -Do note that nothing happened to the original "\_test_runner.py" file. +Do note that nothing happened to the original `_test_runner.py` file. It's still there in the current directory. A new copy of it got created -inside the "test_directory". +inside the `test_directory`. ![](images/linux/commands/image14.png) -We can also use the cp command to copy the whole directory from one +We can also use the `cp` command to copy the whole directory from one location to another. Let's try to understand this with an example. ![](images/linux/commands/image12.png) -We again used the mkdir command to create a new directory called -"another_directory". We then used the cp command along with an -additional argument '-r' to copy the "test_directory". +We again used the `mkdir` command to create a new directory called +`another_directory`. We then used the `cp` command along with an +additional argument `-r` to copy the `test_directory`. **mv (move files and directories)** -The mv command can either be used to move files or directories from one +The `mv` command can either be used to move files or directories from one location to another or it can be used to rename files or directories. Do note that moving files and copying them are very different. When you move the files or directories, the original copy is lost. -General syntax of using mv command: +General syntax of using `mv` command: -``` +```shell mv ``` -In this example, we will use the mv command to move the -"\_test_runner.py" file to "test_directory". In this case, this file -already exists in "test_directory". The mv command will just replace it. +In this example, we will use the `mv` command to move the +`_test_runner.py` file to `test_directory`. In this case, this file +already exists in `test_directory`. The `mv` command will just replace it. **Do note that the original file doesn't exist in the current directory -after mv command ran successfully.** +after `mv` command ran successfully.** ![](images/linux/commands/image26.png) -We can also use the mv command to move a directory from one location to -another. In this case, we do not need to use the '-r' flag that we did -while using the cp command. Do note that the original directory will not -exist if we use mv command. +We can also use the `mv` command to move a directory from one location to +another. In this case, we do not need to use the `-r` flag that we did +while using the `cp` command. Do note that the original directory will not +exist if we use `mv` command. -One of the important uses of the mv command is to rename files and +One of the important uses of the `mv` command is to rename files and directories. Let's see how we can use this command for renaming. -We have first changed our location to "test_directory". We then use the -mv command to rename the ""\_test_runner.py" file to "test.py". +We have first changed our location to `test_directory`. We then use the +`mv` command to rename the `_test_runner.py` file to `test.py`. ![](images/linux/commands/image29.png) @@ -262,9 +254,9 @@ files: We will now try to understand what each command does and how to use these commands. You should also practice the given examples on the -online bash shell. +online Bash shell. -We will create a new file called "numbers.txt" and insert numbers from 1 +We will create a new file called `numbers.txt` and insert numbers from 1 to 100 in this file. Each number will be in a separate line. ![](images/linux/commands/image21.png) @@ -277,7 +269,7 @@ later sections. ### cat -The most simplest use of cat command is to print the contents of the file on +The most simplest use of `cat` command is to print the contents of the file on your output screen. This command is very useful and can be used for many other purposes. We will study about other use cases later. @@ -289,63 +281,63 @@ all the numbers. ### head -The head command displays the first 10 lines of the file by default. We +The `head` command displays the first 10 lines of the file by default. We can include additional arguments to display as many lines as we want from the top. In this example, we are only able to see the first 10 lines from the -file when we use the head command. +file when we use the `head` command. ![](images/linux/commands/image15.png) -By default, head command will only display the first 10 lines. If we +By default, `head` command will only display the first 10 lines. If we want to specify the number of lines we want to see from start, use the -'-n' argument to provide the input. +`-n` argument to provide the input. ![](images/linux/commands/image16.png) ### tail -The tail command displays the last 10 lines of the file by default. We +The `tail` command displays the last 10 lines of the file by default. We can include additional arguments to display as many lines as we want from the end of the file. ![](images/linux/commands/image22.png) -By default, the tail command will only display the last 10 lines. If we -want to specify the number of lines we want to see from the end, use '-n' +By default, the `tail` command will only display the last 10 lines. If we +want to specify the number of lines we want to see from the end, use `-n` argument to provide the input. ![](images/linux/commands/image10.png) In this example, we are only able to see the last 5 lines from the file -when we use the tail command with explicit -n option. +when we use the `tail` command with explicit `-n` option. ### more -More command displays the contents of a file or a command output, +The `more` command displays the contents of a file or a command output, displaying one screen at a time in case the file is large (Eg: log files). It also allows forward navigation and limited backward navigation in the file. ![](images/linux/commands/image33.png) -More command displays as much as can fit on the current screen and waits for user input to advance. Forward navigation can be done by pressing Enter, which advances the output by one line and Space, which advances the output by one screen. +The `more` command displays as much as can fit on the current screen and waits for user input to advance. Forward navigation can be done by pressing `Enter`, which advances the output by one line and `Space`, which advances the output by one screen. ### less -Less command is an improved version of more. It displays the contents of a file or a command output, one page at a time. -It allows backward navigation as well as forward navigation in the file and also has search options. We can use arrow keys for advancing backward or forward by one line. For moving forward by one page, press Space and for moving backward by one page, press b on your keyboard. +The `less` command is an improved version of `more`. It displays the contents of a file or a command output, one page at a time. +It allows backward navigation as well as forward navigation in the file and also has search options. We can use `arrow keys` for advancing backward or forward by one line. For moving forward by one page, press `Space` and for moving backward by one page, press `b` on your keyboard. You can go to the beginning and the end of a file instantly. ## Echo Command in Linux -The echo command is one of the simplest commands that is used in the -shell. This command is equivalent to what we have in other +The `echo` command is one of the simplest commands that is used in the +shell. This command is equivalent to `print` in other programming languages. -The echo command prints the given input string on the screen. +The `echo` command prints the given input string on the screen. ![](images/linux/commands/image34.png) @@ -371,61 +363,61 @@ texts: We will now try to understand what each command does and how to use these commands. You should also practice the given examples on the -online bash shell. +online Bash shell. -We will create a new file called "numbers.txt" and insert numbers from 1 +We will create a new file called `numbers.txt` and insert numbers from 1 to 10 in this file. Each number will be in a separate line. ![](images/linux/commands/image8.png) ### grep -The grep command in its simplest form can be used to search particular +The `grep` command in its simplest form can be used to search particular words in a text file. It will display all the lines in a file that contains a particular input. The word we want to search is provided as -an input to the grep command. +an input to the `grep` command. -General syntax of using grep command: +General syntax of using `grep` command: -``` +```shell grep ``` In this example, we are trying to search for a string "1" in this file. -The grep command outputs the lines where it found this string. +The `grep` command outputs the lines where it found this string. ![](images/linux/commands/image36.png) ### sed -The sed command in its simplest form can be used to replace a text in a +The `sed` command in its simplest form can be used to replace a text in a file. -General syntax of using the sed command for replacement: +General syntax of using the `sed` command for replacement: -``` +```shell sed 's///' ``` Let's try to replace each occurrence of "1" in the file with "3" using -sed command. +`sed` command. ![](images/linux/commands/image31.png) The content of the file will not change in the above -example. To do so, we have to use an extra argument '-i' so that the +example. To do so, we have to use an extra argument `-i` so that the changes are reflected back in the file. ### sort -The sort command can be used to sort the input provided to it as an +The `sort` command can be used to sort the input provided to it as an argument. By default, it will sort in increasing order. Let's first see the content of the file before trying to sort it. ![](images/linux/commands/image27.png) -Now, we will try to sort the file using the sort command. The sort +Now, we will try to sort the file using the `sort` command. The `sort` command sorts the content in lexicographical order. ![](images/linux/commands/image32.png) @@ -437,11 +429,11 @@ example. Each open file gets assigned a file descriptor. A file descriptor is an unique identifier for open files in the system. There are always three -default files open, stdin (the keyboard), stdout (the screen), and -stderr (error messages output to the screen). These files can be +default files open, `stdin` (the keyboard), `stdout` (the screen), and +`stderr` (error messages output to the screen). These files can be redirected. -Everything is a file in linux - +Everything is a file in Linux - [https://unix.stackexchange.com/questions/225537/everything-is-a-file](https://unix.stackexchange.com/questions/225537/everything-is-a-file) Till now, we have displayed all the output on the screen which is the @@ -449,12 +441,12 @@ standard output. We can use some special operators to redirect the output of the command to files or even to the input of other commands. I/O redirection is a very powerful feature. -In the below example, we have used the '>' operator to redirect the -output of ls command to output.txt file. +In the below example, we have used the `>` operator to redirect the +output of `ls` command to `output.txt` file. ![](images/linux/commands/image30.png) -In the below example, we have redirected the output from echo command to +In the below example, we have redirected the output from `echo` command to a file. ![](images/linux/commands/image13.png) @@ -462,13 +454,13 @@ a file. We can also redirect the output of a command as an input to another command. This is possible with the help of pipes. -In the below example, we have passed the output of cat command as an -input to grep command using pipe(\|) operator. +In the below example, we have passed the output of `cat` command as an +input to `grep` command using pipe (`|`) operator. ![](images/linux/commands/image6.png) -In the below example, we have passed the output of sort command as an -input to uniq command using pipe(\|) operator. The uniq command only +In the below example, we have passed the output of `sort` command as an +input to `uniq` command using pipe (`|`) operator. The `uniq` command only prints the unique numbers from the input. ![](images/linux/commands/image28.png) diff --git a/courses/level101/linux_basics/conclusion.md b/courses/level101/linux_basics/conclusion.md index 6634b9d6..0efa0e4d 100644 --- a/courses/level101/linux_basics/conclusion.md +++ b/courses/level101/linux_basics/conclusion.md @@ -1,6 +1,6 @@ # Conclusion -We have covered the basics of Linux operating systems and basic commands used in linux. +We have covered the basics of Linux operating systems and basic commands used in Linux. We have also covered the Linux server administration commands. We hope that this course will make it easier for you to operate on the command line. @@ -13,12 +13,12 @@ We hope that this course will make it easier for you to operate on the command l 4. `tail` command is very useful to view the latest data in the log file. 5. Different users will have different permissions depending on their roles. We will also not want everyone in the company to access our servers for security reasons. Users permissions can be restricted with `chown`, `chmod` and `chgrp` commands. 6. `ssh` is one of the most frequently used commands for a SRE. Logging into servers and troubleshooting along with performing basic administration tasks will only be possible if we are able to login into the server. -7. What if we want to run an apache server or nginx on a server? We will first install it using the package manager. Package management commands become important here. -8. Managing services on servers is another critical responsibility of a SRE. Systemd related commands can help in troubleshooting issues. If a service goes down, we can start it using `systemctl start` command. We can also stop a service in case it is not needed. -9. Monitoring is another core responsibility of a SRE. Memory and CPU are two important system level metrics which should be monitored. Commands like `top` and `free` are quite helpful here. -10. If a service is throwing an error, how do we find out the root cause of the error ? We will certainly need to check logs to find out the whole stack trace of the error. The log file will also tell us the number of times the error has occurred along with time when it started. +7. What if we want to run an Apache server or NGINX on a server? We will first install it using the package manager. Package management commands become important here. +8. Managing services on servers is another critical responsibility of a SRE. `systemd`-related commands can help in troubleshooting issues. If a service goes down, we can start it using `systemctl start` command. We can also stop a service in case it is not needed. +9. Monitoring is another core responsibility of a SRE. Memory and CPU are two important system-level metrics which should be monitored. Commands like `top` and `free` are quite helpful here. +10. If a service throws an error, how do we find out the root cause of the error? We will certainly need to check logs to find out the whole stack trace of the error. The log file will also tell us the number of times the error has occurred along with time when it started. -## Useful Courses and tutorials +## Useful Courses and Tutorials * [Edx basic linux commands course](https://courses.edx.org/courses/course-v1:LinuxFoundationX+LFS101x+1T2020/course/) * [Edx Red Hat Enterprise Linux Course](https://courses.edx.org/courses/course-v1:RedHat+RH066x+2T2017/course/) diff --git a/courses/level101/linux_basics/intro.md b/courses/level101/linux_basics/intro.md index 1060d01a..5cb2334e 100644 --- a/courses/level101/linux_basics/intro.md +++ b/courses/level101/linux_basics/intro.md @@ -3,7 +3,7 @@ ## Introduction ### Prerequisites -- Should be comfortable in using any operating systems like Windows, Linux or Mac +- Should be comfortable in using any operating systems like Windows, Linux or - Expected to have fundamental knowledge of operating systems ## What to expect from this course @@ -15,17 +15,17 @@ difference between GUI and CLI. In the second part, we cover some basic commands used in Linux. We will focus on commands used for navigating the file system, viewing and manipulating files, -I/O redirection etc. +I/O redirection, etc. -In the third part, we cover Linux system administration. This includes day to day tasks +In the third part, we cover Linux system administration. This includes day-to-day tasks performed by Linux admins, like managing users/groups, managing file permissions, monitoring system performance, log files etc. -In the second and third part, we will be taking examples to understand the concepts. +In the second and third part, we will be showing examples to understand the concepts. ## What is not covered under this course -We are not covering advanced Linux commands and bash scripting in this +We are not covering advanced Linux commands and Bash scripting in this course. We will also not be covering Linux internals. ## Course Contents @@ -69,32 +69,32 @@ Most of us are familiar with the Windows operating system used in more than 75% of the personal computers. The Windows operating systems are based on Windows NT kernel. -A kernel is the most important part of -an operating system - it performs important functions like process -management, memory management, filesystem management etc. +A _kernel_ is the most important part of +an operating system—it performs important functions like process +management, memory management, filesystem management, etc. -Linux operating systems are based on the Linux kernel. A Linux based +Linux operating systems are based on the Linux kernel. A Linux-based operating system will consist of Linux kernel, GUI/CLI, system libraries and system utilities. The Linux kernel was independently developed and -released by Linus Torvalds. The Linux kernel is free and open-source - -[https://github.com/torvalds/linux](https://github.com/torvalds/linux) +released by Linus Torvalds. The Linux kernel is free and open-source (See +[https://github.com/torvalds/linux](https://github.com/torvalds/linux)). -Linux is a kernel and not a complete operating system. Linux kernel is combined with GNU system to make a complete operating system. Therefore, linux based operating systems are also called as GNU/Linux systems. GNU is an extensive collection of free softwares like compiler, debugger, C library etc. -[Linux and the GNU System](https://www.gnu.org/gnu/linux-and-gnu.en.html) +Linux is a kernel and not a complete operating system. Linux kernel is combined with GNU system to make a complete operating system. Therefore, Linux-based operating systems are also called as GNU/Linux systems. GNU is an extensive collection of free softwares like compiler, debugger, C library etc. (See +[Linux and the GNU System](https://www.gnu.org/gnu/linux-and-gnu.en.html)) History of Linux - [https://en.wikipedia.org/wiki/History_of_Linux](https://en.wikipedia.org/wiki/History_of_Linux) ## What are popular Linux distributions -A Linux distribution(distro) is an operating system based on +A Linux distribution (_distro_) is an operating system based on the Linux kernel and a package management system. A package management system consists of tools that help in installing, upgrading, configuring and removing softwares on the operating system. Software are usually adopted to a distribution and are packaged in a -distro specific format. These packages are available through a distro -specific repository. Packages are installed and managed in the operating +distro-specific format. These packages are available through a distro-specific +repository. Packages are installed and managed in the operating system by a package manager. **List of popular Linux distributions:** @@ -116,8 +116,8 @@ system by a package manager. | Packaging systems | Distributions | Package manager | ---------------------- | ------------------------------------------ | ----------------- -| Debian style (.deb) | Debian, Ubuntu | APT -| Red Hat style (.rpm) | Fedora, CentOS, Red Hat Enterprise Linux | YUM +| Debian style (`.deb`) | Debian, Ubuntu | APT +| Red Hat style (`.rpm`) | Fedora, CentOS, Red Hat Enterprise Linux | YUM ## Linux Architecture @@ -141,11 +141,11 @@ Operating system based on Linux kernel are widely used in: - Mobile phones - Android is based on Linux operating system -- Embedded devices - watches, televisions, traffic lights etc +- Embedded devices - watches, televisions, traffic lights, etc. - Satellites -- Network devices - routers, switches etc. +- Network devices - routers, switches, etc. ## Graphical user interface (GUI) vs Command line interface (CLI) @@ -172,9 +172,9 @@ programs available on Linux servers. Other popular shell programs are zsh, ksh and tcsh. Terminal is a program that opens a window and lets you interact with the -shell. Some popular examples of terminals are gnome-terminal, xterm, -konsole etc. +shell. Some popular examples of terminals are GNOME-terminal, xterm, +Konsole, etc. -Linux users do use the terms shell, terminal, prompt, console etc. +Linux users do use the terms shell, terminal, prompt, console, etc. interchangeably. In simple terms, these all refer to a way of taking commands from the user. diff --git a/courses/level101/linux_basics/linux_server_administration.md b/courses/level101/linux_basics/linux_server_administration.md index 26b9cbd6..57a8fb6d 100644 --- a/courses/level101/linux_basics/linux_server_administration.md +++ b/courses/level101/linux_basics/linux_server_administration.md @@ -1,6 +1,6 @@ # Linux Server Administration -In this course will try to cover some of the common tasks that a linux +In this course, will try to cover some of the common tasks that a Linux server administrator performs. We will first try to understand what a particular command does and then try to understand the commands using examples. Do keep in mind that it's very important to practice the Linux @@ -8,7 +8,7 @@ commands on your own. ## Lab Environment Setup -- Install docker on your system - [https://docs.docker.com/engine/install/](https://docs.docker.com/engine/install/) OR you can used online [Docker playground](https://labs.play-with-docker.com/) +- Install docker on your system - [https://docs.docker.com/engine/install/](https://docs.docker.com/engine/install/) OR you can use online [Docker playground](https://labs.play-with-docker.com/) - We will be running all the commands on Red Hat Enterprise Linux (RHEL) 8 system. @@ -18,7 +18,7 @@ commands on your own. ## Multi-User Operating Systems -An operating system is considered as multi-user if it allows multiple people/users to use a computer and not affect each other's files and preferences. Linux based operating systems are multi-user in nature as it allows multiple users to access the system at the same time. A typical computer will only have one keyboard and monitor but multiple users can log in via SSH if the computer is connected to the network. We will cover more about SSH later. +An operating system is considered as multi-user if it allows multiple people/users to use a computer and not affect each other's files and preferences. Linux-based operating systems are multi-user in nature as it allows multiple users to access the system at the same time. A typical computer will only have one keyboard and monitor but multiple users can log in via SSH if the computer is connected to the network. We will cover more about SSH later. As a server administrator, we are mostly concerned with the Linux servers which are physically present at a very large distance from us. We can connect to these servers with the help of remote login methods like SSH. @@ -37,26 +37,28 @@ Since Linux supports multiple users, we need to have a method which can protect ### id command -`id` command can be used to find the uid and gid associated with an user. +`id` command can be used to find the `uid` and `gid` associated with an user. It also lists down the groups to which the user belongs to. -The uid and gid associated with the root user is 0. +The `uid` and `gid` associated with the root user is 0. + ![](images/linux/admin/image30.png) -A good way to find out the current user in Linux is to use the whoami +A good way to find out the current user in Linux is to use the `whoami` command. ![](images/linux/admin/image35.png) -**"root" user or superuser is the most privileged user with** +**`root` user or superuser is the most privileged user with** **unrestricted access to all the resources on the system. It has UID 0** ### Important files associated with users/groups -| /etc/passwd | Stores the user name, the uid, the gid, the home directory, the login shell etc | -| -------------| --------------------------------------------------------------------------------- -| /etc/shadow | Stores the password associated with the users | -| /etc/group | Stores information about different groups on the system | +| Files | Description | +|--------------|----------------------------------------------------------------------------------------| +| /etc/passwd | Stores the user name, the `uid`, the `gid`, the home directory, the login shell etc | +| /etc/shadow | Stores the password associated with the users | +| /etc/group | Stores information about different groups on the system | ![](images/linux/admin/image23.png) @@ -64,7 +66,7 @@ command. ![](images/linux/admin/image9.png) -If you want to understand each filed discussed in the above outputs, you can go +If you want to understand each field discussed in the above outputs, you can go through below links: - [https://tldp.org/LDP/lame/LAME/linux-admin-made-easy/shadow-file-formats.html](https://tldp.org/LDP/lame/LAME/linux-admin-made-easy/shadow-file-formats.html) @@ -77,21 +79,18 @@ Some of the commands which are used frequently to manage users/groups on Linux are following: - `useradd` - Creates a new user - - `passwd` - Adds or modifies passwords for a user - - `usermod` - Modifies attributes of an user - - `userdel` - Deletes an user ### useradd -The useradd command adds a new user in Linux. +The `useradd` command adds a new user in Linux. -We will create a new user 'shivam'. We will also verify that the user -has been created by tailing the /etc/passwd file. The uid and gid are +We will create a new user `shivam`. We will also verify that the user +has been created by tailing the `/etc/passwd` file. The `uid` and `gid` are 1000 for the newly created user. The home directory assigned to the user -is /home/shivam and the login shell assigned is /bin/bash. Do note that +is `/home/shivam` and the login shell assigned is `/bin/bash`. Do note that the user home directory and login shell can be modified later on. ![](images/linux/admin/image41.png) @@ -104,17 +103,17 @@ override these default values when creating a new user. ### passwd -The passwd command is used to create or modify passwords for a user. +The `passwd` command is used to create or modify passwords for a user. In the above examples, we have not assigned any password for users -'shivam' or 'amit' while creating them. +`shivam` or `amit` while creating them. -"!!" in an account entry in shadow means the account of an user has +`!!` in an account entry in shadow means the account of an user has been created, but not yet given a password. ![](images/linux/admin/image13.png) -Let's now try to create a password for user "shivam". +Let's now try to create a password for user `shivam`. ![](images/linux/admin/image55.png) @@ -129,118 +128,118 @@ Also, when you login using root user, the password will be asked. ### usermod -The usermod command is used to modify the attributes of an user like the +The `usermod` command is used to modify the attributes of an user like the home directory or the shell. -Let's try to modify the login shell of user "amit" to "/bin/bash". +Let's try to modify the login shell of user `amit` to `/bin/bash`. ![](images/linux/admin/image17.png) In a similar way, you can also modify many other attributes for a user. -Try 'usermod -h' for a list of attributes you can modify. +Try `usermod -h` for a list of attributes you can modify. ### userdel -The userdel command is used to remove a user on Linux. Once we remove a +The `userdel` command is used to remove a user on Linux. Once we remove a user, all the information related to that user will be removed. -Let's try to delete the user "amit". After deleting the user, you will -not find the entry for that user in "/etc/passwd" or "/etc/shadow" file. +Let's try to delete the user `amit`. After deleting the user, you will +not find the entry for that user in `/etc/passwd` or `/etc/shadow` file. ![](images/linux/admin/image34.png) ## Important commands for managing groups -Commands for managing groups are quite similar to the commands used for managing users. Each command is not explained in detail here as they are quite similar. You can try running these commands on your system. +Commands for managing groups are quite similar to the commands used for managing users. Each command is not explained in detail here as they are quite similar. You can try running these commands on your system. - -| groupadd \ | Creates a new group | -| ------------------------ | ------------------------------- | -| groupmod \ | Modifies attributes of a group | -| groupdel \ | Deletes a group | -| gpasswd \ | Modifies password for group | +| Command | Description | +| -----------------------| ------------------------------- | +| groupadd | Creates a new group | +| groupmod | Modifies attributes of a group | +| groupdel | Deletes a group | +| gpasswd | Modifies password for group | ![](images/linux/admin/image52.png) -We will now try to add user "shivam" to the group we have created above. +We will now try to add user `shivam` to the group we have created above. ![](images/linux/admin/image33.png) ## Becoming a Superuser **Before running the below commands, do make sure that you have set up a -password for user "shivam" and user "root" using the passwd command +password for user `shivam` and user `root` using the `passwd` command described in the above section.** -The su command can be used to switch users in Linux. Let's now try to -switch to user "shivam". +The `su` command can be used to switch users in Linux. Let's now try to +switch to user `shivam`. ![](images/linux/admin/image37.png) -Let's now try to open the "/etc/shadow" file. +Let's now try to open the `/etc/shadow` file. ![](images/linux/admin/image29.png) -The operating system didn't allow the user "shivam" to read the content -of the "/etc/shadow" file. This is an important file in Linux which -stores the passwords of users. This file can only be accessed by root or -users who have the superuser privileges. +The operating system didn't allow the user `shivam` to read the content +of the `/etc/shadow` file. This is an important file in Linux which +stores the passwords of users. This file can only be accessed by `root` or +users who have the `superuser` privileges. -**The sudo command allows a** **user to run commands with the security +**The `sudo` command allows a** **user to run commands with the security privileges of the root user.** Do remember that the root user has all -the privileges on a system. We can also use su command to switch to the +the privileges on a system. We can also use `su` command to switch to the root user and open the above file but doing that will require the password of the root user. An alternative way which is preferred on most -modern operating systems is to use sudo command for becoming a +modern operating systems is to use `sudo` command for becoming a superuser. Using this way, a user has to enter his/her password and they -need to be a part of the sudo group. +need to be a part of the `sudo` group. **How to provide superpriveleges to other users ?** -Let's first switch to the root user using su command. Do note that using +Let's first switch to the root user using `su` command. Do note that using the below command will need you to enter the password for the root user. ![](images/linux/admin/image44.png) -In case, you forgot to set a password for the root user, type "exit" and +In case, you forgot to set a password for the root user, type `exit` and you will be back as the root user. Now, set up a password using the -passwd command. +`passwd` command. -**The file /etc/sudoers holds the names of users permitted to invoke -sudo**. In redhat operating systems, this file is not present by -default. We will need to install sudo. +**The file `/etc/sudoers` holds the names of users permitted to invoke +`sudo`**. In Red Hat operating systems, this file is not present by +default. We will need to install `sudo`. ![](images/linux/admin/image3.png) -We will discuss the yum command in detail in later sections. +We will discuss the `yum` command in detail in later sections. -Try to open the "/etc/sudoers" file on the system. The file has a lot of +Try to open the `/etc/sudoers` file on the system. The file has a lot of information. This file stores the rules that users must follow when -running the sudo command. For example, root is allowed to run any +running the `sudo` command. For example, `root` is allowed to run any commands from anywhere. ![](images/linux/admin/image8.png) One easy way of providing root access to users is to add them to a group -which has permissions to run all the commands. "wheel" is a group in -redhat Linux with such privileges. +which has permissions to run all the commands. `wheel` is a group in +Red Hat Linux with such privileges. ![](images/linux/admin/image25.png) -Let's add the user "shivam" to this group so that it also has sudo +Let's add the user `shivam` to this group so that it also has `sudo` privileges. ![](images/linux/admin/image48.png) -Let's now switch back to user "shivam" and try to access the -"/etc/shadow" file. +Let's now switch back to user `shivam` and try to access the +`/etc/shadow` file. ![](images/linux/admin/image56.png) -We need to use sudo before running the command since it can only be -accessed with the sudo privileges. We have already given sudo privileges -to user “shivam” by adding him to the group “wheel”. +We need to use `sudo` before running the command since it can only be +accessed with the `sudo` privileges. We have already given `sudo` privileges +to user `shivam` by adding him to the group `wheel`. ## File Permissions @@ -250,8 +249,8 @@ permissions for the owner of the file, the members of a group of related users and everybody else. This is to make sure that one user is not allowed to access the files and resources of another user. -To see the permissions of a file, we can use the ls command. Let's look -at the permissions of /etc/passwd file. +To see the permissions of a file, we can use the `ls` command. Let's look +at the permissions of `/etc/passwd` file. ![](images/linux/admin/image40.png) @@ -265,10 +264,10 @@ related to file permissions. ### Chmod command -The chmod command is used to modify files and directories permissions in +The `chmod` command is used to modify files and directories permissions in Linux. -The chmod command accepts permissions in as a numerical argument. We can +The `chmod` command accepts permissions in as a numerical argument. We can think of permission as a series of bits with 1 representing True or allowed and 0 representing False or not allowed. @@ -288,26 +287,26 @@ We will now create a new file and check the permission of the file. ![](images/linux/admin/image15.png) The group owner doesn't have the permission to write to this file. Let's -give the group owner or root the permission to write to it using chmod +give the group owner or root the permission to write to it using `chmod` command. ![](images/linux/admin/image26.png) -Chmod command can be also used to change the permissions of a directory +`chmod` command can be also used to change the permissions of a directory in the similar way. ### Chown command -The chown command is used to change the owner of files or +The `chown` command is used to change the owner of files or directories in Linux. -Command syntax: chown \ \ +Command syntax: `chown \ \` ![](images/linux/admin/image6.png) -**In case, we do not have sudo privileges, we need to use sudo -command**. Let's switch to user 'shivam' and try changing the owner. We -have also changed the owner of the file to root before running the below +**In case, we do not have `sudo` privileges, we need to use `sudo` +command**. Let's switch to user `shivam` and try changing the owner. We +have also changed the owner of the file to `root` before running the below command. ![](images/linux/admin/image12.png) @@ -317,53 +316,53 @@ similar way. ### Chgrp command -The chgrp command can be used to change the group ownership of files or -directories in Linux. The syntax is very similar to that of chown +The `chgrp` command can be used to change the group ownership of files or +directories in Linux. The syntax is very similar to that of `chown` command. ![](images/linux/admin/image27.png) -Chgrp command can also be used to change the owner of a directory in the +`chgrp` command can also be used to change the owner of a directory in the similar way. ## SSH Command -The ssh command is used for logging into the remote systems, transfer files between systems and for executing commands on a remote machine. SSH stands for secure shell and is used to provide an encrypted secured connection between two hosts over an insecure network like the internet. +The `ssh` command is used for logging into the remote systems, transfer files between systems and for executing commands on a remote machine. `SSH` stands for secure shell and is used to provide an encrypted secured connection between two hosts over an insecure network like the internet. Reference: [https://www.ssh.com/ssh/command/](https://www.ssh.com/ssh/command/) We will now discuss passwordless authentication which is secure and most -commonly used for ssh authentication. +commonly used for `ssh` authentication. ### Passwordless Authentication Using SSH -Using this method, we can ssh into hosts without entering the password. +Using this method, we can `ssh` into hosts without entering the password. This method is also useful when we want some scripts to perform ssh-related tasks. Passwordless authentication requires the use of a public and private key pair. As the name implies, the public key can be shared with anyone but the private key should be kept private. -Lets not get into the details of how this authentication works. You can read more about it +Let's not get into the details of how this authentication works. You can read more about it [here](https://www.digitalocean.com/community/tutorials/understanding-the-ssh-encryption-and-connection-process) Steps for setting up a passwordless authentication with a remote host: 1. Generating public-private key pair - **If we already have a key pair stored in \~/.ssh directory, we will not need to generate keys again.** + **If we already have a key pair stored in `~/.ssh` directory, we will not need to generate keys again.** - Install openssh package which contains all the commands related to ssh. + Install `openssh` package which contains all the commands related to `ssh`. ![](images/linux/admin/image49.png) - Generate a key pair using the ssh-keygen command. One can choose the + Generate a key pair using the `ssh-keygen` command. One can choose the default values for all prompts. ![](images/linux/admin/image47.png) - After running the ssh-keygen command successfully, we should see two - keys present in the \~/.ssh directory. Id_rsa is the private key and - id_rsa.pub is the public key. Do note that the private key can only be + After running the `ssh-keygen` command successfully, we should see two + keys present in the `~/.ssh` directory. `id_rsa` is the private key and + `id_rsa.pub` is the public key. Do note that the private key can only be read and modified by you. ![](images/linux/admin/image7.png) @@ -372,40 +371,48 @@ Steps for setting up a passwordless authentication with a remote host: There are multiple ways to transfer the public key to the remote server. We will look at one of the most common ways of doing it using the - ssh-copy-id command. + `ssh-copy-id` command. ![](images/linux/admin/image11.png) - Install the openssh-clients package to use ssh-copy-id command. + Install the `openssh-clients` package to use `ssh-copy-id` command. ![](images/linux/admin/image46.png) - Use the ssh-copy-id command to copy your public key to the remote host. + Use the `ssh-copy-id` command to copy your public key to the remote host. ![](images/linux/admin/image50.png) - Now, ssh into the remote host using the password authentication. + Now, `ssh` into the remote host using the password authentication. ![](images/linux/admin/image51.png) - Our public key should be there in \~/.ssh/authorized_keys now. + Our public key should be there in `~/.ssh/authorized_keys` now. ![](images/linux/admin/image4.png) - \~/.ssh/authorized_key contains a list of public keys. The users - associated with these public keys have the ssh access into the remote + `~/.ssh/authorized_key` contains a list of public keys. The users + associated with these public keys have the `ssh` access into the remote host. ### How to run commands on a remote host ? -General syntax: ssh \@\ \ +General syntax: + +```shell +ssh \@\ \ +``` ![](images/linux/admin/image14.png) ### How to transfer files from one host to another host ? -General syntax: scp \ \ +General syntax: + +```shell +scp \ \ +``` ![](images/linux/admin/image32.png) @@ -418,32 +425,32 @@ systems. | Packaging systems | Distributions | | ---------------------- | ------------------------------------------ | -| Debian style (.deb) | Debian, Ubuntu | -| Red Hat style (.rpm) | Fedora, CentOS, Red Hat Enterprise Linux | +| Debian style (`.deb`) | Debian, Ubuntu | +| Red Hat style (`.rpm`) | Fedora, CentOS, Red Hat Enterprise Linux | **Popular Packaging Systems in Linux** |Command | Description | | ----------------------------- | --------------------------------------------------- | -| yum install \ | Installs a package on your system | -| yum update \ | Updates a package to it's latest available version | -| yum remove \ | Removes a package from your system | -| yum search \ | Searches for a particular keyword | +| yum install | Installs a package on your system | +| yum update | Updates a package to its latest available version | +| yum remove | Removes a package from your system | +| yum search | Searches for a particular keyword | [DNF](https://docs.fedoraproject.org/en-US/quick-docs/dnf/) is the successor to YUM which is now used in Fedora for installing and -managing packages. DNF may replace YUM in the future on all RPM based +managing packages. DNF may replace YUM in the future on all RPM-based Linux distributions. ![](images/linux/admin/image20.png) -We did find an exact match for the keyword httpd when we searched using -yum search command. Let's now install the httpd package. +We did find an exact match for the keyword `httpd` when we searched using +`yum search` command. Let's now install the `httpd` package. ![](images/linux/admin/image28.png) -After httpd is installed, we will use the yum remove command to remove -httpd package. +After `httpd` is installed, we will use the `yum remove` command to remove +`httpd` package. ![](images/linux/admin/image43.png) @@ -454,15 +461,15 @@ used to monitor the processes on Linux systems. ### ps (process status) -The ps command is used to know the information of a process or list of +The `ps` command is used to know the information of a process or list of processes. ![](images/linux/admin/image24.png) -If you get an error "ps command not found" while running ps command, do -install **procps** package. +If you get an error "ps command not found" while running `ps` command, do +install `procps` package. -ps without any arguments is not very useful. Let's try to list all the +`ps` without any arguments is not very useful. Let's try to list all the processes on the system by using the below command. Reference: @@ -470,28 +477,28 @@ Reference: ![](images/linux/admin/image42.png) -We can use an additional argument with ps command to list the -information about the process with a specific process ID. +We can use an additional argument with `ps` command to list the +information about the process with a specific process ID (PID). ![](images/linux/admin/image2.png) -We can use grep in combination with ps command to list only specific +We can use `grep` in combination with `ps` command to list only specific processes. ![](images/linux/admin/image1.png) ### top -The top command is used to show information about Linux processes +The `top` command is used to show information about Linux processes running on the system in real time. It also shows a summary of the system information. ![](images/linux/admin/image53.png) -For each process, top lists down the process ID, owner, priority, state, -cpu utilization, memory utilization and much more information. It also -lists down the memory utilization and cpu utilization of the system as a -whole along with system uptime and cpu load average. +For each process, `top` lists down the process ID, owner, priority, state, +CPU utilization, memory utilization and much more information. It also +lists down the memory utilization and CPU utilization of the system as a +whole along with system uptime and CPU load average. ## Memory Management @@ -500,21 +507,21 @@ used to view information about the system memory. ### free -The free command is used to display the memory usage of the system. The +The `free` command is used to display the memory usage of the system. The command displays the total free and used space available in the RAM along with space occupied by the caches/buffers. ![](images/linux/admin/image22.png) -free command by default shows the memory usage in kilobytes. We can use +`free` command by default shows the memory usage in kilobytes. We can use an additional argument to get the data in human-readable format. ![](images/linux/admin/image5.png) ### vmstat -The vmstat command can be used to display the memory usage along with -additional information about io and cpu usage. +The `vmstat` command can be used to display the memory usage along with +additional information about IO and CPU usage. ![](images/linux/admin/image38.png) @@ -525,27 +532,27 @@ used to view disk space on Linux. ### df (disk free) -The df command is used to display the free and available space for each +The `df` command is used to display the free and available space for each mounted file system. ![](images/linux/admin/image36.png) ### du (disk usage) -The du command is used to display disk usage of files and directories on +The `du` command is used to display disk usage of files and directories on the system. ![](images/linux/admin/image10.png) The below command can be used to display the top 5 largest directories -in the root directory. +in the `root` directory. ![](images/linux/admin/image18.png) ## Daemons -A computer program that runs as a background process is called a daemon. -Traditionally, the name of daemon processes ended with d - sshd, httpd +A computer program that runs as a background process is called a _daemon_. +Traditionally, the name of daemon processes ends with `d` - `sshd`, `httpd`, etc. We cannot interact with a daemon process as they run in the background. @@ -553,12 +560,12 @@ Services and daemons are used interchangeably most of the time. ## Systemd -Systemd is a system and service manager for Linux operating systems. -Systemd units are the building blocks of systemd. These units are +`systemd` is a system and service manager for Linux operating systems. +`systemd` units are the building blocks of `systemd`. These units are represented by unit configuration files. The below examples shows the unit configuration files available at -/usr/lib/systemd/system which are distributed by installed RPM packages. +`/usr/lib/systemd/system` which are distributed by installed RPM packages. We are more interested in the configuration file that ends with service as these are service units. @@ -566,8 +573,8 @@ as these are service units. ### Managing System Services -Service units end with .service file extension. Systemctl command can be -used to start/stop/restart the services managed by systemd. +Service units end with `.service` file extension. `systemctl` command can be +used to start/stop/restart the services managed by `systemd`. | Command | Description | | ------------------------------- | -------------------------------------- | diff --git a/courses/level101/linux_networking/conclusion.md b/courses/level101/linux_networking/conclusion.md index 0b15e6dc..8f5743a2 100644 --- a/courses/level101/linux_networking/conclusion.md +++ b/courses/level101/linux_networking/conclusion.md @@ -1,11 +1,11 @@ # Conclusion -With this we have traversed through the TCP/IP stack completely. We hope there will be a different perspective when one opens any website in the browser post the course. +With this, we have traversed through the TCP/IP stack completely. We hope there will be a different perspective when one opens any website in the browser post the course. During the course we have also dissected what are common tasks in this pipeline which falls under the ambit of SRE. # Post Training Exercises -1. Setup own DNS resolver in the dev environment which acts as an authoritative DNS server for example.com and forwarder for other domains. Update resolv.conf to use the new DNS resolver running in localhost -2. Set up a site dummy.example.com in localhost and run a webserver with a self signed certificate. Update the trusted CAs or pass self signed CA’s public key as a parameter so that curl https://dummy.example.com -v works properly without self signed cert warning -3. Update the routing table to use another host(container/VM) in the same network as a gateway for 8.8.8.8/32 and run ping 8.8.8.8. Do the packet capture on the new gateway to see L3 hop is working as expected(might need to disable icmp_redirect) +1. Set up your own DNS resolver in the `dev` environment which acts as an authoritative DNS server for `example.com` and forwarder for other domains. Update `resolv.conf` to use the new DNS resolver running in `localhost`. +2. Set up a site `dummy.example.com` in `localhost` and run a webserver with a self-signed certificate. Update the trusted CAs or pass self-signed CA’s public key as a parameter so that `curl https://dummy.example.com -v` works properly without self-signed cert warning. +3. Update the routing table to use another host (container/VM) in the same network as a gateway for `8.8.8.8/32` and run `ping 8.8.8.8`. Do the packet capture on the new gateway to see L3 hop is working as expected (might need to disable `icmp_redirect`). diff --git a/courses/level101/linux_networking/dns.md b/courses/level101/linux_networking/dns.md index 1a3d19e2..b79b75be 100644 --- a/courses/level101/linux_networking/dns.md +++ b/courses/level101/linux_networking/dns.md @@ -1,14 +1,14 @@ # DNS -Domain Names are the simple human-readable names for websites. The Internet understands only IP addresses, but since memorizing incoherent numbers is not practical, domain names are used instead. These domain names are translated into IP addresses by the DNS infrastructure. When somebody tries to open [www.linkedin.com](https://www.linkedin.com) in the browser, the browser tries to convert [www.linkedin.com](https://www.linkedin.com) to an IP Address. This process is called DNS resolution. A simple pseudocode depicting this process looks this +Domain Names are the simple human-readable names for websites. The Internet understands only IP addresses, but since memorizing incoherent numbers is not practical, domain names are used instead. These domain names are translated into IP addresses by the DNS infrastructure. When somebody tries to open [www.linkedin.com](https://www.linkedin.com) in the browser, the browser tries to convert [www.linkedin.com](https://www.linkedin.com) to an IP Address. This process is called DNS resolution. A simple pseudocode depicting this process looks this: ```python ip, err = getIPAddress(domainName) if err: - print(“unknown Host Exception while trying to resolve:%s”.format(domainName)) + print("unknown Host Exception while trying to resolve:%s".format(domainName)) ``` -Now let’s try to understand what happens inside the getIPAddress function. The browser would have a DNS cache of its own where it checks if there is a mapping for the domainName to an IP Address already available, in which case the browser uses that IP address. If no such mapping exists, the browser calls gethostbyname syscall to ask the operating system to find the IP address for the given domainName +Now, let’s try to understand what happens inside the `getIPAddress` function. The browser would have a DNS cache of its own where it checks if there is a mapping for the `domainName` to an IP Address already available, in which case the browser uses that IP address. If no such mapping exists, the browser calls `gethostbyname` syscall to ask the operating system to find the IP address for the given `domainName`. ```python def getIPAddress(domainName): @@ -23,15 +23,15 @@ def getIPAddress(domainName): return resp ``` -Now lets understand what operating system kernel does when the [gethostbyname](https://man7.org/linux/man-pages/man3/gethostbyname.3.html) function is called. The Linux operating system looks at the file [/etc/nsswitch.conf](https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html) file which usually has a line +Now, let's understand what operating system kernel does when the [gethostbyname](https://man7.org/linux/man-pages/man3/gethostbyname.3.html) function is called. The Linux operating system looks at the file [/etc/nsswitch.conf](https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html) file which usually has a line. ```bash hosts: files dns ``` -This line means the OS has to look up first in file (/etc/hosts) and then use DNS protocol to do the resolution if there is no match in /etc/hosts. +This line means the OS has to look up first in file (`/etc/hosts`) and then use DNS protocol to do the resolution if there is no match in `/etc/hosts`. -The file /etc/hosts is of format +The file `/etc/hosts` is of format: IPAddress FQDN [FQDN].\* @@ -40,13 +40,13 @@ IPAddress FQDN [FQDN].\* ::1 localhost.localdomain localhost ``` -If a match exists for a domain in this file then that IP address is returned by the OS. Lets add a line to this file +If a match exists for a domain in this file, then that IP address is returned by the OS. Let's add a line to this file: ```bash 127.0.0.1 test.linkedin.com ``` -And then do ping test.linkedin.com +And then do ping [test.linkedin.com](https://test.linkedin.com/). ```bash ping test.linkedin.com -n @@ -60,13 +60,13 @@ PING test.linkedin.com (127.0.0.1) 56(84) bytes of data. ``` -As mentioned earlier, if no match exists in /etc/hosts, the OS tries to do a DNS resolution using the DNS protocol. The linux system makes a DNS request to the first IP in /etc/resolv.conf. If there is no response, requests are sent to subsequent servers in resolv.conf. These servers in resolv.conf are called DNS resolvers. The DNS resolvers are populated by [DHCP](https://en.wikipedia.org/wiki/Dynamic_Host_Configuration_Protocol) or statically configured by an administrator. +As mentioned earlier, if no match exists in `/etc/hosts`, the OS tries to do a DNS resolution using the DNS protocol. The Linux system makes a DNS request to the first IP in `/etc/resolv.conf`. If there is no response, requests are sent to subsequent servers in `resolv.conf`. These servers in `resolv.conf` are called DNS resolvers. The DNS resolvers are populated by [DHCP](https://en.wikipedia.org/wiki/Dynamic_Host_Configuration_Protocol) or statically configured by an administrator. [Dig](https://linux.die.net/man/1/dig) is a userspace DNS system which creates and sends request to DNS resolvers and prints the response it receives to the console. ```bash -#run this command in one shell to capture all DNS requests +# run this command in one shell to capture all DNS requests sudo tcpdump -s 0 -A -i any port 53 -#make a dig request from another shell +# make a dig request from another shell dig linkedin.com ``` @@ -79,9 +79,9 @@ dig linkedin.com ..)........ ``` -The packet capture shows a request is made to 172.23.195.101:53 (this is the resolver in /etc/resolv.conf) for linkedin.com and a response is received from 172.23.195.101 with the IP address of linkedin.com 108.174.10.10 +The packet capture shows a request is made to `172.23.195.101:53` (this is the resolver in `/etc/resolv.conf`) for [linkedin.com](https://www.linkedin.com/) and a response is received from `172.23.195.101` with the IP address of [linkedin.com](https://www.linkedin.com/) `108.174.10.10`. -Now let's try to understand how DNS resolver tries to find the IP address of linkedin.com. DNS resolver first looks at its cache. Since many devices in the network can query for the domain name linkedin.com, the name resolution result may already exist in the cache. If there is a cache miss, it starts the DNS resolution process. The DNS server breaks “linkedin.com” to “.”, “com.” and “linkedin.com.” and starts DNS resolution from “.”. The “.” is called root domain and those IPs are known to the DNS resolver software. DNS resolver queries the root domain nameservers to find the right top-level domain (TLD) nameservers which could respond regarding details for "com.". The address of the TLD nameserver of “com.” is returned. Now the DNS resolution service contacts the TLD nameserver for “com.” to fetch the authoritative nameserver for “linkedin.com”. Once an authoritative nameserver of “linkedin.com” is known, the resolver contacts Linkedin’s nameserver to provide the IP address of “linkedin.com”. This whole process can be visualized by running the following - +Now, let's try to understand how DNS resolver tries to find the IP address of [linkedin.com](https://www.linkedin.com/). DNS resolver first looks at its cache. Since many devices in the network can query for the domain name [linkedin.com](https://www.linkedin.com/), the name resolution result may already exist in the cache. If there is a cache miss, it starts the DNS resolution process. The DNS server breaks “linkedin.com” to “.”, “com.” and “linkedin.com.” and starts DNS resolution from “.”. The “.” is called root domain and those IPs are known to the DNS resolver software. DNS resolver queries the root domain nameservers to find the right top-level domain (TLD) nameservers which could respond regarding details for "com.". The address of the TLD nameserver of “com.” is returned. Now the DNS resolution service contacts the TLD nameserver for “com.” to fetch the authoritative nameserver for “linkedin.com”. Once an authoritative nameserver of “linkedin.com” is known, the resolver contacts LinkedIn’s nameserver to provide the IP address of “linkedin.com”. This whole process can be visualized by running the following: ```bash dig +trace linkedin.com @@ -91,15 +91,15 @@ dig +trace linkedin.com linkedin.com. 3600 IN A 108.174.10.10 ``` -This DNS response has 5 fields where the first field is the request and the last field is the response. The second field is the Time to Live which says how long the DNS response is valid in seconds. In this case this mapping of linkedin.com is valid for 1 hour. This is how the resolvers and application(browser) maintain their cache. Any request for linkedin.com beyond 1 hour will be treated as a cache miss as the mapping has expired its TTL and the whole process has to be redone. +This DNS response has 5 fields where the first field is the request and the last field is the response. The second field is the Time-to-Live (TTL) which says how long the DNS response is valid in seconds. In this case, this mapping of [linkedin.com](https://www.linkedin.com/) is valid for 1 hour. This is how the resolvers and application (browser) maintain their cache. Any request for [linkedin.com](https://www.linkedin.com/) beyond 1 hour will be treated as a cache miss as the mapping has expired its TTL and the whole process has to be redone. The 4th field says the type of DNS response/request. Some of the various DNS query types are A, AAAA, NS, TXT, PTR, MX and CNAME. - A record returns IPV4 address of the domain name - AAAA record returns the IPV6 address of the domain Name - NS record returns the authoritative nameserver for the domain name -- CNAME records are aliases to the domain names. Some domains point to other domain names and resolving the latter domain name gives an IP which is used as an IP for the former domain name as well. Example www.linkedin.com’s IP address is the same as 2-01-2c3e-005a.cdx.cedexis.net. -- For the brevity we are not discussing other DNS record types, the RFC of each of these records are available [here](https://en.wikipedia.org/wiki/List_of_DNS_record_types). +- CNAME records are aliases to the domain names. Some domains point to other domain names and resolving the latter domain name gives an IP which is used as an IP for the former domain name as well. Example [www.linkedin.com](https://www.linkedin.com)’s IP address is the same as `2-01-2c3e-005a.cdx.cedexis.net`. +- For the brevity, we are not discussing other DNS record types, the RFC of each of these records are available [here](https://en.wikipedia.org/wiki/List_of_DNS_record_types). ```bash dig A linkedin.com +short @@ -124,16 +124,16 @@ dig www.linkedin.com CNAME +short 2-01-2c3e-005a.cdx.cedexis.net. ``` -Armed with these fundamentals of DNS lets see usecases where DNS is used by SREs. +Armed with these fundamentals of DNS lets see use cases where DNS is used by SREs. ## Applications in SRE role -This section covers some of the common solutions SRE can derive from DNS +This section covers some of the common solutions SRE can derive from DNS. -1. Every company has to have its internal DNS infrastructure for intranet sites and internal services like databases and other internal applications like wiki. So there has to be a DNS infrastructure maintained for those domain names by the infrastructure team. This DNS infrastructure has to be optimized and scaled so that it doesn’t become a single point of failure. Failure of the internal DNS infrastructure can cause API calls of microservices to fail and other cascading effects. -2. DNS can also be used for discovering services. For example the hostname serviceb.internal.example.com could list instances which run service b internally in example.com company. Cloud providers provide options to enable DNS discovery([example](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/service-discovery.html#dns-based-service-discovery)) -3. DNS is used by cloud providers and CDN providers to scale their services. In Azure/AWS, Load Balancers are given a CNAME instead of IPAddress. They update the IPAddress of the Loadbalancers as they scale by changing the IP Address of alias domain names. This is one of the reasons why A records of such alias domains are short lived like 1 minute. +1. Every company has to have its internal DNS infrastructure for intranet sites and internal services like databases and other internal applications like Wiki. So there has to be a DNS infrastructure maintained for those domain names by the infrastructure team. This DNS infrastructure has to be optimized and scaled so that it doesn’t become a single point of failure. Failure of the internal DNS infrastructure can cause API calls of microservices to fail and other cascading effects. +2. DNS can also be used for discovering services. For example the hostname `serviceb.internal.example.com` could list instances which run service `b` internally in `example.com` company. Cloud providers provide options to enable DNS discovery ([example](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/service-discovery.html#dns-based-service-discovery)). +3. DNS is used by cloud providers and CDN providers to scale their services. In Azure/AWS, Load Balancers are given a CNAME instead of IPAddress. They update the IPAddress of the Loadbalancers as they scale by changing the IP Address of alias domain names. This is one of the reasons why A records of such alias domains are short-lived like 1 minute. 4. DNS can also be used to make clients get IP addresses closer to their location so that their HTTP calls can be responded faster if the company has a presence geographically distributed. -5. SRE also has to understand since there is no verification in DNS infrastructure, these responses can be spoofed. This is safeguarded by other protocols like HTTPS(dealt later). DNSSEC protects from forged or manipulated DNS responses. -6. Stale DNS cache can be a problem. Some [apps](https://stackoverflow.com/questions/1256556/how-to-make-java-honor-the-dns-caching-timeout) might still be using expired DNS records for their api calls. This is something SRE has to be wary of when doing maintenance. +5. SRE also has to understand since there is no verification in DNS infrastructure, these responses can be spoofed. This is safeguarded by other protocols like HTTPS (dealt later). DNSSEC protects from forged or manipulated DNS responses. +6. Stale DNS cache can be a problem. Some [apps](https://stackoverflow.com/questions/1256556/how-to-make-java-honor-the-dns-caching-timeout) might still be using expired DNS records for their API calls. This is something SRE has to be wary of when doing maintenance. 7. DNS Loadbalancing and service discovery also has to understand TTL and the servers can be removed from the pool only after waiting till TTL post the changes are made to DNS records. If this is not done, a certain portion of the traffic will fail as the server is removed before the TTL. diff --git a/courses/level101/linux_networking/http.md b/courses/level101/linux_networking/http.md index a9610576..ccd4b8d6 100644 --- a/courses/level101/linux_networking/http.md +++ b/courses/level101/linux_networking/http.md @@ -1,10 +1,10 @@ # HTTP -Till this point we have only got the IP address of linkedin.com. The HTML page of linkedin.com is served by HTTP protocol which the browser renders. Browser sends a HTTP request to the IP of the server determined above. -Request has a verb GET, PUT, POST followed by a path and query parameters and lines of key value pair which gives information about the client and capabilities of the client like contents it can accept and a body (usually in POST or PUT) +Till this point we have only got the IP address of [linkedin.com](https://www.linkedin.com/). The HTML page of [linkedin.com](https://www.linkedin.com/) is served by HTTP protocol which the browser renders. Browser sends a HTTP request to the IP of the server determined above. +Request has a verb GET, PUT, POST followed by a path and query parameters and lines of key-value pair which gives information about the client and capabilities of the client like contents it can accept and a body (usually in POST or PUT). ```bash -# Eg run the following in your container and have a look at the headers +# Eg. run the following in your container and have a look at the headers curl linkedin.com -v ``` ```bash @@ -25,7 +25,7 @@ curl linkedin.com -v * Closing connection 0 ``` -Here, in the first line GET is the verb, / is the path and 1.1 is the HTTP protocol version. Then there are key value pairs which give client capabilities and some details to the server. The server responds back with HTTP version, [Status Code and Status message](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). Status codes 2xx means success, 3xx denotes redirection, 4xx denotes client side errors and 5xx server side errors. +Here, in the first line `GET` is the verb, `/` is the path and `1.1` is the HTTP protocol version. Then there are key-value pairs which give client capabilities and some details to the server. The server responds back with HTTP version, [Status Code and Status message](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). Status codes `2xx` means success, `3xx` denotes redirection, `4xx` denotes client-side errors and `5xx` server-side errors. We will now jump in to see the difference between HTTP/1.0 and HTTP/1.1. @@ -39,15 +39,14 @@ USER-AGENT: curl ``` +This would get server response and waits for next input as the underlying connection to [www.linkedin.com](https://www.linkedin.com/) can be reused for further queries. While going through TCP, we can understand the benefits of this. But in HTTP/1.0, this connection will be immediately closed after the response meaning new connection has to be opened for each query. HTTP/1.1 can have only one inflight request in an open connection but connection can be reused for multiple requests one after another. One of the benefits of HTTP/2.0 over HTTP/1.1 is we can have multiple inflight requests on the same connection. We are restricting our scope to generic HTTP and not jumping to the intricacies of each protocol version but they should be straight forward to understand post the course. -This would get server response and waits for next input as the underlying connection to www.linkedin.com can be reused for further queries. While going through TCP, we can understand the benefits of this. But in HTTP/1.0 this connection will be immediately closed after the response meaning new connection has to be opened for each query. HTTP/1.1 can have only one inflight request in an open connection but connection can be reused for multiple requests one after another. One of the benefits of HTTP/2.0 over HTTP/1.1 is we can have multiple inflight requests on the same connection. We are restricting our scope to generic HTTP and not jumping to the intricacies of each protocol version but they should be straight forward to understand post the course. +HTTP is called **stateless protocol**. This section we will try to understand what stateless means. Say we logged in to [linkedin.com](https://www.linkedin.com/), each request to [linkedin.com](https://www.linkedin.com/) from the client will have no context of the user and it makes no sense to prompt user to login for each page/resource. This problem of HTTP is solved by *COOKIE*. A user is created a session when a user logs in. This session identifier is sent to the browser via *SET-COOKIE* header. The browser stores the COOKIE till the expiry set by the server and sends the cookie for each request from hereon for [linkedin.com](https://www.linkedin.com/). More details on cookies are available [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies). Cookies are a critical piece of information like password and since HTTP is a plain text protocol, any man-in-the-middle can capture either password or cookies and can breach the privacy of the user. Similarly as discussed during DNS, a spoofed IP of [linkedin.com](https://www.linkedin.com/) can cause a phishing attack on users where an user can give LinkedIn’s password to login on the malicious site. To solve both problems, HTTPS came in place and HTTPS has to be mandated. -HTTP is called **stateless protocol**. This section we will try to understand what stateless means. Say we logged in to linkedin.com, each request to linkedin.com from the client will have no context of the user and it makes no sense to prompt user to login for each page/resource. This problem of HTTP is solved by *COOKIE*. A user is created a session when a user logs in. This session identifier is sent to the browser via *SET-COOKIE* header. The browser stores the COOKIE till the expiry set by the server and sends the cookie for each request from hereon for linkedin.com. More details on cookies are available [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies). Cookies are a critical piece of information like password and since HTTP is a plain text protocol, any man in the middle can capture either password or cookies and can breach the privacy of the user. Similarly as discussed during DNS a spoofed IP of linkedin.com can cause a phishing attack on users where an user can give linkedin’s password to login on the malicious site. To solve both problems HTTPs came in place and HTTPs has to be mandated. - -HTTPS has to provide server identification and encryption of data between client and server. The server administrator has to generate a private public key pair and certificate request. This certificate request has to be signed by a certificate authority which converts the certificate request to a certificate. The server administrator has to update the certificate and private key to the webserver. The certificate has details about the server (like domain name for which it serves, expiry date), public key of the server. The private key is a secret to the server and losing the private key loses the trust the server provides. When clients connect, the client sends a HELLO. The server sends its certificate to the client. The client checks the validity of the cert by seeing if it is within its expiry time, if it is signed by a trusted authority and the hostname in the cert is the same as the server. This validation makes sure the server is the right server and there is no phishing. Once that is validated, the client negotiates a symmetrical key and cipher with the server by encrypting the negotiation with the public key of the server. Nobody else other than the server who has the private key can understand this data. Once negotiation is complete, that symmetric key and algorithm is used for further encryption which can be decrypted only by client and server from thereon as they only know the symmetric key and algorithm. The switch to symmetric algorithm from asymmetric encryption algorithm is to not strain the resources of client devices as symmetric encryption is generally less resource intensive than asymmetric. +HTTPS has to provide server identification and encryption of data between client and server. The server administrator has to generate a private-public key pair and certificate request. This certificate request has to be signed by a certificate authority which converts the certificate request to a certificate. The server administrator has to update the certificate and private key to the webserver. The certificate has details about the server (like domain name for which it serves, expiry date), public key of the server. The private key is a secret to the server and losing the private key loses the trust the server provides. When clients connect, the client sends a HELLO. The server sends its certificate to the client. The client checks the validity of the cert by seeing if it is within its expiry time, if it is signed by a trusted authority and the hostname in the cert is the same as the server. This validation makes sure the server is the right server and there is no phishing. Once that is validated, the client negotiates a symmetrical key and cipher with the server by encrypting the negotiation with the public key of the server. Nobody else other than the server who has the private key can understand this data. Once negotiation is complete, that symmetric key and algorithm is used for further encryption which can be decrypted only by client and server from thereon as they only know the symmetric key and algorithm. The switch to symmetric algorithm from asymmetric encryption algorithm is to not strain the resources of client devices as symmetric encryption is generally less resource intensive than asymmetric. ```bash -#Try the following on your terminal to see the cert details like Subject Name(domain name), Issuer details, Expiry date +# Try the following on your terminal to see the cert details like Subject Name (domain name), Issuer details, Expiry date curl https://www.linkedin.com -v ``` ```bash @@ -124,6 +123,6 @@ date: Mon, 09 Nov 2020 10:50:10 GMT * Closing connection 0 ``` -Here my system has a list of certificate authorities it trusts in this file /etc/ssl/cert.pem. Curl validates the certificate is for www.linkedin.com by seeing the CN section of the subject part of the certificate. It also makes sure the certificate is not expired by seeing the expire date. It also validates the signature on the certificate by using the public key of issuer Digicert in /etc/ssl/cert.pem. Once this is done, using the public key of www.linkedin.com it negotiates cipher TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 with a symmetric key. Subsequent data transfer including first HTTP request uses the same cipher and symmetric key. +Here, my system has a list of certificate authorities it trusts in this file `/etc/ssl/cert.pem`. cURL validates the certificate is for [www.linkedin.com](https://www.linkedin.com/) by seeing the CN section of the subject part of the certificate. It also makes sure the certificate is not expired by seeing the expire date. It also validates the signature on the certificate by using the public key of issuer Digicert in `/etc/ssl/cert.pem`. Once this is done, using the public key of [www.linkedin.com](https://www.linkedin.com/) it negotiates cipher `TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384` with a symmetric key. Subsequent data transfer including first HTTP request uses the same cipher and symmetric key. diff --git a/courses/level101/linux_networking/intro.md b/courses/level101/linux_networking/intro.md index e7eef936..03a3526b 100644 --- a/courses/level101/linux_networking/intro.md +++ b/courses/level101/linux_networking/intro.md @@ -2,7 +2,7 @@ ## Prerequisites -- High-level knowledge of commonly used jargon in TCP/IP stack like DNS, TCP, UDP and HTTP +- High-level knowledge of commonly used jargon in TCP/IP stack like DNS, TCP, UDP and HTTP - [Linux Commandline Basics](https://linkedin.github.io/school-of-sre/level101/linux_basics/command_line_basics/) ## What to expect from this course @@ -11,11 +11,11 @@ Throughout the course, we cover how an SRE can optimize the system to improve th ## What is not covered under this course -This course spends time on the fundamentals. We are not covering concepts like [HTTP/2.0](https://en.wikipedia.org/wiki/HTTP/2), [QUIC](https://en.wikipedia.org/wiki/QUIC), [TCP congestion control protocols](https://en.wikipedia.org/wiki/TCP_congestion_control), [Anycast](https://en.wikipedia.org/wiki/Anycast), [BGP](https://en.wikipedia.org/wiki/Border_Gateway_Protocol), [CDN](https://en.wikipedia.org/wiki/Content_delivery_network), [Tunnels](https://en.wikipedia.org/wiki/Virtual_private_network) and [Multicast](https://en.wikipedia.org/wiki/Multicast). We expect that this course will provide the relevant basics to understand such concepts +This course spends time on the fundamentals. We are not covering concepts like [HTTP/2.0](https://en.wikipedia.org/wiki/HTTP/2), [QUIC](https://en.wikipedia.org/wiki/QUIC), [TCP congestion control protocols](https://en.wikipedia.org/wiki/TCP_congestion_control), [Anycast](https://en.wikipedia.org/wiki/Anycast), [BGP](https://en.wikipedia.org/wiki/Border_Gateway_Protocol), [CDN](https://en.wikipedia.org/wiki/Content_delivery_network), [Tunnels](https://en.wikipedia.org/wiki/Virtual_private_network) and [Multicast](https://en.wikipedia.org/wiki/Multicast). We expect that this course will provide the relevant basics to understand such concepts. ## Birds eye view of the course -The course covers the question “What happens when you open linkedin.com in your browser?” The course follows the flow of TCP/IP stack.More specifically, the course covers topics of Application layer protocols DNS and HTTP, transport layer protocols UDP and TCP, networking layer protocol IP and Data Link Layer protocol +The course covers the question “What happens when you open [linkedin.com](https://www.linkedin.com) in your browser?” The course follows the flow of TCP/IP stack. More specifically, the course covers topics of Application layer protocols (DNS and HTTP), transport layer protocols (UDP and TCP), networking layer protocol (IP) and data link layer protocol. ## Course Contents 1. [DNS](https://linkedin.github.io/school-of-sre/level101/linux_networking/dns/) diff --git a/courses/level101/linux_networking/ipr.md b/courses/level101/linux_networking/ipr.md index 2e404fcf..ed541e12 100644 --- a/courses/level101/linux_networking/ipr.md +++ b/courses/level101/linux_networking/ipr.md @@ -1,8 +1,8 @@ # IP Routing and Data Link Layer -We will dig how packets that leave the client reach the server and vice versa. When the packet reaches the IP layer, the transport layer populates source port, destination port. IP/Network layer populates destination IP(discovered from DNS) and then looks up the route to the destination IP on the routing table. +We will dig how packets that leave the client reach the server and vice versa. When the packet reaches the IP layer, the transport layer populates source port, destination port. IP/Network layer populates destination IP (discovered from DNS) and then looks up the route to the destination IP on the routing table. ```bash -#Linux route -n command gives the default routing table +# Linux `route -n` command gives the default routing table route -n ``` @@ -13,20 +13,20 @@ Destination Gateway Genmask Flags Metric Ref Use Iface 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 ``` -Here the destination IP is bitwise AND’d with the Genmask and if the answer is the destination part of the table then that gateway and interface is picked for routing. Here linkedin.com’s IP 108.174.10.10 is AND’d with 255.255.255.0 and the answer we get is 108.174.10.0 which doesn’t match with any destination in the routing table. Then Linux does an AND of destination IP with 0.0.0.0 and we get 0.0.0.0. This answer matches the default row +Here, the destination IP is bitwise AND’d with the Genmask and if the answer is the destination part of the table, then that gateway and interface is picked for routing. Here, [linkedin.com](https://www.linkedin.com)’s IP `108.174.10.10` is AND’d with `255.255.255.0` and the answer we get is `108.174.10.0` which doesn’t match with any destination in the routing table. Then, Linux does an AND of destination IP with `0.0.0.0` and we get `0.0.0.0`. This answer matches the default row. -Routing table is processed in the order of more octets of 1 set in genmask and genmask 0.0.0.0 is the default route if nothing matches. -At the end of this operation Linux figured out that the packet has to be sent to next hop 172.17.0.1 via eth0. The source IP of the packet will be set as the IP of interface eth0. -Now to send the packet to 172.17.0.1 linux has to figure out the MAC address of 172.17.0.1. MAC address is figured by looking at the internal arp cache which stores translation between IP address and MAC address. If there is a cache miss, Linux broadcasts ARP request within the internal network asking who has 172.17.0.1. The owner of the IP sends an ARP response which is cached by the kernel and the kernel sends the packet to the gateway by setting Source mac address as mac address of eth0 and destination mac address of 172.17.0.1 which we got just now. Similar routing lookup process is followed in each hop till the packet reaches the actual server. Transport layer and layers above it come to play only at end servers. During intermediate hops only till the IP/Network layer is involved. +Routing table is processed in the order of more octets of 1 set in Genmask and Genmask `0.0.0.0` is the default route if nothing matches. +At the end of this operation, Linux figured out that the packet has to be sent to next hop `172.17.0.1` via `eth0`. The source IP of the packet will be set as the IP of interface `eth0`. +Now, to send the packet to `172.17.0.1`, Linux has to figure out the MAC address of `172.17.0.1`. MAC address is figured by looking at the internal ARP cache which stores translation between IP address and MAC address. If there is a cache miss, Linux broadcasts ARP request within the internal network asking who has `172.17.0.1`. The owner of the IP sends an ARP response which is cached by the kernel and the kernel sends the packet to the gateway by setting Source MAC address as MAC address of `eth0` and destination MAC address of `172.17.0.1` which we got just now. Similar routing lookup process is followed in each hop till the packet reaches the actual server. Transport layer and layers above it come to play only at end servers. During intermediate hops, only till the IP/Network layer is involved. ![Screengrab for above explanation](images/arp.gif) -One weird gateway we saw in the routing table is 0.0.0.0. This gateway means no Layer3(Network layer) hop is needed to send the packet. Both source and destination are in the same network. Kernel has to figure out the mac of the destination and populate source and destination mac appropriately and send the packet out so that it reaches the destination without any Layer3 hop in the middle +One weird gateway we saw in the routing table is `0.0.0.0`. This gateway means no Layer3 (Network layer) hop is needed to send the packet. Both source and destination are in the same network. Kernel has to figure out the MAC of the destination and populate source and destination MAC appropriately and send the packet out so that it reaches the destination without any Layer3 hop in the middle. -As we followed in other modules, lets complete this session with SRE usecases +As we followed in other modules, let's complete this session with SRE use cases. ## Applications in SRE role -1. Generally the routing table is populated by DHCP and playing around is not a good practice. There can be reasons where one has to play around the routing table but take that path only when it's absolutely necessary -2. Understanding error messages better like, “No route to host” error can mean mac address of the destination host is not found and it can mean the destination host is down -3. On rare cases looking at the ARP table can help us understand if there is a IP conflict where same IP is assigned to two hosts by mistake and this is causing unexpected behavior +1. Generally the routing table is populated by DHCP and playing around is not a good practice. There can be reasons where one has to play around the routing table but take that path only when it's absolutely necessary. +2. Understanding error messages better like, “No route to host” error can mean MAC address of the destination host is not found and it can mean the destination host is down. +3. On rare cases, looking at the ARP table can help us understand if there is a IP conflict where same IP is assigned to two hosts by mistake and this is causing unexpected behavior. diff --git a/courses/level101/linux_networking/tcp.md b/courses/level101/linux_networking/tcp.md index 4a194eb8..1aeef508 100644 --- a/courses/level101/linux_networking/tcp.md +++ b/courses/level101/linux_networking/tcp.md @@ -1,35 +1,35 @@ # TCP TCP is a transport layer protocol like UDP but it guarantees reliability, flow control and congestion control. -TCP guarantees reliable delivery by using sequence numbers. A TCP connection is established by a three way handshake. In our case, the client sends a SYN packet along with the starting sequence number it plans to use, the server acknowledges the SYN packet and sends a SYN with its sequence number. Once the client acknowledges the syn packet, the connection is established. Each data transferred from here on is considered delivered reliably once acknowledgement for that sequence is received by the concerned party +TCP guarantees reliable delivery by using sequence numbers. A TCP connection is established by a three-way handshake. In our case, the client sends a `SYN` packet along with the starting sequence number it plans to use, the server acknowledges the `SYN` packet and sends a `SYN` with its sequence number. Once the client acknowledges the `SYN` packet, the connection is established. Each data transferred from here on is considered delivered reliably once acknowledgement for that sequence is received by the concerned party. ![3-way handshake](images/established.png) ```bash -#To understand handshake run packet capture on one bash session +# To understand handshake run packet capture on one bash session tcpdump -S -i any port 80 -#Run curl on one bash session +# Run curl on one bash session curl www.linkedin.com ``` ![tcpdump-3way](images/pcap.png) -Here client sends a syn flag shown by [S] flag with a sequence number 1522264672. The server acknowledges receipt of SYN with an ack [.] flag and a Syn flag for its sequence number[S]. The server uses the sequence number 1063230400 and acknowledges the client it’s expecting sequence number 1522264673 (client sequence+1). Client sends a zero length acknowledgement packet to the server(server sequence+1) and connection stands established. This is called three way handshake. The client sends a 76 bytes length packet after this and increments its sequence number by 76. Server sends a 170 byte response and closes the connection. This was the difference we were talking about between HTTP/1.1 and HTTP/1.0. In HTTP/1.1 this same connection can be reused which reduces overhead of 3 way handshake for each HTTP request. If a packet is missed between client and server, server won’t send an ack to the client and client would retry sending the packet till the ACK is received. This guarantees reliability. -The flow control is established by the win size field in each segment. The win size says available TCP buffer length in the kernel which can be used to buffer received segments. A size 0 means the receiver has a lot of lag to catch from its socket buffer and the sender has to pause sending packets so that receiver can cope up. This flow control protects from slow receiver and fast sender problem +Here, client sends a `SYN` flag shown by [S] flag with a sequence number `1522264672`. The server acknowledges receipt of `SYN` with an `ACK` [.] flag and a `SYN` flag for its sequence number [S]. The server uses the sequence number `1063230400` and acknowledges the client it's expecting sequence number `1522264673` (client sequence + 1). Client sends a zero length acknowledgement packet to the server (server sequence + 1) and connection stands established. This is called three way handshake. The client sends a 76 bytes length packet after this and increments its sequence number by 76. Server sends a 170 byte response and closes the connection. This was the difference we were talking about between HTTP/1.1 and HTTP/1.0. In HTTP/1.1, this same connection can be reused which reduces overhead of three-way handshake for each HTTP request. If a packet is missed between client and server, server won’t send an `ACK` to the client and client would retry sending the packet till the `ACK` is received. This guarantees reliability. +The flow control is established by the `WIN` size field in each segment. The `WIN` size says available TCP buffer length in the kernel which can be used to buffer received segments. A size 0 means the receiver has a lot of lag to catch from its socket buffer and the sender has to pause sending packets so that receiver can cope up. This flow control protects from slow receiver and fast sender problem. -TCP also does congestion control which determines how many segments can be in transit without an ack. Linux provides us the ability to configure algorithms for congestion control which we are not covering here. +TCP also does congestion control which determines how many segments can be in transit without an `ACK`. Linux provides us the ability to configure algorithms for congestion control which we are not covering here. -While closing a connection, client/server calls a close syscall. Let's assume client do that. Client’s kernel will send a FIN packet to the server. Server’s kernel can’t close the connection till the close syscall is called by the server application. Once server app calls close, server also sends a FIN packet and client enters into time wait state for 2*MSS(120s) so that this socket can’t be reused for that time period to prevent any TCP state corruptions due to stray stale packets. +While closing a connection, client/server calls a close syscall. Let's assume client do that. Client’s kernel will send a `FIN` packet to the server. Server’s kernel can’t close the connection till the close syscall is called by the server application. Once server app calls close, server also sends a `FIN` packet and client enters into `TIME_WAIT` state for 2*MSS (120s) so that this socket can’t be reused for that time period to prevent any TCP state corruptions due to stray stale packets. ![Connection tearing](images/closed.png) -Armed with our TCP and HTTP knowledge lets see how this is used by SREs in their role +Armed with our TCP and HTTP knowledge, let's see how this is used by SREs in their role. ## Applications in SRE role 1. Scaling HTTP performance using load balancers need consistent knowledge about both TCP and HTTP. There are [different kinds of load balancing](https://blog.envoyproxy.io/introduction-to-modern-network-load-balancing-and-proxying-a57f6ff80236?gi=428394dbdcc3) like L4, L7 load balancing, Direct Server Return etc. HTTPs offloading can be done on Load balancer or directly on servers based on the performance and compliance needs. -2. Tweaking sysctl variables for rmem and wmem like we did for UDP can improve throughput of sender and receiver. -3. Sysctl variable tcp_max_syn_backlog and socket variable somax_conn determines how many connections for which the kernel can complete 3 way handshake before app calling accept syscall. This is much useful in single threaded applications. Once the backlog is full, new connections stay in SYN_RCVD state (when you run netstat) till the application calls accept syscall -4. Apps can run out of file descriptors if there are too many short lived connections. Digging through [tcp_reuse and tcp_recycle](http://lxr.linux.no/linux+v3.2.8/Documentation/networking/ip-sysctl.txt#L464) can help reduce time spent in the time wait state(it has its own risk). Making apps reuse a pool of connections instead of creating ad hoc connection can also help -5. Understanding performance bottlenecks by seeing metrics and classifying whether its a problem in App or network side. Example too many sockets in Close_wait state is a problem on application whereas retransmissions can be a problem more on network or on OS stack than the application itself. Understanding the fundamentals can help us narrow down where the bottleneck is +2. Tweaking `sysctl` variables for `rmem` and `wmem` like we did for UDP can improve throughput of sender and receiver. +3. `sysctl` variable `tcp_max_syn_backlog` and socket variable `somax_conn` determines how many connections for which the kernel can complete 3-way handshake before app calling accept syscall. This is much useful in single-threaded applications. Once the backlog is full, new connections stay in `SYN_RCVD` state (when you run `netstat`) till the application calls accept syscall. +4. Apps can run out of file descriptors if there are too many short-lived connections. Digging through [tcp_reuse and tcp_recycle](http://lxr.linux.no/linux+v3.2.8/Documentation/networking/ip-sysctl.txt#L464) can help reduce time spent in the `TIME_WAIT` state (it has its own risk). Making apps reuse a pool of connections instead of creating ad hoc connection can also help. +5. Understanding performance bottlenecks by seeing metrics and classifying whether it's a problem in App or network side. Example too many sockets in `CLOSE_WAIT` state is a problem on application whereas retransmissions can be a problem more on network or on OS stack than the application itself. Understanding the fundamentals can help us narrow down where the bottleneck is. diff --git a/courses/level101/linux_networking/udp.md b/courses/level101/linux_networking/udp.md index 351e59f9..c8741491 100644 --- a/courses/level101/linux_networking/udp.md +++ b/courses/level101/linux_networking/udp.md @@ -1,15 +1,15 @@ # UDP -UDP is a transport layer protocol. DNS is an application layer protocol that runs on top of UDP(most of the times). Before jumping into UDP, let's try to understand what an application and transport layer is. DNS protocol is used by a DNS client(eg dig) and DNS server(eg named). The transport layer makes sure the DNS request reaches the DNS server process and similarly the response reaches the DNS client process. Multiple processes can run on a system and they can listen on any [ports](https://en.wikipedia.org/wiki/Port_(computer_networking)). DNS servers usually listen on port number 53. When a client makes a DNS request, after filling the necessary application payload, it passes the payload to the kernel via **sendto** system call. The kernel picks a random port number([>1024](https://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html)) as source port number and puts 53 as destination port number and sends the packet to lower layers. When the kernel on server side receives the packet, it checks the port number and queues the packet to the application buffer of the DNS server process which makes a **recvfrom** system call and reads the packet. This process by the kernel is called multiplexing(combining packets from multiple applications to same lower layers) and demultiplexing(segregating packets from single lower layer to multiple applications). Multiplexing and Demultiplexing is done by the Transport layer. +UDP is a transport layer protocol. DNS is an application layer protocol that runs on top of UDP (most of the times). Before jumping into UDP, let's try to understand what an application and transport layer is. DNS protocol is used by a DNS client (eg `dig`) and DNS server (eg `named`). The transport layer makes sure the DNS request reaches the DNS server process and similarly the response reaches the DNS client process. Multiple processes can run on a system and they can listen on any [ports](https://en.wikipedia.org/wiki/Port_(computer_networking)). DNS servers usually listen on port number `53`. When a client makes a DNS request, after filling the necessary application payload, it passes the payload to the kernel via **sendto** system call. The kernel picks a random port number ([>1024](https://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html)) as source port number and puts 53 as destination port number and sends the packet to lower layers. When the kernel on server-side receives the packet, it checks the port number and queues the packet to the application buffer of the DNS server process which makes a **recvfrom** system call and reads the packet. This process by the kernel is called multiplexing (combining packets from multiple applications to same lower layers) and demultiplexing (segregating packets from single lower layer to multiple applications). Multiplexing and Demultiplexing is done by the Transport layer. -UDP is one of the simplest transport layer protocol and it does only multiplexing and demultiplexing. Another common transport layer protocol TCP does a bunch of other things like reliable communication, flow control and congestion control. UDP is designed to be lightweight and handle communications with little overhead. So it doesn’t do anything beyond multiplexing and demultiplexing. If applications running on top of UDP need any of the features of TCP, they have to implement that in their application +UDP is one of the simplest transport layer protocol and it does only multiplexing and demultiplexing. Another common transport layer protocol TCP does a bunch of other things like reliable communication, flow control and congestion control. UDP is designed to be lightweight and handle communications with little overhead. So, it doesn’t do anything beyond multiplexing and demultiplexing. If applications running on top of UDP need any of the features of TCP, they have to implement that in their application. -This [example from python wiki](https://wiki.python.org/moin/UdpCommunication) covers a sample UDP client and server where “Hello World” is an application payload sent to server listening on port number 5005. The server receives the packet and prints the “Hello World” string from the client +This [example from python wiki](https://wiki.python.org/moin/UdpCommunication) covers a sample UDP client and server where “Hello World” is an application payload sent to server listening on port number `5005`. The server receives the packet and prints the “Hello World” string from the client. ## Applications in SRE role -1. If the underlying network is slow and the UDP layer is unable to queue packets down to the networking layer, sendto syscall from the application will hang till the kernel finds some of its buffer is freed. This can affect the throughput of the system. Increasing write memory buffer values using [sysctl variables](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/tuning_and_optimizing_red_hat_enterprise_linux_for_oracle_9i_and_10g_databases/sect-oracle_9i_and_10g_tuning_guide-adjusting_network_settings-changing_network_kernel_settings) *net.core.wmem_max* and *net.core.wmem_default* provides some cushion to the application from the slow network -2. Similarly if the receiver process is slow in consuming from its buffer, the kernel has to drop packets which it can’t queue due to the buffer being full. Since UDP doesn’t guarantee reliability these dropped packets can cause data loss unless tracked by the application layer. Increasing sysctl variables *rmem_default* and *rmem_max* can provide some cushion to slow applications from fast senders. +1. If the underlying network is slow and the UDP layer is unable to queue packets down to the networking layer, `sendto` syscall from the application will hang till the kernel finds some of its buffer is freed. This can affect the throughput of the system. Increasing write memory buffer values using [sysctl variables](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/tuning_and_optimizing_red_hat_enterprise_linux_for_oracle_9i_and_10g_databases/sect-oracle_9i_and_10g_tuning_guide-adjusting_network_settings-changing_network_kernel_settings) *net.core.wmem_max* and *net.core.wmem_default* provides some cushion to the application from the slow network +2. Similarly, if the receiver process is slow in consuming from its buffer, the kernel has to drop packets which it can’t queue due to the buffer being full. Since UDP doesn’t guarantee reliability these dropped packets can cause data loss unless tracked by the application layer. Increasing sysctl variables *rmem_default* and *rmem_max* can provide some cushion to slow applications from fast senders. diff --git a/courses/level101/messagequeue/intro.md b/courses/level101/messagequeue/intro.md index 73437504..180af82e 100644 --- a/courses/level101/messagequeue/intro.md +++ b/courses/level101/messagequeue/intro.md @@ -3,7 +3,7 @@ ## What to expect from this course -At the end of training, you will have an understanding of what a Message Services is, learn about different types of Message Service implementation and understand some of the underlying concepts & trade offs. +At the end of training, you will have an understanding of what a Message Services is, learn about different types of Message Service implementation and understand some of the underlying concepts & trade-offs. ## What is not covered under this course diff --git a/courses/level101/messagequeue/key_concepts.md b/courses/level101/messagequeue/key_concepts.md index 03b2b604..5f637de4 100644 --- a/courses/level101/messagequeue/key_concepts.md +++ b/courses/level101/messagequeue/key_concepts.md @@ -1,6 +1,6 @@ # Key Concepts -Lets looks at some of the key concepts when we talk about messaging system +Let's looks at some of the key concepts when we talk about messaging system ### Delivery guarantees diff --git a/courses/level101/metrics_and_monitoring/alerts.md b/courses/level101/metrics_and_monitoring/alerts.md index d235e628..57149d2c 100644 --- a/courses/level101/metrics_and_monitoring/alerts.md +++ b/courses/level101/metrics_and_monitoring/alerts.md @@ -4,11 +4,11 @@ Earlier we discussed different ways to collect key metric data points from a service and its underlying infrastructure. This data gives us a better understanding of how the service is performing. One of the main -objectives of monitoring is to detect any service degradations early +objectives of monitoring is to detect any service degradations early (reduce Mean Time To Detect) and notify stakeholders so that the issues are either avoided or can be fixed early, thus reducing Mean Time To Recover (MTTR). For example, if you are notified when resource usage by -a service exceeds 90 percent, you can take preventive measures to avoid +a service exceeds 90%, you can take preventive measures to avoid any service breakdown due to a shortage of resources. On the other hand, when a service goes down due to an issue, early detection and notification of such incidents can help you quickly fix the issue. @@ -20,10 +20,10 @@ Today most of the monitoring services available provide a mechanism to set up alerts on one or a combination of metrics to actively monitor the service health. These alerts have a set of defined rules or conditions, and when the rule is broken, you are notified. These rules can be as -simple as notifying when the metric value exceeds n to as complex as a -week over week (WoW) comparison of standard deviation over a period of +simple as notifying when the metric value exceeds _n_ to as complex as a +week-over-week (WoW) comparison of standard deviation over a period of time. Monitoring tools notify you about an active alert, and most of these tools support instant messaging (IM) platforms, SMS, email, or phone calls. Figure 8 shows a sample alert notification received on -Slack for memory usage exceeding 90 percent of total RAM space on the +Slack for memory usage exceeding 90% of total RAM space on the host. diff --git a/courses/level101/metrics_and_monitoring/best_practices.md b/courses/level101/metrics_and_monitoring/best_practices.md index 5454bde4..6eccb63b 100644 --- a/courses/level101/metrics_and_monitoring/best_practices.md +++ b/courses/level101/metrics_and_monitoring/best_practices.md @@ -5,35 +5,35 @@ When setting up monitoring for a service, keep the following best practices in mind. -- **Use the right metric type** -- Most of the libraries available +- **Use the right metric type**—Most of the libraries available today offer various metric types. Choose the appropriate metric type for monitoring your system. Following are the types of metrics and their purposes. - - **Gauge --** *Gauge* is a constant type of metric. After the + - **Gauge**—*Gauge* is a constant type of metric. After the metric is initialized, the metric value does not change unless you intentionally update it. - - **Timer --** *Timer* measures the time taken to complete a + - **Timer**—*Timer* measures the time taken to complete a task. - - **Counter --** *Counter* counts the number of occurrences of a + - **Counter**—*Counter* counts the number of occurrences of a particular event. For more information about these metric types, see [Data Types](https://statsd.readthedocs.io/en/v0.5.0/types.html). -- **Avoid over-monitoring** -- Monitoring can be a significant - engineering endeavor***.*** Therefore, be sure not to spend too +- **Avoid over-monitoring**—Monitoring can be a significant + engineering endeavor. Therefore, be sure not to spend too much time and resources on monitoring services, yet make sure all important metrics are captured. -- **Prevent alert fatigue** -- Set alerts for metrics that are +- **Prevent alert fatigue**—Set alerts for metrics that are important and actionable. If you receive too many non-critical alerts, you might start ignoring alert notifications over time. As a result, critical alerts might get overlooked. -- **Have a runbook for alerts** -- For every alert, make sure you have +- **Have a runbook for alerts**—For every alert, make sure you have a document explaining what actions and checks need to be performed when the alert fires. This enables any engineer on the team to handle the alert and take necessary actions, without any help from diff --git a/courses/level101/metrics_and_monitoring/command-line_tools.md b/courses/level101/metrics_and_monitoring/command-line_tools.md index 987466b5..65d662e7 100644 --- a/courses/level101/metrics_and_monitoring/command-line_tools.md +++ b/courses/level101/metrics_and_monitoring/command-line_tools.md @@ -6,48 +6,48 @@ monitor the system's performance. These tools help you measure and understand various subsystem statistics (CPU, memory, network, and so on). Let's look at some of the tools that are predominantly used. -- `ps/top `-- The process status command (ps) displays information +- **`ps/top`**: The process status command (`ps`) displays information about all the currently running processes in a Linux system. The - top command is similar to the ps command, but it periodically + top command is similar to the `ps` command, but it periodically updates the information displayed until the program is terminated. - An advanced version of top, called htop, has a more user-friendly + An advanced version of top, called `htop`, has a more user-friendly interface and some additional features. These command-line utilities come with options to modify the operation and output of the command. Following are some important options supported by the - ps command. + `ps` command. - - `-p ` -- Displays information about processes + - `-p `: Displays information about processes that match the specified process IDs. Similarly, you can use `-u ` and `-g ` to display information about processes belonging to a specific user or group. - - `-a` -- Displays information about other users' processes, as well + - `-a`: Displays information about other users' processes, as well as one's own. - - `-x` -- When displaying processes matched by other options, + - `-x`: When displaying processes matched by other options, includes processes that do not have a controlling terminal. ![Results of top command](images/image12.png)

Figure 2: Results of top command

-- `ss` -- The socket statistics command (ss) displays information +- **`ss`**: The socket statistics command (`ss`) displays information about network sockets on the system. This tool is the successor of [netstat](https://man7.org/linux/man-pages/man8/netstat.8.html), which is deprecated. Following are some command-line options - supported by the ss command: + supported by the `ss` command: - - `-t` -- Displays the TCP socket. Similarly, `-u` displays UDP + - `-t`: Displays the TCP socket. Similarly, `-u` displays UDP sockets, `-x` is for UNIX domain sockets, and so on. - - `-l` -- Displays only listening sockets. + - `-l`: Displays only listening sockets. - - `-n` -- Instructs the command to not resolve service names. + - `-n`: Instructs the command to not resolve service names. Instead displays the port numbers. ![List of listening sockets on a system](images/image8.png)

Figure 3: List of listening sockets on a system

-- `free` -- The free command displays memory usage statistics on the +- **`free`**: The `free` command displays memory usage statistics on the host like available memory, used memory, and free memory. Most often, this command is used with the `-h` command-line option, which displays the statistics in a human-readable format. @@ -55,7 +55,7 @@ on). Let's look at some of the tools that are predominantly used. ![Memory statistics on a host in human-readable form](images/image6.png)

Figure 4: Memory statistics on a host in human-readable form

-- `df --` The df command displays disk space usage statistics. The +- **`df`**: The `df` command displays disk space usage statistics. The `-i` command-line option is also often used to display [inode](https://en.wikipedia.org/wiki/Inode) usage statistics. The `-h` command-line option is used for displaying @@ -65,12 +65,12 @@ on). Let's look at some of the tools that are predominantly used.

Figure 5: Disk usage statistics on a system in human-readable form

-- `sar` -- The sar utility monitors various subsystems, such as CPU +- **`sar`**: The `sar` utility monitors various subsystems, such as CPU and memory, in real time. This data can be stored in a file specified with the `-o` option. This tool helps to identify anomalies. -- `iftop` -- The interface top command (`iftop`) displays bandwidth +- **`iftop`**: The interface top command (`iftop`) displays bandwidth utilization by a host on an interface. This command is often used to identify bandwidth usage by active connections. The `-i` option specifies which network interface to watch. @@ -80,22 +80,22 @@ on). Let's look at some of the tools that are predominantly used.

Figure 6: Network bandwidth usage by active connection on the host

-- `tcpdump` -- The tcpdump command is a network monitoring tool that +- **`tcpdump`**: The `tcpdump` command is a network monitoring tool that captures network packets flowing over the network and displays a description of the captured packets. The following options are available: - - `-i ` -- Interface to listen on + - `-i `: Interface to listen on - - `host ` -- Filters traffic going to or from the + - `host `: Filters traffic going to or from the specified host - - `src/dst` -- Displays one-way traffic from the source (src) or to + - `src/dst`: Displays one-way traffic from the source (src) or to the destination (dst) - - `port ` -- Filters traffic to or from a particular + - `port `: Filters traffic to or from a particular port ![tcpdump of packets on an interface](images/image10.png) -

Figure 7: *tcpdump* of packets on *docker0* +

Figure 7: tcpdump of packets on docker0 interface on a host

\ No newline at end of file diff --git a/courses/level101/metrics_and_monitoring/conclusion.md b/courses/level101/metrics_and_monitoring/conclusion.md index 6d426517..d1550b51 100644 --- a/courses/level101/metrics_and_monitoring/conclusion.md +++ b/courses/level101/metrics_and_monitoring/conclusion.md @@ -2,13 +2,13 @@ A robust monitoring and alerting system is necessary for maintaining and troubleshooting a system. A dashboard with key metrics can give you an -overview of service performance, all in one place. Well-defined alerts +overview of service performance, all in one place. Well-defined alerts (with realistic thresholds and notifications) further enable you to quickly identify any anomalies in the service infrastructure and in resource saturation. By taking necessary actions, you can avoid any service degradations and decrease MTTD for service breakdowns. -In addition to in-house monitoring, monitoring real user experience can +In addition to in-house monitoring, monitoring real-user experience can help you to understand service performance as perceived by the users. Many modules are involved in serving the user, and most of them are out of your control. Therefore, you need to have real-user monitoring in @@ -30,7 +30,6 @@ observability: Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/) - ## References - [Google SRE book: Monitoring Distributed diff --git a/courses/level101/metrics_and_monitoring/introduction.md b/courses/level101/metrics_and_monitoring/introduction.md index fda7d7ef..f9e8cfb0 100644 --- a/courses/level101/metrics_and_monitoring/introduction.md +++ b/courses/level101/metrics_and_monitoring/introduction.md @@ -76,7 +76,7 @@ a system, analyzing the data to derive meaningful information, and displaying the data to the users. In simple terms, you measure various metrics regularly to understand the state of the system, including but not limited to, user requests, latency, and error rate. *What gets -measured, gets fixed*---if you can measure something, you can reason +measured, gets fixed*—if you can measure something, you can reason about it, understand it, discuss it, and act upon it with confidence. @@ -102,14 +102,14 @@ book](https://sre.google/sre-book/monitoring-distributed-systems/), if you can measure only four metrics of your service, focus on these four. Let's look at each of the four golden signals. -- **Traffic** -- *Traffic* gives a better understanding of the service +- **Traffic**—*Traffic* gives a better understanding of the service demand. Often referred to as *service QPS* (queries per second), traffic is a measure of requests served by the service. This signal helps you to decide when a service needs to be scaled up to handle increasing customer demand and scaled down to be cost-effective. -- **Latency** -- *Latency* is the measure of time taken by the service +- **Latency**—*Latency* is the measure of time taken by the service to process the incoming request and send the response. Measuring service latency helps in the early detection of slow degradation of the service. Distinguishing between the latency of successful @@ -121,7 +121,7 @@ four. Let's look at each of the four golden signals. HTTP 500 error indicates a failed request, factoring 500s into overall latency might result in misleading calculations. -- **Error (rate)** -- *Error* is the measure of failed client +- **Error (rate)**—*Error* is the measure of failed client requests. These failures can be easily identified based on the response codes ([HTTP 5XX error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)). @@ -136,7 +136,7 @@ four. Let's look at each of the four golden signals. [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming))) in place to capture errors in addition to the response codes. -- **Saturation** -- *Saturation* is a measure of the resource +- **Saturation**—*Saturation* is a measure of the resource utilization by a service. This signal tells you the state of service resources and how full they are. These resources include memory, compute, network I/O, and so on. Service performance @@ -168,17 +168,17 @@ service health. With access to historical data collected over time, you can build intelligent applications to address specific needs. Some of the key use cases follow: -- **Reduction in time to resolve issues** -- With a good monitoring +- **Reduction in time to resolve issues**—With a good monitoring infrastructure in place, you can identify issues quickly and resolve them, which reduces the impact caused by the issues. -- **Business decisions** -- Data collected over a period of time can +- **Business decisions**—Data collected over a period of time can help you make business decisions such as determining the product release cycle, which features to invest in, and geographical areas to focus on. Decisions based on long-term data can improve the overall product experience. -- **Resource planning** -- By analyzing historical data, you can +- **Resource planning**—By analyzing historical data, you can forecast service compute-resource demands, and you can properly allocate resources. This allows financially effective decisions, with no compromise in end-user experience. @@ -186,35 +186,35 @@ the key use cases follow: Before we dive deeper into monitoring, let's understand some basic terminologies. -- **Metric** -- A metric is a quantitative measure of a particular - system attribute---for example, memory or CPU +- **Metric**—A metric is a quantitative measure of a particular + system attribute—for example, memory or CPU -- **Node or host** -- A physical server, virtual machine, or container +- **Node or host**—A physical server, virtual machine, or container where an application is running -- **QPS** -- *Queries Per Second*, a measure of traffic served by the +- **QPS**—*Queries Per Second*, a measure of traffic served by the service per second -- **Latency** -- The time interval between user action and the - response from the server---for example, time spent after sending a +- **Latency**—The time interval between user action and the + response from the server—for example, time spent after sending a query to a database before the first response bit is received -- **Error** **rate** -- Number of errors observed over a particular +- **Error** **rate**—Number of errors observed over a particular time period (usually a second) -- **Graph** -- In monitoring, a graph is a representation of one or +- **Graph**—In monitoring, a graph is a representation of one or more values of metrics collected over time -- **Dashboard** -- A dashboard is a collection of graphs that provide +- **Dashboard**—A dashboard is a collection of graphs that provide an overview of system health -- **Incident** -- An incident is an event that disrupts the normal +- **Incident**—An incident is an event that disrupts the normal operations of a system -- **MTTD** -- *Mean Time To Detect* is the time interval between the +- **MTTD**—*Mean Time To Detect* is the time interval between the beginning of a service failure and the detection of such failure -- **MTTR** -- Mean Time To Resolve is the time spent to fix a service +- **MTTR**—Mean Time To Resolve is the time spent to fix a service failure and bring the service back to its normal state Before we discuss monitoring an application, let us look at the @@ -230,7 +230,7 @@ In addition, a monitoring infrastructure includes alert subsystems for notifying concerned parties during any abnormal behavior. Let's look at each of these infrastructure components: -- **Host metrics agent --** A *host metrics agent* is a process +- **Host metrics agent**—A *host metrics agent* is a process running on the host that collects performance statistics for host subsystems such as memory, CPU, and network. These metrics are regularly relayed to a metrics collector for storage and @@ -239,7 +239,7 @@ each of these infrastructure components: [telegraf](https://www.influxdata.com/time-series-platform/telegraf/), and [metricbeat](https://www.elastic.co/beats/metricbeat). -- **Metric aggregator --** A *metric aggregator* is a process running +- **Metric aggregator**—A *metric aggregator* is a process running on the host. Applications running on the host collect service metrics using [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)). @@ -249,7 +249,7 @@ each of these infrastructure components: collector in batches. An example is [StatsD](https://github.com/statsd/statsd). -- **Metrics collector --** A *metrics collector* process collects all +- **Metrics collector**—A *metrics collector* process collects all the metrics from the metric aggregators running on multiple hosts. The collector takes care of decoding and stores this data on the database. Metric collection and storage might be taken care of by @@ -258,19 +258,19 @@ each of these infrastructure components: next. An example is [carbon daemons](https://graphite.readthedocs.io/en/latest/carbon-daemons.html). -- **Storage --** A time-series database stores all of these metrics. +- **Storage**—A time-series database stores all of these metrics. Examples are [OpenTSDB](http://opentsdb.net/), [Whisper](https://graphite.readthedocs.io/en/stable/whisper.html), and [InfluxDB](https://www.influxdata.com/). -- **Metrics server --** A *metrics server* can be as basic as a web +- **Metrics server**—A *metrics server* can be as basic as a web server that graphically renders metric data. In addition, the metrics server provides aggregation functionalities and APIs for fetching metric data programmatically. Some examples are [Grafana](https://github.com/grafana/grafana) and [Graphite-Web](https://github.com/graphite-project/graphite-web). -- **Alert manager --** The *alert manager* regularly polls metric data +- **Alert manager**—The *alert manager* regularly polls metric data available and, if there are any anomalies detected, notifies you. Each alert has a set of rules for identifying such anomalies. Today many metrics servers such as diff --git a/courses/level101/metrics_and_monitoring/observability.md b/courses/level101/metrics_and_monitoring/observability.md index d08acd16..12d1ca7b 100644 --- a/courses/level101/metrics_and_monitoring/observability.md +++ b/courses/level101/metrics_and_monitoring/observability.md @@ -3,7 +3,7 @@ # Observability Engineers often use observability when referring to building reliable -systems. *Observability* is a term derived from control theory, It is a +systems. *Observability* is a term derived from control theory, it is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Service infrastructures used on a daily basis are becoming more and more complex; proactive monitoring @@ -82,7 +82,7 @@ Figure 10 shows a log processing platform using ELK (Elasticsearch, Logstash, Kibana), which provides centralized log processing. Beats is a collection of lightweight data shippers that can ship logs, audit data, network data, and so on over the network. In this use case specifically, -we are using filebeat as a log shipper. Filebeat watches service log +we are using Filebeat as a log shipper. Filebeat watches service log files and ships the log data to Logstash. Logstash parses these logs and transforms the data, preparing it to store on Elasticsearch. Transformed log data is stored on Elasticsearch and indexed for fast retrieval. diff --git a/courses/level101/metrics_and_monitoring/third-party_monitoring.md b/courses/level101/metrics_and_monitoring/third-party_monitoring.md index b81ff0c9..0fae2e06 100644 --- a/courses/level101/metrics_and_monitoring/third-party_monitoring.md +++ b/courses/level101/metrics_and_monitoring/third-party_monitoring.md @@ -8,13 +8,13 @@ addition, a number of companies such as monitoring-as-a-service. In this section, we are not covering monitoring-as-a-service in depth. -In recent years, more and more people have access to the internet. Many +In recent years, more and more people have access to the Internet. Many services are offered online to cater to the increasing user base. As a result, web pages are becoming larger, with increased client-side scripts. Users want these services to be fast and error-free. From the service point of view, when the response body is composed, an HTTP 200 OK response is sent, and everything looks okay. But there might be -errors during transmission or on the client side. As previously +errors during transmission or on the client-side. As previously mentioned, monitoring services from within the service infrastructure give good visibility into service health, but this is not enough. You need to monitor user experience, specifically the availability of @@ -29,7 +29,7 @@ service is globally accessible. Other third-party monitoring solutions for real user monitoring (RUM) provide performance statistics such as service uptime and response time, from different geographical locations. This allows you to monitor the user experience from these locations, -which might have different internet backbones, different operating +which might have different Internet backbones, different operating systems, and different browsers and browser versions. [Catchpoint Global Monitoring Network](https://pages.catchpoint.com/overview-video) is a diff --git a/courses/level101/python_web/intro.md b/courses/level101/python_web/intro.md index 84aa91b1..cac9a27a 100644 --- a/courses/level101/python_web/intro.md +++ b/courses/level101/python_web/intro.md @@ -2,24 +2,24 @@ ## Prerequisites -- Basic understanding of python language. -- Basic familiarity with flask framework. +- Basic understanding of Python language. +- Basic familiarity with Flask framework. ## What to expect from this course -This course is divided into two high level parts. In the first part, assuming familiarity with python language’s basic operations and syntax usage, we will dive a little deeper into understanding python as a language. We will compare python with other programming languages that you might already know like Java and C. We will also explore concepts of Python objects and with help of that, explore python features like decorators. +This course is divided into two high-level parts. In the first part, assuming familiarity with Python language’s basic operations and syntax usage, we will dive a little deeper into understanding Python as a language. We will compare Python with other programming languages that you might already know like Java and C. We will also explore concepts of Python objects and with help of that, explore Python features like decorators. -In the second part which will revolve around the web, and also assume familiarity with the Flask framework, we will start from the socket module and work with HTTP requests. This will demystify how frameworks like flask work internally. +In the second part which will revolve around the web, and also assume familiarity with the Flask framework, we will start from the `socket` module and work with HTTP requests. This will demystify how frameworks like Flask work internally. -And to introduce SRE flavour to the course, we will design, develop and deploy (in theory) a URL shortening application. We will emphasize parts of the whole process that are more important as an SRE of the said app/service. +And to introduce SRE flavour to the course, we will design, develop and deploy (in theory) a URL-shortening application. We will emphasize parts of the whole process that are more important as an SRE of the said app/service. ## What is not covered under this course -Extensive knowledge of python internals and advanced python. +Extensive knowledge of Python internals and advanced Python. ## Lab Environment Setup -Have latest version of python installed +Have latest version of Python installed ## Course Contents @@ -29,21 +29,21 @@ Have latest version of python installed 2. [Python and Web](https://linkedin.github.io/school-of-sre/level101/python_web/python-web-flask/) 1. [Sockets](https://linkedin.github.io/school-of-sre/level101/python_web/python-web-flask/#sockets) 2. [Flask](https://linkedin.github.io/school-of-sre/level101/python_web/python-web-flask/#flask) -3. [The URL Shortening App](https://linkedin.github.io/school-of-sre/level101/python_web/url-shorten-app/) +3. [The URL-Shortening App](https://linkedin.github.io/school-of-sre/level101/python_web/url-shorten-app/) 1. [Design](https://linkedin.github.io/school-of-sre/level101/python_web/url-shorten-app/#design) 2. [Scaling The App](https://linkedin.github.io/school-of-sre/level101/python_web/sre-conclusion/#scaling-the-app) 3. [Monitoring The App](https://linkedin.github.io/school-of-sre/level101/python_web/sre-conclusion/#monitoring-strategy) ## The Python Language -Assuming you know a little bit of C/C++ and Java, let's try to discuss the following questions in context of those two languages and python. You might have heard that C/C++ is a compiled language while python is an interpreted language. Generally, with compiled language we first compile the program and then run the executable while in case of python we run the source code directly like `python hello_world.py`. While Java, being an interpreted language, still has a separate compilation step and then its run. So what's really the difference? +Assuming you know a little bit of C/C++ and Java, let's try to discuss the following questions in context of those two languages and Python. You might have heard that C/C++ is a compiled language while Python is an interpreted language. Generally, with compiled language we first compile the program and then run the executable while in case of Python we run the source code directly like `python hello_world.py`. While Java, being an interpreted language, still has a separate compilation step and then it's run. So, what's really the difference? ### Compiled vs. Interpreted -This might sound a little weird to you: python, in a way is a compiled language! Python has a compiler built-in! It is obvious in the case of java since we compile it using a separate command ie: `javac helloWorld.java` and it will produce a `.class` file which we know as a _bytecode_. Well, python is very similar to that. One difference here is that there is no separate compile command/binary needed to run a python program. +This might sound a little weird to you: Python, in a way is a compiled language! Python has a compiler built-in! It is obvious in the case of Java since we compile it using a separate command, ie: `javac helloWorld.java` and it will produce a `.class` file which we know as a _bytecode_. Well, Python is very similar to that. One difference here is that there is no separate compile command/binary needed to run a Python program. -**What is the difference then, between java and python?** -Well, Java's compiler is more strict and sophisticated. As you might know Java is a statically typed language. So the compiler is written in a way that it can verify types related errors during compile time. While python being a _dynamic_ language, types are not known until a program is run. So in a way, python compiler is dumb (or, less strict). But there indeed is a compile step involved when a python program is run. You might have seen python bytecode files with `.pyc` extension. Here is how you can see bytecode for a given python program. +**What is the difference then, between Java and Python?** +Well, Java's compiler is more strict and sophisticated. As you might know Java is a statically typed language. So the compiler is written in a way that it can verify types-related errors during compile time. While Python being a _dynamic_ language, types are not known until a program is run. So in a way, Python compiler is dumb (or, less strict). But there indeed is a compile step involved when a Python program is run. You might have seen Python bytecode files with `.pyc` extension. Here is how you can see bytecode for a given Python program. ```bash # Create a Hello World @@ -63,15 +63,15 @@ $ python -m dis hello_world.py 10 RETURN_VALUE ``` -Read more about dis module [here](https://docs.python.org/3/library/dis.html) +Read more about `dis` module [here](https://docs.python.org/3/library/dis.html). -Now coming to C/C++, there of course is a compiler. But the output is different than what java/python compiler would produce. Compiling a C program would produce what we also know as _machine code_. As opposed to bytecode. +Now coming to C/C++, there of course is a compiler. But the output is different than what Java/Python compiler would produce. Compiling a C program would produce what we also know as _machine code_, as opposed to _bytecode_. ### Running The Programs We know compilation is involved in all 3 languages we are discussing. Just that the compilers are different in nature and they output different types of content. In case of C/C++, the output is machine code which can be directly read by your operating system. When you execute that program, your OS will know how exactly to run it. **But this is not the case with bytecode.** -Those bytecodes are language specific. Python has its own set of bytecode defined (more in `dis` module) and so does java. So naturally, your operating system will not know how to run it. To run this bytecode, we have something called Virtual Machines. Ie: The JVM or the Python VM (CPython, Jython). These so called Virtual Machines are the programs which can read the bytecode and run it on a given operating system. Python has multiple VMs available. Cpython is a python VM implemented in C language, similarly Jython is a Java implementation of python VM. **At the end of the day, what they should be capable of is to understand python language syntax, be able to compile it to bytecode and be able to run that bytecode.** You can implement a python VM in any language! (And people do so, just because it can be done) +Those bytecodes are language specific. Python has its own set of bytecode defined (more in `dis` module) and so does Java. So naturally, your operating system will not know how to run it. To run this bytecode, we have something called Virtual Machines. Ie: The JVM or the Python VM (CPython, Jython). These so-called Virtual Machines are the programs which can read the bytecode and run it on a given operating system. Python has multiple VMs available. CPython is a Python VM implemented in C language, similarly Jython is a Java implementation of Python VM. **At the end of the day, what they should be capable of is to understand Python language syntax, be able to compile it to bytecode and be able to run that bytecode.** You can implement a Python VM in any language! (And people do so, just because it can be done) ``` The Operating System @@ -108,5 +108,5 @@ hello_world.c OS Specific machinecode | A New Pr Two things to note for above diagram: -1. Generally, when we run a python program, a python VM process is started which reads the python source code, compiles it to byte code and run it in a single step. Compiling is not a separate step. Shown only for illustration purpose. +1. Generally, when we run a Python program, a Python VM process is started which reads the Python source code, compiles it to bytecode and run it in a single step. Compiling is not a separate step. Shown only for illustration purpose. 2. Binaries generated for C like languages are not _exactly_ run as is. Since there are multiple types of binaries (eg: ELF), there are more complicated steps involved in order to run a binary but we will not go into that since all that is done at OS level. diff --git a/courses/level101/python_web/python-concepts.md b/courses/level101/python_web/python-concepts.md index bceaf382..8cc11555 100644 --- a/courses/level101/python_web/python-concepts.md +++ b/courses/level101/python_web/python-concepts.md @@ -1,12 +1,12 @@ # Some Python Concepts -Though you are expected to know python and its syntax at basic level, let us discuss some fundamental concepts that will help you understand the python language better. +Though you are expected to know python and its syntax at basic level, let us discuss some fundamental concepts that will help you understand the Python language better. **Everything in Python is an object.** -That includes the functions, lists, dicts, classes, modules, a running function (instance of function definition), everything. In the CPython, it would mean there is an underlying struct variable for each object. +That includes the functions, lists, dicts, classes, modules, a running function (instance of function definition), everything. In the CPython, it would mean there is an underlying `struct` variable for each object. -In python's current execution context, all the variables are stored in a dict. It'd be a string to object mapping. If you have a function and a float variable defined in the current context, here is how it is handled internally. +In Python's current execution context, all the variables are stored in a dict. It'd be a string to object mapping. If you have a function and a float variable defined in the current context, here is how it is handled internally. ```python >>> float_number=42.0 @@ -35,7 +35,7 @@ Since functions too are objects, we can see what all attributes a function conta '__subclasshook__'] ``` -While there are a lot of them, let's look at some interesting ones +While there are a lot of them, let's look at some interesting ones. #### __globals__ @@ -53,7 +53,7 @@ This attribute, as the name suggests, has references of global variables. If you ### __code__ -This is an interesting one! As everything in python is an object, this includes the bytecode too. The compiled python bytecode is a python code object. Which is accessible via `__code__` attribute here. A function has an associated code object which carries some interesting information. +This is an interesting one! As everything in Python is an object, this includes the bytecode too. The compiled Python bytecode is a Python code object. Which is accessible via `__code__` attribute here. A function has an associated code object which carries some interesting information. ```python # the file in which function is defined @@ -74,11 +74,11 @@ This is an interesting one! As everything in python is an object, this includes b't\x00d\x01|\x00\x9b\x00d\x02\x9d\x03\x83\x01\x01\x00d\x00S\x00' ``` -There are more code attributes which you can enlist by `>>> dir(hello.__code__)` +There are more code attributes which you can enlist by `>>> dir(hello.__code__)`. ## Decorators -Related to functions, python has another feature called decorators. Let's see how that works, keeping `everything is an object` in mind. +Related to functions, Python has another feature called decorators. Let's see how that works, keeping `everything is an object` in mind. Here is a sample decorator: @@ -115,7 +115,7 @@ What goes inside the `deco` function might seem complex. Let's try to uncover it 1. Function `hello_world` is created 2. It is passed to `deco` function 3. `deco` create a new function - 1. This new function is calls `hello_world` function + 1. This new function calls `hello_world` function 2. And does a couple other things 4. `deco` returns the newly created function 5. `hello_world` is replaced with above function @@ -156,7 +156,7 @@ Note how the `hello_world` name points to a new function object but that new fun ## Some Gotchas -- While it is very quick to build prototypes in python and there are tons of libraries available, as the codebase complexity increases, type errors become more common and will get hard to deal with. (There are solutions to that problem like type annotations in python. Checkout [mypy](http://mypy-lang.org/).) -- Because python is dynamically typed language, that means all types are determined at runtime. And that makes python run very slow compared to other statically typed languages. +- While it is very quick to build prototypes in Python and there are tons of libraries available, as the codebase complexity increases, type errors become more common and will get hard to deal with. (There are solutions to that problem like type annotations in Python. Checkout [mypy](http://mypy-lang.org/).) +- Because Python is dynamically typed language, that means all types are determined at runtime. And that makes Python run very slow compared to other statically typed languages. - Python has something called [GIL](https://www.dabeaz.com/python/UnderstandingGIL.pdf) (global interpreter lock) which is a limiting factor for utilizing multiple CPU cores for parallel computation. -- Some weird things that python does: https://github.com/satwikkansal/wtfpython +- Some weird things that Python does: [https://github.com/satwikkansal/wtfpython](https://github.com/satwikkansal/wtfpython). diff --git a/courses/level101/python_web/python-web-flask.md b/courses/level101/python_web/python-web-flask.md index c7978307..5554b4c2 100644 --- a/courses/level101/python_web/python-web-flask.md +++ b/courses/level101/python_web/python-web-flask.md @@ -1,12 +1,12 @@ # Python, Web and Flask -Back in the old days, websites were simple. They were simple static html contents. A webserver would be listening on a defined port and according to the HTTP request received, it would read files from disk and return them in response. But since then, complexity has evolved and websites are now dynamic. Depending on the request, multiple operations need to be performed like reading from database or calling other API and finally returning some response (HTML data, JSON content etc.) +Back in the old days, websites were simple. They were simple static html contents. A webserver would be listening on a defined port and according to the HTTP request received, it would read files from disk and return them in response. But since then, complexity has evolved and websites are now dynamic. Depending on the request, multiple operations need to be performed like reading from database or calling other API and finally returning some response (HTML data, JSON content, etc.) -Since serving web requests is no longer a simple task like reading files from disk and return contents, we need to process each http request, perform some operations programmatically and construct a response. +Since serving web requests is no longer a simple task like reading files from disk and return contents, we need to process each HTTP request, perform some operations programmatically and construct a response. ## Sockets -Though we have frameworks like flask, HTTP is still a protocol that works over TCP protocol. So let us setup a TCP server and send an HTTP request and inspect the request's payload. Note that this is not a tutorial on socket programming but what we are doing here is inspecting HTTP protocol at its ground level and look at what its contents look like. (Ref: [Socket Programming in Python (Guide) on RealPython](https://realpython.com/python-sockets/)) +Though we have frameworks like Flask, HTTP is still a protocol that works over TCP protocol. So, let us setup a TCP server and send an HTTP request and inspect the request's payload. Note that this is not a tutorial on socket programming but what we are doing here is inspecting HTTP protocol at its ground level and look at what its contents look like. (Ref: [Socket Programming in Python (Guide) on RealPython](https://realpython.com/python-sockets/)) ```python import socket @@ -27,7 +27,7 @@ with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: print(data) ``` -Then we open `localhost:65432` in our web browser and following would be the output: +Then, we open `localhost:65432` in our web browser and following would be the output: ```bash Connected by ('127.0.0.1', 54719) @@ -47,10 +47,10 @@ So though it's a blob of bytes, knowing [http protocol specification](https://to Flask, and other such frameworks does pretty much what we just discussed in the last section (with added more sophistication). They listen on a port on a TCP socket, receive an HTTP request, parse the data according to protocol format and make it available to you in a convenient manner. -ie: you can access headers in flask by `request.headers` which is made available to you by splitting above payload by `/r/n`, as defined in http protocol. +That is you can access headers in Flask by `request.headers` which is made available to you by splitting above payload by `/r/n`, as defined in HTTP protocol. -Another example: we register routes in flask by `@app.route("/hello")`. What flask will do is maintain a registry internally which will map `/hello` with the function you decorated with. Now whenever a request comes with the `/hello` route (second component in the first line, split by space), flask calls the registered function and returns whatever the function returned. +Another example: we register routes in Flask by `@app.route("/hello")`. What Flask will do is maintain a registry internally which will map `/hello` with the function you decorated with. Now, whenever a request comes with the `/hello` route (second component in the first line, split by space), Flask calls the registered function and returns whatever the function returned. Same with all other web frameworks in other languages too. They all work on similar principles. What they basically do is understand the HTTP protocol, parses the HTTP request data and gives us programmers a nice interface to work with HTTP requests. -Not so much of magic, innit? +Not so much of magic in it? diff --git a/courses/level101/python_web/sre-conclusion.md b/courses/level101/python_web/sre-conclusion.md index 529c5030..3a621e69 100644 --- a/courses/level101/python_web/sre-conclusion.md +++ b/courses/level101/python_web/sre-conclusion.md @@ -4,7 +4,7 @@ The design and development is just a part of the journey. We will need to setup continuous integration and continuous delivery pipelines sooner or later. And we have to deploy this app somewhere. -Initially we can start with deploying this app on one virtual machine on any cloud provider. But this is a `Single point of failure` which is something we never allow as an SRE (or even as an engineer). So an improvement here can be having multiple instances of applications deployed behind a load balancer. This certainly prevents problems of one machine going down. +Initially, we can start with deploying this app on one virtual machine on any cloud provider. But this is a `Single point of failure` which is something we never allow as an SRE (or even as an engineer). So an improvement here can be having multiple instances of applications deployed behind a load balancer. This certainly prevents problems of one machine going down. Scaling here would mean adding more instances behind the load balancer. But this is scalable upto only a certain point. After that, other bottlenecks in the system will start appearing. ie: DB will become the bottleneck, or perhaps the load balancer itself. How do you know what is the bottleneck? You need to have observability into each aspects of the application architecture. @@ -14,32 +14,32 @@ Get deeper insights into scaling from School Of SRE's [Scalability module](../sy ## Monitoring Strategy -Once we have our application deployed. It will be working ok. But not forever. Reliability is in the title of our job and we make systems reliable by making the design in a certain way. But things still will go down. Machines will fail. Disks will behave weirdly. Buggy code will get pushed to production. And all these possible scenarios will make the system less reliable. So what do we do? **We monitor!** +Once we have our application deployed. It will be working okay. But not forever. Reliability is in the title of our job and we make systems reliable by making the design in a certain way. But things still will go down. Machines will fail. Disks will behave weirdly. Buggy code will get pushed to production. And all these possible scenarios will make the system less reliable. So what do we do? **We monitor!** We keep an eye on the system's health and if anything is not going as expected, we want ourselves to get alerted. -Now let's think in terms of the given url shortening app. We need to monitor it. And we would want to get notified in case something goes wrong. But we first need to decide what is that _something_ that we want to keep an eye on. +Now let's think in terms of the given URL-shortening app. We need to monitor it. And we would want to get notified in case something goes wrong. But we first need to decide what is that _something_ that we want to keep an eye on. 1. Since it's a web app serving HTTP requests, we want to keep an eye on HTTP Status codes and latencies 2. Request volume again is a good candidate, if the app is receiving an unusual amount of traffic, something might be off. -3. We also want to keep an eye on the database so depending on the database solution chosen. Query times, volumes, disk usage etc. +3. We also want to keep an eye on the database so depending on the database solution chosen. Query times, volumes, disk usage, etc. 4. Finally, there also needs to be some external monitoring which runs periodic tests from devices outside of your data centers. This emulates customers and ensures that from customer point of view, the system is working as expected. ## Applications in SRE role -In the world of SRE, python is a widely used language. For small scripts and tooling developed for various purposes. Since tooling developed by SRE works with critical pieces of infrastructure and has great power (to bring things down), it is important to know what you are doing while using a programming language and its features. Also it is equally important to know the language and its characteristics while debugging the issues. As an SRE having a deeper understanding of python language, it has helped me a lot to debug very sneaky bugs and be generally more aware and informed while making certain design decisions. +In the world of SRE, Python is a widely used language for small scripts and tooling developed for various purposes. Since tooling developed by SRE works with critical pieces of infrastructure and has great power (to bring things down), it is important to know what you are doing while using a programming language and its features. Also it is equally important to know the language and its characteristics while debugging the issues. As an SRE having a deeper understanding of Python language, it has helped me a lot to debug very sneaky bugs and be generally more aware and informed while making certain design decisions. While developing tools may or may not be part of SRE job, supporting tools or services is more likely to be a daily duty. Building an application or tool is just a small part of productionization. While there is certainly that goes in the design of the application itself to make it more robust, as an SRE you are responsible for its reliability and stability once it is deployed and running. And to ensure that, you’d need to understand the application first and then come up with a strategy to monitor it properly and be prepared for various failure scenarios. ## Optional Exercises 1. Make a decorator that will cache function return values depending on input parameters. -2. Host the URL shortening app on any cloud provider. -3. Setup monitoring using many of the tools available like catchpoint, datadog etc. -4. Create a minimal flask-like framework on top of TCP sockets. +2. Host the URL-shortening app on any cloud provider. +3. Setup monitoring using many of the tools available like Catchpoint, Datadog, etc. +4. Create a minimal Flask-like framework on top of TCP sockets. ## Conclusion -This module, in the first part, aims to make you more aware of the things that will happen when you choose python as your programming language and what happens when you run a python program. With the knowledge of how python handles things internally as objects, lot of seemingly magic things in python will start to make more sense. +This module, in the first part, aims to make you more aware of the things that will happen when you choose Python as your programming language and what happens when you run a Python program. With the knowledge of how Python handles things internally as objects, lot of seemingly magic things in Python will start to make more sense. -The second part will first explain how a framework like flask works using the existing knowledge of protocols like TCP and HTTP. It then touches the whole lifecycle of an application development lifecycle including the SRE parts of it. While the design and areas in architecture considered will not be exhaustive, it will give a good overview of things that are also important being an SRE and why they are important. +The second part will first explain how a framework like Flask works using the existing knowledge of protocols like TCP and HTTP. It then touches the whole lifecycle of an application development lifecycle including the SRE parts of it. While the design and areas in architecture considered will not be exhaustive, it will give a good overview of things that are also important being an SRE and why they are important. diff --git a/courses/level101/python_web/url-shorten-app.md b/courses/level101/python_web/url-shorten-app.md index 058dd0b4..24c3b889 100644 --- a/courses/level101/python_web/url-shorten-app.md +++ b/courses/level101/python_web/url-shorten-app.md @@ -1,6 +1,6 @@ # The URL Shortening App -Let's build a very simple URL shortening app using flask and try to incorporate all aspects of the development process including the reliability aspects. We will not be building the UI and we will come up with a minimal set of API that will be enough for the app to function well. +Let's build a very simple URL-shortening app using Flask and try to incorporate all aspects of the development process including the reliability aspects. We will not be building the UI and we will come up with a minimal set of API that will be enough for the app to function well. ## Design @@ -8,19 +8,19 @@ We don't jump directly to coding. First thing we do is gather requirements. Come ### 1. High Level Operations and API Endpoints -Since it's a URL shortening app, we will need an API for generating the shorten link given an original link. And an API/Endpoint which will accept the shorten link and redirect to original URL. We are not including the user aspect of the app to keep things minimal. These two API should make app functional and usable by anyone. +Since it's a URL-shortening app, we will need an API for generating the shorten link given an original link. And an API/Endpoint which will accept the shorten link and redirect to original URL. We are not including the user aspect of the app to keep things minimal. These two API should make app functional and usable by anyone. ### 2. How to shorten? -Given a url, we will need to generate a shortened version of it. One approach could be using random characters for each link. Another thing that can be done is to use some sort of hashing algorithm. The benefit here is we will reuse the same hash for the same link. ie: if lot of people are shortening `https://www.linkedin.com` they all will have the same value, compared to multiple entries in DB if chosen random characters. +Given a URL, we will need to generate a shortened version of it. One approach could be using random characters for each link. Another thing that can be done is to use some sort of hashing algorithm. The benefit here is we will reuse the same hash for the same link. Ie: if lot of people are shortening `https://www.linkedin.com`, they all will have the same value, compared to multiple entries in DB if chosen random characters. -What about hash collisions? Even in random characters approach, though there is a less probability, hash collisions can happen. And we need to be mindful of them. In that case we might want to prepend/append the string with some random value to avoid conflict. +What about hash collisions? Even in random characters approach, though there is a less probability, hash collisions can happen. And we need to be mindful of them. In that case, we might want to prepend/append the string with some random value to avoid conflict. Also, choice of hash algorithm matters. We will need to analyze algorithms. Their CPU requirements and their characteristics. Choose one that suits the most. ### 3. Is URL Valid? -Given a URL to shorten, how do we verify if the URL is valid? Do we even verify or validate? One basic check that can be done is see if the URL matches a regex of a URL. To go even further we can try opening/visiting the URL. But there are certain gotchas here. +Given a URL to shorten, how do we verify if the URL is valid? Do we even verify or validate? One basic check that can be done is see if the URL matches a regex of a URL. To go even further, we can try opening/visiting the URL. But there are certain gotchas here. 1. We need to define success criteria. ie: HTTP 200 means it is valid. 2. What if the URL is in private network? @@ -32,13 +32,12 @@ Finally, storage. Where will we store the data that we will generate over time? ### 5. Other -We are not accounting for users into our app and other possible features like rate limiting, customized links etc but it will eventually come up with time. Depending on the requirements, they too might need to get incorporated. +We are not accounting for users into our app and other possible features like rate limiting, customized links, etc. but it will eventually come up with time. Depending on the requirements, they too might need to get incorporated. -The minimal working code is given below for reference but I'd encourage you to come up with your own. +The minimal working code is given below for reference, but I'd encourage you to come up with your own. ```python from flask import Flask, redirect, request - from hashlib import md5 app = Flask("url_shortener") diff --git a/courses/level101/security/conclusion.md b/courses/level101/security/conclusion.md index 8eeceead..ef4e537a 100644 --- a/courses/level101/security/conclusion.md +++ b/courses/level101/security/conclusion.md @@ -8,18 +8,23 @@ This course provides fundamental everyday knowledge on security domain which wil Some books that would be a great resource -- Holistic Info-Sec for Web Developers - Free and downloadable book series with very broad and deep coverage of what Web Developers and DevOps Engineers need to know in order to create robust, reliable, maintainable and secure software, networks and other, that are delivered continuously, on time, with no nasty surprises -- Docker Security - Quick Reference: For DevOps Engineers - A book on understanding the Docker security defaults, how to improve them (theory and practical), along with many tools and techniques. -- How to Hack Like a Legend - A hacker’s tale breaking into a secretive offshore company, Sparc Flow, 2018 -- How to Investigate Like a Rockstar - Live a real crisis to master the secrets of forensic analysis, Sparc Flow, 2017 -- Real World Cryptography - This early-access book teaches you applied cryptographic techniques to understand and apply security at every level of your systems and applications. -- AWS Security - This early-access book covers commong AWS security issues and best practices for access policies, data protection, auditing, continuous monitoring, and incident response. +- Holistic Info-Sec for Web Developers ()—Free and downloadable book series with very broad and deep coverage of what Web Developers and DevOps Engineers need to know in order to create robust, reliable, maintainable and secure software, networks and other, that are delivered continuously, on time, with no nasty surprises. + +- Docker Security: Quick Reference—For DevOps Engineers ()—A book on understanding the Docker security defaults, how to improve them (theory and practical), along with many tools and techniques. + +- How to Hack Like a Legend ()—A hacker’s tale breaking into a secretive offshore company, Sparc Flow, 2018 + +- How to Investigate Like a Rockstar ()—Live a real crisis to master the secrets of forensic analysis, Sparc Flow, 2017 + +- Real World Cryptography ()—This early-access book teaches you applied cryptographic techniques to understand and apply security at every level of your systems and applications. + +- AWS Security ()—This early-access book covers common AWS security issues and best practices for access policies, data protection, auditing, continuous monitoring, and incident response. ## Post Training asks/ Further Reading -- CTF Events like : -- Penetration Testing : -- Threat Intelligence : -- Threat Detection & Hunting : +- CTF Events like: +- Penetration Testing: +- Threat Intelligence: +- Threat Detection & Hunting: - Web Security: -- Building Secure and Reliable Systems : +- Building Secure and Reliable Systems: diff --git a/courses/level101/security/fundamentals.md b/courses/level101/security/fundamentals.md index 5c82e679..18ac213d 100644 --- a/courses/level101/security/fundamentals.md +++ b/courses/level101/security/fundamentals.md @@ -13,23 +13,23 @@ - They have quite a big role in System design & hence are quite sometimes the first line of defence. - SRE’s help in preventing bad design & implementations which can affect the overall security of the infrastructure. - Successfully designing, implementing, and maintaining systems requires a commitment to **the full system lifecycle**. This commitment is possible only when security and reliability are central elements in the architecture of systems. -- Core Pillars of Information Security : - - **Confidentiality** – only allow access to data for which the user is permitted - - **Integrity** – ensure data is not tampered or altered by unauthorized users - - **Availability** – ensure systems and data are available to authorized users when they need it +- Core Pillars of Information Security: + - **Confidentiality**—only allow access to data for which the user is permitted + - **Integrity**—ensure data is not tampered or altered by unauthorized users + - **Availability**—ensure systems and data are available to authorized users when they need it -- Thinking like a Security Engineer - - When starting a new application or re-factoring an existing application, you should consider each functional feature, and consider: - - Is the process surrounding this feature as safe as possible? In other words, is this a flawed process? - - If I were evil, how would I abuse this feature? Or more specifically failing to address how a feature can be abused can cause design flaws. - - Is the feature required to be on by default? If so, are there limits or options that could help reduce the risk from this feature? +- Thinking like a Security Engineer: + - When starting a new application or re-factoring an existing application, you should consider each functional feature, and consider: + - Is the process surrounding this feature as safe as possible? In other words, is this a flawed process? + - If I were evil, how would I abuse this feature? Or more specifically failing to address how a feature can be abused can cause design flaws. + - Is the feature required to be on by default? If so, are there limits or options that could help reduce the risk from this feature? - Security Principles By OWASP (Open Web Application Security Project) - - Minimize attack surface area : + - Minimize attack surface area: - Every feature that is added to an application adds a certain amount of risk to the overall application. The aim of secure development is to reduce the overall risk by reducing the attack surface area. - For example, a web application implements online help with a search function. The search function may be vulnerable to SQL injection attacks. If the help feature was limited to authorized users, the attack likelihood is reduced. If the help feature’s search function was gated through centralized data validation routines, the ability to perform SQL injection is dramatically reduced. However, if the help feature was re-written to eliminate the search function (through a better user interface, for example), this almost eliminates the attack surface area, even if the help feature was available to the Internet at large. - Establish secure defaults: - - There are many ways to deliver an “out of the box” experience for users. However, by default, the experience should be secure, and it should be up to the user to reduce their security – if they are allowed. + - There are many ways to deliver an “out of the box” experience for users. However, by default, the experience should be secure, and it should be up to the user to reduce their security—if they are allowed. - For example, by default, password ageing and complexity should be enabled. Users might be allowed to turn these two features off to simplify their use of the application and increase their risk. - Default Passwords of routers, IoT devices should be changed - Principle of Least privilege @@ -41,20 +41,17 @@ - For example, a flawed administrative interface is unlikely to be vulnerable to an anonymous attack if it correctly gates access to production management networks, checks for administrative user authorization, and logs all access. - Fail securely - Applications regularly fail to process transactions for many reasons. How they fail can determine if an application is secure or not. - - ``` - - is_admin = true; - try { - code_which_may_faile(); - is_admin = is_user_assigned_role("Adminstrator"); - } - catch (Exception err) { - log.error(err.toString()); - } - - ``` - - If either codeWhichMayFail() or isUserInRole fails or throws an exception, the user is an admin by default. This is obviously a security risk. +

+      is_admin = true;
+      try {
+        code_which_may_fail();
+        is_admin = is_user_assigned_role("Adminstrator");
+      }
+      catch (Exception err) {
+        log.error(err.toString());
+      }
+      
+    - If either `codeWhichMayFail()` or `isUserInRole` fails or throws an exception, the user is an admin by default. This is obviously a security risk.
 
   - Don’t trust services
     - Many organizations utilize the processing capabilities of third-party partners, who more than likely have different security policies and posture than you. It is unlikely that you can influence or control any external third party, whether they are home users or major suppliers or partners.
@@ -63,7 +60,7 @@
   - Separation of duties
     - The key to fraud control is the separation of duties. For example, someone who requests a computer cannot also sign for it, nor should they directly receive the computer. This prevents the user from requesting many computers and claiming they never arrived.
     - Certain roles have different levels of trust than normal users. In particular, administrators are different from normal users. In general, administrators should not be users of the application.
-    - For example, an administrator should be able to turn the system on or off, set password policy but shouldn’t be able to log on to the storefront as a super privileged user, such as being able to “buy” goods on behalf of other users.
+    - For example, an administrator should be able to turn the system on or off, set password policy but shouldn't be able to log on to the storefront as a super privileged user, such as being able to "buy" goods on behalf of other users.
   - Avoid security by obscurity
     - Security through obscurity is a weak security control, and nearly always fails when it is the only control. This is not to say that keeping secrets is a bad idea, it simply means that the security of systems should not be reliant upon keeping details hidden.
     - For example, the security of an application should not rely upon knowledge of the source code being kept secret. The security should rely upon many other factors, including reasonable password policies, defence in depth, business transaction limits, solid network architecture, and fraud, and audit controls.
@@ -77,7 +74,7 @@
     - For example, a user has found that they can see another user’s balance by adjusting their cookie. The fix seems to be relatively straightforward, but as the cookie handling code is shared among all applications, a change to just one application will trickle through to all other applications. The fix must, therefore, be tested on all affected applications.
   - Reliability & Security
     - Reliability and security are both crucial components of a truly trustworthy system, but building systems that are both reliable and secure is difficult. While the requirements for reliability and security share many common properties, they also require different design considerations. It is easy to miss the subtle interplay between reliability and security that can cause unexpected outcomes
-    - Ex: A password management application failure was triggered by a reliability problem i.e poor load-balancing and load-shedding strategies and its recovery were later complicated by multiple measures (HSM mechanism which needs to be plugged into server racks, which works as an authentication & the HSM token supposedly locked inside a case.. & the problem can be further elongated ) designed to increase the security of the system.
+    - Ex: A password management application failure was triggered by a reliability problem i.e poor load-balancing and load-shedding strategies and its recovery were later complicated by multiple measures (HSM mechanism which needs to be plugged into server racks, which works as an authentication & the HSM token supposedly locked inside a case.. & the problem can be further elongated) designed to increase the security of the system.
 
 ---
 
@@ -99,7 +96,7 @@
 
 ***OpenID*** is an authentication protocol that allows us to authenticate users without using a local auth system. In such a scenario, a user has to be registered with an OpenID Provider and the same provider should be integrated with the authentication flow of your application. To verify the details, we have to forward the authentication requests to the provider. On successful authentication, we receive a success message and/or profile details with which we can execute the necessary flow.
 
-***OAuth*** is an authorization mechanism that allows your application user access to a provider(Gmail/Facebook/Instagram/etc). On successful response, we (your application) receive a token with which the application can access certain APIs on behalf of a user. OAuth is convenient in case your business use case requires some certain user-facing APIs like access to Google Drive or sending tweets on your behalf. Most OAuth 2.0 providers can be used for pseudo authentication. Having said that, it can get pretty complicated if you are using multiple OAuth providers to authenticate users on top of the local authentication system.
+***OAuth*** is an authorization mechanism that allows your application user access to a provider (Gmail/Facebook/Instagram/etc). On successful response, we (your application) receive a token with which the application can access certain APIs on behalf of a user. OAuth is convenient in case your business use case requires some certain user-facing APIs like access to Google Drive or sending tweets on your behalf. Most OAuth 2.0 providers can be used for pseudo authentication. Having said that, it can get pretty complicated if you are using multiple OAuth providers to authenticate users on top of the local authentication system.
 
 ---
 
@@ -129,13 +126,13 @@ D(k,E(k,m)) = m
 
 Stream Ciphers:
 
-- The message is broken into characters or bits and enciphered with a key or keystream(should be random and generated independently of the message stream) that is as long as the plaintext bitstream.
+- The message is broken into characters or bits and enciphered with a key or keystream (should be random and generated independently of the message stream) that is as long as the plaintext bitstream.
 - If the keystream is random, this scheme would be unbreakable unless the keystream was acquired, making it unconditionally secure. The keystream must be provided to both parties in a secure way to prevent its release.
 
 Block Ciphers:
 
-- Block ciphers — process messages in blocks, each of which is then encrypted or decrypted.
-- A block cipher is a symmetric cipher in which blocks of plaintext are treated as a whole and used to produce ciphertext blocks. The block cipher takes blocks that are b bits long and encrypts them to blocks that are also b bits long. Block sizes are typically 64 or 128 bits long. 
+- Block ciphers—process messages in blocks, each of which is then encrypted or decrypted.
+- A block cipher is a symmetric cipher in which blocks of plaintext are treated as a whole and used to produce ciphertext blocks. The block cipher takes blocks that are *b* bits long and encrypts them to blocks that are also *b* bits long. Block sizes are typically 64 or 128 bits long. 
 
     ![image5](images/image5.png)
     ![image6](images/image6.png)
@@ -143,7 +140,7 @@ Block Ciphers:
 Encryption
 
 - **Secret Key (Symmetric Key)**: the same key is used for encryption and decryption
-- **Public Key (Asymmetric Key)** in an asymmetric, the encryption and decryption keys are different but related. The encryption key is known as the public key and the decryption key is known as the private key. The public and private keys are known as a key pair.
+- **Public Key (Asymmetric Key)**: in an asymmetric, the encryption and decryption keys are different but related. The encryption key is known as the public key and the decryption key is known as the private key. The public and private keys are known as a key pair.
 
 Symmetric Key Encryption
 
@@ -181,7 +178,7 @@ Asymmetric Key Algorithm
 
 Diffie-Hellman
 
-- The protocol has two system parameters, p and g. They are both public and may be used by everybody. Parameter p is a prime number, and parameter g (usually called a generator) is an integer that is smaller than p, but with the following property: For every number n between 1 and p – 1 inclusive, there is a power k of g such that n = gk mod p.
+- The protocol has two system parameters, *p* and *g*. They are both public and may be used by everybody. Parameter *p* is a prime number, and parameter *g* (usually called a generator) is an integer that is smaller than *p*, but with the following property: For every number, *n* between 1 and p – 1 inclusive, there is a power *k* of *g* such that `n = gk mod p`.
 - Diffie Hellman algorithm is an asymmetric algorithm used to establish a shared secret for a symmetric key algorithm. Nowadays most of the people use hybrid cryptosystem i.e, a combination of symmetric and asymmetric encryption. Asymmetric Encryption is used as a technique in key exchange mechanism to share a secret key and after the key is shared between sender and receiver, the communication will take place using symmetric encryption. The shared secret key will be used to encrypt the communication.
 - Refer: 
 
@@ -199,8 +196,8 @@ Hashing Algorithms
 - A hash function, which is a one-way function to input data to produce a fixed-length digest (fingerprint) of output data. The digest is cryptographically strong; that is, it is impossible to recover input data from its digest. If the input data changes just a little, the digest (fingerprint) changes substantially in what is called an avalanche effect.
 
 - More:
-  - 
-  - 
+    - 
+    - 
 
 MD5
 
@@ -269,6 +266,7 @@ The major features and guarantees of the SSH protocol are:
   - a server: Kerberos protected hosts reside
 
     ![image10](images/image10.png)
+    
   - a Key Distribution Center (KDC), which acts as the trusted third-party authentication service.
 
 The KDC includes the following two servers:
@@ -281,10 +279,10 @@ The KDC includes the following two servers:
 
 ### Certificate Chain
 
-The first part of the output of the OpenSSL command shows three certificates numbered 0, 1, and 2(not 2 anymore). Each certificate has a subject, s, and an issuer, i. The first certificate, number 0, is called the end-entity certificate. The subject line tells us it’s valid for any subdomain of google.com because its subject is set to *.google.com. 
-
+The first part of the output of the OpenSSL command shows three certificates numbered 0, 1, and 2 (not 2 anymore). Each certificate has a subject, *s*, and an issuer, *i*. The first certificate, number 0, is called the end-entity certificate. The subject line tells us it’s valid for any subdomain of `google.com` because its subject is set to `*.google.com`. 
 
-`$ openssl s_client -connect www.google.com:443 -CApath /etc/ssl/certs
+```shell
+$ openssl s_client -connect www.google.com:443 -CApath /etc/ssl/certs
 CONNECTED(00000005)
 depth=2 OU = GlobalSign Root CA - R2, O = GlobalSign, CN = GlobalSign
 verify return:1
@@ -298,8 +296,10 @@ Certificate chain
    i:/C=US/O=Google Trust Services/CN=GTS CA 1O1
  1 s:/C=US/O=Google Trust Services/CN=GTS CA 1O1
    i:/OU=GlobalSign Root CA - R2/O=GlobalSign/CN=GlobalSign
----`
-`Server certificate`
+---
+```
+
+**Server certificate**
 
 - The issuer line indicates it’s issued by Google Internet Authority G2, which also happens to be the subject of the second certificate, number 1
 - What the OpenSSL command line doesn’t show here is the trust store that contains the list of CA certificates trusted by the system OpenSSL runs on.
@@ -313,15 +313,15 @@ Certificate chain
 
 1. The client sends a HELLO message to the server with a list of protocols and algorithms it supports.
 2. The server says HELLO back and sends its chain of certificates. Based on the capabilities of the client, the server picks a cipher suite.
-3. If the cipher suite supports ephemeral key exchange, like ECDHE does(ECDHE is an algorithm known as the Elliptic Curve Diffie-Hellman Exchange), the server and the client negotiate a pre-master key with the Diffie-Hellman algorithm. The pre-master key is never sent over the wire.
+3. If the cipher suite supports ephemeral key exchange, like ECDHE does (ECDHE is an algorithm known as the Elliptic Curve Diffie-Hellman Exchange), the server and the client negotiate a pre-master key with the Diffie-Hellman algorithm. The pre-master key is never sent over the wire.
 4. The client and server create a session key that will be used to encrypt the data transiting through the connection.
 
-At the end of the handshake, both parties possess a secret session key used to encrypt data for the rest of the connection. This is what OpenSSL refers to as Master-Key
+At the end of the handshake, both parties possess a secret session key used to encrypt data for the rest of the connection. This is what OpenSSL refers to as Master-Key.
 
 **NOTE**
 
-- There are 3 versions of TLS , TLS 1.0, 1.1 & 1.2
-- TLS 1.0 was released in 1999, making it a nearly two-decade-old protocol. It has been known to be vulnerable to attacks—such as BEAST and POODLE—for years, in addition to supporting weak cryptography, which doesn’t keep modern-day connections sufficiently secure.
+- There are 3 versions of TLS, TLS 1.0, 1.1 & 1.2
+- TLS 1.0 was released in 1999, making it a nearly two-decade-old protocol. It has been known to be vulnerable to attacks—such as BEAST and POODLE—for years, in addition to supporting weak cryptography, which doesn’t keep modern-day connections sufficiently secure.
 - TLS 1.1 is the forgotten “middle child.” It also has bad cryptography like its younger sibling. In most software, it was leapfrogged by TLS 1.2 and it’s rare to see TLS 1.1 used.
 
 ### “Perfect” Forward Secrecy
@@ -331,4 +331,4 @@ At the end of the handshake, both parties possess a secret session key used to e
 - An ephemeral key exchange like DHE, or its variant on elliptic curve, ECDHE, solves this problem by not transmitting the pre-master key over the wire. Instead, the pre-master key is computed by both the client and the server in isolation, using nonsensitive information exchanged publicly. Because the pre-master key can’t be decrypted later by an attacker, the session key is safe from future attacks: hence, the term perfect forward secrecy.
 - Keys are changed every X blocks along the stream. That prevents an attacker from simply sniffing the stream and applying brute force to crack the whole thing. "Forward secrecy" means that just because I can decrypt block M, does not mean that I can decrypt block Q
 - Downside:
-  - The downside to PFS is that all those extra computational steps induce latency on the handshake and slow the user down. To avoid repeating this expensive work at every connection, both sides cache the session key for future use via a technique called session resumption. This is what the session-ID and TLS ticket are for: they allow a client and server that share a session ID to skip over the negotiation of a session key, because they already agreed on one previously, and go directly to exchanging data securely.
+    - The downside to PFS is that all those extra computational steps induce latency on the handshake and slow the user down. To avoid repeating this expensive work at every connection, both sides cache the session key for future use via a technique called session resumption. This is what the session-ID and TLS ticket are for: they allow a client and server that share a session ID to skip over the negotiation of a session key, because they already agreed on one previously, and go directly to exchanging data securely.
diff --git a/courses/level101/security/intro.md b/courses/level101/security/intro.md
index e8415cc8..dd585c57 100644
--- a/courses/level101/security/intro.md
+++ b/courses/level101/security/intro.md
@@ -9,7 +9,7 @@
 
 ## What to expect from this course
 
-The course covers fundamentals of information security along with touching on subjects of system security, network & web security. This course aims to get you familiar with the basics of information security in day to day operations & then as an SRE develop the mindset of ensuring that security takes a  front-seat while developing solutions. The course also serves as an introduction to common risks and best practices along with practical ways to find out vulnerable systems and loopholes which might become compromised if not secured.
+The course covers fundamentals of information security along with touching on subjects of system security, network & web security. This course aims to get you familiar with the basics of information security in day-to-day operations and then as an SRE develop the mindset of ensuring that security takes a front-seat while developing solutions. The course also serves as an introduction to common risks and best practices along with practical ways to find out vulnerable systems and loopholes which might become compromised if not secured.
 
 
 ## What is not covered under this course
diff --git a/courses/level101/security/network_security.md b/courses/level101/security/network_security.md
index 2ddf2f6d..10503a9f 100644
--- a/courses/level101/security/network_security.md
+++ b/courses/level101/security/network_security.md
@@ -6,7 +6,7 @@
 - The OSI model is a seven-layer architecture. The OSI architecture is similar to the TCP/IP architecture, except that the OSI model specifies two additional layers between the application layer and the transport layer in the TCP/IP architecture. These two layers are the presentation layer and the session layer. Figure 5.1 shows the relationship between the TCP/IP layers and the OSI layers. The application layer in TCP/IP corresponds to the application layer and the presentation layer in OSI. The transport layer in TCP/IP corresponds to the session layer and the transport layer in OSI. The remaining three layers in the TCP/IP architecture are one-to-one correspondent to the remaining three layers in the OSI model.
 
     ![image14](images/image14.png)
-    Correspondence between layers of the TCP/IP architecture and the OSI model. Also shown are placements of cryptographic algorithms in network layers, where the dotted arrows indicate actual communications of cryptographic algorithms
+    Correspondence between layers of the TCP/IP architecture and the OSI model. Also shown are placements of cryptographic algorithms in network layers, where the _dotted arrows_ indicate actual communications of cryptographic algorithms
 
 The functionalities of OSI layers are briefly described as follows:
 
@@ -15,7 +15,7 @@ The functionalities of OSI layers are briefly described as follows:
 3. The session layer is responsible for creating, managing, and closing a communication connection.
 4. The transport layer is responsible for providing reliable connections, such as packet sequencing, traffic control, and congestion control.
 5. The network layer is responsible for routing device-independent data packets from the current hop to the next hop.
-6. The data-link layer is responsible for encapsulating device-independent data packets into device-dependent data frames. It has two sublayers: logical link control and media access control.
+6. The data-link layer is responsible for encapsulating device-independent data packets into device-dependent data frames. It has two sublayers: logical link control (LLC) and media access control (MAC).
 7. The physical layer is responsible for transmitting device-dependent frames through some physical media.
 
 - Starting from the application layer, data generated from an application program is passed down layer-by-layer to the physical layer. Data from the previous layer is enclosed in a new envelope at the current layer, where the data from the previous layer is also just an envelope containing the data from the layer before it. This is similar to enclosing a smaller envelope in a larger one. The envelope added at each layer contains sufficient information for handling the packet. Application-layer data are divided into blocks small enough to be encapsulated in an envelope at the next layer.
@@ -48,13 +48,13 @@ The functionalities of OSI layers are briefly described as follows:
 ### PGP & S/MIME : Email Security
 
 - There are several security protocols at the application layer. The most used of these protocols are email security protocols namely PGP and S/MIME.
-- SMTP (“Simple Mail Transfer Protocol”) is used for sending and delivering from a client to a server via port 25: it’s the outgoing server. On the contrary, POP (“Post Office Protocol”) allows the users to pick up the message and download it into their inbox: it’s the incoming server. The latest version of the Post Office Protocol is named POP3, and it’s been used since 1996; it uses port 110
+- SMTP (“Simple Mail Transfer Protocol”) is used for sending and delivering from a client to a server via port 25: it’s the outgoing server. On the contrary, POP (“Post Office Protocol”) allows the users to pick up the message and download it into their inbox: it’s the incoming server. The latest version of the Post Office Protocol is named POP3, and it’s been used since 1996; it uses port 110.
 
 PGP
 
 - PGP implements all major cryptographic algorithms, the ZIP compression algorithm, and the Base64 encoding algorithm.
 - It can be used to authenticate a message, encrypt a message, or both. PGP follows the following general process: authentication, ZIP compression, encryption, and Base64 encoding.
-- The Base64 encoding procedure makes the message ready for SMTP transmission
+- The Base64 encoding procedure makes the message ready for SMTP transmission.
 
 GPG (GnuPG)
 
@@ -65,25 +65,25 @@ GPG (GnuPG)
 
 S/MIME
 
-- SMTP can only handle 7-bit ASCII text (You can use UTF-8 extensions to alleviate these limitations, ) messages. While POP can handle other content types besides 7-bit ASCII, POP may, under a common default setting, download all the messages stored in the mail server to the user's local computer. After that, if POP removes these messages from the mail server. This makes it difficult for the users to read their messages from multiple computers.
+- SMTP can only handle 7-bit ASCII text messages (You can use UTF-8 extensions to alleviate these limitations.) While POP can handle other content types besides 7-bit ASCII, POP may, under a common default setting, download all the messages stored in the mail server to the user's local computer. After that, if POP removes these messages from the mail server. This makes it difficult for the users to read their messages from multiple computers.
 - The Multipurpose Internet Mail Extension protocol (MIME) was designed to support sending and receiving email messages in various formats, including nontext files generated by word processors, graphics files, sound files, and video clips. Moreover, MIME allows a single message to include mixed types of data in any combination of these formats.
-- The Internet Mail Access Protocol (IMAP), operated on TCP port 143(only for non-encrypted), stores (Configurable on both server & client just like PoP) incoming email messages in the mail server until the user deletes them deliberately. This allows the users to access their mailbox from multiple machines and download messages to a local machine without deleting it from the mailbox in the mail server.
+- The Internet Mail Access Protocol (IMAP), operated on TCP port 143 (only for non-encrypted), stores (Configurable on both server & client just like PoP) incoming email messages in the mail server until the user deletes them deliberately. This allows the users to access their mailbox from multiple machines and download messages to a local machine without deleting it from the mailbox in the mail server.
 
 SSL/TLS
 
 - SSL uses a PKI to decide if a server’s public key is trustworthy by requiring servers to use a security certificate signed by a trusted CA.
 - When Netscape Navigator 1.0 was released, it trusted a single CA operated by the RSA Data Security corporation.
 - The server’s public RSA keys were used to be stored in the security certificate, which can then be used by the browser to establish a secure communication channel. The security certificates we use today still rely on the same standard (named X.509) that Netscape Navigator 1.0 used back then.
-- Netscape intended to train users(though this didn’t work out later) to differentiate secure communications from insecure ones, so they put a lock icon next to the address bar. When the lock is open, the communication is insecure. A closed lock means communication has been secured with SSL, which required the server to provide a signed certificate. You’re obviously familiar with this icon as it’s been in every browser ever since. The engineers at Netscape truly created a standard for secure internet communications.
+- Netscape intended to train users (though this didn’t work out later) to differentiate secure communications from insecure ones, so they put a lock icon next to the address bar. When the lock is open, the communication is insecure. A closed lock means communication has been secured with SSL, which required the server to provide a signed certificate. You’re obviously familiar with this icon as it’s been in every browser ever since. The engineers at Netscape truly created a standard for secure Internet communications.
 - A year after releasing SSL 2.0, Netscape fixed several security issues and released SSL 3.0, a protocol that, albeit being officially deprecated since June 2015, remains in use in certain parts of the world more than 20 years after its introduction. To standardize SSL, the Internet Engineering Task Force (IETF) created a slightly modified SSL 3.0 and, in 1999, unveiled it as Transport Layer Security (TLS) 1.0. The name change between SSL and TLS continues to confuse people today. Officially, TLS is the new SSL, but in practice, people use SSL and TLS interchangeably to talk about any version of the protocol.
 
 - Must See:
-  - 
-  - 
+    - 
+    - 
 
-## Network  Perimeter Security
+## Network Perimeter Security
 
-Let us see how we keep a check on the perimeter i.e the edges, the first layer of protection
+Let us see how we keep a check on the perimeter, i.e the edges, the first layer of protection.
 
 ### General Firewall Framework
 
@@ -96,28 +96,28 @@ Let us see how we keep a check on the perimeter i.e the edges, the first layer o
 ### Packet Filters
 
 - It inspects ingress packets coming to an internal network from outside and inspects egress packets going outside from an internal network
-- Packing filtering only inspects IP headers and TCP headers, not the payloads generated at the application layer
+- Packet-filtering only inspects IP headers and TCP headers, not the payloads generated at the application layer.
 - A packet-filtering firewall uses a set of rules to determine whether a packet should be allowed or denied to pass through.
 - 2 types:
-  - Stateless
-    - It treats each packet as an independent object, and it does not keep track of any previously processed packets. In other words, stateless filtering inspects a packet when it arrives and makes a decision without leaving any record of the packet being inspected.
+    - Stateless
+        - It treats each packet as an independent object, and it does not keep track of any previously processed packets. In other words, stateless filtering inspects a packet when it arrives and makes a decision without leaving any record of the packet being inspected.
 
-  - Stateful
-    - Stateful filtering, also referred to as connection-state filtering, keeps track of connections between an internal host and an external host. A connection state (or state, for short) indicates whether it is a TCP connection or a UDP connection and whether the connection is established.
+    - Stateful
+        - Stateful filtering, also referred to as connection-state filtering, keeps track of connections between an internal host and an external host. A connection state (or state, for short) indicates whether it is a TCP connection or a UDP connection and whether the connection is established.
 
 ### Circuit Gateways
 
-- Circuit gateways, also referred to as circuit-level gateways, are typically operated at the transportation layer
+- Circuit gateways, also referred to as circuit-level gateways, are typically operated at the transportation layer.
 - They evaluate the information of the IP addresses and the port numbers contained in TCP (or UDP) headers and use it to determine whether to allow or to disallow an internal host and an external host to establish a connection.
 - It is common practice to combine packet filters and circuit gateways to form a dynamic packet filter (DPF).
 
-### Application Gateways(ALG)
+### Application Gateways (ALG)
 
 - Aka PROXY Servers
 - An Application Level Gateway (ALG) acts as a proxy for internal hosts, processing service requests from external clients.
 - An ALG performs deep inspections on each IP packet (ingress or egress).
 - In particular, an ALG inspects application program formats contained in the packet (e.g., MIME format or SQL format) and examines whether its payload is permitted.
-  - Thus, an ALG may be able to detect a computer virus contained in the payload. Because an ALG inspects packet payloads, it may be able to detect malicious code and quarantine suspicious packets, in addition to blocking packets with suspicious IP addresses and TCP ports. On the other hand, an ALG also incurs substantial computation and space overheads.
+    - Thus, an ALG may be able to detect a computer virus contained in the payload. Because an ALG inspects packet payloads, it may be able to detect malicious code and quarantine suspicious packets, in addition to blocking packets with suspicious IP addresses and TCP ports. On the other hand, an ALG also incurs substantial computation and space overheads.
 
 ### Trusted Systems & Bastion Hosts
 
@@ -128,8 +128,8 @@ Let us see how we keep a check on the perimeter i.e the edges, the first layer o
   - Its system management is appropriate.
 
 - Bastion Hosts
-  - Bastion hosts are computers with strong defence mechanisms. They often serve as host computers for implementing application gateways, circuit gateways, and other types of firewalls. A bastion host is operated on a trusted operating system that must not contain unnecessary functionalities or programs. This measure helps to reduce error probabilities and makes it easier to conduct security checks. Only those network application programs that are necessary, for example, SSH, DNS, SMTP, and authentication programs, are installed on a bastion host.
-  - Bastion hosts are also primarily used as controlled ingress points so that the security monitoring can focus more narrowly on actions happening at a single point closely.
+    - Bastion hosts are computers with strong defence mechanisms. They often serve as host computers for implementing application gateways, circuit gateways, and other types of firewalls. A bastion host is operated on a trusted operating system that must not contain unnecessary functionalities or programs. This measure helps to reduce error probabilities and makes it easier to conduct security checks. Only those network application programs that are necessary, for example, SSH, DNS, SMTP, and authentication programs, are installed on a bastion host.
+    - Bastion hosts are also primarily used as controlled ingress points so that the security monitoring can focus more narrowly on actions happening at a single point closely.
 
 ---
 
@@ -138,9 +138,9 @@ Let us see how we keep a check on the perimeter i.e the edges, the first layer o
 ### Scanning Ports with Nmap
 
 - Nmap ("Network Mapper") is a free and open-source (license) utility for network discovery and security auditing.  Many systems and network administrators also find it useful for tasks such as network inventory, managing service upgrade schedules, and monitoring host or service uptime.
-- The best thing about Nmap is it’s free and open-source and is very flexible and versatile
+- The best thing about Nmap is it’s free and open-source and is very flexible and versatile.
 - Nmap is often used to determine alive hosts in a network, open ports on those hosts, services running on those open ports, and version identification of that service on that port.
-- More at http://scanme.nmap.org/
+- More at [http://scanme.nmap.org/](http://scanme.nmap.org/).
 
 ```
 nmap [scan type] [options] [target specification]
@@ -149,48 +149,48 @@ nmap [scan type] [options] [target specification]
 
 Nmap uses 6 different port states:
 
-- **Open** — An open port is one that is actively accepting TCP, UDP or SCTP connections. Open ports are what interests us the most because they are the ones that are vulnerable to attacks. Open ports also show the available services on a network.
-- **Closed** — A port that receives and responds to Nmap probe packets but there is no application listening on that port. Useful for identifying that the host exists and for OS detection.
-- **Filtered** — Nmap can’t determine whether the port is open because packet filtering prevents its probes from reaching the port. Filtering could come from firewalls or router rules. Often little information is given from filtered ports during scans as the filters can drop the probes without responding or respond with useless error messages e.g. destination unreachable.
-- **Unfiltered** — Port is accessible but Nmap doesn’t know if it is open or closed. Only used in ACK scan which is used to map firewall rulesets. Other scan types can be used to identify whether the port is open.
-- **Open/filtered** — Nmap is unable to determine between open and filtered. This happens when an open port gives no response. No response could mean that the probe was dropped by a packet filter or any response is blocked.
-- **Closed/filtered** — Nmap is unable to determine whether a port is closed or filtered. Only used in the IP ID idle scan.
+- **Open**—An open port is one that is actively accepting TCP, UDP or SCTP connections. Open ports are what interests us the most because they are the ones that are vulnerable to attacks. Open ports also show the available services on a network.
+- **Closed**—A port that receives and responds to Nmap probe packets but there is no application listening on that port. Useful for identifying that the host exists and for OS detection.
+- **Filtered**—Nmap can’t determine whether the port is open because packet filtering prevents its probes from reaching the port. Filtering could come from firewalls or router rules. Often little information is given from filtered ports during scans as the filters can drop the probes without responding or respond with useless error messages, e.g. destination unreachable.
+- **Unfiltered**—Port is accessible but Nmap doesn’t know if it is open or closed. Only used in ACK scan which is used to map firewall rulesets. Other scan types can be used to identify whether the port is open.
+- **Open/filtered**—Nmap is unable to determine between open and filtered. This happens when an open port gives no response. No response could mean that the probe was dropped by a packet filter or any response is blocked.
+- **Closed/filtered**—Nmap is unable to determine whether a port is closed or filtered. Only used in the IP ID idle scan.
 
 ### Types of Nmap Scan:
 
 1. TCP Connect
-   - TCP Connect scan completes the 3-way handshake.
-   - If a port is open, the operating system completes the TCP three-way handshake and the port scanner immediately closes the connection to avoid DOS. This is “noisy” because the services can log the sender IP address and might trigger Intrusion Detection Systems.
+    - TCP Connect scan completes the three-way handshake.
+    - If a port is open, the operating system completes the TCP three-way handshake and the port scanner immediately closes the connection to avoid DOS. This is “noisy” because the services can log the sender IP address and might trigger Intrusion Detection Systems.
 2. UDP Scan
-   - This scan checks to see if any UDP ports are listening.
-   - Since UDP does not respond with a positive acknowledgement like TCP and only responds to an incoming UDP packet when the port is closed,
+    - This scan checks to see if any UDP ports are listening.
+    - Since UDP does not respond with a positive acknowledgement like TCP and only responds to an incoming UDP packet when the port is closed.
 
 3. SYN Scan
-   - SYN scan is another form of TCP scanning.
-   - This scan type is also known as “half-open scanning” because it never actually opens a full TCP connection.
-   - The port scanner generates a SYN packet. If the target port is open, it will respond with an SYN-ACK packet. The scanner host responds with an RST packet, closing the connection before the handshake is completed.
-   - If the port is closed but unfiltered, the target will instantly respond with an RST packet.
-   - SYN scan has the advantage that the individual services never actually receive a connection.
+    - SYN scan is another form of TCP scanning.
+    - This scan type is also known as “half-open scanning” because it never actually opens a full TCP connection.
+    - The port scanner generates a SYN packet. If the target port is open, it will respond with an SYN-ACK packet. The scanner host responds with an RST packet, closing the connection before the handshake is completed.
+    - If the port is closed but unfiltered, the target will instantly respond with an RST packet.
+    - SYN scan has the advantage that the individual services never actually receive a connection.
 
 4. FIN Scan
-   - This is a stealthy scan, like the SYN scan, but sends a TCP FIN packet instead.
+    - This is a stealthy scan, like the SYN scan, but sends a TCP FIN packet instead.
 
 5. ACK Scan
-   - Ack scanning determines whether the port is filtered or not.
-6. Null Scan
-   - Another very stealthy scan that sets all the TCP header flags to off or null.
-   - This is not normally a valid packet and some hosts will not know what to do with this.
+    - ACK scanning determines whether the port is filtered or not.
+6. NULL Scan
+    - Another very stealthy scan that sets all the TCP header flags to off or NULL.
+    - This is not normally a valid packet and some hosts will not know what to do with this.
 7. XMAS Scan
-   - Similar to the NULL scan except for all the flags in the TCP header is set to on
+    - Similar to the NULL scan except for all the flags in the TCP header is set to on.
 8. RPC Scan
-   - This special type of scan looks for machine answering to RPC (Remote Procedure Call) services
+    - This special type of scan looks for machine answering to RPC (Remote Procedure Call) services.
 9. IDLE Scan
-   - It is a super stealthy method whereby the scan packets are bounced off an external host.
-   - You don’t need to have control over the other host but it does have to set up and meet certain requirements. You must input the IP address of our “zombie” host and what port number to use. It is one of the more controversial options in Nmap since it only has a use for malicious attacks.
+    - It is a super stealthy method whereby the scan packets are bounced off an external host.
+    - You don’t need to have control over the other host but it does have to set up and meet certain requirements. You must input the IP address of our “zombie” host and what port number to use. It is one of the more controversial options in Nmap since it only has a use for malicious attacks.
 
 Scan Techniques
 
-A couple of scan techniques which can be used to gain more information about a system and its ports. You can read more at 
+A couple of scan techniques which can be used to gain more information about a system and its ports. You can read more at .
 
 ### OpenVAS
 
@@ -198,12 +198,12 @@ A couple of scan techniques which can be used to gain more information about a s
 - OpenVAS is a framework of services and tools that provides a comprehensive and powerful vulnerability scanning and management package
 - OpenVAS, which is an open-source program, began as a fork of the once-more-popular scanning program, Nessus.
 - OpenVAS is made up of three main parts. These are:
-  - a regularly updated feed of Network Vulnerability Tests (NVTs);
-  - a scanner, which runs the NVTs; and
-  - an SQLite 3 database for storing both your test configurations and the NVTs’ results and configurations.
-  - 
+    - a regularly updated feed of Network Vulnerability Tests (NVTs);
+    - a scanner, which runs the NVTs; and
+    - an SQLite 3 database for storing both your test configurations and the NVTs’ results and configurations.
+    - 
 
-### WireShark
+### Wireshark
 
 - Wireshark is a protocol analyzer.
 - This means Wireshark is designed to decode not only packet bits and bytes but also the relations between packets and protocols.
@@ -211,103 +211,118 @@ A couple of scan techniques which can be used to gain more information about a s
 
 A simple demo of Wireshark
 
-1. Capture only udp packets:
-   - Capture filter = “udp”
+1. Capture only UDP packets:
+    - `Capture filter = “udp”`
 
-2. Capture only tcp packets
-   - Capture filter = “tcp”
+2. Capture only TCP packets:
+    - `Capture filter = “tcp”`
 
-3. TCP/IP 3 way Handshake
+3. TCP/IP three-way Handshake:

![image17](images/image17.png) 4. Filter by IP address: displays all traffic from IP, be it source or destination - - ip.addr == 192.168.1.1 + - `ip.addr == 192.168.1.1` + 5. Filter by source address: display traffic only from IP source - - ip.src == 192.168.0.1 + - `ip.src == 192.168.0.1` 6. Filter by destination: display traffic only form IP destination - - ip.dst == 192.168.0.1 + - `ip.dst == 192.168.0.1` 7. Filter by IP subnet: display traffic from subnet, be it source or destination - - ip.addr = 192.168.0.1/24 + - `ip.addr = 192.168.0.1/24` 8. Filter by protocol: filter traffic by protocol name - - dns - - http - - ftp - - arp - - ssh - - telnet - - icmp + - dns + - http + - ftp + - arp + - ssh + - telnet + - icmp 9. Exclude IP address: remove traffic from and to IP address - - !ip.addr ==192.168.0.1 + - `!ip.addr ==192.168.0.1` 10. Display traffic between two specific subnet - - ip.addr == 192.168.0.1/24 and ip.addr == 192.168.1.1/24 + - `ip.addr == 192.168.0.1/24 and ip.addr == 192.168.1.1/24` 11. Display traffic between two specific workstations - - ip.addr == 192.168.0.1 and ip.addr == 192.168.0.2 + - `ip.addr == 192.168.0.1 and ip.addr == 192.168.0.2` + 12. Filter by MAC - - eth.addr = 00:50:7f:c5:b6:78 + - `eth.addr = 00:50:7f:c5:b6:78` 13. Filter TCP port - - tcp.port == 80 + - `tcp.port == 80` + 14. Filter TCP port source - - tcp.srcport == 80 + - `tcp.srcport == 80` + 15. Filter TCP port destination - - tcp.dstport == 80 + - `tcp.dstport == 80` + 16. Find user agents - - http.user_agent contains Firefox - - !http.user_agent contains || !http.user_agent contains Chrome + - `http.user_agent contains Firefox` + - `!http.user_agent contains || !http.user_agent contains Chrome` + 17. Filter broadcast traffic - - !(arp or icmp or dns) + - `!(arp or icmp or dns)` + 18. Filter IP address and port - - tcp.port == 80 && ip.addr == 192.168.0.1 - -19. Filter all http get requests - - http.request -20. Filter all http get requests and responses - - http.request or http.response -21. Filter three way handshake - - tcp.flags.syn==1 or (tcp.seq==1 and tcp.ack==1 and tcp.len==0 and tcp.analysis.initial_rtt) + - `tcp.port == 80 && ip.addr == 192.168.0.1` + +19. Filter all HTTP GET requests + - `http.request` + +20. Filter all HTTP GET requests and responses + - `http.request or http.response` + +21. Filter three-way handshake + - `tcp.flags.syn==1 or (tcp.seq==1 and tcp.ack==1 and tcp.len==0 and tcp.analysis.initial_rtt)` + 22. Find files by type - - frame contains “(attachment|tar|exe|zip|pdf)” + - `frame contains “(attachment|tar|exe|zip|pdf)”` + 23. Find traffic based on keyword - - tcp contains facebook - - frame contains facebook + - `tcp contains facebook` + - `frame contains facebook` + 24. Detecting SYN Floods - - tcp.flags.syn == 1 and tcp.flags.ack == 0 + - `tcp.flags.syn == 1 and tcp.flags.ack == 0` + **Wireshark Promiscuous Mode** - - By default, Wireshark only captures packets going to and from the computer where it runs. By checking the box to run Wireshark in Promiscuous Mode in the Capture Settings, you can capture most of the traffic on the LAN. -### DumpCap + - By default, Wireshark only captures packets going to and from the computer where it runs. By checking the box to run Wireshark in Promiscuous Mode in the Capture Settings, you can capture most of the traffic on the LAN. -- Dumpcap is a network traffic dump tool. It captures packet data from a live network and writes the packets to a file. Dumpcap’s native capture file format is pcapng, which is also the format used by Wireshark. -- By default, Dumpcap uses the pcap library to capture traffic from the first available network interface and writes the received raw packet data, along with the packets’ time stamps into a pcapng file. The capture filter syntax follows the rules of the pcap library. -- The Wireshark command-line utility called 'dumpcap.exe' can be used to capture LAN traffic over an extended period of time. -- Wireshark itself can also be used, but dumpcap does not significantly utilize the computer's memory while capturing for long periods. +### Dumpcap -### DaemonLogger +- Dumpcap is a network traffic dump tool. It captures packet data from a live network and writes the packets to a file. Dumpcap’s native capture file format is `pcapng`, which is also the format used by Wireshark. +- By default, Dumpcap uses the `pcap` library to capture traffic from the first available network interface and writes the received raw packet data, along with the packets’ time stamps into a `pcapng` file. The capture filter syntax follows the rules of the `pcap` library. +- The Wireshark command-line utility called `dumpcap.exe` can be used to capture LAN traffic over an extended period of time. +- Wireshark itself can also be used, but Dumpcap does not significantly utilize the computer's memory while capturing for long periods. -- Daemonlogger is a packet logging application designed specifically for use in Network and Systems Management (NSM) environments. -- The biggest benefit Daemonlogger provides is that, like Dumpcap, it is simple to use for capturing packets. In order to begin capturing, you need only to invoke the command and specify an interface. - - daemonlogger –i eth1 - - This option, by default, will begin capturing packets and logging them to the current working directory. - - Packets will be collected until the capture file size reaches 2 GB, and then a new file will be created. This will continue indefinitely until the process is halted. +### DaemonLogger -### NetSniff-NG +- DaemonLogger is a packet logging application designed specifically for use in Network and Systems Management (NSM) environments. +- The biggest benefit DaemonLogger provides is that, like Dumpcap, it is simple to use for capturing packets. In order to begin capturing, you need only to invoke the command and specify an interface. + - `daemonlogger –i eth1` + - This option, by default, will begin capturing packets and logging them to the current working directory. + - Packets will be collected until the capture file size reaches 2 GB, and then a new file will be created. This will continue indefinitely until the process is halted. -- Netsniff-NG is a high-performance packet capture utility -- While the utilities we’ve discussed to this point rely on Libpcap for capture, Netsniff-NG utilizes zero-copy mechanisms to capture packets. This is done with the intent to support full packet capture over high throughput links. -- To begin capturing packets with Netsniff-NG, we have to specify an input and output. In most cases, the input will be a network interface, and the output will be a file or folder on disk. +### netsniff-ng - `netsniff-ng –i eth1 –o data.pcap` +- netsniff-ng is a high-performance packet capture utility +- While the utilities we’ve discussed to this point rely on `libpcap` for capture, netsniff-ng utilizes zero-copy mechanisms to capture packets. This is done with the intent to support full packet capture over high throughput links. +- To begin capturing packets with netsniff-ng, we have to specify an input and output. In most cases, the input will be a network interface, and the output will be a file or folder on disk. +```shell +netsniff-ng –i eth1 –o data.pcap +``` -### Netflow +### NetFlow -- NetFlow is a feature that was introduced on Cisco routers around 1996 that provides the ability to collect IP network traffic as it enters or exits an interface. By analyzing the data provided by NetFlow, a network administrator can determine things such as the source and destination of traffic, class of service, and the causes of congestion. A typical flow monitoring setup (using NetFlow) consists of three main components:[1] +- NetFlow is a feature that was introduced on Cisco routers around 1996 that provides the ability to collect IP network traffic as it enters or exits an interface. By analyzing the data provided by NetFlow, a network administrator can determine things such as the source and destination of traffic, class of service, and the causes of congestion. A typical flow monitoring setup (using NetFlow) consists of three main components:[1] - Flow exporter: aggregates packets into flows and exports flow records towards one or more flow collectors. - Flow collector: responsible for reception, storage and pre-processing of flow data received from a flow exporter. @@ -317,7 +332,7 @@ A simple demo of Wireshark ### IDS A security solution that detects security-related events in your environment but does not block them. -IDS sensors can be software and hardware-based used to collect and analyze the network traffic. These sensors are available in two varieties, network IDS and host IDS. +IDS sensors can be software- and hardware-based used to collect and analyze the network traffic. These sensors are available in two varieties, network IDS and host IDS. - A host IDS is a server-specific agent running on a server with a minimum of overhead to monitor the operating system. - A network IDS can be embedded in a networking device, a standalone appliance, or a module monitoring the network traffic. @@ -325,7 +340,7 @@ IDS sensors can be software and hardware-based used to collect and analyze the n Signature Based IDS - The signature-based IDS monitors the network traffic or observes the system and sends an alarm if a known malicious event is happening. -- It does so by comparing the data flow against a database of known attack patterns +- It does so by comparing the data flow against a database of known attack patterns. - These signatures explicitly define what traffic or activity should be considered as malicious. - Signature-based detection has been the bread and butter of network-based defensive security for over a decade, partially because it is very similar to how malicious activity is detected at the host level with antivirus utilities - The formula is fairly simple: an analyst observes a malicious activity, derives indicators from the activity and develops them into signatures, and then those signatures will alert whenever the activity occurs again. @@ -342,7 +357,7 @@ Policy-Based IDS Anomaly Based IDS -- The anomaly-based IDS looks for traffic that deviates from the normal, but the definition of what is a normal network traffic pattern is the tricky part +- The anomaly-based IDS looks for traffic that deviates from the normal, but the definition of what is a normal network traffic pattern is the tricky part. - Two types of anomaly-based IDS exist: statistical and nonstatistical anomaly detection - Statistical anomaly detection learns the traffic patterns interactively over a period of time. - In the nonstatistical approach, the IDS has a predefined configuration of the supposedly acceptable and valid traffic patterns. @@ -355,7 +370,7 @@ Host-Based IDS & Network-Based IDS Honeypots -- The use of decoy machines to direct intruders' attention away from the machines under protection is a major technique to preclude intrusion attacks. Any device, system, directory, or file used as a decoy to lure attackers away from important assets and to collect intrusion or abusive behaviours is referred to as a honeypot. +- The use of decoy machines to direct intruders' attention away from the machines under protection is a major technique to preclude intrusion attacks. Any device, system, directory, or file used as a decoy to lure attackers away from important assets and to collect intrusion or abusive behaviors is referred to as a honeypot. - A honeypot may be implemented as a physical device or as an emulation system. The idea is to set up decoy machines in a LAN, or decoy directories/files in a file system and make them appear important, but with several exploitable loopholes, to lure attackers to attack these machines or directories/files, so that other machines, directories, and files can evade intruders' attentions. A decoy machine may be a host computer or a server computer. Likewise, we may also set up decoy routers or even decoy LANs. --- @@ -373,46 +388,45 @@ Honeypots IP Spoofing Detection Techniques - Direct TTL Probes - - In this technique we send a packet to a host of suspect spoofed IP that triggers reply and compares TTL with suspect packet; if the TTL in the reply is not the same as the packet being checked; it is a spoofed packet. - - - This Technique is successful when the attacker is in a different subnet from the victim. - ![image19](images/image19.png) + - In this technique, we send a packet to a host of suspect spoofed IP that triggers reply and compares TTL with suspect packet; if the TTL in the reply is not the same as the packet being checked; it is a spoofed packet. + - This Technique is successful when the attacker is in a different subnet from the victim. + ![image19](images/image19.png) - IP Identification Number. - Send a probe to the host of suspect spoofed traffic that triggers a reply and compares IP ID with suspect traffic. - - If IP IDs are not in the near value of packet being checked, suspect traffic is spoofed + - If IP IDs are not in the near value of packet being checked, suspect traffic is spoofed. - TCP Flow Control Method - - Attackers sending spoofed TCP packets will not receive the target’s SYN-ACK packets. - - Attackers cannot, therefore, be responsive to change in the congestion window size - - When the receiver still receives traffic even after a windows size is exhausted, most probably the packets are spoofed. + - Attackers sending spoofed TCP packets will not receive the target’s SYN-ACK packets. + - Attackers cannot, therefore, be responsive to change in the congestion window size. + - When the receiver still receives traffic even after a windows size is exhausted, most probably the packets are spoofed. ### Covert Channel - A covert or clandestine channel can be best described as a pipe or communication channel between two entities that can be exploited by a process or application transferring information in a manner that violates the system's security specifications. - More specifically for TCP/IP, in some instances, covert channels are established, and data can be secretly passed between two end systems. - - Ex: ICMP resides at the Internet layer of the TCP/IP protocol suite and is implemented in all TCP/IP hosts. Based on the specifications of the ICMP Protocol, an ICMP Echo Request message should have an 8-byte header and a 56-byte payload. The ICMP Echo Request packet should not carry any data in the payload. However, these packets are often used to carry secret information. The ICMP packets are altered slightly to carry secret data in the payload. This makes the size of the packet larger, but no control exists in the protocol stack to defeat this behaviour. The alteration of ICMP packets allows intruders to program specialized client-server pairs. These small pieces of code export confidential information without alerting the network administrator. - - ICMP can be leveraged for more than data exfiltration. For eg. some C&C tools such as Loki used ICMP channel to establish encrypted interactive session back in 1996. - - - Deep packet inspection has since come a long way. A lot of IDS/IPS detect ICMP tunnelling. - - Check for echo responses that do not contain the same payload as request - - Check for the volume of ICMP traffic especially for volumes beyond an acceptable threshold + - Ex: ICMP resides at the Internet layer of the TCP/IP protocol suite and is implemented in all TCP/IP hosts. Based on the specifications of the ICMP Protocol, an ICMP Echo Request message should have an 8-byte header and a 56-byte payload. The ICMP Echo Request packet should not carry any data in the payload. However, these packets are often used to carry secret information. The ICMP packets are altered slightly to carry secret data in the payload. This makes the size of the packet larger, but no control exists in the protocol stack to defeat this behavior. The alteration of ICMP packets allows intruders to program specialized client-server pairs. These small pieces of code export confidential information without alerting the network administrator. + - ICMP can be leveraged for more than data exfiltration. For eg. some C&C tools such as Loki used ICMP channel to establish encrypted interactive session back in 1996. + + - Deep packet inspection has since come a long way. A lot of IDS/IPS detect ICMP tunnelling. + - Check for Echo responses that do not contain the same payload as request. + - Check for the volume of ICMP traffic especially for volumes beyond an acceptable threshold. ### IP Fragmentation Attack -- The TCP/IP protocol suite, or more specifically IP, allows the fragmentation of packets.(this is a feature & not a bug) +- The TCP/IP protocol suite, or more specifically IP, allows the fragmentation of packets. (this is a feature & not a bug) - IP fragmentation offset is used to keep track of the different parts of a datagram. -- The information or content in this field is used at the destination to reassemble the datagrams +- The information or content in this field is used at the destination to reassemble the datagrams. - All such fragments have the same Identification field value, and the fragmentation offset indicates the position of the current fragment in the context of the original packet. - Many access routers and firewalls do not perform packet reassembly. In normal operation, IP fragments do not overlap, but attackers can create artificially fragmented packets to mislead the routers or firewalls. Usually, these packets are small and almost impractical for end systems because of data and computational overhead. -- A good example of an IP fragmentation attack is the Ping of Death attack. The Ping of Death attack sends fragments that, when reassembled at the end station, create a larger packet than the maximum permissible length. +- A good example of an IP fragmentation attack is the Ping of Death (PoD) attack. The Ping of Death attack sends fragments that, when reassembled at the end station, create a larger packet than the maximum permissible length. TCP Flags - Data exchange using TCP does not happen until a three-way handshake has been completed. This handshake uses different flags to influence the way TCP segments are processed. - There are 6 bits in the TCP header that are often called flags. Namely: - - 6 different flags are part of the TCP header: Urgent pointer field (URG), Acknowledgment field (ACK), Push function (PSH), Reset the connection (RST), Synchronize sequence numbers (SYN), and the sender is finished with this connection (FIN). + - six different flags are part of the TCP header: Urgent pointer field (URG), Acknowledgment field (ACK), Push function (PSH), Reset the connection (RST), Synchronize sequence numbers (SYN), and the sender is finished with this connection (FIN). ![image20](images/image20.png) - Abuse of the normal operation or settings of these flags can be used by attackers to launch DoS attacks. This causes network servers or web servers to crash or hang. @@ -430,8 +444,8 @@ TCP Flags SYN FLOOD -- The timers (or lack of certain timers) in 3 way handshake are often used and exploited by attackers to disable services or even to enter systems. -- After step 2 of the three-way handshake, no limit is set on the time to wait after receiving a SYN. The attacker initiates many connection requests to the webserver of Company XYZ (almost certainly with a spoofed IP address). +- The timers (or lack of certain timers) in three-way handshake are often used and exploited by attackers to disable services or even to enter systems. +- After step 2 of the three-way handshake, no limit is set on the time-to-wait after receiving a SYN. The attacker initiates many connection requests to the webserver of Company XYZ (almost certainly with a spoofed IP address). - The SYN+ACK packets (Step 2) sent by the web server back to the originating source IP address are not replied to. This leaves a TCP session half-open on the webserver. Multiple packets cause multiple TCP sessions to stay open. - Based on the hardware limitations of the server, a limited number of TCP sessions can stay open, and as a result, the webserver refuses further connection establishments attempts from any host as soon as a certain limit is reached. These half-open connections need to be completed or timed out before new connections can be established. @@ -474,7 +488,7 @@ Mechanism: - The first task can be achieved in two ways: by injecting the code in the right address space or by using the existing code and modifying certain parameters slightly. The second task is a little more complex because the program's control flow needs to be modified to make the program jump to the dirty code. -CounterMeasure: +Counter Measure: - The most important approach is to have a concerted focus on writing correct code. - A second method is to make the data buffers (memory locations) address space of the program code non-executable. This type of address space makes it impossible to execute code, which might be infiltrated in the program's buffers during an attack. @@ -485,10 +499,10 @@ Address Resolution Protocol Spoofing - The Address Resolution Protocol (ARP) provides a mechanism to resolve, or map, a known IP address to a MAC sublayer address. - Using ARP spoofing, the cracker can exploit this hardware address authentication mechanism by spoofing the hardware address of Host B. Basically, the attacker can convince any host or network device on the local network that the cracker's workstation is the host to be trusted. This is a common method used in a switched environment. - - ARP spoofing can be prevented with the implementation of static ARP tables in all the hosts and routers of your network. Alternatively, you can implement an ARP server that responds to ARP requests on behalf of the target host. + - ARP spoofing can be prevented with the implementation of static ARP tables in all the hosts and routers of your network. Alternatively, you can implement an ARP server that responds to ARP requests on behalf of the target host. DNS Spoofing - DNS spoofing is the method whereby the hacker convinces the target machine that the system it wants to connect to is the machine of the cracker. - The cracker modifies some records so that name entries of hosts correspond to the attacker's IP address. There have been instances in which the complete DNS server was compromised by an attack. -- To counter DNS spoofing, the reverse lookup detects these attacks. The reverse lookup is a mechanism to verify the IP address against a name. The IP address and name files are usually kept on different servers to make compromise much more difficult +- To counter DNS spoofing, the reverse lookup detects these attacks. The reverse lookup is a mechanism to verify the IP address against a name. The IP address and name files are usually kept on different servers to make compromise much more difficult. diff --git a/courses/level101/security/threats_attacks_defences.md b/courses/level101/security/threats_attacks_defences.md index 145a17d7..43b156ba 100644 --- a/courses/level101/security/threats_attacks_defences.md +++ b/courses/level101/security/threats_attacks_defences.md @@ -7,12 +7,12 @@ - Since DNS responses are cached, a quick response can be provided for repeated translations. DNS negative queries are also cached, e.g., misspelt words, and all cached data periodically times out. Cache poisoning is an issue in what is known as pharming. This term is used to describe a hacker’s attack in which a website’s traffic is redirected to a bogus website by forging the DNS mapping. In this case, an attacker attempts to insert a fake address record for an Internet domain into the DNS. -If the server accepts the fake record, the cache is poisoned and subsequent requests for the address of the domain are answered with the address of a server controlled by the attacker. As long as the fake entry is cached by the server, browsers or e-mail servers will automatically go to the address provided by the compromised DNS server. -the typical time to live (TTL) for cached entries is a couple of hours, thereby permitting ample time for numerous users to be affected by the attack. +If the server accepts the fake record, the cache is poisoned and subsequent requests for the address of the domain are answered with the address of a server controlled by the attacker. As long as the fake entry is cached by the server, browsers or e-mail servers will automatically go to the address provided by the compromised DNS server. +The typical time-to-live (TTL) for cached entries is a couple of hours, thereby permitting ample time for numerous users to be affected by the attack. ### DNSSEC (Security Extension) -- The long-term solution to these DNS problems is authentication. If a resolver cannot distinguish between valid and invalid data in a response, then add source authentication to verify that the data received in response is equal to the data entered by the zone administrator +- The long-term solution to these DNS problems is authentication. If a resolver cannot distinguish between valid and invalid data in a response, then add source authentication to verify that the data received in response is equal to the data entered by the zone administrator. - DNS Security Extensions (DNSSEC) protects against data spoofing and corruption and provides mechanisms to authenticate servers and requests, as well as mechanisms to establish authenticity and integrity. - When authenticating DNS responses, each DNS zone signs its data using a private key. It is recommended that this signing be done offline and in advance. The query for a particular record returns the requested resource record set (RRset) and signature (RRSIG) of the requested resource record set. The resolver then authenticates the response using a public key, which is pre-configured or learned via a sequence of key records in the DNS hierarchy. - The goals of DNSSEC are to provide authentication and integrity for DNS responses without confidentiality or DDoS protection. @@ -20,7 +20,7 @@ the typical time to live (TTL) for cached entries is a couple of hours, thereby ### BGP - BGP stands for border gateway protocol. It is a routing protocol that exchanges routing information among multiple Autonomous Systems (AS) - - An Autonomous System is a collection of routers or networks with the same network policy usually under single administrative control. + - An Autonomous System is a collection of routers or networks with the same network policy usually under single administrative control. - BGP tells routers which hop to use in order to reach the destination network. - BGP is used for both communicating information among routers in an AS (interior) and between multiple ASes (exterior). @@ -31,28 +31,28 @@ the typical time to live (TTL) for cached entries is a couple of hours, thereby - BGP is responsible for finding a path to a destination router & the path it chooses should be the shortest and most reliable one. - This decision is done through a protocol known as Link state. With the link-state protocol, each router broadcasts to all other routers in the network the state of its links and IP subnets. Each router then receives information from the other routers and constructs a complete topology view of the entire network. The next-hop routing table is based on this topology view. - The link-state protocol uses a famous algorithm in the field of computer science, Dijkstra’s shortest path algorithm: - - We start from our router considering the path cost to all our direct neighbours. - - The shortest path is then taken - - We then re-look at all our neighbours that we can reach and update our link state table with the cost information. We then continue taking the shortest path until every router has been visited. + - We start from our router considering the path cost to all our direct neighbours. + - The shortest path is then taken + - We then re-look at all our neighbours that we can reach and update our link state table with the cost information. We then continue taking the shortest path until every router has been visited. ## BGP Vulnerabilities -- By corrupting the BGP routing table we are able to influence the direction traffic flows on the internet! This action is known as BGP hijacking. +- By corrupting the BGP routing table, we are able to influence the direction traffic flows on the Internet! This action is known as BGP hijacking. - Injecting bogus route advertising information into the BGP-distributed routing database by malicious sources, accidentally or routers can disrupt Internet backbone operations. - Blackholing traffic: - Blackhole route is a network route, i.e., routing table entry, that goes nowhere and packets matching the route prefix are dropped or ignored. Blackhole routes can only be detected by monitoring the lost traffic. - Blackhole routes are the best defence against many common viral attacks where the traffic is dropped from infected machines to/from command & control hosts. - Infamous BGP Injection attack on Youtube -- Ex: In 2008, Pakistan decided to block YouTube by creating a BGP route that led into a black hole. Instead, this routing information got transmitted to a hong kong ISP and from there accidentally got propagated to the rest of the world meaning millions were routed through to this black hole and therefore unable to access YouTube. -- Potentially, the greatest risk to BGP occurs in a denial of service attack in which a router is flooded with more packets than it can handle. Network overload and router resource exhaustion happen when the network begins carrying an excessive number of BGP messages, overloading the router control processors, memory, routing table and reducing the bandwidth available for data traffic. +- Ex: In 2008, Pakistan decided to block YouTube by creating a BGP route that led into a black hole. Instead, this routing information got transmitted to a Hong Kong ISP and from there accidentally got propagated to the rest of the world meaning millions were routed through to this black hole and therefore unable to access YouTube. +- Potentially, the greatest risk to BGP occurs in a denial-of-service attack in which a router is flooded with more packets than it can handle. Network overload and router resource exhaustion happen when the network begins carrying an excessive number of BGP messages, overloading the router control processors, memory, routing table and reducing the bandwidth available for data traffic. - Refer: -- Router flapping is another type of attack. Route flapping refers to repetitive changes to the BGP routing table, often several times a minute. Withdrawing and re-advertising at a high-rate can cause a serious problem for routers since they propagate the announcements of routes. If these route flaps happen fast enough, e.g., 30 to 50 times per second, the router becomes overloaded, which eventually prevents convergence on valid routes. The potential impact for Internet users is a slowdown in message delivery, and in some cases, packets may not be delivered at all. +- Router flapping is another type of attack. Route flapping refers to repetitive changes to the BGP routing table, often several times a minute. Withdrawing and re-advertising at a high-rate can cause a serious problem for routers since they propagate the announcements of routes. If these route flaps happen fast enough, e.g., 30-50 times per second, the router becomes overloaded, which eventually prevents convergence on valid routes. The potential impact for Internet users is a slowdown in message delivery, and in some cases, packets may not be delivered at all. BGP Security - Border Gateway Protocol Security recommends the use of BGP peer authentication since it is one of the strongest mechanisms for preventing malicious activity. - - The authentication mechanisms are Internet Protocol Security (IPsec) or BGP MD5. + - The authentication mechanisms are Internet Protocol Security (IPsec) or BGP MD5. - Another method, known as prefix limits, can be used to avoid filling router tables. In this approach, routers should be configured to disable or terminate a BGP peering session, and issue warning messages to administrators when a neighbour sends in excess of a preset number of prefixes. - IETF is currently working on improving this space @@ -62,7 +62,7 @@ BGP Security - HTTP response splitting attack may happen where the server script embeds user data in HTTP response headers without appropriate sanitation. - This typically happens when the script embeds user data in the redirection URL of a redirection response (HTTP status code 3xx), or when the script embeds user data in a cookie value or name when the response sets a cookie. -- HTTP response splitting attacks can be used to perform web cache poisoning and cross-site scripting attacks. +- HTTP response splitting attacks can be used to perform web cache poisoning and cross-site scripting (XSS) attacks. - HTTP response splitting is the attacker’s ability to send a single HTTP request that forces the webserver to form an output stream, which is then interpreted by the target as two HTTP responses instead of one response. ### Cross-Site Request Forgery (CSRF or XSRF) @@ -70,9 +70,9 @@ BGP Security - A Cross-Site Request Forgery attack tricks the victim’s browser into issuing a command to a vulnerable web application. - Vulnerability is caused by browsers automatically including user authentication data, session ID, IP address, Windows domain credentials, etc. with each request. - Attackers typically use CSRF to initiate transactions such as transfer funds, login/logout user, close account, access sensitive data, and change account details. -- The vulnerability is caused by web browsers that automatically include credentials with each request, even for requests caused by a form, script, or image on another site. CSRF can also be dynamically constructed as part of a payload for a cross-site scripting attack +- The vulnerability is caused by web browsers that automatically include credentials with each request, even for requests caused by a form, script, or image on another site. CSRF can also be dynamically constructed as part of a payload for a cross-site scripting attack. - All sites relying on automatic credentials are vulnerable. Popular browsers cannot prevent cross-site request forgery. Logging out of high-value sites as soon as possible can mitigate CSRF risk. It is recommended that a high-value website must require a client to manually provide authentication data in the same HTTP request used to perform any operation with security implications. Limiting the lifetime of session cookies can also reduce the chance of being used by other malicious sites. -- OWASP recommends website developers include a required security token in HTTP requests associated with sensitive business functions in order to mitigate CSRF attacks +- OWASP recommends website developers include a required security token in HTTP requests associated with sensitive business functions in order to mitigate CSRF attacks. ### Cross-Site Scripting (XSS) Attacks @@ -90,7 +90,7 @@ BGP Security - The technique works by hiding malicious link/scripts under the cover of the content of a legitimate site. - Buttons on a website actually contain invisible links, placed there by the attacker. So, an individual who clicks on an object they can visually see is actually being duped into visiting a malicious page or executing a malicious script. - When mouseover is used together with clickjacking, the outcome is devastating. Facebook users have been hit by a clickjacking attack, which tricks people into “liking” a particular Facebook page, thus enabling the attack to spread since Memorial Day 2010. -- There is not yet effective defence against clickjacking, and disabling JavaScript is the only viable method +- There is not yet effective defence against clickjacking, and disabling JavaScript is the only viable method. ## DataBase Attacks & Defenses @@ -98,23 +98,23 @@ BGP Security - It exploits improper input validation in database queries. - A successful exploit will allow attackers to access, modify, or delete information in the database. -- It permits attackers to steal sensitive information stored within the backend databases of affected websites, which may include such things as user credentials, email addresses, personal information, and credit card numbers +- It permits attackers to steal sensitive information stored within the backend databases of affected websites, which may include such things as user credentials, email addresses, personal information, and credit card numbers. -``` +```SQL SELECT USERNAME,PASSWORD from USERS where USERNAME='' AND PASSWORD=''; +``` +Here, the username & password is the input provided by the user. Suppose an attacker gives the input as ` OR '1'='1' ` in both fields. Therefore the SQL query will look like: -Here the username & password is the input provided by the user. Suppose an attacker gives the input as " OR '1'='1'" in both fields. Therefore the SQL query will look like: - +```SQL SELECT USERNAME,PASSWORD from USERS where USERNAME='' OR '1'='1' AND PASSOWRD='' OR '1'='1'; - -This query results in a true statement & the user gets logged in. This example depicts the bost basic type of SQL injection ``` +This query results in a true statement & the user gets logged in. This example depicts the most basic type of SQL injection. ### SQL Injection Attack Defenses - SQL injection can be protected by filtering the query to eliminate malicious syntax, which involves the employment of some tools in order to (a) scan the source code. -- In addition, the input fields should be restricted to the absolute minimum, typically anywhere from 7-12 characters, and validate any data, e.g., if a user inputs an age make sure the input is an integer with a maximum of 3 digits. +- In addition, the input fields should be restricted to the absolute minimum, typically anywhere from 7-12 characters, and validate any data, e.g., if a user inputs an age, make sure the input is an integer with a maximum of 3 digits. ## VPN @@ -130,15 +130,15 @@ In spite of the most aggressive steps to protect computers from attacks, attacke ### Denial of Service Attacks -- Denial of service (DoS) attacks result in downtime or inability of a user to access a system. DoS attacks impact the availability of tenet of information systems security. A DoS attack is a coordinated attempt to deny service by occupying a computer to perform large amounts of unnecessary tasks. This excessive activity makes the system unavailable to perform legitimate operations +- Denial-of-service (DoS) attacks result in downtime or inability of a user to access a system. DoS attacks impact the availability of tenet of information systems security. A DoS attack is a coordinated attempt to deny service by occupying a computer to perform large amounts of unnecessary tasks. This excessive activity makes the system unavailable to perform legitimate operations - Two common types of DoS attacks are as follows: - - Logic attacks—Logic attacks use software flaws to crash or seriously hinder the performance of remote servers. You can prevent many of these attacks by installing the latest patches to keep your software up to date. - - Flooding attacks—Flooding attacks overwhelm the victim computer’s CPU, memory, or network resources by sending large numbers of useless requests to the machine. + - Logic attacks—Logic attacks use software flaws to crash or seriously hinder the performance of remote servers. You can prevent many of these attacks by installing the latest patches to keep your software up to date. + - Flooding attacks—Flooding attacks overwhelm the victim computer’s CPU, memory, or network resources by sending large numbers of useless requests to the machine. - Most DoS attacks target weaknesses in the overall system architecture rather than a software bug or security flaw - One popular technique for launching a packet flood is a SYN flood. - One of the best defences against DoS attacks is to use intrusion prevention system (IPS) software or devices to detect and stop the attack. -### Distributed Denial of Service Attacks +### Distributed Denial-of-Service Attacks - DDoS attacks differ from regular DoS attacks in their scope. In a DDoS attack, attackers hijack hundreds or even thousands of Internet computers, planting automated attack agents on those systems. The attacker then instructs the agents to bombard the target site with forged messages. This overloads the site and blocks legitimate traffic. The key here is strength in numbers. The attacker does more damage by distributing the attack across multiple computers. @@ -149,8 +149,8 @@ In spite of the most aggressive steps to protect computers from attacks, attacke - Attackers can tap telephone lines and data communication lines. Wiretapping can be active, where the attacker makes modifications to the line. It can also be passive, where an unauthorized user simply listens to the transmission without changing the contents. Passive intrusion can include the copying of data for a subsequent active attack. - Two methods of active wiretapping are as follows: - - Between-the-lines wiretapping—This type of wiretapping does not alter the messages sent by the legitimate user but inserts additional messages into the communication line when the legitimate user pauses. - - Piggyback-entry wiretapping—This type of wiretapping intercepts and modifies the original message by breaking the communications line and routing the message to another computer that acts as a host. + - Between-the-lines wiretapping—This type of wiretapping does not alter the messages sent by the legitimate user but inserts additional messages into the communication line when the legitimate user pauses. + - Piggyback-entry wiretapping—This type of wiretapping intercepts and modifies the original message by breaking the communications line and routing the message to another computer that acts as a host. ### Backdoors @@ -162,44 +162,46 @@ In spite of the most aggressive steps to protect computers from attacks, attacke - Once an attacker compromises a hashed password file, a birthday attack is performed. A birthday attack is a type of cryptographic attack that is used to make a brute-force attack of one-way hashes easier. It is a mathematical exploit that is based on the birthday problem in probability theory. - Further Reading: - - - - + - + - ### Brute-Force Password Attacks -- In a brute-force password attack, the attacker tries different passwords on a system until one of them is successful. Usually, the attacker employs a software program to try all possible combinations of a likely password, user ID, or security code until it locates a match. This occurs rapidly and in sequence. This type of attack is called a brute-force password attack because the attacker simply hammers away at the code. There is no skill or stealth involved—just brute force that eventually breaks the code. +- In a brute-force password attack, the attacker tries different passwords on a system until one of them is successful. Usually, the attacker employs a software program to try all possible combinations of a likely password, user ID, or security code until it locates a match. This occurs rapidly and in sequence. This type of attack is called a brute-force password attack because the attacker simply hammers away at the code. There is no skill or stealth involved—just brute force that eventually breaks the code. - Further Reading: - - - - + - + - ### Dictionary Password Attacks - A dictionary password attack is a simple attack that relies on users making poor password choices. In a dictionary password attack, a simple password-cracker program takes all the words from a dictionary file and attempts to log on by entering each dictionary entry as a password. - Further Reading: -https://capec.mitre.org/data/definitions/16.html + - ### Replay Attacks - Replay attacks involve capturing data packets from a network and retransmitting them to produce an unauthorized effect. The receipt of duplicate, authenticated IP packets may disrupt service or have some other undesired consequence. Systems can be broken through replay attacks when attackers reuse old messages or parts of old messages to deceive system users. This helps intruders to gain information that allows unauthorized access into a system. - Further reading: - + - ### Man-in-the-Middle Attacks - A man-in-the-middle attack takes advantage of the multihop process used by many types of networks. In this type of attack, an attacker intercepts messages between two parties before transferring them on to their intended destination. - Web spoofing is a type of man-in-the-middle attack in which the user believes a secure session exists with a particular web server. In reality, the secure connection exists only with the attacker, not the webserver. The attacker then establishes a secure connection with the webserver, acting as an invisible go-between. The attacker passes traffic between the user and the webserver. In this way, the attacker can trick the user into supplying passwords, credit card information, and other private data. - Further Reading: - - + - ### Masquerading - In a masquerade attack, one user or computer pretends to be another user or computer. Masquerade attacks usually include one of the other forms of active attacks, such as IP address spoofing or replaying. Attackers can capture authentication sequences and then replay them later to log on again to an application or operating system. For example, an attacker might monitor usernames and passwords sent to a weak web application. The attacker could then use the intercepted credentials to log on to the web application and impersonate the user. -- Further Reading: +- Further Reading: + - + - ### Eavesdropping -- Eavesdropping, or sniffing, occurs when a host sets its network interface on promiscuous mode and copies packets that pass by for later analysis. Promiscuous mode enables a network device to intercept and read each network packet(of course given some conditions) given sec, even if the packet’s address doesn’t match the network device. It is possible to attach hardware and software to monitor and analyze all packets on that segment of the transmission media without alerting any other users. Candidates for eavesdropping include satellite, wireless, mobile, and other transmission methods. +- Eavesdropping, or sniffing, occurs when a host sets its network interface on promiscuous mode and copies packets that pass by for later analysis. Promiscuous mode enables a network device to intercept and read each network packet (of course given some conditions) given sec, even if the packet’s address doesn’t match the network device. It is possible to attach hardware and software to monitor and analyze all packets on that segment of the transmission media without alerting any other users. Candidates for eavesdropping include satellite, wireless, mobile, and other transmission methods. ### Social Engineering diff --git a/courses/level101/security/writing_secure_code.md b/courses/level101/security/writing_secure_code.md index 8be7b8c2..42a06fa7 100644 --- a/courses/level101/security/writing_secure_code.md +++ b/courses/level101/security/writing_secure_code.md @@ -32,7 +32,7 @@ The first and most important step in reducing security and reliability issues is ### Refactoring -- Refactoring is the most effective way to keep a codebase clean and simple. Even a healthy codebase occasionally needs to be +- Refactoring is the most effective way to keep a codebase clean and simple. Even a healthy codebase occasionally needs to be. - Regardless of the reasons behind refactoring, you should always follow one golden rule: never mix refactoring and functional changes in a single commit to the code repository. Refactoring changes are typically significant and can be difficult to understand. - If a commit also includes functional changes, there’s a higher risk that an author or reviewer might overlook bugs. @@ -42,11 +42,11 @@ The first and most important step in reducing security and reliability issues is ### Fuzz Testing -- Fuzz testing is a technique that complements the previously mentioned testing techniques. Fuzzing involves using a fuzzing engine to generate a large number of candidate inputs that are then passed through a fuzz driver to the fuzz target. The fuzzer then analyzes how the system handles the input. Complex inputs handled by all kinds of software are popular targets for fuzzing - for example, file parsers, compression algorithms, network protocol implementation and audio codec. +- Fuzz testing is a technique that complements the previously mentioned testing techniques. Fuzzing involves using a fuzzing engine to generate a large number of candidate inputs that are then passed through a fuzz driver to the fuzz target. The fuzzer then analyzes how the system handles the input. Complex inputs handled by all kinds of software are popular targets for fuzzing—for example, file parsers, compression algorithms, network protocol implementation and audio codec. ### Integration Testing -- Integration testing moves beyond individual units and abstractions, replacing fake or stubbed-out implementations of abstractions like databases or network services with real implementations. As a result, integration tests exercise more complete code paths. Because you must initialize and configure these other dependencies, integration testing may be slower and flakier than unit testing—to execute the test, this approach incorporates real-world variables like network latency as services communicate end-to-end. As you move from testing individual low-level units of code to testing how they interact when composed together, the net result is a higher degree of confidence that the system is behaving as expected. +- Integration testing moves beyond individual units and abstractions, replacing fake or stubbed-out implementations of abstractions like databases or network services with real implementations. As a result, integration tests exercise more complete code paths. Because you must initialize and configure these other dependencies, integration testing may be slower and flakier than unit testing—to execute the test, this approach incorporates real-world variables like network latency as services communicate end-to-end. As you move from testing individual low-level units of code to testing how they interact when composed together, the net result is a higher degree of confidence that the system is behaving as expected. ### Last But not the least diff --git a/courses/level101/systems_design/availability.md b/courses/level101/systems_design/availability.md index 87410c4b..14bfdc41 100644 --- a/courses/level101/systems_design/availability.md +++ b/courses/level101/systems_design/availability.md @@ -3,43 +3,42 @@ Availability is generally expressed as “Nines”, common ‘Nines’ are list | Availability % | Downtime per year | Downtime per month | Downtime per week | Downtime per day | |---------------------------------|:-----------------:|:-------------------:|:-----------------:|:----------------:| -| 99%(Two Nines) | 3.65 days | 7.31 hours | 1.68 hours | 14.40 minutes | -| 99.5%(Two and a half Nines) | 1.83 days | 3.65 hours | 50.40 minutes | 7.20 minutes | -| 99.9%(Three Nines) | 8.77 hours | 43.83 minutes | 10.08 minutes | 1.44 minutes | -| 99.95%(Three and a half Nines) | 4.38 hours | 21.92 minutes | 5.04 minutes | 43.20 seconds | -| 99.99%(Four Nines) | 52.60 minutes | 4.38 minutes | 1.01 minutes | 8.64 seconds | -| 99.995%(Four and a half Nines) | 26.30 minutes | 2.19 minutes | 30.24 seconds | 4.32 seconds | -| 99.999%(Five Nines) | 5.26 minutes | 26.30 seconds | 6.05 seconds | 864.0 ms | +| 99% (Two Nines) | 3.65 days | 7.31 hours | 1.68 hours | 14.40 minutes | +| 99.5% (Two and a half Nines) | 1.83 days | 3.65 hours | 50.40 minutes | 7.20 minutes | +| 99.9% (Three Nines) | 8.77 hours | 43.83 minutes | 10.08 minutes | 1.44 minutes | +| 99.95% (Three and a half Nines) | 4.38 hours | 21.92 minutes | 5.04 minutes | 43.20 seconds | +| 99.99% (Four Nines) | 52.60 minutes | 4.38 minutes | 1.01 minutes | 8.64 seconds | +| 99.995% (Four and a half Nines) | 26.30 minutes | 2.19 minutes | 30.24 seconds | 4.32 seconds | +| 99.999% (Five Nines) | 5.26 minutes | 26.30 seconds | 6.05 seconds | 864.0 ms | ### Refer -- https://en.wikipedia.org/wiki/High_availability#Percentage_calculation +- [https://en.wikipedia.org/wiki/High_availability#Percentage_calculation](https://en.wikipedia.org/wiki/High_availability#Percentage_calculation) ## HA - Availability Serial Components -A System with components is operating in the series If the failure of a part leads to the combination becoming inoperable. +A System with components is operating in the series if the failure of a part leads to the combination becoming inoperable. For example, if LB in our architecture fails, all access to app tiers will fail. LB and app tiers are connected serially. - -The combined availability of the system is the product of individual components availability +The combined availability of the system is the product of individual components availability: *A = Ax x Ay x …..* ### Refer -- http://www.eventhelix.com/RealtimeMantra/FaultHandling/system_reliability_availability.htm +- [http://www.eventhelix.com/RealtimeMantra/FaultHandling/system_reliability_availability.htm](http://www.eventhelix.com/RealtimeMantra/FaultHandling/system_reliability_availability.htm) ## HA - Availability Parallel Components -A System with components is operating in parallel If the failure of a part leads to the other part taking over the operations of the failed part. +A System with components is operating in parallel if the failure of a part leads to the other part taking over the operations of the failed part. -If we have more than one LB and if the rest of the LBs can take over the traffic during one LB failure then LBs are operating in parallel +If we have more than one LB and if the rest of the LBs can take over the traffic during one LB failure, then LBs are operating in parallel. The combined availability of the system is *A = 1 - ( (1-Ax) x (1-Ax) x ….. )* ### Refer -- http://www.eventhelix.com/RealtimeMantra/FaultHandling/system_reliability_availability.htm +- [http://www.eventhelix.com/RealtimeMantra/FaultHandling/system_reliability_availability.htm](http://www.eventhelix.com/RealtimeMantra/FaultHandling/system_reliability_availability.htm) ## HA - Core Principles @@ -47,10 +46,10 @@ The combined availability of the system is **Reliable crossover** In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover. -**Detection of failures as they occur** If the two principles above are observed, then a user may never see a failure +**Detection of failures as they occur** If the two principles above are observed, then a user may never see a failure. ### Refer -- https://en.wikipedia.org/wiki/High_availability#Principles +- [https://en.wikipedia.org/wiki/High_availability#Principles](https://en.wikipedia.org/wiki/High_availability#Principles) ## HA - SPOF @@ -58,7 +57,7 @@ The combined availability of the system is **WHEN TO USE:** During architecture reviews and new designs. -**HOW TO USE:** Identify single instances on architectural diagrams. Strive for active/active configurations. At the very least we should have a standby to take control when active instances fail. +**HOW TO USE:** Identify single instances on architectural diagrams. Strive for active/active configurations. At the very least, we should have a standby to take control when active instances fail. **WHY:** Maximize availability through multiple instances. @@ -74,16 +73,17 @@ The combined availability of the system is **WHY:** Maximize availability and ensure data handling semantics are preserved. -**KEY TAKEAWAYS:** Strive for active/active rather than active/passive solutions, they have a lesser risk of cross over being unreliable. Use LB and the right load balancing methods to ensure reliable failover. Model and build your data systems to ensure data is correctly handled when crossover happens. Generally, DB systems follow active/passive semantics for writes. Masters accept writes and when the master goes down, the follower is promoted to master(active from being passive) to accept writes. We have to be careful here that the cutover never introduces more than one master. This problem is called a split brain. +**KEY TAKEAWAYS:** Strive for active/active rather than active/passive solutions, they have a lesser risk of cross over being unreliable. Use LB and the right load-balancing methods to ensure reliable failover. Model and build your data systems to ensure data is correctly handled when crossover happens. Generally, DB systems follow active/passive semantics for writes. Masters accept writes and when the master goes down, the follower is promoted to master (active from being passive) to accept writes. We have to be careful here that the cutover never introduces more than one master. This problem is called a split brain. ## Applications in SRE role 1. SRE works on deciding an acceptable SLA and make sure the system is available to achieve the SLA 2. SRE is involved in architecture design right from building the data center to make sure the site is not affected by a network switch, hardware, power, or software failures 3. SRE also run mock drills of failures to see how the system behaves in uncharted territory and comes up with a plan to improve availability if there are misses. -https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear +[https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear](https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear) + +Post our understanding about HA, our architecture diagram looks something like this below: -Post our understanding about HA, our architecture diagram looks something like this below ![HA Block Diagram](images/availability.jpg) diff --git a/courses/level101/systems_design/conclusion.md b/courses/level101/systems_design/conclusion.md index 8e3cb47c..346a33d7 100644 --- a/courses/level101/systems_design/conclusion.md +++ b/courses/level101/systems_design/conclusion.md @@ -1,3 +1,3 @@ # Conclusion -Armed with these principles, we hope the course will give a fresh perspective to design software systems. It might be over-engineering to get all this on day zero. But some are really important from day 0 like eliminating single points of failure, making scalable services by just increasing replicas. As a bottleneck is reached, we can split code by services, shard data to scale. As the organization matures, bringing in [chaos engineering](https://en.wikipedia.org/wiki/Chaos_engineering) to measure how systems react to failure will help in designing robust software systems. +Armed with these principles, we hope the course will give a fresh perspective to design software systems. It might be over-engineering to get all this on day zero. But some are really important from day 0 like eliminating single points of failure, making scalable services by just increasing replicas. As a bottleneck is reached, we can _split code by services_, _shard data_ to scale. As the organization matures, bringing in [chaos engineering](https://en.wikipedia.org/wiki/Chaos_engineering) to measure how systems react to failure will help in designing robust software systems. diff --git a/courses/level101/systems_design/fault-tolerance.md b/courses/level101/systems_design/fault-tolerance.md index 2ff9cd69..2bb77733 100644 --- a/courses/level101/systems_design/fault-tolerance.md +++ b/courses/level101/systems_design/fault-tolerance.md @@ -3,10 +3,10 @@ Failures are not avoidable in any system and will happen all the time, hence we need to build systems that can tolerate failures or recover from them. - In systems, failure is the norm rather than the exception. -- "Anything that can go wrong will go wrong” -- Murphy’s Law -- “Complex systems contain changing mixtures of failures latent within them” -- How Complex Systems Fail. +- "Anything that can go wrong will go wrong”—Murphy’s Law +- “Complex systems contain changing mixtures of failures latent within them”—How Complex Systems Fail. -### Fault Tolerance - Failure Metrics +### Fault Tolerance: Failure Metrics Common failure metrics that get measured and tracked for any system. @@ -27,40 +27,41 @@ Common failure metrics that get measured and tracked for any system. **Failure rate:** Another reliability metric, which measures the frequency with which a component or system fails. It is expressed as a number of failures over a unit of time. #### Refer -- https://www.splunk.com/en_us/data-insider/what-is-mean-time-to-repair.html +- [https://www.splunk.com/en_us/data-insider/what-is-mean-time-to-repair.html](https://www.splunk.com/en_us/data-insider/what-is-mean-time-to-repair.html) -### Fault Tolerance - Fault Isolation Terms +### Fault Tolerance: Fault Isolation Terms Systems should have a short circuit. Say in our content sharing system, if “Notifications” is not working, the site should gracefully handle that failure by removing the functionality instead of taking the whole site down. Swimlane is one of the commonly used fault isolation methodologies. Swimlane adds a barrier to the service from other services so that failure on either of them won’t affect the other. Say we roll out a new feature ‘Advertisement’ in our content sharing app. We can have two architectures + ![Swimlane](images/swimlane-1.jpg) -If Ads are generated on the fly synchronously during each Newsfeed request, the faults in the Ads feature get propagated to the Newsfeed feature. Instead if we swimlane the “Generation of Ads” service and use a shared storage to populate Newsfeed App, Ads failures won’t cascade to Newsfeed, and worst case if Ads don’t meet SLA , we can have Newsfeed without Ads. +If Ads are generated on the fly synchronously during each Newsfeed request, the faults in the Ads feature get propagated to the Newsfeed feature. Instead if we swimlane the “Generation of Ads” service and use a shared storage to populate Newsfeed App, Ads failures won’t cascade to Newsfeed, and worst case if Ads don’t meet SLA, we can have Newsfeed without Ads. -Let's take another example, we have come up with a new model for our Content sharing App. Here we roll out an enterprise content sharing App where enterprises pay for the service and the content should never be shared outside the enterprise. +Let's take another example, we have come up with a new model for our Content sharing App. Here, we roll out an enterprise content sharing App where enterprises pay for the service and the content should never be shared outside the enterprise. ![Swimlane-principles](images/swimlane-2.jpg) ### Swimlane Principles -**Principle 1:** Nothing is shared (also known as “share as little as possible”). The less that is shared within a swim lane, the more fault isolative the swim lane becomes. (as shown in Enterprise use-case) +**Principle 1:** Nothing is shared (also known as “share as little as possible”). The less that is shared within a swimlane, the more fault isolative the swimlane becomes. (as shown in Enterprise use-case) -**Principle 2:** Nothing crosses a swim lane boundary. Synchronous (defined by expecting a request—not the transfer protocol) communication never crosses a swim lane boundary; if it does, the boundary is drawn incorrectly. (as shown in Ads feature) +**Principle 2:** Nothing crosses a swimlane boundary. Synchronous (defined by expecting a request—not the transfer protocol) communication never crosses a swimlane boundary; if it does, the boundary is drawn incorrectly. (as shown in Ads feature) ### Swimlane Approaches -**Approach 1:** Swim lane the money-maker. Never allow your cash register to be compromised by other systems. (Tier 1 vs Tier 2 in enterprise use case) +**Approach 1:** Swimlane the money-maker. Never allow your cash register to be compromised by other systems. (Tier 1 vs Tier 2 in enterprise use case) -**Approach 2:** Swim lane the biggest sources of incidents. Identify the recurring causes of pain and isolate them. (if Ads feature is in code yellow, swim laning it is the best option) +**Approach 2:** Swimlane the biggest sources of incidents. Identify the recurring causes of pain and isolate them. (If Ads feature is in code yellow, swimlaning it is the best option.) -**Approach 3:** Swim lane natural barriers. Customer boundaries make good swim lanes. (Public vs Enterprise customers) +**Approach 3:** Swimlane natural barriers. Customer boundaries make good swimlanes. (Public vs Enterprise customers) #### Refer -- https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch21.html#ch21 +- [https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch21.html#ch21](https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch21.html#ch21) ### Applications in SRE role -1. Work with the DC tech or cloud team to distribute infrastructure such that its immune to switch or power failures by creating fault zones within a Data Center -https://docs.microsoft.com/en-us/azure/virtual-machines/manage-availability#use-availability-zones-to-protect-from-datacenter-level-failures -2. Work with the partners and design interaction between services such that one service breakdown is not amplified in a cascading fashion to all upstreams +1. Work with the DC tech or cloud team to distribute infrastructure such that it's immune to switch or power failures by creating fault zones within a Data Center ([https://docs.microsoft.com/en-us/azure/virtual-machines/manage-availability#use-availability-zones-to-protect-from-datacenter-level-failures]( +https://docs.microsoft.com/en-us/azure/virtual-machines/manage-availability#use-availability-zones-to-protect-from-datacenter-level-failures)). +2. Work with the partners and design interaction between services such that one service breakdown is not amplified in a cascading fashion to all upstreams. diff --git a/courses/level101/systems_design/intro.md b/courses/level101/systems_design/intro.md index 802a05a9..f2a0ee14 100644 --- a/courses/level101/systems_design/intro.md +++ b/courses/level101/systems_design/intro.md @@ -31,19 +31,18 @@ More light will be shed on concepts rather than on setting up and configuring co So, how do you go about learning to design a system? -*” Like most great questions, it showed a level of naivety that was breathtaking. The only short answer I could give was, essentially, that you learned how to design a system by designing systems and finding out what works and what doesn’t work.” -Jim Waldo, Sun Microsystems, On System Design* +"*Like most great questions, it showed a level of naivety that was breathtaking. The only short answer I could give was, essentially, that you learned how to design a system by designing systems and finding out what works and what doesn’t work.*"—Jim Waldo, Sun Microsystems, On System Design As software and hardware systems have multiple moving parts, we need to think about how those parts will grow, their failure modes, their inter-dependencies, how it will impact the users and the business. There is no one-shot method or way to learn or do system design, we only learn to design systems by designing and iterating on them. -This course will be a starter to make one think about scalability, availability, and fault tolerance during systems design. +This course will be a starter to make one think about _scalability_, _availability_, and _fault tolerance_ during systems design. ## Backstory -Let’s design a simple content sharing application where users can share photos, media in our application which can be liked by their friends. Let’s start with a simple design of the application and evolve it as we learn system design concepts +Let’s design a simple content sharing application where users can share photos, media in our application which can be liked by their friends. Let’s start with a simple design of the application and evolve it as we learn system design concepts. ![First architecture diagram](images/first-architecture.jpg) diff --git a/courses/level101/systems_design/scalability.md b/courses/level101/systems_design/scalability.md index a69ff2cc..75bab801 100644 --- a/courses/level101/systems_design/scalability.md +++ b/courses/level101/systems_design/scalability.md @@ -1,27 +1,27 @@ # Scalability -What does scalability mean for a system/service? A system is composed of services/components, each service/component scalability needs to be tackled separately, and the scalability of the system as a whole. +**What does scalability mean for a system/service?** A system is composed of services/components, each service/component scalability needs to be tackled separately, and the scalability of the system as a whole. -A service is said to be scalable if, as resources are added to the system, it results in increased performance in a manner proportional to resources added +A service is said to be scalable if, as resources are added to the system, it results in increased performance in a manner proportional to resources added. -An always-on service is said to be scalable if adding resources to facilitate redundancy does not result in a loss of performance +An always-on service is said to be scalable if adding resources to facilitate redundancy does not result in a loss of performance. ## Refer - [https://www.allthingsdistributed.com/2006/03/a_word_on_scalability.html](https://www.allthingsdistributed.com/2006/03/a_word_on_scalability.html) -## Scalability - AKF Scale Cube +## Scalability: AKF Scale Cube -The [Scale Cube](https://akfpartners.com/growth-blog/scale-cube) is a model for segmenting services, defining microservices, and scaling products. It also creates a common language for teams to discuss scale related options in designing solutions. The following section talks about certain scaling patterns based on our inferences from the AKF cube +The [Scale Cube](https://akfpartners.com/growth-blog/scale-cube) is a model for segmenting services, defining microservices, and scaling products. It also creates a common language for teams to discuss scale-related options in designing solutions. The following section talks about certain scaling patterns based on our inferences from the AKF cube. -## Scalability - Horizontal scaling +## Scalability: Horizontal scaling Horizontal scaling stands for cloning of an application or service such that work can easily be distributed across instances with absolutely no bias. -Let's see how our monolithic application improves with this principle +Let's see how our monolithic application improves with this principle. ![Horizontal Scaling](images/horizontal-scaling.jpg) -Here DB is scaled separately from the application. This is to let you know each component’s scaling capabilities can be different. Usually, web applications can be scaled by adding resources unless there is state stored inside the application. But DBs can be scaled only for Reads by adding more followers but Writes have to go to only one leader to make sure data is consistent. There are some DBs that support multi-leader writes but we are keeping them out of scope at this point. +Here, DB is scaled separately from the application. This is to let you know each component’s scaling capabilities can be different. Usually, web applications can be scaled by adding resources unless there is state stored inside the application. But DBs can be scaled only for Reads by adding more followers but Writes have to go to only one leader to make sure data is consistent. There are some DBs that support multi-leader writes but we are keeping them out of scope at this point. Apps should be able to differentiate between Reads and Writes to choose appropriate DB servers. Load balancers can split traffic between identical servers transparently. @@ -33,17 +33,17 @@ Apps should be able to differentiate between Reads and Writes to choose appropri **WHY:** Allows for the fast scale of transactions at the cost of duplicated data and functionality. -**KEY TAKEAWAYS:** This is fast to implement, is a low cost from a developer effort perspective, and can scale transaction volumes nicely. However, they tend to be high cost from the perspective of the operational cost of data. The cost here means if we have 3 followers and 1 Leader DB, the same database will be stored as 4 copies in the 4 servers. Hence added storage cost +**KEY TAKEAWAYS:** This is fast to implement, is a low cost from a developer effort perspective, and can scale transaction volumes nicely. However, they tend to be high cost from the perspective of the operational cost of data. The cost here means if we have 3 followers and 1 Leader DB, the same database will be stored as 4 copies in the 4 servers. Hence added storage cost. ### Refer - [https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch23.html](https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch23.html) -### Scalability Pattern - Load Balancing +### Scalability Pattern: Load Balancing -Improves the distribution of workloads across multiple computing resources, such as computers, a computer cluster, network links, central processing units, or disk drives. A commonly used technique is load balancing traffic across identical server clusters. A similar philosophy is used to load balance traffic across network links by [ECMP](https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing), disk drives by [RAID](https://en.wikipedia.org/wiki/RAID),etc +Improves the distribution of workloads across multiple computing resources, such as computers, a computer cluster, network links, central processing units, or disk drives. A commonly used technique is load balancing traffic across identical server clusters. A similar philosophy is used to load balance traffic across network links by [ECMP](https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing), disk drives by [RAID](https://en.wikipedia.org/wiki/RAID), etc. Aims to optimize resource use, maximize throughput, minimize response time, and avoid overload of any single resource. -Using multiple components with load balancing instead of a single component may increase reliability and availability through redundancy. In our updated architecture diagram we have 4 servers to handle app traffic instead of a single server +Using multiple components with load balancing instead of a single component may increase reliability and availability through redundancy. In our updated architecture diagram, we have 4 servers to handle app traffic instead of a single server. The device or system that performs load balancing is called a load balancer, abbreviated as LB. @@ -52,9 +52,9 @@ The device or system that performs load balancing is called a load balancer, abb - [https://blog.envoyproxy.io/introduction-to-modern-network-load-balancing-and-proxying-a57f6ff80236](https://blog.envoyproxy.io/introduction-to-modern-network-load-balancing-and-proxying-a57f6ff80236) - [https://learning.oreilly.com/library/view/load-balancing-in/9781492038009/](https://learning.oreilly.com/library/view/load-balancing-in/9781492038009/) - [https://learning.oreilly.com/library/view/practical-load-balancing/9781430236801/](https://learning.oreilly.com/library/view/practical-load-balancing/9781430236801/) -- [http://shop.oreilly.com/product/9780596000509.do](http://shop.oreilly.com/product/9780596000509.do) +- [https://shop.oreilly.com/product/9780596000509.do](https://shop.oreilly.com/product/9780596000509.do) -### Scalability Pattern - LB Tasks +### Scalability Pattern: LB Tasks What does an LB do? @@ -63,36 +63,37 @@ What does an LB do? What backends are available in the system? In our architecture, 4 servers are available to serve App traffic. LB acts as a single endpoint that clients can use transparently to reach one of the 4 servers. #### Health checking: -What backends are currently healthy and available to accept requests? If one out of the 4 App servers turns bad, LB should automatically short circuit the path so that clients don’t sense any application downtime +What backends are currently healthy and available to accept requests? If one out of the 4 App servers turns bad, LB should automatically short circuit the path so that clients don’t sense any application downtime. #### Load balancing: -What algorithm should be used to balance individual requests across the healthy backends? There are many algorithms to distribute traffic across one of the four servers. Based on observations/experience, SRE can pick the algorithm that suits their pattern +What algorithm should be used to balance individual requests across the healthy backends? There are many algorithms to distribute traffic across one of the four servers. Based on observations/experience, SRE can pick the algorithm that suits their pattern. -### Scalability Pattern - LB Methods +### Scalability Pattern: LB Methods Common Load Balancing Methods #### Least Connection Method -directs traffic to the server with the fewest active connections. Most useful when there are a large number of persistent connections in the traffic unevenly distributed between the servers. Works if clients maintain long-lived connections +This method directs traffic to the server with the fewest active connections. Most useful when there are a large number of persistent connections in the traffic unevenly distributed between the servers. Works if clients maintain long-lived connections. #### Least Response Time Method -directs traffic to the server with the fewest active connections and the lowest average response time. Here response time is used to provide feedback of the server’s health +This method directs traffic to the server with the fewest active connections and the lowest average response time. Here, response time is used to provide feedback of the server’s health. #### Round Robin Method -rotates servers by directing traffic to the first available server and then moves that server to the bottom of the queue. Most useful when servers are of equal specification and there are not many persistent connections. +This method rotates servers by directing traffic to the first available server and then moves that server to the bottom of the queue. Most useful when servers are of equal specification and there are not many persistent connections. #### IP Hash -the IP address of the client determines which server receives the request. This can sometimes cause skewness in distribution but is useful if apps store some state locally and need some stickiness +The IP address of the client determines which server receives the request. This can sometimes cause skewness in distribution but is useful if apps store some state locally and need some stickiness. -More advanced client/server-side example techniques -- https://docs.nginx.com/nginx/admin-guide/load-balancer/ -- http://cbonte.github.io/haproxy-dconv/2.2/intro.html#3.3.5 -- https://twitter.github.io/finagle/guide/Clients.html#load-balancing +More advanced client/server-side example techniques: +- [https://docs.nginx.com/nginx/admin-guide/load-balancer/](https://docs.nginx.com/nginx/admin-guide/load-balancer/) +- [https://cbonte.github.io/haproxy-dconv/2.2/intro.html#3.3.5](https://cbonte.github.io/haproxy-dconv/2.2/intro.html#3.3.5) +- [https://twitter.github.io/finagle/guide/Clients.html#load-balancing](https://twitter.github.io/finagle/guide/Clients.html#load-balancing) -### Scalability Pattern - Caching - Content Delivery Networks (CDN) +### Scalability Pattern: Caching—Content Delivery Networks (CDN) -CDNs are added closer to the client’s location. If the app has static data like images, Javascript, CSS which don’t change very often, they can be cached. Since our example is a content sharing site, static content can be cached in CDNs with a suitable expiry. + +CDNs are added closer to the client’s location. If the app has static data like images, JavaScript, CSS which don’t change very often, they can be cached. Since our example is a content-sharing site, static content can be cached in CDNs with a suitable expiry. ![CDN block diagram](images/cdn.jpg) @@ -103,42 +104,42 @@ CDNs are added closer to the client’s location. If the app has static data lik **HOW TO USE:** Most CDNs leverage DNS to serve content on your site’s behalf. Thus you may need to make minor DNS changes or additions and move content to be served from new subdomains. Eg -media-exp1.licdn.com is a domain used by Linkedin to serve static content - -Here a CNAME points the domain to the DNS of the CDN provider +`media-exp1.licdn.com` is a domain used by Linkedin to serve static content +Here, a CNAME points the domain to the DNS of the CDN provider. +``` dig media-exp1.licdn.com +short 2-01-2c3e-005c.cdx.cedexis.net. - +``` **WHY:** CDNs help offload traffic spikes and are often economical ways to scale parts of a site’s traffic. They also often substantially improve page download times. **KEY TAKEAWAYS:** CDNs are a fast and simple way to offset the spikiness of traffic as well as traffic growth in general. Make sure you perform a cost-benefit analysis and monitor the CDN usage. If CDNs have a lot of cache misses, then we don’t gain much from CDN and are still serving requests using our compute resources. -## Scalability - Microservices +## Scalability: Microservices -This pattern represents the separation of work by service or function within the application. Microservices are meant to address the issues associated with growth and complexity in the code base and data sets. The intent is to create fault isolation as well as to reduce response times. +This pattern represents the separation of work by service or function within the application. Microservices are meant to address the issues associated with growth and complexity in the codebase and datasets. The intent is to create fault isolation as well as to reduce response times. Microservices can scale transactions, data sizes, and codebase sizes. They are most effective in scaling the size and complexity of your codebase. They tend to cost a bit more than horizontal scaling because the engineering team needs to rewrite services or, at the very least, disaggregate them from the original monolithic application. ![Microservices block diagram](images/microservices.jpg) -**WHAT:** Sometimes referred to as scale through services or resources, this rule focuses on scaling by splitting data sets, transactions, and engineering teams along verb (services) or noun (resources) boundaries. +**WHAT:** Sometimes referred to as scale through services or resources, this rule focuses on scaling by splitting datasets, transactions, and engineering teams along verb (services) or noun (resources) boundaries. -**WHEN TO USE:** Very large data sets where relations between data are not necessary. Large, complex systems where scaling engineering resources requires specialization. +**WHEN TO USE:** Very large datasets where relations between data are not necessary. Large, complex systems where scaling engineering resources requires specialization. **HOW TO USE:** Split up actions by using verbs, or resources by using nouns, or use a mix. Split both the services and the data along the lines defined by the verb/noun approach. -**WHY:** Allows for efficient scaling of not only transactions but also very large data sets associated with those transactions. It also allows for the efficient scaling of teams. +**WHY:** Allows for efficient scaling of not only transactions but also very large datasets associated with those transactions. It also allows for the efficient scaling of teams. -**KEY TAKEAWAYS:** Microservices allow for efficient scaling of transactions, large data sets, and can help with fault isolation. It helps reduce the communication overhead of teams. The codebase becomes less complex as disjoint features are decoupled and spun as new services thereby letting each service scale independently specific to its requirement. +**KEY TAKEAWAYS:** Microservices allow for efficient scaling of transactions, large datasets, and can help with fault isolation. It helps reduce the communication overhead of teams. The codebase becomes less complex as disjoint features are decoupled and spun as new services thereby letting each service scale independently specific to its requirement. ### Refer -- https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch23.html +- [https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch23.html](https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch23.html) -## Scalability - Sharding +## Scalability: Sharding This pattern represents the separation of work based on attributes that are looked up to or determined at the time of the transaction. Most often, these are implemented as splits by requestor, customer, or client. @@ -154,30 +155,30 @@ There can be other ways to split ![Sharding-block-2](images/sharding-2.jpg) -Here the whole data center is split and replicated and clients are directed to a data center based on their geography. This helps in improving performance as clients are directed to the closest data center and performance increases as we add more data centers. There are some replication and consistency overhead with this approach one needs to be aware of. This also gives fault tolerance by rolling out test features to one site and rollback if there is an impact to that geography +Here the whole data center is split and replicated and clients are directed to a data center based on their geography. This helps in improving performance as clients are directed to the closest data center and performance increases as we add more data centers. There are some replication and consistency overhead with this approach one needs to be aware of. This also gives fault tolerance by rolling out test features to one site and rollback if there is an impact to that geography. **WHAT:** This is very often a split by some unique aspect of the customer such as customer ID, name, geography, and so on. -**WHEN TO USE:** Very large, similar data sets such as large and rapidly growing customer bases or when the response time for a geographically distributed customer base is important. +**WHEN TO USE:** Very large, similar datasets such as large and rapidly growing customer bases or when the response time for a geographically distributed customer base is important. **HOW TO USE:** Identify something you know about the customer, such as customer ID, last name, geography, or device, and split or partition both data and services based on that attribute. **WHY:** Rapid customer growth exceeds other forms of data growth, or you have the need to perform fault isolation between certain customer groups as you scale. -**KEY TAKEAWAYS:** Shards are effective at helping you to scale customer bases but can also be applied to other very large data sets that can’t be pulled apart using the microservices methodology. +**KEY TAKEAWAYS:** Shards are effective at helping you to scale customer bases but can also be applied to other very large datasets that can’t be pulled apart using the microservices methodology. ### Refer -- https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch23.html +- [https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch23.html](https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch23.html) ## Applications in SRE role 1. SREs in coordination with the network team work on how to map users' traffic to a particular site. -https://engineering.linkedin.com/blog/2017/05/trafficshift--load-testing-at-scale +[https://engineering.linkedin.com/blog/2017/05/trafficshift--load-testing-at-scale](https://engineering.linkedin.com/blog/2017/05/trafficshift--load-testing-at-scale) 2. SREs work closely with the Dev team to split monoliths to multiple microservices that are easy to run and manage -3. SREs work on improving Load Balancers' reliability, service discovery, and performance +3. SREs work on improving Load Balancers' reliability, service discovery, and performance. 4. SREs work closely to split Data into shards and manage data integrity and consistency. -https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store +[https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store) 5. SREs work to set up, configure, and improve the CDN cache hit rate.