Luminousmen bloghttps://luminousmen.com/feeds/2019-09-20T00:00:00ZUnknown authorWerkzeugDemystifying hypothesis testinghttps://luminousmen.com/post/352019-09-20T00:00:00Z2019-09-20T00:00:00ZThere are a lot of engineers who have never been involved in statistics or data science. So, in order to build a data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arises from the engineering side. For these Data/ML engineers and novice data scientists, I make this series of articles. I'll try to explain some basic approaches in plain English and, based on them, explain some of the Data Science model concepts.
The whole series:
- [Data Science. Probability](https://luminousmen.com/post/data-science-probability)
- [Data Science. Bayes theorem](https://luminousmen.com/post/data-science-bayes-theorem)
- [Data Science. Probability distributions](https://luminousmen.com/post/data-science-probability-distributions)
- [Data Science. Measures](https://luminousmen.com/post/data-science-measures)
- [Data Science. Correlation](https://luminousmen.com/post/data-science-correlation)
- [Data Science. The Central Limit Theorem and sampling](https://luminousmen.com/post/data-science-central-limit-theorem)
- [Demystifying hypothesis testing](https://luminousmen.com/post/demystifying-hypothesis-testing)
---
## Hypothesis testing
Let us begin by getting some conceptual clarity on what a hypothesis test is. It all starts with a hypotheses.
*So what is a hypothesis?*
This is a question based on a specific observation that you need to prove. For a question to become a hypothesis, it must be certainly provable — this is the main condition. For example, you can prove that changing the title in an ad will increase conversion by 20%, but you won’t be able to check the question “Will changing the title help increase conversion?” In other words, the hypothesis should be specific, not vague.
Hence, the process of **Hypothesis testing** consist of forming questions about the data based on the gathered information and testing them using statistical methods.
Example: Testing the hypothesis that women spend more time talking on the phone than men. Suppose that 62 men and 53 women participated in the study. The average talk time was 37 minutes. per day in men and 41 minutes. per day in women. At first glance, differences are obvious, and the results confirm the hypothesis.
However, such a result can be obtained by chance, even if there are no differences in the population. Therefore, the question arises — is the difference obtained in the average values sufficient to say that in general, women on average speak on the phone longer than men? What is the probability of that being true? Is this difference statistically significant?
## Null hypothesis and the alternative hypothesis
![The null hypothesis](/media/demystifying-hypothesis-testing/support-null-hypothesis.png)
**The null hypothesis** is the claim about a population, which is usually formulated as the absence of differences between two populations, the absence of the influence of a factor, the absence of an effect, etc. For our example, this would be *the assumption that women and men spend the same amount of time talking on the phone*.
Another assumption under test (not always strictly opposite or opposite to the first) is called an **alternative** hypothesis. So, for the example mentioned above, it would be the following — *women spend more time talking on the phone than men call*.
When I started to understand the topic, I did not understand why the null hypothesis is a testable assumption, it seemed to me that the null hypothesis should be our assumption about the population (H0 changed to H1). Over time, I realized that in essence, the null hypothesis is a zero state — we assume that we do not know anything about the population. An alternative hypothesis is tested assumption about the population, next state, next level of knowledge.
## Significance Levels
![Error types](/media/demystifying-hypothesis-testing/error-types.png)
When testing statistical hypotheses, it is possible to make a mistake by accepting or refusing the correct hypothesis. Level of significance is the probability of making a **Type I error**. It's the threshold probability of an error for rejecting the null hypothesis when it is true. In other words, this is the acceptable (from the researcher point of view) probability of making a statistical error of the first type — the error that *the differences are considered significant, but they are actually random*. This probability indicated by the letter α (alpha).
I'll tell you right away about a **Type II error** — rejecting the null hypothesis when it is false, concluding that there is no effect, whereas it exists. The chance of a Type II error is indicated by β (beta).
![Significance levels](/media/demystifying-hypothesis-testing/significance-levels.jpg)
Usually, significance levels of 0.05, 0.01 and 0.001 are used, because the smaller the significance level, the less likely it is to make a Type I error. *For example, a significance level of 0.05 means that no more than a 5% probability of error is allowed*. Hence, the null hypothesis can be rejected in favor of an alternative hypothesis if, according to the results of the statistical test, the probability of error, i.e. the probability of accidental occurrence of the detected difference does not exceed 5 out of 100, i.e. there are only 5 chances out of 100 to make a mistake. If this significance level is not reached (the probability of error is above 5%), consider that the difference may be random and therefore it is impossible to reject the null hypotheses.
![p-value](/media/demystifying-hypothesis-testing/p-value.png)
Such probability of error is also called a significance level and in another way a **p-value**. And it corresponds to the risk of a Type I error. If p <α, the null hypothesis is rejected.
Cool video for fixing who did not understand anything at all
<iframe width="560" height="315" src="https://www.youtube.com/embed/42QuXLucH3Q" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
## Choosing statistical criteria
The general idea of hypothesis testing involves:
1. State two hypotheses so that only one can be right.
2. Collecting evidence (data).
3. Formulate an analysis plan, set the criteria for a decision.
4. Physically analyze the sample data. - compute the test statistic.
5. Based on the available evidence (data) either accept or reject the null hypothesis.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
Two first steps are clear — we make an assumption and gather all the needed data. The rest steps may vary based on **statistical criteria.**
To make decisions about which of the hypotheses (null or alternative) should be accepted, **statistical criteria** are used that include methods for calculating a certain indicator, based on which a decision is made to reject or accept the hypothesis, as well as the decision conditions.
For data obtained in metric scales (interval or relative) for distributions close to normal, **parametric methods** are used based on indicators such as mean and standard deviation. This is used to, for example, determine if samples came from the same population. In particular, the **[Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)** is used to determine the reliability of the difference in means for the two samples. Also, the **[F-test](https://en.wikipedia.org/wiki/F-test)** or **[analysis of variance (ANOVA)](https://en.wikipedia.org/wiki/Analysis_of_variance)** is used to understand the differences between three or more samples. Also, the ANOVA test is often used to determine the influence that independent variables have on the dependent variable in a regression study.
If the researcher is dealing with data obtained on non-metric (nominative or ordinal) scales or the samples are too small to be sure that the populations from which they are taken are based on the normal distribution, use nonparametric methods — **[Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test)**, **[Mann–Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)**, **[Wilcoxon signed-rank test](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test)**. These methods are very simple in terms of both calculations and applying.
Choosing the statistical criteria also depend on whether the samples are independent (i.e., for example, taken from two different groups) or dependent (i.e., reflecting the results of the same group of subjects before and after exposure or after two different effects).
## Conclusion
Hypothesis testing is most often used when conducting A / B testing, as well as simply gaining new knowledge about the data.
Hypothesis testing is an essential procedure in statistics. A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the data sample.
There are certain steps in the Hypothesis testing algorithm. But there is one that can vary - choosing statistical criteria. It's no trivial and required some knowledge of the analyzed data as well as the conditions in which criteria can be used.
Do not believe anyone but the data and test even the most obvious hypotheses. Cheers!
**Additional material:**
[A Dirty Dozen: Twelve P-Value Misconceptions](http://www.perfendo.org/docs/BayesProbability/twelvePvaluemisconceptions.pdf)Big Data file formats https://luminousmen.com/post/342019-08-18T00:00:00Z2019-08-18T00:00:00ZApache Spark supports many different data formats, such as the ubiquitous CSV format and web-friendly JSON format. Common formats used primarily for big data analytical purposes are Apache Parquet and Apache Avro.
In this post, we’re going to cover the properties of these 4 formats — CSV, JSON, Parquet and Avro with Apache Spark.
## CSV
CSV files (comma separated values) are commonly used to exchange tabular data between systems using plain text. CSV is a row-based file format, which means that every line of the file is the row in the table. Basically, CSV contains a header row that provides column names for the data, otherwise, files are considered partially structured. CSV files cannot initially present hierarchical or relational data. Data connections are typically organized by using multiple CSV files. Foreign keys are stored in columns of one or more files, but the links between these files are not expressed by the format itself. Also, CSV format is not fully standardized, files can use delimiters other than commas, such as tabs or spaces.
One of the other properties of CSV files is that they are only splittable when it is a raw, uncompressed file or when splittable compression format is used such as [bzip2](https://en.wikipedia.org/wiki/Bzip2) or [lzo](https://ru.wikipedia.org/wiki/LZO) (note: lzo needs to be indexed to be splittable).
**Advantages**:
- CSV is human-readable and easy to edit manually;
- CSV provides a straightforward information schema;
- CSV is processed by almost all existing applications;
- CSV is simple to implement and parse;
- CSV is compact. For XML you start tag and end tag for each column in each row. In CSV you write the column headers only once.
**Disadvantages**:
- CSV allows working with flat data. Complex data structures need to be handled aside from format;
- No support for column types. There is no distinction between text and numeric columns;
- No standard way to represent binary data;
- Problems with importing CSV (no distinction between NULL and quotes);
- Poor support of special characters;
- Lack of universal standard.
Despite the limitations, CSV files are a popular choice for data sharing, as they are supported by a wide range of business applications, consumer and scientific applications. Similarly, most batch and streaming data processing modules (for example, Spark and Hadoop) initially support serialization and deserialization of CSV files and offer ways to add a schema when reading.
## JSON
JSON data (JavaScript object notation) is represented as key-value pairs in a partially structured format. JSON is often compared to XML because it can store data in a hierarchical format. Child data is presented by parent data. Both formats are self-describing and readable by the user, but JSON documents are usually much smaller. Therefore, they are more often used in network communication, especially with the advent of REST-based web services.
Since a lot of data transferring is already using JSON format, most web languages initially support JSON or use external libraries to serialize and deserialize JSON data. Thanks to this support, JSON is used in logical formats by presenting data structures, interchange formats for hot data, and cold data stores.
Many batches and stream data processing modules natively support JSON serialization and deserialization. Although the data contained in JSON documents can ultimately be stored in more performance-optimized formats, such as Parquet or Avro, they serve as raw data, which is very important for reprocessing data (if necessary).
JSON files have several **advantages**:
- JSON supports hierarchical structures, simplifying the storage of related data in one document and the presentation of complex relations;
- Most languages provide simplified JSON serialization libraries or built-in support for JSON serialization/deserialization;
- JSON supports lists of objects, helping to avoid erratic transformations of lists into a relational data model;
- JSON is a widely used file format for NoSQL databases such as MongoDB, Couchbase and Azure Cosmos DB;
- Built-in support in most nowadays tools.
## [Parquet](https://parquet.apache.org/)
Launched in 2013, Parquet was developed by Cloudera and Twitter to serve as a column-based storage format, optimized for work with multi-column datasets. Because data is stored by columns, it can be highly compressed (compression algorithms perform better on data with low information entropy which is usually contained in columns) and splittable. The developers of the format claim that this storage format is ideal for Big Data problems.
Unlike CSV and JSON, Parquet files are **binary** files that contain **metadata** about their content. So, without reading/parsing the content of the file(s), Spark can just rely on the metadata to determine column names, compression/encodings, data types and even some basic statistics. The column metadata for a Parquet file is stored at the end of the file, which allows for fast, one-pass writing.
Parquet is optimized for the *[Write Once Read Many (WORM)](https://en.wikipedia.org/wiki/Write_once_read_many)* paradigm. It's slow to write, but incredibly fast to read, especially when you're only accessing a subset of the total columns. Parquet is a **good choice for read-heavy workloads**. For use cases requiring operating on entire rows of data, a format like CSV or AVRO should be used.
The **advantages** of data storage in Parquet:
- Parquet is a *columnar format*. Only required columns would be fetched/read, it *reduces the disk I/O*. This concept is called *projection pushdown*;
- Schema travels with the data so data is *self-describing*;
- Despite the fact that it is created for HDFS, data can be stored in other file systems, such as GlusterFs or on top of NFS;
- Parquet are just files, which means that it is easy to work with them, move, back up and replicate;
- Native support in Spark out of the box provides the ability to simply take and save the file to your storage;
- Parquet provides very good compression up to 75% when used even with the compression formats like [snappy](https://en.wikipedia.org/wiki/Snappy_(compression));
- As practice shows, this format is the fastest for reading workflows compared to other file formats;
- Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data;
- Parquet can be read and write using Avro API and Avro Schema(which gives the idea to store all raw data in Avro format but all processed data in Parquet);
- It also provides *predicate pushdown*, thus reducing further disk I/O cost.
**Disadvantages**:
- The column-based design makes you think about the schema and data types;
- Parquet does not always have native support in other tools other than Spark;
- Does not support data modification and schematic evolution. Of course, Spark knows how to merge the scheme, if you change it over time (you need to specify a special option when reading). But to change something in an already existing file, you can do nothing other than overwriting, except that you can add a new column.
### Predicate Pushdown / Filter Pushdown
The basic idea of predicate pushdown is that certain parts of queries (the predicates) can be "pushed" to where the data stored. For example, when we give some filter criteria, data store tries to filter the records at the time of reading from disk. Advantage of predicate pushdown is fewer disks I/O and hence overall performance would be better. Otherwise, whole data would be brought into memory and then filtering needs to be done, which results in large memory requirement.
This optimization can drastically reduce query/processing time by filtering out data earlier rather than later. Depending on the processing framework, predicate pushdown can optimize your query by doing things like filtering data before it is transferred over the network, filtering data before loading into memory, or skipping reading entire files or chunks of files.
This concept is followed by most RDBMS and has been followed by big data storage formats like Parquet and ORC as well.
### Projection Pushdown
![Predicate Pushdown / Filter Pushdown](/media/big-data-file-formats/predicate-pushdown.jpg)
When data is read from the data store, only those columns would be read which are required as per the query, not all the fields would be read. Generally, columnar formats like Parquets and ORC follow this concept, which results in better I/O performance.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## [Avro](https://avro.apache.org/)
Apache Avro was released by the Hadoop working group in 2009. It is a **row-based format** that is highly splittable. It also described as a data serialization system similar to Java Serialization. The schema is stored in JSON format while the data is stored in *binary format*, minimizing file size and maximizing efficiency. Avro has robust support for schema evolution by managing added fields, missing fields, and fields that have changed. This allows old software to read the new data and new software to read the old data — a critical feature if your data has the potential to change.
With Avro’s capacity to manage schema evolution, it’s possible to update components independently, at different times, with low risk of incompatibility. This saves applications from having to write if-else statements to process different schema versions and saves the developer from having to look at old code to understand old schemas. Because all versions of the schema are stored in a human-readable JSON header, it’s easy to understand all the fields that you have available.
Avro can support many different programming languages. Because the schema is stored in JSON while the data is in binary, Avro is a relatively compact option for both persistent data storage and wire transfer. Avro is typically the format of choice for write-heavy workloads given its easy to append new rows.
**Advantages**:
- Avro is language-neutral data serialization
- Avro stores the schema in the header of the file so data is *self-describing*;
- Avro formatted files are *splittable and compressible* and hence it's a good candidate for data storage in Hadoop ecosystem;
- Schema used to read an Avro file need not be the same as schema which was used to write the files. This makes it possible to add new fields independently.
- Just as with Sequence Files, Avro files also contains Sync markers to separate the blocks. This makes it highly splittable;
- These blocks can be compressed using compression formats such as snappy.
## Summary
| | CSV | JSON | Parquet | Avro |
|---------------------------------|------|------|---------|------|
| Columnar | No | No | Yes | No |
| Compressable | Yes | Yes | Yes | Yes |
| Splittable | Yes\* | Yes\* | Yes | Yes |
| Human readable | Yes | Yes | No | No |
| Support complex data structures | No | Yes | Yes | Yes |
| Schema evolution | No | No | Yes | Yes |
\* JSON has the same conditions about splittability when compressed as CSV with one extra difference. When “wholeFile” option is set to true (re: [SPARK-18352](http://web.archive.org/web/20190329085319/https://issues.apache.org/jira/browse/SPARK-18352)), JSON is NOT splittable.
## Performance tests
In order to somehow compare the formats with each other, I created a set of tests using a [Netflix dataset](http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a). This is the so-called narrow dataset — it has only three columns and a large number of lines.
All the tests were run on my laptop(I will not show the proper characteristics I think they do not matter for comparison) using pyspark 2.4.3 with running `spark-submit`. Experiments were run 3 times for each of the formats used and then averaged. Test scripts are public: [here](https://gist.github.com/luminousmen/ac38d8935ea71ac738ba92677b0eb703).
### Space utilization per format
The basic function of any files is to store the data. For the big data sphere, it is a necessity to store different types of data for different purposes, at least storing raw data. How much this data will take up space means how much money it will cost.
![Space utilization per format](/media/big-data-file-formats/benchmark-1.png)
It should be noted that all configuration was used by default and compression was not used anywhere. I admit that an aggressively gzip-based source CSV file may take no more than 1.5 GB. So the advantage of binary formats will not be so dramatic. Conclusion — do not use JSON for storing raw data.
### Ingestion latency per format
Next, I tried to save the data in various file formats and calculate the latency.
![Ingestion latency per format](/media/big-data-file-formats/benchmark-2.png)
It may turn out that most of this time is reading a 2-gigabyte source data. And everything else takes 5-10 seconds(I didn't check Spark UI). Anyway, the results are understandable — plain text always will be easier to write than encoding and collecting the metadata of more complex formats(like Avro and parquet).
### Random data lookup latency per format
In this test, I try to understand the latency of getting the records by providing a record identifier(using `sample` transformation in spark).
![Random data lookup latency per format](/media/big-data-file-formats/benchmark-3.png)
The individual record stored in CSV, JSON and Avro format can only get in a brute force scan of an entire data partition. In parquet, columnar nature of the format allows performing partition scans relatively fast. Thanks to column projection and column predicate push down, a scan input set is ultimately reduced.
### Basic statistics per format
This test shows how quickly we can calculate frequently used column statistics(`min`, `max`, `count`).
![Basic statistics per format](/media/big-data-file-formats/benchmark-4.png)
Parquet and Avro use their hack when to calculate the number of rows in a dataset it’s enough to [read metadata](https://github.com/dennyglee/databricks/blob/master/misc/parquet-count-metadata-explanation.md). For CSV and JSON, the situation is worse — they have to parse every byte.
### Processing data per format
This test uses the time column filtering function.
![Processing data per format](/media/big-data-file-formats/benchmark-5.png)
### Grouping by column per format
This test uses a more complex transformation — grouping by column.
![Grouping by column per format](/media/big-data-file-formats/benchmark-6.png)
### Get distinct values latency per format
This test will try to measure the latency of getting distinct values in the column.
![Get distinct values latency per format](/media/big-data-file-formats/benchmark-7.png)
The result looks exactly the same as for random data lookup, but the `distinct` operation significantly complicates the task. For simplicity, we can assume that it adds a reduce phase to the job, which means that the overhead for exchanging data between workers is also measured.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
### Lessons learned from performance tests
- CSV should generally be the fastest to write, JSON the easiest for a human to understand and Parquet the fastest to read a subset of the columns, while Avro is the fastest to read all columns at once.
- JSON is the standard for communicating on the web. APIs and websites are constantly communicating using JSON because of their usability properties such as well-defined schemas.
- Parquet and Avro are definitely more optimized for Big Data needs — splittability, compression support, great support for complex data structures but readability and writing speed is quite bad.
- Unit of data access parallelization in case of Parquet and Avro is an HDFS file block – thanks to that it is very easy to evenly distribute processing across all the resources available on a Hadoop cluster.
- When you have to choose a data storage format in Hadoop, you need to consider many factors, such as integration with third-party applications, the evolution of the scheme, support for specific data types ... But if you put performance at the forefront, the tests above convincingly show that Parquet is your best choice.
- Notably, **compression** algorithms played a significant role not only in reducing the data volume but also in enhancing the performance of data ingestion and data access. But this part of the picture wasn't tested;
- **Apache Avro** has proven to be a fast universal encoder for structured data. Due to very efficient serialization and deserialization, this format can guarantee very good performance whenever access to all the attributes of a record is required at the same time.
- According to the tests, columnar stores like **Apache Parquet** delivered very good flexibility between fast data ingestion, fast random data lookup, and scalable data analytics. In many cases, this provides the added advantage of keeping systems simple as only one type of technology is needed for storing the data and serving different use cases (random data access and analytics).
- Whatever you do -- never use JSON format. In almost all tests, it proved to be the worst format to use.
*P.S.* Columnar formats are generally used where you need to query upon few columns rather than all the fields in a row because their column-oriented storage pattern is well suited for the same. On the other hand Row formats are used where you need to access all the fields of row. So generally Avro is used to store the raw data because during processing usually all the fields are required.How to make color emoji work on Chrome for Linuxhttps://luminousmen.com/post/332019-07-20T00:00:00Z2019-07-20T00:00:00ZIn order to support emoji on Linux we will be using [Noto Color Emoji font](https://www.google.com/get/noto/help/emoji/).
The script is simple:
```bash
wget https://noto-website.storage.googleapis.com/pkgs/NotoColorEmoji-unhinted.zip
sudo mkdir -p /usr/local/share/fonts/truetype
sudo unzip NotoColorEmoji-unhinted.zip -d /usr/local/share/fonts/truetype/noto
sudo rm /usr/local/share/fonts/truetype/noto/LICENSE_OFL.txt
sudo apt-get remove ttf-ancient-fonts-symbola fonts-symbola
sudo chmod 644 /usr/local/share/fonts/truetype/noto/NotoColorEmoji.ttf
fc-cache -f -v
rm NotoColorEmoji-unhinted.zip
```
Or using gist script:
```bash
curl https://gist.githubusercontent.com/luminousmen/7ba2f1f672213a396e8c42a3802348df/raw/434c8ef6bad8bb0df8c794bf598021eef7eebc7c/ubuntu-chrome-emoji.sh | bash
```
This script uses [apt-get](https://linux.die.net/man/8/apt-get) packet management utility, please update it in order to run on your system.Spark. Anatomy of Spark applicationhttps://luminousmen.com/post/322019-07-14T00:00:00Z2019-07-14T00:00:00Z[Apache Spark](https://spark.apache.org/) is considered as a powerful complement to [Hadoop](https://hadoop.apache.org/), big data’s original technology. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. It has become mainstream and most in-demand big data framework across all major industries. Spark has become part of the Hadoop since 2.0. And is one of the most useful technologies for Python Big Data Engineers.
This series of posts is a single-stop resource that gives spark architecture overview and it's good for people looking to learn spark.
Whole series:
- [Things you need to know about Hadoop and YARN being a Spark developer](https://luminousmen.com/post/hadoop-yarn-spark)
- [Spark core concepts explained](https://luminousmen.com/post/spark-core-concepts-explained)
- [Spark. Anatomy of Spark application](https://luminousmen.com/post/spark-anatomy-of-spark-application)
---
## Architecture
![Spark yarn architecture](/media/spark-yarn-architecture.jpg)
The components of the spark application are:
- Driver
- Application Master
- Spark Context
- Executors
- Cluster Resource Manager(aka Cluster Manager)
Spark uses a master/slave architecture with the central coordinator, named **Driver**, and a set of executable worker process, called **Executors**, which are located on different cluster nodes.
### Driver
The Driver (aka an application’s driver process) is responsible for converting a user application into smaller execution units called **tasks** and then schedules them to run on the executors. The Driver is also responsible for the execution of the spark application and returning the status/results back to the user.
Spark Driver contains various components – `DAGScheduler`, `TaskScheduler`, `BackendScheduler` and `Block Manager`. They are responsible for the translation of user code into actual spark jobs executed on the cluster.
Other Driver properties:
- can run in an independent process, or on one of the worker node for High Availability(HA);
- stores the metadata about all the Resilient Distributed Databases and their partitions;
- is created once the user submits the spark application to the cluster manager(YARN in our case);
- runs in his own JVM;
- optimizes the logical DAG of transformations and combine them into stages if possible;
- brings up Spark WebUI with application details;
### Application Master
As we described in the [first post](https://luminousmen.com/post/hadoop-yarn-spark) — Application Master is a framework-specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the application tasks.
Spark Master is created at the same time as the Driver on the same node(in case of cluster mode) when the user submits the spark application using `spark-submit`.
The Driver informs the Application Master about the executor's requirements for the application and the Application Master negotiates the resources with the Resource Manager to host these executors.
In a standalone mode, the Spark Master plays the role of Cluster manager.
### Spark Context
The Spark Context is the main entry point for Spark functionality and so the heart of any Spark application. It allows Spark Driver to access the cluster through a Cluster Resource Manager and it can be used to create RDDs, [accumulators and broadcast variables](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#shared-variables) on the cluster. Spark Context also keep track of live executors by sending heartbeat messages regularly.
The Spark Context is created by the Spark Driver for each individual Spark application when it is first submitted by the user. It exists throughout the entire life of a spark application.
The Spark Context terminates once the spark application completes. Only one Spark Context can be active per JVM. You must `stop()` the active Spark Context before creating a new one.
### Cluster Resource Manager
Cluster Manager in a distributed Spark application is the process that monitors, governs, reserves resources in the form of containers on the cluster worker nodes. These containers are reserved upon request by the Application Masters and allocated to the Application Master when released or available.
Once the Cluster Manager allocates the containers, the Application Master provides the container's resources back to Spark Driver and Spark Driver will be responsible for executing the various stages and tasks of Spark application.
### Executors
Executors are processes on the worker nodes whose job is to execute the assigned tasks. These tasks are executed on the partitioned RDDs on the worker nodes and then return the result back to the Spark Driver.
Executors launch once at the beginning of Spark Application and then they run for the entire lifetime of an application this phenomenon is known as "Static Allocation of Executors". However, users can also opt for dynamic allocations of executors wherein they can add or remove spark executors dynamically to match with the overall workload. Even if the Spark executor fails, the Spark application can continue.
Executors provide in-memory storage for RDD's partitions that are cached(locally) in Spark applications (via `BlockManager`).
Other executor properties:
- stores the data in the cache in JVM heap or on HDDs
- reads data from external sources
- writes data to external sources
- performs all the data processing
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## Spark Application running steps
On this level of understanding let's create and break down one of the simplest spark applications.
```python
from pyspark.sql import SparkSession
# initialization of spark context
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkSession\
.builder\
.appName("PythonWordCount")\
.config(conf=conf)
.getOrCreate()
# read data from HDFS, as a result we get RDD of lines
linesRDD = sc.textFile("hdfs://...")
# from RDD of lines create RDD of lists of words
wordsRDD = linesRDD.flatMap(lambda line: line.split(" ")
# from RDD of lists of words make RDD of words tuples where
# the first element is word and the second is counter, at the
# beginning it should be 1
wordCountRDD= wordsRDD.map(lambda word: (word, 1))
# combine elements with the same word value
resultRDD = wordCountRDD.reduceByKey(lambda a, b: a + b)
# write it back to HDFS
resultRDD.saveAsTextFile("hdfs://...")
spark.stop()
```
![Spark word count example](/media/spark-wordcount.jpg)
1. When we submit a Spark application via the cluster mode, `spark-submit` utility will interact with the Cluster Resource Manager to start the Application Master.
2. The Resource Manager gets responsibility for allocating the required container where the Application Master will be launched. Then Resource Manager launches the Application Master.
3. The Application Master registers itself on Resource Manager. Registration allows the client program to ask the Resource Manager for specific information which allows it to directly communicate with its own Application Master.
4. Next Spark Driver runs on the Application Master container(in case of cluster mode).
5. The driver implicitly converts user code that contains transformations and actions into a logical plan called DAG of RDDs. All RDDs are created on the Driver and do not do anything until action is called. On this step, Driver also performs optimizations such as pipelining transformations.
6. After that, it converts the DAG into a physical execution plan. After converting into a physical execution plan, it creates physical execution units called tasks under each stage.
7. Now the Driver talks to the Cluster Manager and negotiates the resources. Cluster Manager will then allocate containers and launches executors on all the allocated containers and assigns tasks to run on behalf of the Driver. When executors start, they register themselves with Driver. So, the Driver will have a complete view of executors that are executing the task.
8. At this point, the driver will send the tasks to the executors based on data placement. The cluster manager is responsible for the scheduling and allocation of resources across the worker machines forming the cluster. At this point, the Driver sends tasks to the Cluster Manager based on data placement.
9. Upon successful receipt of the containers, the Application Master launches the container by providing the Node Manager with a container configuration.
10. Inside the container, the user application code starts. It provides information (stage of execution, status) to Application Master.
11. So, on this step, we will actually start executing our code. Our first RDD will be created by reading the data from HDFS into different partitions on different nodes in parallel. So each node will have a subset of the data.
12. After reading the data we have two map transformations which will be executing in parallel on each partition.
13. Then we have a `reduceByKey` transformation, it's not a standard pipe operation like `map` hence it will create an additional stage. It combines the records with the same keys, then it moves data between nodes(shuffle) and partitions to combine the same record's keys together.
14. Then we have the action operation -- writing back to HDFS, which will trigger the whole DAG execution.
15. During the user application execution, the client communicates with the Application Master to obtain the status of the application.
16. When the application has completed execution and all the necessary work has been finished, the Application Master deregisters from Resource Manager and shuts down, freeing its container for other purposes.
Data Science. The Central Limit Theorem and samplinghttps://luminousmen.com/post/312019-07-07T00:00:00Z2019-07-07T00:00:00ZThere are a lot of engineers who have never been involved in statistics or data science. So, in order to build a data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arises from the engineering side. For these Data/ML engineers and novice data scientists, I make this series of articles. I'll try to explain some basic approaches in plain English and, based on them, explain some of the Data Science model concepts.
The whole series:
- [Data Science. Probability](https://luminousmen.com/post/data-science-probability)
- [Data Science. Bayes theorem](https://luminousmen.com/post/data-science-bayes-theorem)
- [Data Science. Probability distributions](https://luminousmen.com/post/data-science-probability-distributions)
- [Data Science. Measures](https://luminousmen.com/post/data-science-measures)
- [Data Science. Correlation](https://luminousmen.com/post/data-science-correlation)
- [Data Science. The Central Limit Theorem and sampling](https://luminousmen.com/post/data-science-central-limit-theorem)
- [Demystifying hypothesis testing](https://luminousmen.com/post/demystifying-hypothesis-testing)
---
The practice of studying random phenomena shows that although the results of individual observations, even those carried out under the same conditions, can differ. But the average results for a sufficiently large number of observations are stable and only weakly fluctuates by the results of individual observations. The theoretical basis for this remarkable property of random phenomena is the Central Limit Theorem(aka law of large numbers).
**According to the central limit theorem, the average value of the data sample will be closer to the average value of the whole population and will be approximately normal, as the sample size increases.** Importance of this theorem comes from the fact that **this is true regardless of the distribution of population.**
To illustrate the concept check the following animation of the die roll and code:
![Die roll](https://luminousmen.com/media/die-roll.gif)
```python
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plot
import matplotlib.animation as animation
from wand.image import Image
from wand.display import display
# 1000 simulations of die roll
n, m = 200, 31
# In each simulation, there is one trial more than the previous simulation
avg = []
for i in range(2, n):
a = np.random.randint(0, m, i)
avg.append(np.average(a))
# Function that will plot the histogram, where current is the latest figure
def clt(current):
# if animation is at the last frame, stop it
plt.cla()
if current == n:
a.event_source.stop()
plt.xlim(0, m)
plt.hist(avg[0:current])
plt.gca().set_title('Expected value of die rolls')
plt.gca().set_xlabel('Average from die roll')
plt.gca().set_ylabel('Frequency')
plt.annotate('Die roll = {}'.format(current), [3, 27])
fig = plt.figure()
a = animation.FuncAnimation(fig, clt, interval=1, save_count=n)
a.save('animation.gif', writer='imagemagick', fps=10)
```
In a practical world, to understand the characteristics of a population, scientists usually sample data and work with sample' statistics. They deal with samples to understand and to generalize insights about population. Using a big sample size, **central limit theorem allows us to apply properties of normal distribution in this process.**
We already know that the normal distribution is special. We can also use some of its properties for distributions that, strictly speaking, cannot be called normal.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## Sampling
Before that, we talked about how, [knowing the theoretical distribution](https://luminousmen.com/post/data-science-probability-distributions), [knowing the theory of probability](https://luminousmen.com/post/data-science-bayes-theorem), [knowing the parameters of the distribution](https://luminousmen.com/post/data-science-measures), to evaluate the sample what happens to the general population. But there is a problem that even the best theory and even the most knowing distribution will not help us if the sample on which we estimate is designed incorrectly or it does not represent the population.
![Sampling concept](https://luminousmen.com/media/sampling_concept.jpg)
*Why sampling?*
Well, the logical question is: if we take a sample, we, in any case, discard some data. Why do not take all the data and work with all the elements of the population?
First, the data need to be collected, and it is very **expensive**. Think about surveys. Even if we are not talking about population of all Americans, but we want to present, for example, all the residents of [California](https://en.wikipedia.org/wiki/California), it is 39 million people. In order to take an interview with 39 million people, we need budgets, which, of course, we don’t have. In addition, even if we have such budgets, it is almost impossible to reach all residents of any state.
And the idea of sampling is, in general, simple — take a small set, but which will be quite heterogeneous, in terms of some key criteria that represent our general population. That is, not to interview all of California residents, but to take some kind of a slice that will represent California by important criteria for us.
The idea of sampling is very similar to the idea of soup. When we cooked a soup that contains a lot of ingredients, they are cut differently, they are added at different times, and we need to evaluate the quality of the soup. It is clear that we do not need to eat all the soup in order to evaluate how tasty it turned out. Moreover, if we needed to eat all the soup in order to understand how good it is, then any idea of collective cooking would be somewhat absurd. But what are we doing? We boil the soup, and after that, we take a spoon, scoop up and, **on the basis of this small portion, we are trying to assess whether we have done the soup** or we need to change something in it. If we just take some random part, for example, scoop up from above, then we will have a spoon full of water, and it will not give us any idea about the ingredients (vegetables or meat). If we scoop anyhow from the bottom, it may turn out that we only got large pieces, but we did not understand anything about small pieces. In order to get all the ingredients of our soup into sample on which we can get the taste of the soup, we need to mix it first, and then, after we mix it well, scoop it up, and we see that then all the ingredients turn out to be in the spoon — large, small, and water, all. That is, we can estimate already at this portion how well all the ingredients in the soup are prepared. The same with sampling.
The analog of this mixing in the case of sampling is **random sampling**. It is a random selection, the essence of which is to ensure an equal probability for each element of the population in the sample to get, it provides us with this representativeness of the sample.
*What's terrible is that the sample is not representative?*
Well, I will highlight a few problems. The simplest and most understandable example — if we select, for example, an **available sample**.
![Available sample](https://luminousmen.com/media/available_sample.jpg)
That is, we study the preferences of young people, but since we study or work in a certain university, we only interview students of our university and say that we will know about all the young people on the basis of this study. Obviously, we will not know about all the young people, because the sample is a very specific group. The available sample gives us some part of the picture, but not a complete one. In this case, a substantial part of young people who do not study at universities or study at other universities will not be covered by our sample. Another problem is — we can select only those people who want to talk to us. That is, the trouble with such non-representative samples is that we do not give equal opportunities to different people, different points of view to be represented in our sample. A random sample at least formally guarantees a possibility of such a representation.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
### Probability sampling methods
![Simple random sampling](https://luminousmen.com/media/random_sampling.jpg)
The simplest and most understandable method is a **simple random sample** when we have a complete list of elements of the general population. For example, our population is all owners of the telephone numbers of NYC, and we have a complete list of these numbers. We turn on the "random number sensor", select the number of objects we need and call these phone numbers — a simple random selection.
![Stratified sampling](https://luminousmen.com/media/stratified_sampling.jpg)
Another option is **stratified sampling**. Here we are no longer doing a random selection, but here we know something about our population. We know that it consists of several homogeneous clusters, which need to be presented in our sample. For example, a man and a woman, who have different opinions on some question. And we first divide our population into men and women clusters, and then we randomly select in both clusters to guarantee the representation of these clusters in the final sample.
![Cluster sampling](https://luminousmen.com/media/cluster_sampling.jpg)
And one more variant of the random sample is the **cluster sampling**. Such samples are used when we explore cities, for example, which are very often divided into districts. Some of the regions are similar to each other, some are different, and we have such kind of clusters of areas that are similar, say, by socio-economic conditions. And we first divide the city into clusters and then randomly select one of these clusters one at a time so as not to go to all twelve districts, for example, but choose three out of twelve, randomly, dividing them into these similar categories first, and then work inside these areas.
### Non-probability sampling methods
![Cluster sampling](https://luminousmen.com/media/snowball_sampling.jpg)
Non-probability sampling methods are also needed. Moreover, in some cases, non-probability sampling is irreplaceable. For example, there is a sampling method — **snowball sampling**. It is necessary if we investigate hard-to-reach groups, or we don’t know exactly the volume of the general population then it turns out that we are talking to the first person, he contact us with the next one, the next, the next, and we sort of accumulate a snowball. We increased the sample, starting from one, or sometimes start several such snowballs in order to guarantee the heterogeneity of the population. But of course this sample is statistically unrepresentative, but there are tasks which we simply cannot do without it.
## Conclusion
Central limit theorem is quite an important concept in statistics, and consequently data science. This theorem will allow us to test the so-called **statistical hypotheses**, i.e. allow testing assumptions on applicability to the whole population. We will be covering the concept in the later posts.
Sampling is a cheap and understandable concept of getting little but representative data from the population. The probability methods are preferable for most of the research problems, but there are tasks for which only non-random samples can help. There are tasks for which they are irreplaceable, but in the statistical sense non-random samples are not representative. Therefore, all of this theoretical knowledge about distributions, about making a conclusion about the general population on the basis of a sample, we can do this only on random samples.Spark core concepts explainedhttps://luminousmen.com/post/302019-06-23T00:00:00Z2019-06-23T00:00:00Z[Apache Spark](https://spark.apache.org/) is considered as a powerful complement to [Hadoop](https://hadoop.apache.org/), big data’s original technology. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. It has become mainstream and most in-demand big data framework across all major industries. Spark has become part of the Hadoop since 2.0. And is one of the most useful technologies for Python Big Data Engineers.
This series of posts is a single-stop resource that gives spark architecture overview and it's good for people looking to learn spark.
Whole series:
- [Things you need to know about Hadoop and YARN being a Spark developer](https://luminousmen.com/post/hadoop-yarn-spark)
- [Spark core concepts explained](https://luminousmen.com/post/spark-core-concepts-explained)
- [Spark. Anatomy of Spark application](https://luminousmen.com/post/spark-anatomy-of-spark-application)
---
Apache Spark Architecture is based on two main abstractions:
- Resilient Distributed Dataset (RDD)
- Directed Acyclic Graph (DAG)
Let's dive in these concepts
## RDD — the Spark basic concept
The key to understanding Apache Spark is RDD — the Resilient Distributed Dataset. RDD contains an **arbitrary collection of objects**. Each data set in RDD is logically distributed across the cluster nodes so that they can be processed in parallel.
Physically, RDD is stored as an object in driver JVM and it refers to the data stored either in persisted store (HDFS, Cassandra, HBase, etc.) or in the cache (memory, memory+disks, disk only, etc.) or in another RDD.
RDD stores the following metadata:
▪ Partitions – set of data splits associated with this RDD. They are located on the cluster nodes. One partition is a minimal data batch which will be processed by each cluster node;
▪ Dependencies – list of parent RDDs involved in the computation aka lineage graph;
▪ Computation – function to compute child RDD given the parent RDD from the Dependencies;
▪ Preferred Locations – where is the best place to put computations on partitions (data locality);
▪ Partitioner – how the data is split into Partitions(by default they split by `HashPartitioner`);
RDD can be recreated as well as data that it refers to because every RDD knows how it was created (by storing the lineage graph). RDD also can be materialized, in memory or on disk.
Example:
```python
>>> rdd = sc.parallelize(range(20)) # create RDD
>>> rdd
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195
>>> rdd.collect() # collect data on driver and show
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
```
RDD can also be cached and manually partitioned. Caching is beneficial when we use RDD several times(and may slow down our computations otherwise). And manual partitioning is important to correctly balance data between partitions. Generally, smaller partitions allow distributing data more equally, among more executors. Hence, fewer partitions may boost tasks with a lot of repartitions(data reorganizations during computations).
Let's check the number of partitions and data on them:
```python
>>> rdd.getNumPartitions() # get current number of paritions
4
>>> rdd.glom().collect() # collect data on driver based on partitions
[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]]
```
All interesting that happens in Spark happens through RDD operations. That is, usually Spark applications look like the following — we create RDD (for example, we set data source as file on HDFS), we transform it (`map`, `reduce`, `join`, `groupBy`, `aggregate`, `reduce`, ...), do something with the result (for example, we throw it back into HDFS).
![Spark Application Flow](/media/spark_application_flow.jpg)
Over RDD, you can do two types of operations (and, accordingly, all the work with the data is in the sequence of these two types): **transformations and actions**.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
### **Transformations**
The result of applying this operation to RDD is a **new RDD**. As a rule, these are operations that in some way convert the elements of a given data.
Transformations are **lazy in nature** meaning when we call some operation in RDD, it does not execute immediately. Spark maintains the record of which operation is being called(through DAG, we will talk about it later).
We can think Spark RDD as the data, that we built up through transformation. Because of transformations laziness, we can execute operation any time by calling an action on data. Hence, data is not loaded until it is necessary. It gives plenty of opportunities to induce low-level optimizations.
At a high level, there are two groups of transformations that can be applied onto the RDDs, namely **narrow transformations**, and **wide transformations**.
![Narrow/wide transformations](/media/narrow_wide_transformations.jpg)
**Narrow transformation** doesn't require the data to be **shuffled or reorganized** across the partitions. For example, `map`, `filter`, etc. The narrow transformations will be grouped (or pipe-lined) together into a single stage.
A **shuffle** occurs when data is rearranged between partitions. This is required when a transformation requires information from other partitions, such as summing all the values in a column. Spark will gather the required data from each partition and combine it into a new partition, likely on a different executor.
But there are exceptions, operations like `coalesce` may cause the task to work with multiple input partitions, but the transformation will still be considered narrow because the input records used to compute any output record can still be found only in a limited subset of partitions.
Let's use filter transformation on our data:
```python
>>> filteredRDD = rdd.filter(lambda x: x > 10)
>>> print(filteredRDD.toDebugString()) # to see the execution graph; only one stage is created
(4) PythonRDD[1] at RDD at PythonRDD.scala:53 []
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []
>>> filteredRDD.collect()
[11, 12, 13, 14, 15, 16, 17, 18, 19]
```
In this example we don't need to shuffle the data, each partition can be processed independently.
However, Spark also supports transformations with wide dependencies(namely **wide transformations**), such as `groupByKey`, `reduceByKey`, etc. Within such dependencies, the data required for calculation may be located in several partitions of the parent RDD. All data with the same key must be in the same partition, processed by a single task. To implement these operations, Spark must perform the shuffling, moving data across the cluster and forming a new stage with a new set of partitions as in the example below:
```python
>>> groupedRDD = filteredRDD.groupBy(lambda x: x % 2) # group data based on mod
>>> print(groupedRDD.toDebugString()) # two separate stages are created, because of the shuffle
(4) PythonRDD[6] at RDD at PythonRDD.scala:53 []
| MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:133 []
| ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:0 []
+-(4) PairwiseRDD[3] at groupBy at <ipython-input-5-a92aa13dcb83>:1 []
| PythonRDD[2] at groupBy at <ipython-input-5-a92aa13dcb83>:1 []
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []
```
### Actions
Actions are applied when it is necessary to materialize the result — save the data to disk, or output part of the data to the console. `collect` operation we used so far is also an action — it collects data.
Actions are not lazy — they actually will trigger the data processing. Actions are RDD operations that produce non-RDD values.
For example, to get sum of our filtered data we can use `reduce`:
```python
>> filteredRDD.reduce(lambda a, b: a + b)
135
```
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## DAG
Unlike Hadoop where the user has to break down the whole job into smaller jobs and chain them together to go along with MapReduce, Spark identifies the tasks that can be computed in parallel with partitioned data on the cluster. With these identified tasks, Spark builds a logical flow of operations that can be represented in a graph which is directed and acyclic, also known as **DAG** (Directed Acyclic Graph), where a node is RDD partition and the edge is transformation on the data. Thus Spark builds its own plan of executions implicitly from the provided spark application.
![DAG flow](/media/dag_flow.jpg)
The **DAGScheduler** divides operators into stages of tasks. A stage is consist of tasks based on the input data partitions. The DAGScheduler pipelines some transformations together. For e.g. many map operators can be squash into a single stage. The final result of a DAGScheduler is a set of stages. The stages are passed on to the **TaskScheduler**. The number of tasks submitted depends on the number of partitions. The TaskScheduler launches tasks via the **[cluster manager](https://luminousmen.com/post/hadoop-yarn-spark)**. The TaskScheduler doesn't know about dependencies of the stages.
RDDs are capable of defining location preference to compute partitions. Location preference refers to information about the RDD location. The **DAGScheduler** places the partitions in such a way that task is close to data as much as possible(data locality).Things you need to know about Hadoop and YARN being a Spark developerhttps://luminousmen.com/post/292019-06-16T00:00:00Z2019-06-16T00:00:00Z[Apache Spark](https://spark.apache.org/) is considered as a powerful complement to [Hadoop](https://hadoop.apache.org/), big data’s original technology. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. It has become mainstream and most in-demand big data framework across all major industries. Spark has become part of the Hadoop since 2.0. And is one of the most useful technologies for Python Big Data Engineers.
This series of posts is a single-stop resource that gives spark architecture overview and it's good for people looking to learn spark.
Whole series:
- [Things you need to know about Hadoop and YARN being a Spark developer](https://luminousmen.com/post/hadoop-yarn-spark))
- [Spark core concepts explained](https://luminousmen.com/post/spark-core-concepts-explained)
- [Spark. Anatomy of Spark application](https://luminousmen.com/post/spark-anatomy-of-spark-application)
---
*What is Spark?*
Apache Spark is an open source cloud computing framework for batch and stream processing which was designed for fast in-memory data processing.
Spark is a **framework** and it mostly used on top of other systems. You can run Spark using its standalone cluster mode, on [EC2](https://aws.amazon.com/ru/ec2/), on [Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), on [Mesos](http://mesos.apache.org/), or on [Kubernetes](https://kubernetes.io/). I will tell you about the most popular build — Spark on top of the Hadoop Yarn.
So before we go in depth of what the Apache Spark consists of, let's briefly understand the Hadoop platform and what YARN is doing there.
## The OS analogy
In order to understand what Hadoop is, I will make an analogy with the operating system. The traditional operating system on a high-level consists of several parts: the file system and the processing component.
On a single machine, there is a **file system**, it can be different: FAT32, HPFS, ext2, NFS, ZFS, etc. It is basically a system for storing and managing data.
![Classic OS](/media/classic_os.jpg)
Also OS have a **processing component**: kernel, scheduler and some threads and process that's allowing us to run programs on the data.
![Hadoop](/media/hadoop_os.jpg)
When we move that concept of storage/processing parts to a cluster level and put that inside Hadoop we basically will get the same separation, the same two components. But the storage layer is instead of a single node file system will be **HDFS** — Hadoop distributed file system. And **YARN**(Yet Another Resource Negotiator) takes the role of the processing component: execution, scheduling, deciding what gets to run particular tasks and where.
Let's take a closer look at it.
## YARN architecture
![YARN](/media/yarn.jpg)
*So how the YARN works?*
YARN has some kind of the main process — **ResourceManager**. The ResourceManager is the ultimate authority that arbitrates the division of resources among all the applications in the system. It has minions that are running on all cluster nodes which called **NodeManagers**.
![YARN NodeManagers](/media/yarn_nodemanager.jpg)
All cluster nodes in the cluster have a certain number of **containers**. Containers are computational units, some kind of wrappers for node resources for performing user application's tasks. They are the main processing units which YARN manages. Containers have their own parameters that can be configured upon request(like ram, CPU, etc).
![YARN Containers](/media/yarn_containers.jpg)
The containers on each node are controlled by the NodeManager daemon. When a new application is launching on a cluster, the ResourceManager allocates one container for the **ApplicationMaster**.
The per-application ApplicationMaster is a framework-specific entity and is tasked with negotiating for resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.
The ResourceManager has a pluggable **scheduler** component, which is responsible for allocating resources to the various running applications subject to the familiar constraints of capacity, queues, and other factors.
After the ApplicationMaster is started, it will be responsible for the whole life cycle of the distributed application. First and foremost, it will be sending resource requests to the ResourceManager to ask for containers needed to run an application’s tasks. A resource request is simply a request for a number of containers that satisfies some resource requirements, such as:
- An amount of resources, today expressed as megabytes of memory and CPU shares
- A preferred location, specified by hostname, rackname, or `*` to indicate no preference
- A priority within this application, and not across multiple applications
If the ApplicationMaster crashes or becomes unavailable, the ResourceManager can create another container and restart the ApplicationMaster on it.
The ResourceManager stores information about running applications and completed tasks in HDFS. If the ResourceManager is restarted, it recreates the state of applications and re-runs only incomplete tasks.
The ResourceManager, the NodeManager, and a container are not concerned about the type of application or task. All application framework-specific code is simply moved to its ApplicationMaster so that **any distributed framework can be supported by YARN** — as long as someone implements an appropriate ApplicationMaster for it. Thanks to this generic approach, the dream of a Hadoop YARN cluster running many various workloads comes true.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## Submit application
Armed with the knowledge from the previous sections, it will be useful to find out how the applications themselves work in YARN.
Let's go through the sequence of steps of application start:
1. A **client program** submits the application, including the necessary specifications to launch the application-specific **ApplicationMaster** itself.
2. The **ResourceManager** gets responsibility for allocating the required container in which the **ApplicationMaster** will starts. Then **ResourceManager** launches the **ApplicationMaster**.
3. The **ApplicationMaster** registers itself on **ResourceManager**. Registration allows the **client program** to ask the **ResourceManager** for specific information which allows it to directly communicate with its own **ApplicationMaster**.
4. During normal operation, the **ApplicationMaster** requests the appropriate resource containers.
5. Upon successful receipt of the containers, the **ApplicationMaster** launches the container by providing the **NodeManager** with a container configuration.
6. Inside the container, the user application code starts. Then the user application provides information (stage of execution, status) to **ApplicationMaster**.
7. During the user application execution, the client communicates with the **ApplicationMaster** to obtain the status of the application.
8. When the application has completed execution and all the necessary work has been finished, the **ApplicationMaster** deregisters from **ResourceManager** and shuts down, freeing its container for other purposes.
## Interesting facts and features
YARN offers several other great features. Describing all of them is outside the scope of the post, but I’ve included some noteworthy features:
- *Uberization* is the possibility to run all tasks of a MapReduce job in the ApplicationMaster’s JVM if the job is small enough. This way, you avoid the overhead of requesting containers from the ResourceManager and asking the NodeManagers to start (supposedly small) tasks.
- Simplified user-log management and access. Logs generated by applications are not left on individual slave nodes (as with MRv1) but are moved to central storage, such as HDFS. Later, they can be used for debugging purposes or for historical analyses to discover performance issues.
Data Science. Correlationhttps://luminousmen.com/post/282019-06-09T00:00:00Z2019-06-09T00:00:00ZThere are a lot of engineers who have never been involved in statistics or data science. So, in order to build a data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arises from the engineering side. For these Data/ML engineers and novice data scientists, I make this series of articles. I'll try to explain some basic approaches in plain English and, based on them, explain some of the Data Science model concepts.
The whole series:
- [Data Science. Probability](https://luminousmen.com/post/data-science-probability)
- [Data Science. Bayes theorem](https://luminousmen.com/post/data-science-bayes-theorem)
- [Data Science. Probability distributions](https://luminousmen.com/post/data-science-probability-distributions)
- [Data Science. Measures](https://luminousmen.com/post/data-science-measures)
- [Data Science. Correlation](https://luminousmen.com/post/data-science-correlation)
- [Data Science. The Central Limit Theorem and sampling](https://luminousmen.com/post/data-science-central-limit-theorem)
- [Demystifying hypothesis testing](https://luminousmen.com/post/demystifying-hypothesis-testing)
---
Correlation, correlation dependence — the dependence or association of two or more variables. Its essence lies in the fact that when the value of one variable changes, a change (decrease or increase) of another variable occurs.
When calculating correlations, we are trying to determine whether there is a statistically significant relationship between two or more variables in one or several samples. For example, the relationship between height and weight, the relationship between performance and results of the IQ test, between work experience and productivity.
If you read somewhere a phrase like "it turned out that these events have such a correlation here" then in about 99% of cases, unless otherwise specified, we are talking about the **Pearson correlation coefficient**. It's the **default correlation** per se.
## Pearson Correlation Coefficient
<img src="https://luminousmen.com/media/Pearson_Correlation_Coefficient_meme.png" style="width: auto !important;align: center;"/>
Suppose we have two processes, for each of which we measure some parameters. As a result, we have a set of pairs of numbers(value of the 1st process and value of the 2nd process). Assuming at the same time these processes are somehow connected, we assume that this connection should reveal itself numerically from the parameters' values. Hence, from the resulting pairs of numbers, we can somehow get information about the presence or absence of a connection.
However, interconnection can be of varying degrees of strength; therefore, it is desirable to obtain not a binary "yes" or "no", but something like a continuous number which will characterize the magnitude of the degree (strength) of the interconnection between two variables. And here, meet, the Pearson correlation coefficient.
It can vary from -1 (negative correlation) to +1 (positive correlation). If the correlation coefficient is 0 then this indicates the absence of interconnection between the variables. And if the correlation coefficient is closer to 1 (or -1), then it is a **strong correlation**, and if closer to 0, a **weak correlation**.
With a **positive correlation**, an increase (or decrease) in the values of one variable leads to a regular increase (or decrease) in another variable.
With a **negative correlation**, an increase (or decrease) in the values of one variable leads to a regular decrease (or increase) in another variable.
The formula for Pearson correlation coefficient for two variables X and Y:
$$r=\frac{cov}{\sigma_x\sigma_y}$$
where $$cov=M((X - M(X))(Y - M(Y))$$
Frankly speaking, I also do not really like the mathematical version of the coefficient. Some scary little letters in scary combinations, all that(I get you guys). Plain text with examples seems much more understandable. Therefore, I will try to follow him.
There is such a term "mathematical expectation of a quantity", for brevity called **expectation** or **mean**(we talked about it in the [previous post](https://luminousmen.com/post/data-science-measures)). In the simplest case, its meaning is extremely simple: it is the arithmetic average of all values:
$$M(X)=\frac{(x_1 + x_2 + ... + x_n)}{n}$$
Next, we find for each number in the list its deviation from the mean:
$$mx = M(X)$$
$${dx_1, dx_2, ...} = {x_1 - mx, x_2 - mx, x_3 - mx, ...}$$
The same can be done with another list in which there are measurements of the second variable, measured simultaneously with the first:
$$Y={y_1, y_2, ..., y_n}$$
$$my = M(Y)$$
$${dy_1, dy_2, ...} = {y_1 - my, y_2 - my, ..., y_n - my}$$
For example, we measured the temperature of every patient in the ward. And, besides that, we recorded how many aspirin pills the patient took every day. By the procedure above we can build the list of deviations of every patient in the ward from the mean temperature of the patients. And apply the same procedure to the corresponding list of deviations of the number of aspirin pills taken by the patient from the average number of pills taken.
Hense, we can assume that if temperature somehow is connected with a number of aspirin pills taken then the deviations will be connected with them as well.
For example, if taking aspirin leads to an increase in temperature, then we will see the following. If the patient took more pills than others, then his temperature **deviates from the average upwards**. If the patient took fewer pills than others, then his temperature **deviates from the average downwards**. That is both deviations — in the same direction.
If we multiply in pairs all the deviations, then the result will be positive.
And vice versa — if taking aspirin lowers the temperature, then all products of deviations will be negative.
In other words, we have obtained a certain indicator with an interesting property — **it is positive for the positive relation of phenomena and negative for negative relation**.
$$D={{dx_1 dy_1,dx_2 dy_2,...}}$$
But after all, if variables are not really connected, then on a large number of measurements approximately both of them should be equally distributed(we will cover this topic in the next post) — a positive product of deviations and a negative one. If you add them up, then, apparently, you get something around zero. Moreover, the closer to zero, the more measurements were there.
Yeah, it mean again. Therefore, here it is, the criteria — [covariance](https://en.wikipedia.org/wiki/Covariance):
$$cov = M(D)$$
The covariance can have an arbitrary value, and therefore, in order to draw a conclusion about the relationship between the lists, it is also necessary to know the maximum range just for these variables.
To do this, we introduce another interesting value — derived from the list of squares of deviations from the mean.
$$DX^2 = {dx_1^2, dx_2^2, ...}$$
$$\sigma_x = \sqrt{M(DX^2)}$$
It is called "[standard deviation](https://luminousmen.com/post/data-science-measures)".
So, it can be shown that the covariance in its absolute value does not exceed the product of the standard deviations of these two lists.
$$|cov| ≤ \sigma_x * \sigma_y$$
Well, since the standard deviation by construction is always positive, we can conclude that
$$-\sigma_x*\sigma_y ≤ cov ≤ \sigma_x*\sigma_y $$
or
$$r=\frac{cov}{\sigma_x\sigma_y}$$
$$-1≤r≤1$$
In general, if we take such coefficient as a measure of association, then it will turn out to be very convenient and very universal: it will show whether the values are related to each other and how much.
Everything is very convenient and universal, but in the above reasoning, there is a fair amount of flaws, which, in view of the universality of the obtained "measure of association", is very convenient to ignore but they can lead to wrong conclusions.
### Example #1
There once was a student who decided to find out the connection between the patient's body temperature and the patient's well-being. Everything is obvious, at a normal temperature of 36.6 ° C, the state of health seems to be the best. When the temperature rises, health deteriorates. However, when the temperature drops, health also deteriorates...
![Student story](https://luminousmen.com/media/Student_story.jpg)
If the measures on the left and right of the normal temperature are symmetrical(see on the picture), the correlation will be close to zero. From which the student concluded that there is no connection between temperature and well-being(ha!).
### ~~Example #1~~
The example shows: the "measure of association" introduced by us is **not universal**. The reasoning made at its construction applies **only to linear dependencies**.
Roughly speaking, this only works when an observed dependency is follows:
$$y(x) = kx$$
If the dependency is different, then, generally speaking, the correlation coefficient may be **random**. Yes, yes, not zero, but random.
In other words, a researcher somehow **must know in advance** that there is either a linear relationship or there is no association at all to make more or less correct conclusions.
*But how can he know this before research?*
Only from other studies. Where the analysis was done by other methods. For example, it can be based on a thoughtful examination of graphs or distributions.
But if all this has already been done, then why should he even consider the correlation at all? He already has more reliable results.
How good is such a measure, which gives a random number, and thus leads to completely different conclusions?
In the result, it turns out that the only thing that can be done with the help of correlation is **the conclusion that there is no linear relationship throughout the interval**. It is all about the absence — even the existence of such a relationship cannot be concluded.
And exactly the same thing will be with aspirin pills that have an optimal dose — with this dose, the results will be the best, but with a smaller and with a larger one, they will be worse.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
### Example #2
There once a scientist. Each time in the morning he saw the same person passing by his window. After some time, he noticed an amazing pattern — if this person took an umbrella with him, then it was raining that day.
Of course, this did not happen every time — sometimes the person took an umbrella, but it did not rain, sometimes he did not take it, but it still rained. But too often the presence of an umbrella with rain and the absence of an umbrella with no rain occurred at the same time.
The scientist, of course, did not trust such his hypothesis, but instead began to carefully write down his observations every day. After a while, he attributed the events "an umbrella" and "rain" as 1, the events "no umbrella" and "no rain" as 0, and counted the correlation according to the records.
The correlation was very high: 0.95. These two events were clearly linked!
A proud scientist wrote an article entitled "How to cause rain" in which he convincingly proved that it was this guy who controls the rain at his place of residence by wearing an umbrella.
After all, clearly, such a high correlation can not be the result of chance.
### ~~Example #2~~
![Scientist story](https://luminousmen.com/media/Scientist_story.jpg)
If the correlation coefficient showed a real connection of one variable with another (and it is not a random number, as it was with a student), it still does not prove that one phenomenon was caused by another. It may be that the first phenomenon is not caused by the second, and even the second phenomenon is not caused by the first.
In these cases, they say — "it may be that there is a certain third phenomenon that causes both two" — but in some cases, this is more misleading than helpful. After all, the truth may be such a phenomenon, but all the same, **the possible options do not end there**.
And, moreover, no one guarantees that it is the "third phenomenon" that someone suggested is the causation of those two.
Of course, there is no external mystical power in the story above. The umbrella guy simply watched the weather forecast in the morning, and, if it promised rain, he would take an umbrella with him.
It may seem that, well, if not mystical power, but still the third phenomenon that causes the first two, is there.
Suppose there is. *But which one?*
*Does the weather forecast causes the rain?*
*Or has the future rain causes the forecast?*
*Well okay, not the rain itself, but the atmospheric pressure, the speed, and direction of the wind, the presence of* the *lake, etc., and cause rain and forecast?*
However, this person gets an umbrella in the mornings because he simply reads the published forecasts and doesn't know the upcoming weather at all. To check this, we could simply ask the forecast source in which he reads these forecasts to publish incorrect forecasts. And thus, to see firsthand that the umbrella is taken if it is written in the forecast that it will rain, and not at all if some physical parameters indicate its high probability.
That is, **there is no one "third phenomenon"** that causes the first two. Instead, there is a whole set of phenomena that are in very difficult relationships with each other.
## Conclusion
The moral of the stories is that **no statistical indicator alone can confirm the theory you like**. Theories are confirmed only by a set of indicators of a correctly constructed series of experiments. A series, not a single experiment — even with a large number of data. We will talk
Whatever the statistics you have calculated, it only gives you some ground for hypothesis and assumptions([take a look at how Nicolas cage causes people to die in the pool](http://www.tylervigen.com/spurious-correlations)). For hypotheses, and not for "theories", which many people like to declare at the first stage.
The first experiment is the first. According to its results, you need to formulate a hypothesis and in the following experiments check whether it gives true predictions.
The intention in the first experiment is the data on which the hypothesis is built. It is impossible to check on these data whether the hypothesis really works: it is on them you built it - the stump is clear, it will work on them. So it will be with any hypothesis - even with the wrong one.
With the predictions that have come true on the new experiments, the truth will already appear "statistical evidence" — after all, the proposed dependence of one phenomenon on another allows you to make predictions on the data that we had not yet received as a hypothesis at the time of its introduction. This proves the reality of the connection and not just a high correlation.
Moreover, it is not enough to repeat the same experiment and make sure that it worked a second time. It worked or not, but you have to check all this and in other conditions, too. After all, a real theory cannot describe one particular case — it should extend to a rather extensive area of possible options.
But this is not the end of it — even if this hypothesis really does provide predictions, it is still necessary to check the following experiments that all alternative hypotheses do not work for them. Otherwise, it turns out that you did not prove the correctness of your hypothesis, but only the correctness of a rather extensive set of hypotheses, including yours.
Such is the science of data.Data Science. Measureshttps://luminousmen.com/post/272019-05-12T00:00:00Z2019-05-12T00:00:00ZThere are a lot of engineers who have never been involved in statistics or data science. So, in order to build a data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arises from the engineering side. For these Data/ML engineers and novice data scientists, I make this series of articles. I'll try to explain some basic approaches in plain English and, based on them, explain some of the Data Science model concepts.
The whole series:
- [Data Science. Probability](https://luminousmen.com/post/data-science-probability)
- [Data Science. Bayes theorem](https://luminousmen.com/post/data-science-bayes-theorem)
- [Data Science. Probability distributions](https://luminousmen.com/post/data-science-probability-distributions)
- [Data Science. Measures](https://luminousmen.com/post/data-science-measures)
- [Data Science. Correlation](https://luminousmen.com/post/data-science-correlation)
- [Data Science. The Central Limit Theorem and sampling](https://luminousmen.com/post/data-science-central-limit-theorem)
- [Demystifying hypothesis testing](https://luminousmen.com/post/demystifying-hypothesis-testing)
---
In statistics, what we see in our data itself is not interesting to us of itself, but as an assessment of where it comes from (population). In order to asses and describe the distribution of characteristics, we need to know a couple of things. First, the values of these characteristics, which are **typical** for the distribution under study. Second, how much these values differ (scattered), **how much they are** **typical**.
The first task is solved by measures of the central tendency. The second task is solved by measures of variation.
# Measures of Central Tendency
For the study of various measures of the central tendency, we need to construct the data, let it be in format of the story.
Imagine a situation that 5 people are sitting in a bar with an income of 40k. The average income of people in the bar is 40k.
```python
data = [40, 40, 40, 40, 40]
```
Using this data we will now go to the theory.
## Mean
Probability theory is concerned with the study of random variables. For this purpose, various characteristics are calculated which describe their behavior. One of the main characteristics of a random variable is the mean(others called it also mathematical expectation or average), which is a kind of the **center of the data** around which values are grouped.
Mean value gives us a generalized estimate, helps us to decide. For example, the average income in a company gives us a rough estimate of how much new employee can get there, the average restaurant’s check us to orient is it worth going there or not.
Mean is intuitive measure and has the following formula:
$$
mean = \sum\limits_{i = 1}^n {x_i p_i}
$$
where $x_i$ – random variables, $p_i$ – their probabilities
Hence, the mean of a random variable is a weighted sum of random variable values, where the weights are equal to the corresponding probabilities.
For example, if you calculate the mean value of the sum of points when throwing two dice, you get the number 7. But we know for sure all possible values and their probabilities. And what if there is no such information? There is only the result of some observations. How to be? It comes in statistics, which allows obtaining an approximate value of the mean, to evaluate it from the available data.
Mathematical statistics provides several options for estimating the mean value. The main among them is **the arithmetic mean**, with a number of useful properties. For example, the arithmetic mean is an unbiased estimate, i.e. average expectation equals estimated expectation.
So, the arithmetic mean value is calculated by the formula, which is known to any student.
$$
mean = \frac{1}{n} \sum\limits_{i = 1}^n {x_i}
$$
where $x_i$ – random variables, *n* – number of values.
```python
def mean(x):
return sum(x) / len(x)
mean(data) # 40
```
The disadvantage of this measure is the sensitivity to various deviations and inhomogeneities in the sample, in other words, it is subject to significant distortions from the side of emissions that deviate sharply from the distribution center. For distributions with a large asymmetry factor, it may not correspond to the notion of the mean.
If another person comes to our imaginary bar with an income of $40k, the average income of people in the bar will not change and will be the same.
```python
data1 = data + [40]
mean(data1) # 40
```
If the bar goes Jeff Bezos with an income of 10 billion. The average income at the bar will be `1700`, although the income of the first 4 people has not changed.
```python
data2 = data + [10000]
mean(data2) # 1700
```
In order to deal with this problem, there are other measures of the tendency: [truncated mean](https://en.wikipedia.org/wiki/Truncated_mean), mode and median.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## Mode
Another intuitive measure. A mode is **the most common value**. This is simply the value that is most often found in the data sample.
There can be several modes. And the presence of several modes is in fact also a certain characteristic of the data that need to be noticed. This is a sign that the data have some kind of internal structure, that there may be some subgroups that are qualitatively different from each other. And maybe it makes sense not to look at this distribution as a whole, but to divide it into subgroups and look at them separately.
```python
def mode(x):
"""returns a list, might be more than one mode"""
counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.iteritems() if count == max_count]
mode(data) # [40]
```
A mode is irreplaceable for nominal variables and is of little use for quantitative. It also helps us evaluate the most typical data sampling value.
## Median
The central tendency can be viewed not only as a value with a zero total deviation (arithmetic mean) or maximum frequency (mode) but also as a certain mark (a certain level of the analyzed indicator) dividing the ordered data (sorted in ascending or descending) into two equal parts. That is, half of the source data is less than this mark in its value, and half more. This is a **median**.
Mode and median are important measures, they reflect the data structure and are sometimes used instead of the arithmetic mean.
```python
def median(v):
"""finds the 'middle-most' value of v"""
n = len(v)
sorted_v = sorted(v)
midpoint = n // 2
if n % 2 == 1:
# if odd, return the middle value
return sorted_v[midpoint]
else:
# if even, return the average of the middle values
lo = midpoint - 1
hi = midpoint
return (sorted_v[lo] + sorted_v[hi]) / 2
median(data) # 40
median(data2) # 40
```
Obviously, with a symmetric distribution, the middle, dividing the sample in half, will be in the very center — in the same place, where the arithmetic mean (and mode). This is, so to speak, the ideal situation when the mode, the median and the arithmetic mean coincide and all their properties fall on one point — the maximum frequency, the division in half, the zero-sum of deviations — all in one place. However, life is not as symmetrical as the normal distribution.
# Measures of Variability(Dispersion)
In order to understand how characteristics of the data sample behave, it is not enough to know the mean, it is not enough to know the typical values of these characteristics, you also need to know their **variability**. That is, we must not only know what is typical, but we must also know how variable the values are, how strongly those values that do not resemble it differ from the mean. And for this, we have measures of variation.
Let's go back to our imagine situation. Imagine that we have now two bars:
```python
data1 = [40, 40, 40, 40, 40]
data2 = [80, 40, 15, 25, 40]
mean(data1) # 40
mean(data2) # 40
median(data1) # 40
median(data2) # 40
mode(data1) # [40]
mode(data2) # [40]
```
Clearly, they seem similar to analyst. But the data is different.
## Range
The simplest and most understandable measure of variability is range.
Range is the distance between the minimum and maximum value of the characteristic.
```python
def data_range(x):
return max(x) - min(x)
data_range(data1) # 0
data_range(data2) # 65
```
On the one hand, the range can be quite informative and useful. For example, the maximum and minimum cost of an apartment in the city of `N`, the maximum and minimum salary in the region, and so on. On the other hand, the range can be very large and not have any practical meaning.
This measure says how much the values in the sample vary but does not say anything about the distribution itself.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## Variance
If the mean reflects the center of a random variable, then the variance gives a characteristic of **the spread of data** around the center and it takes into account the influence of the values of all objects.
The formula for variance is:
$$
s^2 =\frac{1}{n}\sum\limits_{i = 1}^n {\left( {x_i - \bar x} \right)^2 }
$$
where *x* – data value, $\bar x$ – mean value, *n* – number of objects.
Hence, for each object, we take a deviation from the average, square them up and then divide by the number of objects in the sample.
*Why square?*
The sum of negative and positive deviations will give zero because of deviations in minus and deviations in plus cancel each other. In order to avoid this mutual cancellation of pluses by minuses, the squaring of this measure in the numerator is used. As for the denominator, we divide by `n`. However, using values other than n improves estimator in various ways. Common values for the denominator is `n − 1`, it eliminates [bias](https://en.wikipedia.org/wiki/Bias_of_an_estimator).
```python
def de_mean(x):
"""translate x by subtracting its mean (so the result has mean 0)"""
x_bar = mean(x)
return [x_i - x_bar for x_i in x]
def sum_of_squares(y):
"""the total squared variation of y_i's from their mean"""
return sum(v ** 2 for v in de_mean(y))
def variance(x):
"""assumes x has at least two elements"""
n = len(x)
deviations = de_mean(x)
return sum_of_squares(deviations) / (n - 1)
variance(data1) # 0
variance(data1) # 612
```
Thus, we take into account each deviation, and the sum divided by the number of objects gives us an estimate of the variability.
*What are the problems here?*
The fact that we are squaring, gives us multiple increases in the measurement. That is, if in the first case with our salaries we speak in dollars, in thousands of dollars, then when we square it, we begin to operate with millions or even billions. And this is reasonable from the point of view of squaring, but **not informative** from the point of view of the specific salaries that people in the organization get.
## Standard deviation
In order to return the dispersion to reality, that is, to use for more practical purposes, a square root is extracted from it. It turns out the so-called **standard deviation**. It's formula is:
$$
s = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^n {\left( {x_i - \bar x} \right)^2 } }
$$
```python
def standard_deviation(x):
return math.sqrt(variance(x))
standard_deviation(data1) # 0
standard_deviation(data2) # 24.7
```
Obviously, the standard deviation also characterizes the measure of variability, but now (unlike the variance) it can be compared with the original data since they have the same units of measurement (this is clear from the calculation formula). But this measurement in its pure form is not very informative since it contains too many interim calculations that confuse (deviation, squared, sum, average, root). However, it is already possible to work directly with the standard deviation, because the properties of this measurement are well studied and known.
For example, there is a [three-sigma rule](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule) that states that a normally distributed data has 997 values out of 1000 within ± 3 sigma from the arithmetic mean. Standard deviation, as a measure of uncertainty, is also involved in many statistical calculations. With its help, establish the degree of accuracy of various estimates and projections. If the variation is very large, then the standard deviation will also be large, therefore, the forecast will also be inaccurate, which will be expressed, for example, in very wide confidence intervals.
# Conclusion
These are the basic measures data engineer should know, but not all of them. They will be used in EDA section of the series where some new measures will be introduced.Data Science. Probability distributions https://luminousmen.com/post/262019-04-07T00:00:00Z2019-04-07T00:00:00ZThere are a lot of engineers who have never been involved in statistics or data science. So, in order to build a data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arises from the engineering side. For these Data/ML engineers and novice data scientists, I make this series of articles. I'll try to explain some basic approaches in plain English and, based on them, explain some of the Data Science model concepts.
The whole series:
- [Data Science. Probability](https://luminousmen.com/post/data-science-probability)
- [Data Science. Bayes theorem](https://luminousmen.com/post/data-science-bayes-theorem)
- [Data Science. Probability distributions](https://luminousmen.com/post/data-science-probability-distributions)
- [Data Science. Measures](https://luminousmen.com/post/data-science-measures)
- [Data Science. Correlation](https://luminousmen.com/post/data-science-correlation)
- [Data Science. The Central Limit Theorem and sampling](https://luminousmen.com/post/data-science-central-limit-theorem)
- [Demystifying hypothesis testing](https://luminousmen.com/post/demystifying-hypothesis-testing)
---
## Numerical data types
They are numerical values, sensible to add, subtract, take averages, etc.
1. **Continuous.** A number within a range of values, usually measured, such as height (within the range of human heights).
2. **Discrete**. Only take certain values (can’t be decimal), usually counted, such as the count of students in a class.
There are many distributions, but here, we will be talking about the most common and used ones. But first we need to understand the plots, and the common plots for distributions are pdf and cdf.
## Probability Density Function(pdf)
Consider an experiment in which the probability of events are as follows. The probabilities of getting the numbers 1,2,3,4 individually are 1/10, 2/10, 3/10, 4/10 respectively. It will be more convenient for us if we have an equation for this experiment which will give these values based on the events. For example, the equation for this experiment can be given by f(x)=x/10 where x=1,2,3,4. This equation (equivalently a function) is called **[probability distribution function](https://en.wikipedia.org/wiki/Probability_density_function)**. Although some authors also refer to it as the **probability function**, the **frequency function**, or **probability mass function**. It tells us the probability of occurring random variable x.
## Cumulative Distribution Function(cdf)
The [cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) provides an integral picture of the probability distribution. As the name *cumulative* suggests, it is simply the probability up to a particular value of the random variable. In the example above given x=3, the cdf tells us the sum probability of all random variables from 1 till 3.
## Continuous distributions
In this section, as the title suggests, we are going to investigate probability distributions of continuous random variables, that is, random variables whose support contains an infinite interval of possible outcomes.
### Uniform Distribution
<img src="https://luminousmen.com/media/uniform.png" style="width: auto !important;"/>
```python
from scipy.stats import uniform
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(1, 1, figsize=(15,15))
# calculate a few first moments:
#mean, var, skew, kurt = uniform.stats(moments='mvsk')
# display the probability density function (``pdf``):
x = np.linspace(uniform.ppf(0.01), uniform.ppf(0.99), 100)
ax.plot(x, uniform.pdf(x),'r-', lw=5, alpha=0.6, label='uniform pdf')
ax.plot(x, uniform.cdf(x),'b-', lw=5, alpha=0.6, label='uniform cdf')
# Check accuracy of ``cdf`` and ``ppf``:
vals = uniform.ppf([0.001, 0.5, 0.999])
np.allclose([0.001, 0.5, 0.999], uniform.cdf(vals))
# generate random numbers:
r = uniform.rvs(size=1000)
# and compare the histogram:
ax.hist(r, normed=True, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
plt.show()
```
In statistics, a type of probability distribution in which all outcomes are equally likely; **each variable has the same probability** that it will be the outcome. A deck of cards has within it uniform distributions because the probability of drawing a heart, a club, a diamond or a spade is equally likely. A coin also has a uniform distribution because the probability of getting either heads or tails in a coin toss is the same.
More formally this is the distribution of a random variable that can take any value in the interval (a, b), and the probability of being in any segment inside (a, b) is proportional to the length of the segment and does not depend on its position, and the probability of values outside the interval (a, b) is equal to `0`.
So, a continuous random variable `x` has a uniform distribution, denoted U(a, b), if its probability density function is:
$$
f(x)=\dfrac{1}{b-a}
$$
### Normal Distribution
<img src="https://luminousmen.com/media/normal.png" style="width: auto !important;"/>
```python
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(1, 1)
# calculate a few first moments:
mean, var, skew, kurt = norm.stats(moments='mvsk')
# display the probability density function (``pdf``):
x = np.linspace(norm.ppf(0.01), norm.ppf(0.99), 100)
ax.plot(x, norm.pdf(x),
'r-', lw=5, alpha=0.6, label='norm pdf')
ax.plot(x, norm.cdf(x),
'b-', lw=5, alpha=0.6, label='norm cdf')
# check accuracy of ``cdf`` and ``ppf``:
vals = norm.ppf([0.001, 0.5, 0.999])
np.allclose([0.001, 0.5, 0.999], norm.cdf(vals))
# generate random numbers:
r = norm.rvs(size=1000)
# and compare the histogram:
ax.hist(r, normed=True, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
plt.show()
```
At the center of statistics lies the normal distribution, known to millions of people as the bell curve, or the bell-shaped curve. This is actually a two-parameter family of curves that are graphs of the probability density functions:
$$
f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(−\frac{(x−\mu)^2}{2\sigma^2}\right)
$$
It looks a bit scary, but we'll get it all. There are two mathematical constants in the density function of the normal distribution:
- π — the ratio of a circle's circumference to its diameter and it is approximately `3.142`;
- e — the base of the natural logarithm is approximately `2.718`;
And two parameters that set the shape of a particular curve:
- `µ` is a mathematical expectation or mean. It shows that data near the mean are more frequent in occurrence than data far from the mean.
- $σ^2$ — variance, will also be discussed in the following articles;
And, of course, the variable `x` itself, for which the value of the function is calculated, i.e. probability density.
Constants, of course, do not change. But the parameters — this is what gives the final appearance of a specific normal distribution.
So, the specific form of the normal distribution depends on 2 parameters: **the expectation (µ) and variance ($σ^2$)**. Briefly denoted by $N(m, σ^2)$. The parameter µ (expectation) determines the distribution center, which corresponds to the maximum height of the graph. The variance $σ^2$ characterizes the range of variation, that is, the “spreading” of the data.
*Why is this distribution so popular?*
The importance of the normal distributions stems primarily from the fact that the distributions of many natural phenomena are at least **approximately normally distributed**. One of the first applications of the normal distribution was to the analysis of errors of measurement made in astronomical observations, errors that occurred because of imperfect instruments and imperfect observers. Galileo in the 17th century noted that these errors were symmetric and that small errors occurred more frequently than large errors. This led to several hypothesized distributions of errors, but it was not until the early 19th century that it was discovered that these errors followed a normal distribution. Independently, the mathematicians Adrain in 1808 and Gauss in 1809 developed the formula for the normal distribution and showed that errors were fit well by this distribution.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## Discrete distributions
### Bernoulli distribution(binomial distribution)
<img src="https://luminousmen.com/media/binominal.png" style="width: auto !important;"/>
```python
from scipy.stats import bernoulli
import seaborn as sb
data_bern = bernoulli.rvs(size=1000,p=0.6)
ax = sb.distplot(
data_bern,
kde=True,
color='b',
hist_kws={'alpha':1},
kde_kws={"color": "r", "lw": 3, "label": "KDE"})
ax.set(xlabel='Bernouli', ylabel='Frequency')
```
Not all phenomena are measured on a quantitative scale of type 1, 2, 3 ... 100500... Not always a phenomenon can take on an infinite or a large number of different states. For example, a person’s sex can be either men or woman. The shooter either hits the target or misses. You can vote either "for" or "against", etc. Other words reflect the state of an alternative feature (the event did not come). The upcoming event (positive outcome) is also called "success." Such phenomena can also be massive and random. Therefore, they can be measured and make statistically valid conclusions.
Experiments with such data are called [the Bernoulli scheme](https://en.wikipedia.org/wiki/Bernoulli_scheme), in honor of the famous Swiss mathematician, who found that with a large number of tests, the ratio of positive outcomes to the total number of tests tends to the probability of the occurrence of this event.
$$
f(x)=\dbinom{n}{x} p^x (1-p)^{n-x}
$$
- `n` - the number of experiments in the series;
- `x` is a random variable (the number of occurrences of event A);
- $p^x$ is the probability that A happens exactly m times;
- `q = 1 - p` (the probability that A does not appear in the test)
### Poisson Distribution
<img src="https://luminousmen.com/media/poisson.png" style="width: auto !important;"/>
```python
from scipy.stats import poisson
import seaborn as sb
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(15,15))
data_binom = poisson.rvs(mu=4, size=10000)
ax = sb.distplot(data_binom, kde=True, color='b',
bins=np.arange(data_binom.min(), data_binom.max() + 1),
kde_kws={"color": "r", "lw": 3, "label": "KDE"})
ax.set(xlabel='Poisson', ylabel='Frequency')
```
The Poisson distribution is obtained as a limiting case of the Bernoulli distribution, if we push `p` to zero and `n` to infinity, but so that their product remains constant: np = a. Formally, such a transition leads to the formula
$$
f(x) = \frac{{e^{ - \lambda } \lambda ^x }}{{x!}}
$$
- `x` is a random variable (the number of occurrences of event A);
- The average number of events in an interval is designated $\lambda$. $\lambda$ is the event rate, also called the rate parameter. It is also equal to mean and variance.
The Poisson distribution is subject to very many random variables occurring in science and practical life: equipment breaks, the duration of repair work performed by working employee, a printing error, the number of goals and goals scored by a football team.
## Conclusion
There are a lot of theoretical distributions: Normal, Poisson, Student, Fisher, binomial, etc. Each of them was designed to analyze data of various origins and having certain characteristics. In practice, **these distributions are used as some kind of template for analyzing real data of a similar type**. In other words, they try to impose the structure of the chosen theoretical distribution on the real data, thereby calculating the probabilities of interest to the analyst.
More strictly speaking, **theoretical distributions are probabilistic-statistical models** whose properties are used to analyze empirical data. This is done something like this. Data is collected and compared with any known theoretical distribution. If there is a similarity, then the properties of the theoretical model with degree of confidence are transferred to empirical data with the corresponding conclusions. This approach underlies the classical methods associated with statistical hypothesis testing(calculation of confidence intervals, comparison of average values, checking the significance of parameters, etc).
If the available data do not correspond to any known theoretical distribution (which usually happens in practice, but this does not concern anyone), then it is not recommended to use the selected template (probabilistic-statistical model). The illegal use of parametric distributions (listed above) leads to a situation where the analyst searches for keys not where he lost, but under a lamppost where it is light. To solve the problem, there are other approaches associated with the use of non-parametric statistics.
### Links
- [MIT Introduction to Probability](https://ocw.mit.edu/resources/res-6-012-introduction-to-probability-spring-2018/part-i-the-fundamentals/)
- [Intro Probability Theory](https://newonlinecourses.science.psu.edu/stat414/)Data Science. Bayes theoremhttps://luminousmen.com/post/252019-03-31T00:00:00Z2019-03-31T00:00:00ZThere are a lot of engineers who have never been involved in statistics or data science. So, in order to build a data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arises from the engineering side. For these Data/ML engineers and novice data scientists, I make this series of articles. I'll try to explain some basic approaches in plain English and, based on them, explain some of the Data Science model concepts.
The whole series:
- [Data Science. Probability](https://luminousmen.com/post/data-science-probability)
- [Data Science. Bayes theorem](https://luminousmen.com/post/data-science-bayes-theorem)
- [Data Science. Probability distributions](https://luminousmen.com/post/data-science-probability-distributions)
- [Data Science. Measures](https://luminousmen.com/post/data-science-measures)
- [Data Science. Correlation](https://luminousmen.com/post/data-science-correlation)
- [Data Science. The Central Limit Theorem and sampling](https://luminousmen.com/post/data-science-central-limit-theorem)
- [Demystifying hypothesis testing](https://luminousmen.com/post/demystifying-hypothesis-testing)
---
Bayes theorem is one of the most important rules of probability theory used in Data Science. It provides us with a way to update our beliefs based on the arrival of new events.
Imagine we have two related events A and B. It can be for example, A — I get wet today, B — it will be rainy today. Let's calculate the probability of A given B has already happened.
![A ∩ B](https://luminousmen.com/media/data-science-bayes-theorem-1.jpg)
Now since B has happened, the part which now matters for A is the shaded part which is interestingly A ∩ B. So, the probability of A given B turns out to be:
$$
P(A|B) = \frac{P(A ∩ B)}{P(B + A ∩ B)}
$$
Therefore, we can write the formula for event B given A has already occurred by:
$$
P(A|B) = \frac{P(A ∩ B)}{P(B)}
$$
or
$$
P(B|A) = \frac{P(A ∩ B)}{P(A)}
$$
Now, the second equation can be rewritten as :
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$
It's all. These are all conclusions that need to be made to arrive at Bayes' theorem. Let's combine it all into one picture and re-name the members of the formula:
![Bayes' theorem](https://luminousmen.com/media/data-science-bayes-theorem-2.jpg)
- P(A|B) is the **posterior** probability, or the probability of A to occur given event B occurred
- P(B|A) is the **likelihood**, or the probability of B given A
- P(A), P(B) is the **prior** probability of event A or B to occur
It should be noted that with independent events P (B | A) = P (B), which is logical — if the occurrence of event A does not affect the occurrence of event B, then what should be taken into account?
---
Example: If a single card is drawn from a standard deck of playing cards, the probability that the card is a king is 4/52, since there are 4 kings in a standard deck of 52 cards. Rewording this, if King is the event "this card is a king," the prior probability P(King) = 4/52=1/13
If evidence is provided (for instance, someone looks at the card) that the single card is a face card, then the posterior probability P(King|Face) can be calculated using Bayes' theorem:
P(King|Face) = P(Face|King) * P(King) / P(Face)
Since every King is also a face card, P(Face|King)=1. Since there are 3 face cards in each suit (Jack, Queen, King) , the probability of a face card is P(Face) = 3/52. Combining these gives a likelihood ratio of 1/3/13 = 13/3.
Using Bayes' theorem gives P(King|Face) = 13/3 / 1/13 = 1/3.
---
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## Intuitive understanding
Man was sitting with his back to a perfectly flat and perfect square table. Then he would ask an assistant to throw a ball onto the table. Now, this ball could obviously land and end up **anywhere** on the table and he wanted to figure out where it was? So, what he asked his assistant to do was to throw another ball and tell him whether it landed to the left or to the right or in the front or behind of the first ball. This he would note down and then ask for more and more balls to be thrown on the table. What he realized was, that through this method he could **keep updating his idea of where his first ball was**. But, of course he could never be completely certain but with each new piece of evidence he would get more and more accurate.
And that’s how Bayes saw the world, it is his thought experiment. It wasn’t that he thought the world was not determined, that reality didn’t exist, but it was that we couldn’t know it perfectly and all we could hope to do was update our understanding as more and more evidence was available.
I'm advising to watch the video below, it's cool anyway: <a href="https://youtu.be/R13BD8qKeTg">Bayes theorem</a>
## Conclusion
The fundamental idea of Bayesian inference is to become "less wrong" with more data. The process is straightforward: we have an initial belief, known as a prior, which we update as we gain additional information.
The conclusions drawn from the Bayes theorem are logical, but anti-intuitive. Almost always, people pay a lot of attention to the posterior probability, but they overlook the prior probability.
Using this simple formula we already can construct some of the models but hold on to the flow. Basics first.
Data Science. Probabilityhttps://luminousmen.com/post/242019-03-24T00:00:00Z2019-03-24T00:00:00ZThere are a lot of engineers who have never been involved in statistics or data science. So, in order to build a data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arises from the engineering side. For these Data/ML engineers and novice data scientists, I make this series of articles. I'll try to explain some basic approaches in plain English and, based on them, explain some of the Data Science model concepts.
The whole series:
- [Data Science. Probability](https://luminousmen.com/post/data-science-probability)
- [Data Science. Bayes theorem](https://luminousmen.com/post/data-science-bayes-theorem)
- [Data Science. Probability distributions](https://luminousmen.com/post/data-science-probability-distributions)
- [Data Science. Measures](https://luminousmen.com/post/data-science-measures)
- [Data Science. Correlation](https://luminousmen.com/post/data-science-correlation)
- [Data Science. The Central Limit Theorem and sampling](https://luminousmen.com/post/data-science-central-limit-theorem)
- [Demystifying hypothesis testing](https://luminousmen.com/post/demystifying-hypothesis-testing)
---
> True logic of this world lies in the calculus of probabilities
> — James Clerk Maxwell
Data science often uses statistical inferences to predict or analyze trends from data, while statistical inferences use probability distributions of data. Hence knowing probability and its applications are important to work effectively on data science problems.
## Events. Probabilities of events
One of the basic concepts in statistics is an event. Events are simply results of some experiments. Events can be certain, impossible and random.
A **certain event** is an event that as a result of an experiment (the execution of certain actions, a certain set of conditions) will necessarily occur. For example, a tossed coin will certainly fall.
**Impossible event** — an event, as the name implies, will not occur as a result of the experiment. For example, a tossed coin will fly in the sky.
And finally, an event is called a **random event**, if the event may or may not occur as a result of the experiment. In such experiment should present fundamental criteria of randomness: a random event is a consequence of **random factors** whose influence cannot be predicted or it is extremely difficult. For example, as a result of a coin toss, tails will fall out. In this case, random factors are the shape and physical characteristics of the coin, the strength/direction of the throw, air resistance, etc.
![Probability](/media/data-science-probability_1.jpg)
Let us consider in more detail the flip of a coin (meaning a fair coin — a coin in which both results ("heads and tails") are equally likely). There are 2 mutually exclusive outcomes: heads or tails. The outcome of the flip is random since the observer cannot analyze and take into account all the factors that influence the result. What is the probability of heads? Most answer ½, but why?
Let's denote A as the event that came up tails. Let the coin be thrown `n` times. Then the probability of the event A can be defined as:
`The probability of an event happening = Number of ways it can happen/Total number of outcomes`
The ratio is called **the frequency of event A in a long series of tests.**
---
*Example: there are 4 Kings in a deck of 52 cards. What is the probability of picking a King?*
Number of ways it can happen: 4 (there are 4 Kings)
Total number of outcomes: 52 (there are 52 cards in total)
So the probability = 4/52 = 1/13
---
It turns out that in various test series the corresponding frequency for large `n` is fluctuating around a constant value `P(A)`. This value is called the probability of event A and is denoted by the letter P — an abbreviation for probability.
The probability lies in the range [0, 1], where, in general, 0 indicates the impossibility of the event, and 1 indicates certainty. The higher the probability of an event, the greater the likelihood that an event will occur.
---
*Example: What is the probability of drawing a Jack or a Queen from a well-shuffled deck of 52 cards?*
If we have 4 Jack and 4 Queen cards, the probability is simply the sum of the individual probabilities.
Number of ways it can happen: 4 (there are 4 Jacks) and 4 (there are 4 Queens) = 8
Total number of outcomes: 52 (there are 52 cards in total)
So the P(Jack or Queen) = 8/52 = 2/13
---
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## Dependent/Independent/Disjoint Events
Two random events A and B are called **independent** if the occurrence of one of them does not change the probability of the occurrence of the other. Otherwise, events A and B are called **dependent**.
Knowing that the coin landed on a head on the first toss, does not provide any useful information for determining what the coin will land on in the second toss. The probability of a head or a tail on the second toss is 1/2, regardless of the outcome of the first toss. Probabilities of independent events can be multiplied to get the total probability of the occurrence of all of them.
---
*Example: What are the chances of getting heads 3 times in a row?*
Let's define possible outcomes of 3 coin tosses(H - heads, T - tails):
HHH, HHT, HTH, THH, TTH, THT, HTT, TTT
Number of ways it can happen: 1
Total number of outcomes: 8
So, the answer is 1/8. However, we know that the results of a coin toss are independent and we can multiply them to get the total probability: P(3x heads) = P(heads)P(heads)P(heads) = 1/2 * 1/2 * 1/2 = 1/8
---
On the other hand, knowing that the first card drawn from a deck is an ACE does provide useful information for calculating the probabilities of outcomes in the second draw. So, the probability of drawing yet another ACE is going to be 3 over 51, instead of 4 over 52.
**Disjoint Events** cannot happen at the same time. A synonym for this term is mutually exclusive.
For example, the outcome of a single coin toss cannot be a head and a tail, it can be either head or tails.
The non-disjoint event can happen at the same time. Therefore they can overlap, and the probability of overlapping should be excluded from total probability to avoid double counting.
---
Example: *What is the probability of drawing a Jack or a Red card from a well-shuffled deck of 52 cards?*
Number of ways it can happen: 4 (there are 4 Jacks) and 26 (there are 26 Red cards). But there is 2 Red cards overlap between them. Two red jacks that fit both criteria.
Total number of outcomes: 52 (there are 52 cards in total)
So, P(Jack or Red card) = P(Jack) + P(Red card) - P(Jack and Red card) = 4/52 + 26/52 - 2/52 = 7/13
---
## Types of probabilities
![Joint Probability](/media/data-science-probability_3.jpg)
**Joint Probability** is a probability of more than one event occurring simultaneously. The joint probability is the probability that event A will occur at the same time as event B.
For example, from a deck of 52 cards, the total probability of receiving a red card and 6 is P(6 ∩ red) = 2/52 = 1/26, since there are two red sixes in the deck of cards — six hearts and six diamonds. You can also use the formula to calculate the total probability — P(6 ∩ red) = P(6)P(red) = 4/52 * 26/52 = 1/26.
The symbol "∩" in joint probability is an intersection. The probability of occurrence of an event A and an event B is the same as the intersection of A and B. Therefore, the joint probability is also called the intersection of two or more events. Venn Diagram is perhaps the best visual tool to explain.
**Marginal Probability** — a probability of any single event occurring unconditioned on any other events.
Whenever someone asks you whether the weather is going to be rainy or sunny today(without any conditional or prior information), you are computing a marginal probability.
![Conditional Probability](/media/data-science-probability_4.jpg)
**Conditional probability** — is a measure of the probability of an event (some particular situation occurring) given that (by assumption, presumption, assertion or evidence) another event has occurred. When I ask you what is the probability that today will be rainy or sunny given that I noticed the temperature is going to be above 80 degrees, you are computing a [conditional probability](https://youtu.be/KqCKZwh5WY8). There is a specific notation for conditional probability, it is shown on the image above.
So, we want to understand the probability of event B given A. To get the probability of event B occurred we should divide all events that lead to event B by all possible events. In this situation, event A has occurred, so the event that lead to event B are A∩B and all possible events are B + A ∩ B. Thus,
$$
P(B|A) = \frac{P(A ∩ B)}{P(B + A ∩ B)}
$$
---
*Example: A math teacher gave her class two tests. 25% of the class passed both tests and 42% of the class passed the first test. What percent of those who passed the first test also passed the second test?*
P(Second|First) = P(First and Second) / P(First) = 0.25/0.42 = 0.60 = 60%
---
## Conclusion
In this post we got acquainted with the basic concepts of probability and probability algebra. In the next post we will talk about the Bayes theorem and look at the world through the eyes of Bayes.
### Additional material
- [Conditional probability visualizations](http://setosa.io/conditional/)
- [Dependent probability introduction](https://www.khanacademy.org/math/math2/math2-prob)
- [Independent & dependent probability](https://www.khanacademy.org/math/math2/math2-prob/math2-mul-rule-dependent/v/independent-events-1?modal=1)
- [Independent Problems](https://www.khanacademy.org/math/math2/math2-prob/math2-mul-rule-independent/v/independent-events-2?modal=1)
- [Conditional Prob Article](https://www.khanacademy.org/math/math2/math2-prob/math2-conditional-prob/a/conditional-probability-using-two-way-tables?modal=1)Best blogs/podcasts to follow for Python developershttps://luminousmen.com/post/232019-03-10T00:00:00Z2019-03-10T00:00:00ZI want to share blogs and podcasts I occasionally read/hear stuff from.
## [Python subreddit](https://www.reddit.com/r/Python/)
First things first. This is a good feed for any Python related news.
## [Dev.to](https://dev.to/t/python)
This is a great source of information for beginners and I really love the community around this resource.
## [Ned Batchelder blog](https://nedbatchelder.com/blog/)
This is the personal blog of well known in Python community Ned Batchelder.
## [Nikita Sobolev blog](https://sobolevn.me/)
Nikita Sobolev's blog mostly about Python, engineering practices and career development. Always interesting for me to read his posts and thoughts.
## [Real Python](https://realpython.com/)
Here you learn not so much how to learn the basics of the language, but learn how to really write applications using Python. Low signal-to-noise ratio.
## [The Invent with Python Blog](http://inventwithpython.com/blog/)
Blog of the author of "Automate the Boring Stuff with Python". Will be interesting for beginners.
## [Talk Python to me](https://talkpython.fm/episodes/all)
Talk Python To Me is a podcast for developers who want to know news about the language and related technologies. Sometimes good, sometimes boring but worth sharing.
## [Podcast.__init__](https://www.pythonpodcast.com)
I like its slogan: "A podcast about Python and the people who make it great". It provides insights into the projects, platforms, and practices that engineers, business leaders, and data scientists need to know about to learn and grow in their career. Also it worth mention it's child podcast: [Data Engineering podcast](https://www.dataengineeringpodcast.com).
## [Inside the Head of PyDanny (Daniel Greenfeld)](https://www.pydanny.com/)
Blog of Daniel Roy Greenfeld about Python and Django.
## [SaltyCrane Blog](https://www.saltycrane.com/blog/)
It will be interesting for Python full stack developers as well as JS developers.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## [Weekly Python Newsletter](https://importpython.com/newsletter/)
It is a weekly Python Newsletter. It picks the best of Python blog posts, open source projects, interesting tweets.
## [Python Coder](https://stevepython.wordpress.com/)
Blog for Python beginners. The author provides a lot of tips and code snippets.
## [The Python Guru](https://thepythonguru.com/blog/)
The author writes about Python-related things but with a bias towards Data Engineering.
## [Python Tips](https://pythontips.com/)
Good source of information, but recently not really Python-related.
## [Neopythonic](http://neopythonic.blogspot.com/)
Then, there is Guido's own blog :) - not regularly updated though but may be interesting.
## [Luminousmen blog](http://luminousmen.com)
This is the blog you currently are. I write about Python, Data Science, Big Data and software engineering themes.Asynchronous programming. Python3.5+https://luminousmen.com/post/222019-03-03T00:00:00Z2019-03-03T00:00:00ZThis is a practical post of the series of asynchronous programming.
Whole series:
- [Asynchronous programming. Blocking I/O and non-blocking I/O](https://luminousmen.com/post/asynchronous-programming-blocking-and-non-blocking)
- [Asynchronous programming. Cooperative multitasking](https://luminousmen.com/post/asynchronous-programming-cooperative-multitasking)
- [Asynchronous programming. Await the Future](https://luminousmen.com/post/asynchronous-programming-await-the-future)
- [Asynchronous programming. Python3.5+](https://luminousmen.com/post/asynchronous-programming-python3.5)
In this post we will be talking about the Python stack on the concepts we talked so far: from the simplest like threads, processes to the asyncio library.
Asynchronous programming in Python has become more and more popular lately. There are many different libraries in python for doing asynchronous programming. One of those libraries is asyncio, which is a python standard library added in Python 3.4. In Python 3.5 we got an async/await syntax. Asyncio is part of the reason asynchronous programming is becoming more popular in Python. This article will explain what asynchronous programming is and compare some of these libraries.
---
## Quick Recap
What we have realized so far from the previous posts:
- **Sync:** Blocking operations.
- **Async:** Non-blocking operations.
- **Concurrency:** Making progress together.
- **Parallelism:** Making progress in parallel.
- **Parallelism implies Concurrency**. But [Concurrency doesn’t always mean Parallelism](https://luminousmen.com/post/concurrency-and-parallelism-are-different).
---
Python code can now be mainly executed in one of two worlds, synchronous or asynchronous. You should think of them as **separate worlds with different libraries and styles of calls**, but sharing variables and syntax.
In the synchronous Python world, which has existed for several decades, you call functions directly, and everything is processed sequentially, exactly as you wrote your code. There are options to run the code concurrently.
## Synchronous world
In this post we will be comparing different implementations of the same code. We will try to execute two functions. First one is calculating the power of number:
```python
def cpu_bound(a, b):
return a ** b
```
We will do it N times:
```python
def simple_1(N, a, b):
for i in range(N):
cpu_bound(a, b)
```
And the second one is downloading data from the web:
```python
def io_bound(urls):
data = []
for url in urls:
data.append(urlopen(url).read())
return data
def simple_2(N, urls):
for i in range(N):
io_bound(urls)
```
To compare how much time function works we implement simple decorator/context manager:
```python
import time
from contextlib import ContextDecorator
class timeit(object):
def __call__(self, f):
@functools.wraps(f)
def decorated(*args, **kwds):
with self:
return f(*args, **kwds)
return decorated
def __enter__(self):
self.start_time = time.time()
def __exit__(self, *args, **kw):
elapsed = time.time() - self.start_time
print("{:.3} sec".format(elapsed))
```
Now let's put it all together and run, to understand how much time my machine will be executing this code:
```python
import time
import functools
from urllib.request import urlopen
from contextlib import ContextDecorator
class timeit(object):
def __call__(self, f):
@functools.wraps(f)
def decorated(*args, **kwds):
with self:
return f(*args, **kwds)
return decorated
def __enter__(self):
self.start_time = time.time()
def __exit__(self, *args, **kw):
elapsed = time.time() - self.start_time
print("{:.3} sec".format(elapsed))
def cpu_bound(a, b):
return a ** b
def io_bound(urls):
data = []
for url in urls:
data.append(urlopen(url).read())
return data
@timeit()
def simple_1(N, a, b):
for i in range(N):
cpu_bound(a, b)
@timeit()
def simple_2(N, urls):
for i in range(N):
io_bound(urls)
if __name__ == '__main__':
a = 7777
b = 200000
urls = [
"http://google.com",
"http://yahoo.com",
"http://linkedin.com",
"http://facebook.com"
]
simple_1(10, a, b)
simple_2(10, urls)
```
We implemented execution of the same functions `N` times sequentially.
On my hardware, `cpu_bound` function took 2.18 sec, `io_bound` — 31.4 sec.
So, we get our basic performance. Let's move on to threads.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
### Threads
![threads](https://luminousmen.com/media/asynchronous-programming-python3.5-1.jpg)
A thread is the smallest unit of processing that can be performed in an OS.
Threads of a process can **share the memory of global variables**. If a global variable is changed in one thread, this change is valid for all threads.
In simple words, a **thread** is a sequence of such operations within a program that can be executed independently of other code.
Threads executing **concurrently** but can be executing in parallel — it depends on the system on which they are running.
Python threads are implemented using OS threads in all implementations I know (CPython, PyPy and Jython). For each Python thread, there is an underlying OS thread.
One thread is executed on one processor core per unit of time. It works until it consumes its time slice (default is 100 ms) or until it gives up control to the next thread by making a system call.
Let's try to implement our example functionality using threads:
```python
from threading import Thread
@timeit()
def threaded(n_threads, func, *args):
jobs = []
for i in range(n_threads):
thread = Thread(target=func, args=args)
jobs.append(thread)
# start the threads
for j in jobs:
j.start()
# ensure all of the threads have finished
for j in jobs:
j.join()
if __name__ == '__main__':
...
threaded(10, cpu_bound, a, b)
threaded(10, io_bound, urls)
```
On my hardware, `cpu_bound` took 2.47 sec, `io_bound` — 7.9 sec.
The I/O-bound function executed more than 5 times faster because we download the data in parallel on separate threads. But why CPU-bound operation goes slower?
In the reference implementation of Python — CPython there is the infamous GIL (Global Interpreter Lock). And we slowly go to its section...
### Global Interpreter Lock (GIL)
First of all, GIL is a lock that must be taken before any access to Python (and this is not only the execution of Python code but also calls to the Python C API). In essence, **GIL is a global semaphore** that does not allow more than one thread to work simultaneously within an interpreter.
Strictly speaking, the only calls available after running the interpreter with an uncaptured GIL are its capture. Violation of the rule leads to an instant crash (the best option) or delayed crash of the program (much worse and harder to debug).
*How it works?*
When the thread starts, it performs a GIL capture. After a while, the process scheduler decides that the current thread has done enough and give control to the next thread. Thread #2 sees that GIL is captured so that it does not continue to work, but plunges itself into sleep, yielding the processor to thread #1.
But the **thread cannot hold GIL indefinitely**. Prior to Python 3.3, GIL switched every 100 machine code instructions. In later versions of GIL, a thread can be held no longer than 5 ms. GIL is also released if the thread makes a system call, works with a disk or network(I/O-bound operation).
In fact, GIL in python makes the idea of using threads for parallelism in computational problems(CPU-bound operations) useless. They will work sequentially even on a multiprocessor system. **On CPU-bound tasks, the program will not accelerate**, but only slow down, because now the threads will have to halve the processor time. At the same time, the **GIL I/O operation will not slow down**, since before the system call the thread releases the GIL.
It is clear that GIL slows down the execution of our program due to the additional work of creating and communicating threads, capturing and releasing the semaphore itself and preserving the context. But it needs to be mentioned that GIL **does not limit** parallel execution.
GIL is not part of the language and does not exist in all language implementations, but only in the above mentioned CPython.
*So why the heck does he even exist?*
**GIL protects operating data structures from concurrent access problems**. For example, it prevents the race condition when the object's reference count value changes. GIL makes it easy to integrate non-thread safe libraries on C. Thanks to GIL, we have so many fast modules and binders for almost everything.
There are exceptions, for example, the GIL control mechanism is available to C libraries. For example, `NumPy` releases it on long operations. Or, when using the `numba` package, the programmer can control the semaphore disabling itself.
On this sad note, you can come to the conclusion that threads will be enough for parallelizing tasks that are tied to I/O. But computing tasks should be run in separate processes.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
### Processes
From the OS point of view, a process is a data structure that holds a memory area and some other resources, for example, files opened by it.
Often the process has one thread, called **main**, but the program can create any number of threads. At the start, the thread is not allocated individual resources, instead, it uses the memory and resources of the process that spawned it. Due to this, the threads can quickly start and stop.
Multitasking is handled by the scheduler — part of the OS kernel, which in turn loads the execution threads into the central processor.
Like threads, processes are always executed **concurrently**, but they can also be executed in parallel, depending on the presence of the hardware component.
```python
from multiprocessing import Process
@timeit()
def multiprocessed(n_threads, func, *args):
processes = []
for i in range(n_threads):
p = Process(target=func, args=args)
processes.append(p)
# start the processes
for p in processes:
p.start()
# ensure all processes have finished execution
for p in processes:
p.join()
if __name__ == '__main__':
...
multiprocessed(10, cpu_bound, a, b)
multiprocessed(10, io_bound, urls)
```
On my hardware, `cpu_bound` took 1.12 sec, `io_bound` — 7.22 sec.
So, the calculation operation executed faster than threaded implementation because now we are not stuck in capturing GIL, but the I/O bound function took slightly more time because processes are more heavyweight than threads.
## Asynchronous world
![asynchronous programming](https://luminousmen.com/media/asynchronous-programming.jpg)
In an asynchronous world, everything changes a bit. Everything works in a central event-processing loop, which is a small code that allows you to run several coroutines (an important term, in simple terms it is not OS-managed threads, except they are co-operatively multitasking, and hence not truly concurrent) at the same time. Coroutines work synchronously until the expected result is reached, and then they stop, transfer control to the event loop, and something else can happen.
### Green threads
![Green threads](https://luminousmen.com/media/asynchronous-programming-python3.5-2.jpg)
[Green threads](https://luminousmen.com/post/asynchronous-programming-await-the-future) are a primitive level of asynchronous programming. A green thread is a regular thread, except that switching between threads is done in the application code(on the user level), and not in the processor(OS level). The core of it is [non-blocking operations](https://luminousmen.com/post/asynchronous-programming-blocking-and-non-blocking). Switching between threads occurs only on I/O. Non-I/O threads will take control forever.
[Gevent](http://www.gevent.org/) is a well-known Python library for using green threads. Gevent is a green thread and non-blocking I/O. `gevent.monkey` modifies the behavior of standard Python libraries so that they allow non-blocking I/O operations to be performed.
Other libraries:
- [eventlet](http://eventlet.net/)
- [Tornado](https://www.tornadoweb.org/en/stable/)
- [Twisted](https://twistedmatrix.com/)
- [Stackless Python](https://en.wikipedia.org/wiki/Stackless_Python)
Let's see how performance changes if we start using green threads using gevent library in Python:
```python
import gevent.monkey
# patch any other imported module that has a blocking code in it
# to make it asynchronous.
gevent.monkey.patch_all()
@timeit()
def green_threaded(n_threads, func, *args):
jobs = []
for i in range(n_threads):
jobs.append(gevent.spawn(func, *args))
# ensure all jobs have finished execution
gevent.wait(jobs)
if __name__ == '__main__:
...
green_threaded(10, cpu_bound, a, b)
green_threaded(10, io_bound, urls)
```
Results are: `cpu_bound` — 2.23 sec, `io_bound` — 6.85 sec.
Slower for CPU-bound function, and faster for I/O-bound function. As expected.
### Asyncio
The `asyncio` package is described in the Python documentation as [a library for writing parallel code](https://docs.python.org/3/library/asyncio.html). However, `asyncio` is not multithreaded and is not multiprocessing. It is not built on top of one of them.
While Gevent and Twisted aim to be higher level frameworks, asyncio aims to be a lower-level implementation of an asynchronous event loop, with the intention that higher level frameworks like Twisted, Gevent or Tornado, will build on top of it. However, by itself, it makes a suitable framework on its own.
In fact, `asyncio` is a single-threaded, single-process project: it uses [**cooperative multitasking**](https://luminousmen.com/post/asynchronous-programming-cooperative-multitasking). `asyncio` allows us to write asynchronous concurrent programs running in the same thread, using an event loop for scheduling tasks and multiplexing I/O through sockets (and other resources).
`asyncio` provides us an event loop along with other good stuff. The event loop tracks different I/O events and switches to tasks which are ready and pauses the ones which are waiting on I/O. Thus we don’t waste time on tasks which are not ready to run right now.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
### How it works
Synchronous and asynchronous functions/callables are different types — **you can't just mix them**. If you block a coroutine synchronously — maybe you use `time.sleep(10)` rather than `await asyncio.sleep(10)` — you don't return control to the event loop — the entire process will block.
You should think of your codebase as comprised of pieces of either sync code or async code — anything inside an `async def` is async code, anything else (including the main body of a Python file or class) is synchronous code.
![asynchronous api](https://luminousmen.com/media/asynchronous-programming-await-the-future-4.jpg)
The idea is very simple. There’s an event loop. And we have an asynchronous function (coroutine) in Python, you declare it with `async def`, which changes how its call behaves. In particular, **calling it will immediately return a coroutine object**, which basically says "I can run the coroutine and return a result when you await me".
We give those functions to the event loop and ask it to run them for us. The event loop gives us back a `Future` object, it’s like **a promise that we will get something back in the future**. We hold on to the promise, time to time check if it has a value (when we feel impatient) and finally when the future has a value, we use it in some other operations.
When you call `await`, the function gets suspended while whatever you asked to wait on happens, and then when it's finished, the event loop will wake the function up again and resume it from the await call, passing any result out. Example:
```python
import asyncio
async def say(what, when):
await asyncio.sleep(when)
print(what)
loop = asyncio.get_event_loop()
loop.run_until_complete(say('hello world', 1))
loop.close()
```
In the example here, the `say()` function pauses and gives back control to the event loop, which sees that `sleep` needs to run, calls it, and then that calls await and gets suspended with a marker to resume it one second. Once it resumes, `say` completes, returns a result, and then that makes main ready to run again and the event loop resumes it with the returned value.
This is how async code can have so many things happening at once - anything that's blocking calls await, and gets put onto the event loop's list of paused coroutines so
something else can run. Everything that's paused has an associated callback that will wake it up again — some are time-based, some are I/O-based, and most of them are like the example above and waiting for a result from another coroutine.
Let's return to our example. We have two blocking functions `cpu_bound` and `io_bound`. As I said, we cannot mix synchronous and asynchronous operations — **we must make all of them asynchronous**. Naturally, not for everything there are asynchronous libraries. **Some code remains [blocking](https://luminousmen.com/post/asynchronous-programming-blocking-and-non-blocking)**, and it must somehow be run so that it does not block our event loop. For this, there is a good `run_in_executor()` method, which runs what we passed to it in one of the threads of the built-in pool, without blocking the main thread with the event loop. We will use this functionality for our CPU-bound function. We will rewrite the I/O-bound function completely to await those moments when we are waiting for an event.
```python
import asyncio
import aiohttp
async def async_func(N, func, *args):
coros = [func(*args) for _ in range(N)]
# run awaitable objects concurrently
await asyncio.gather(*coros)
async def a_cpu_bound(a, b):
result = await loop.run_in_executor(None, cpu_bound, a, b)
return result
async def a_io_bound(urls):
# create a coroutine function where we will download from individual url
async def download_coroutine(session, url):
async with session.get(url, timeout=10) as response:
await response.text()
# set an aiohttp session and download all our urls
async with aiohttp.ClientSession(loop=loop) as session:
for url in urls:
await download_coroutine(session, url)
if __name__ == '__main__':
...
loop = asyncio.get_event_loop()
with timeit():
loop.run_until_complete(async_func(10, a_cpu_bound, a, b))
with timeit():
loop.run_until_complete(async_func(10, a_io_bound, urls))
```
Results are: `cpu_bound` — 2.23 sec, `io_bound` — 4.37 sec.
Slower for CPU-bound function, and almost twice as fast as the threaded example for I/O-bound function.
## Making the Right Choice
- CPU-bound -> multiprocessing
- I/O-bound, fast I/O, Limited Number of Connections -> multithreading
- I/O-bound, slow I/O, many connections -> asyncio
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## Conclusion
Threads will be easier if you have a typical web application that does not depend on external services, and relatively finite amount users for whom the response time will be predictably short.
async is suitable if the application spends most of the time reading/writing data, rather than processing it. For example, you have a lot of slow requests — websockets, long polling, or there are slow external synchronous backends, requests for which are unknown when they end.
Synchronous programming is what most often begins the development of applications in which sequential execution of commands is performed.
Even with conditional branching, loops, and function calls, we think about code in terms of performing one step at a time. After the completion of the current step, it proceeds to the next.
An asynchronous application behaves differently. It still running one step at a time, but the difference is that the system moving forward, it's not waiting for the completion of the current execution step. As a result, we are going to [the event-driven programming](https://en.wikipedia.org/wiki/Event-driven_programming).
asyncio is a great library and it's cool that it was included into Python standard library. asyncio has already begun to build an ecosystem (aiohttp, asyncpg, etc.) for application development. There are other event loop implementations ([uvloop](https://github.com/MagicStack/uvloop), [dabeaz/curio](https://github.com/dabeaz/curio), [python-trio/trio](https://github.com/python-trio/trio)), and I think the asyncio will evolve in even more powerful tool.
### Links
- [PEP 342](https://www.python.org/dev/peps/pep-0342/)
- [PEP 492](https://www.python.org/dev/peps/pep-0492/)
- Check the old [guido's presentation](https://www.dropbox.com/s/essjj4qmmtrhys4/SFMeetup2013.pdf) of the asyncio approach.
- Interesting talk of Robert Smallshire ["Get to grips with asyncio in Python 3"](https://youtu.be/M-UcUs7IMIM)
- [David Beazley's Curio library](https://curio.readthedocs.io/en/latest/)
- [Trio project](https://trio.readthedocs.io/en/latest/)
- [David Beazley talk](https://youtu.be/ZzfHjytDceU) about getting rid of asyncio
- [uvloop - faster event-loop for asyncio](https://magic.io/blog/uvloop-blazing-fast-python-networking/)
- [Some thoughts on asynchronous API design in a post-async/await world](https://vorpus.org/blog/some-thoughts-on-asynchronous-api-design-in-a-post-asyncawait-world/)11 steps of Scrumhttps://luminousmen.com/post/212019-02-24T00:00:00Z2019-02-24T00:00:00Z99% of those dissatisfied with the results of their project are sure that they know who is to blame. Owners scold lazy contractors, programmers, stupid customers etc. Who is right? All are mistaken. Each of us believes that only he knows how to react objectively to a situation, while the behavior of the others is only motivated by their personal characteristics.
**Blame is generally silly**. Do not look for bad people, look for harmful systems — systems that provoke the creation of bad processes and reward bad work. The scrum is the technique, where instead of looking for the guilty, it tries to investigate the system that has become the source of the error and fix it.
Scrum is based on a simple idea — whenever a project is started, nothing prevents you from regularly checking the progress of work and consistently finding out:
- Can your team handle the tasks;
- Are you moving in the right direction;
- Do your team create exactly what the customer actually wants.
The process [described by Jeff Sutherland](https://www.scrumguides.org/docs/scrumguide/v2016/2016-Scrum-Guide-US.pdf) is called **"check and adapt"**. In other words, at any suitable moment and as often as possible you should interrupt the work, review what has already been done, and find out whether everything is done correctly, what is needed, and how to do it better.
How to do it? The author of the scrum technique offers a list of 11 items.
## 1. Pick a Product Owner
[Jeff Sutherland](https://scrumguides.org/jeff.html) calls the owner of the product **not the main investor**, but a specialist, professional, who has the vision of what your team is going to do/create/achieve. He makes the final decisions, is always available for the team and ready to answer questions. It is he who takes into account the risks and benefits: what needs to be done, what can be done to maximize the benefits of the product and what will inspire the whole team. It is the product owner who is responsible for the [OODA cycle](https://en.wikipedia.org/wiki/OODA_loop) (an abbreviation for observing, orienting, deciding, acting).
![OODA](/media/11-steps-of-scrum-2.jpg)
This concept was created by military pilot John Boyd. Sometime in the skies over North Korea, the work on OODA principles helped a wider F-86 review to gain superiority over the more advanced manoeuvrability and weapons of the MIG-15. The one who owns the OODA is able to get inside the opponent’s competitor’s decision-making cycle. So what is it about?
**Observe** — means to see the situation clearly as events unfold. And somehow covert these events into the data. Boyd recommends going beyond your own "I" to see the picture as a whole, and not only from your own point of view.
**Orient** — it is not only about the place where you are; it’s also a variety of situations that you can see — a menu of opportunities that you create for yourself. So, here you transform the findings from the first step, understand and analyze it.
**Decide** — here you are choosing the solution from the options based on plan that was developed on orient stage.
**Act** — speaks for itself.
Then the cycle repeats: you need to start watching again — the results of your own actions, the actions of your opponents, competitors or the market reaction. Then set a new plan or goal where you need to go and choose the way you will be reaching that goal and act by your plan.
The scrum methodology, with the addition of the notion **"increment of work" or "sprint"** to this model, allows the Product Owner to see how much value this increment creates and how clients react to it. Based on the information received, he can change the tasks or its priorities for the next sprint. In this way, a constant feedback loop is established, which speeds up the process of innovation and adaptation and allows the product owner to measure the amount of added value.
## 2. Build a team
Who are the people who have the job to do? Professionals in the team must have all the skills and knowledge necessary to bring the product owner’s idea to life. A gold standard of scrum is to have **from 3 to 9 people in one team**. I think the ideal size is 6 people including the Scrum Master — large enough to ensure adequate skill sets, but small enough to collaborate easily.
The main attributes of the team:
- motivation to create a product;
- ability to create a product;
- autonomy, the freedom to do your job in the way that team consider the best;
- focusing on the customer and their needs;
## 3. Pick a Scrum Master
It is critical that someone ask difficult questions. You need a character like Shakespeare's wise jester. This is the person who follows the project, ensures that all short meetings are held and it helps the team to eliminate obstacles. Most often, the scrum master is a project manager. But it's always **not a good idea** to cope Scrum master with Product Owner, they have different responsibilities and conflict of interest may arise.
## 4. Create a Product Backlog
**Product backlog** — this is a list of absolutely all requirements for the product placed by their priority. Backlog exists and evolves throughout the whole life of the product, which it serves as a guide. The product backlog is the only and unambiguous concept of **"everything that a team can, in principle, do in order of priority"**.
There can be only one product backlog. This means that the product owner must make priority decisions based on the full range of tasks. The product owner must talk with all interested people and the team to ensure full feedback and display all the requirements and wishes of the client in the backlog.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## 5. Specify and estimate the product backlog
It is very important that team members, who will perform tasks from backlog, estimate how much effort this will take. In terms of scrum it called **Grooming**.
The team members should look at each task and determine:
1. Whether it is **feasible** at all.
2. Is there **enough** information to complete the task?
3. Is it **understandable** enough to be evaluated?
4. Is there a **common understanding** of what standards and criteria it must meet to be fulfilled?
5. Does it create **real value**?
It must be possible to demonstrate the result of each task. Do not estimate backlog tasks in hours or money — people do poorly with this. Evaluate in **relative sizes**: "small", "medium", "large". It is better to use the Fibonacci sequence and assign the number of points to each task: 1, 2, 3, 5, 8, 13, 21. Of course, nothing will be exactly five, or eight, or thirteen, but by applying these numbers, you can collect opinions on the scale of the problem in conditions when everyone uses approximately one abstract measurement system — this way team will quickly reach an agreement. But it is also possible to measure a task in units of time: hours, days, weeks. Check out my post about [how to estimate time properly](https://luminousmen.com/post/how-to-estimate-time). Thus, it turns out that a collective task assessment gives us much more accurate results than an individual one and may raise some important questions to the Product Owner.
## 6. Plan your sprint
Sprint is the time interval for which the team creates at least a minimally functioning part of the product. It then can be immediately shown to the client. Each sprint is preceded by a meeting where the team, scrum master and product owner plan tasks for the sprint.
Sprints always have a fixed duration, which must be less than a month. As a rule, choose sprints with a length of one or two weeks(questionable, but for me it works best). The team looks to the top of the backlog and estimate the number of tasks that can be done for the next sprint. If the team has already done a couple of sprints, it should take into account the number of points that was in the last sprint. Scrum master and the team must try to increase the number of points taken in each sprint.
**Sprint planning** is another opportunity for the product owner and team to make sure that everyone understands exactly how the implementation of the tasks goes into the final product. At this meeting, everyone must agree on the goal of the sprint and determine what should be done for the sprint.
The main rule of scrum — if the team agreed on a certain number of tasks that need to be completed in one sprint, then **new ones can no longer be added**.
## 7. Make workflow transparent
It's all about building the right system. It is necessary to give people the freedom, respect and the right to do their own business on their own. Heads of organizations, where it is common to keep everything secret, do not want ordinary employees, leading experts, even managers close to the top of the hierarchy — know what is happening now, what has already been done and in what time frame. Any information and knowledge is not subject to share, since hidden information is the only guarantee of their power(at least they think so). They pursue only their own interests, which most often do not help product or company.
Here we are dealing with the type of thinking responsible for such a phenomenon as the management failure, which became widespread. **Transparency of all actions and processes** ensures the earliest achievement of the goal.
![Scrum board](/media/11-steps-of-scrum-3.jpg )
The most common way to achieve this is to have a scrum board with columns: "Todo or backlog", "In progress", "Done". Stickers are requirements that need to be implemented the command moves the stickers from one column to another as they are completed.
Another way to make the work transparent is to create a [Burn down chart](https://en.wikipedia.org/wiki/Burn_down_chart). On one axis place the number of points that the team took in this sprint, on the other — the number of days. Every day the Scrum master counts the number of points for the tasks performed and reflects this on the chart. Ideally, by the end of the sprint, this plot should meet zero. By the way, another plus of transparency lies in the fact that team members can see the number of tasks of each other and promptly help those who lag behind in order to achieve the goals set in the sprint.
## 8. Daily meeting or daily scrum
This is the pulse of the whole scrum process. Every day at the same time for no more than fifteen minutes the team and the Scrum master meet and give answers to three simple questions:
- What did you do yesterday to help the team finish the sprint?
- What will you do today to help the team finish the sprint?
- What obstacles and problems do you have that blocks you from the task completion?
That's all. All meeting. If it takes more than fifteen minutes, then you are doing something wrong. The essence of such meetings is that the whole team knows exactly what task at what stage is in the current sprint.
Will all tasks be completed on time? Is it possible to help other team members to overcome obstacles? Nobody gives tasks — the team is independent and decides everything itself. No one writes detailed reports to management, there is no point of this.
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-4665632381152577"
data-ad-slot="7970039881"></ins>
## 9. Demo to customer
It is an open meeting where the team demonstrates what was done during the sprint. The whole team with the product owner, scrum master, team and any interested parties: customer, management representatives, potential consumers are presented at this meeting. The team must demonstrate only what meets **the definition of "Done"**(it's criteria that determine the degree of readiness of an element from the user's wish log — this definition should be described and agreed by the team). What is completely and finally ready. It can be a completely completed product or its separate finished function.
Not everything can be done? This means that too many tasks were selected for this sprint.
## 10. Retrospective meeting
After the team has shown what it has done for the last sprint and what can be handed over to the client to receive feedback, everyone sits down and discusses the sprint. What went well? What could be done better? What was totally wrong? What improvement can the team implement immediately?
For the meeting to be effective, Scrum master and the team need to create an atmosphere of trust and show the necessary emotional maturity. The main thing that needs to remember is that you do not convict anyone, but discuss the work process. **Why did this happen? Why did we miss it? What could speed up the progress of work?**
It is especially important that people feel themselves a team and take responsibility for all processes and their results. Solutions are sought by the whole team. Group members should have a certain psychological exposure, so that their discussions are directed at solving the problem, and not at finding guilty ones. It is absolutely unacceptable that even one member of the team was forced to take a defensive position — everyone in the group should hear and understand each other. By the end of the meeting, the team and the Scrum master must agree on improving the process, which will be put into effect in the next sprint.
## 11. Start the next sprint immediately.
The most important thing in this methodology is **customer orientation**. The customer must receive what he wants, on time and at minimal cost. In contrast to the linear (waterfall) approach, when the project was originally planned "from" and "to", and the result is somewhere "at the end of the path", this method allows getting the finished product in a short time with minimal cost. Of course, it does not yet have all the required characteristics, but it can already be used. Further, in the course of the project, the contractor receives feedback from the client, on the basis of which the cyclic increase in functionality and product improvement is carried out.
Scrum's main characteristic is **flexibility**. This methodology allows you to quickly respond to changes in customer requirements and quickly adapt the product to them. There may be difficulties in implementing Scrum. Firstly, the active participation of the customer in the project is assumed, and secondly, well-coordinated teamwork is required.
## What scrum is not
**...a tool for controlling people**(yeah, I heard it from developers). Of course, it is not true, [it's more productive to work less](https://www.scruminc.com/maxwell-curve-getting-more-production/). Agile in common is to control the risk, not each other.
**...a complex framework**. Go to [http://agilemanifesto.org/](http://agilemanifesto.org/) and read 4 values, just four points. That's it.
**...unclear requirements** because they may change along product development. It means requirements need to be clear enough to be defined as user stories or items in the Product Backlog.
**...only for software development**. Many industries may benefit from this framework.
### Additional material
- [5 False Hopes of Scrum and How to Fix Them](https://www.toptal.com/project-managers/scrum/five-scrum-myths)