Introduction to Pyspark join types

#spark #data-engineering #database

This article is written in order to visualize different join types, a cheat sheet so that all types of joins are listed in one place with examples and without stupid circles. Aaah, circles!

I am tired of these explanations of joins with intersections of sets and circles. It seems both clear and understandable, but in fact, it is at least inaccurate and generally wrong. Let's see why and mention a couple of nuances of the joints.

Join meme

Firstly, dataframe/table is not a set. In a set, all elements are unique, there should be no duplicates. In tables in general, this is not really the case. Secondly, the term "intersection" is confusing to people.

Let's describe the data we will be working with:

heroes_data = [
    ('Deadpool', 3), 
    ('Iron man', 1),
    ('Groot', 7),
]
race_data = [
    ('Kryptonian', 5), 
    ('Mutant', 3), 
    ('Human', 1), 
]
heroes = spark.createDataFrame(heroes_data, ['name', 'id'])
races = spark.createDataFrame(race_data, ['race', 'id'])

Two dateframes of superheroes and their race. They're connected through an id column.

Cross join

As the saying goes, the cross-product of big data and big data is an out-of-memory exception. [Holden’s "High-Performance Spark"]

Let's start with the cross-join.

This join simply combines each row of the first table with each row of the second table. For example, we have m rows in one table and n rows in another, this gives us m*n rows in the resulting table. So, imagine that a small table of 1000 customers combined with a product table with 1000 records will produce 1,000,000 records! Try to avoid this with large tables in production.

Also, to get around AnalysisException when running a query with cross join we have to set spark.sql.crossJoin.enabled to true in our Spark session builder, but please do not do that. For this type of connection, it is better to use a special Spark API method crossJoin. The reason here is that you explicitly tell Spark to use this dangerous method in a particular situation, knowing what you are doing. But you'll be safe in the next case where you accidentally forget to add a on argument (for Spark, which means cross join) - Spark will warn you about this with the exception.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.crossJoin(races).show()
+--------+---+----------+---+  
|    name| id|      race| id|
+--------+---+----------+---+
|Deadpool|  3|Kryptonian|  5|
|Deadpool|  3|    Mutant|  3|
|Deadpool|  3|     Human|  1|
|Iron man|  1|Kryptonian|  5|
|Iron man|  1|    Mutant|  3|
|Iron man|  1|     Human|  1|
|   Groot|  7|Kryptonian|  5|
|   Groot|  7|    Mutant|  3|
|   Groot|  7|     Human|  1|
+--------+---+----------+---+

Inner join

This is the default join type in Spark. The inner join essentially removes anything that is not common in both tables. It returns all data that has a match under the join condition (predicate in the `on' argument) from both sides of the table. This means that if one of the tables is empty, the result will also be empty.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on='id', how='inner').show()
+---+--------+------+ 
| id|    name|  race|
+---+--------+------+
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+

So in the example above, only Iron Man and Deadpool have entries in both tables, so the inner join only returns these rows.

But if there are duplicates in the source data, those duplicates will be the result as well:

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on='id', how='inner').show()
+---+--------+------+
| id|    name|  race|
+---+--------+------+
|  1|Iron man| Human|
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+

Yes, the circles don't show that.

Left join / Left outer join

Left join

We looked at the inner join, which returns only those combinations of left/right table rows, for which the predicate value in the on expression is true. The outer join allows us to include in the result rows of one table for which there are no matching rows found in another table.

In a left join, all rows of the left table remain unchanged, regardless of whether there is a match in the right table or not. When a id match is found in the right table, it will be returned or null otherwise.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on='id', how='left').show()
+---+--------+------+
| id|    name|  race|
+---+--------+------+
|  7|   Groot|  null|
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+
>>> heroes.join(races, on='id', how='leftouter').show()
+---+--------+------+
| id|    name|  race|
+---+--------+------+
|  7|   Groot|  null|
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+

You can use left or left_outer and the results are exactly the same. It is just an alias in Spark. It seems like this is a convenience for people coming from different SQL flavor backgrounds.

Right join / Right outer join

Right join

The right outer join performs the same task as the left outer join but for the right table. Thus, it returns all the rows of the right table as a result.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on='id', how='right').show()
+---+--------+----------+
| id|    name|      race|
+---+--------+----------+
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+
>>> heroes.join(races, on='id, how='rightouter').show()
+---+--------+----------+ 
| id|    name|      race|
+---+--------+----------+
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+

Here the right side of the table is the race, therefore all data from the race table are returned. Yeah, we don't see any Kryptonians in the data.

Also, the right join and right outer join yield produce the same result.

Full outer join

We use full external join to store records from both tables together with the corresponding zero values in the corresponding left/right tables. This is a fairly rare case but is usually used in situations where you do not want to lose data from either table.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on='id', how='outer').show()
+---+--------+----------+
| id|    name|      race|
+---+--------+----------+
|  7|   Groot|      null|
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+
>>> heroes.join(races, on='id', how='full').show()
+---+--------+----------+
| id|    name|      race|
+---+--------+----------+
|  7|   Groot|      null|
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+

Full outer join can be considered as a combination of inner join + left join + right join.

Left semi-join

Only the data on the left side that has a match on the right side will be returned based on the condition in on. Unlike the left join, in which all rows of the right-hand table are also present in the result, here right-hand table data is omitted from the output.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on='id', how='leftsemi').show()
+---+--------+
| id|    name|
+---+--------+
|  1|Iron man|
|  3|Deadpool|
+---+--------+

Left anti join

As the name suggests, it does the exact opposite of left semi-join. It simply returns data that does not match in the right table. The result will only include the columns from the left table.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on='id', how='leftanti').show()
+---+-----+
| id| name|
+---+-----+
|  7|Groot|
+---+-----+

Some notes

If you have not applied an alias to a DataFrame, you will get an error after creating a joined DataFrame. When two columns are named the same, accessing one of the duplicates named columns returns an error, which basically means that it doesn't know which column you chose.
In the second parameter, you use the &(ampersand) symbol for AND the |(pipe) symbol for OR between columns.