Home / Introduction to Pyspark join types

Introduction to Pyspark join types

Introduction to Pyspark join types

This article is written to visualize various types of joins, a cheat sheet so that all join types are listed in the same place with examples and without stupid circles. Aaaah, circles!

I've tired of these explanations of joins with intersections of sets and circles. It seems to be both clear and understandable, but the fact it's at least not accurate and in the general case incorrect. Let's look at why and touch on a couple of subtleties of joins.

Join meme

Firstly, a dataframe/table is not a set. In the set all the elements are unique, there should be no duplicates. In the tables in general, this is actually not true. Secondly, the term "intersection" confuses people.


Let's describe the data we will be working with:

heroes_data = [
	('Deadpool', 3), 
	('Iron man', 1),
	('Groot', 7),
]
race_data = [
	('Kryptonian', 5), 
	('Mutant', 3), 
	('Human', 1), 
]
heroes = spark.createDataFrame(heroes_data, ['name', 'id'])
races = spark.createDataFrame(race_data, ['race', 'id'])

Two dataframes of superheroes and their races. They are connected via id column.

Cross join

As the saying goes, the cross product of big data and big data is an out-of-memory exception. [Holden’s "High-Performance Spark"]

We start with a cross join.

This join simply combines each row of the first table with each row of the second table. For example, we have m rows in one table, and n rows in another, this will give us m * n rows in the result table. So, imagine that a small table of 1,000 customers combined with a product table of 1,000 records will produce 1,000,000 records! Try to avoid this with large tables in the prod.

Also, to bypass this AnalysisException we have to set the spark.sql.crossJoin.enabled to true in our Spark session builder object but please don't do this. It's better to use a special API method for that type of join, crossJoin. The reason here is that you're explicitly telling Spark to use that dangerous method in a particular situation, knowing what you are doing(kind of). But you will be safe in the next case where you accidentally forgot to add on argument(for Spark that means cross join) - spark will crash telling you that.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.crossJoin(races).show()
+--------+---+----------+---+  
|    name| id|      race| id|
+--------+---+----------+---+
|Deadpool|  3|Kryptonian|  5|
|Deadpool|  3|    Mutant|  3|
|Deadpool|  3|     Human|  1|
|Iron man|  1|Kryptonian|  5|
|Iron man|  1|    Mutant|  3|
|Iron man|  1|     Human|  1|
|   Groot|  7|Kryptonian|  5|
|   Groot|  7|    Mutant|  3|
|   Groot|  7|     Human|  1|
+--------+---+----------+---+

Inner join

Inner join

This is the default join type in Spark. Inner join basically removes all the things that are not common in both tables. It returns back all the data that has a match on the join condition(predicate in on argument) from both sides of the table. It means that if one of the tables is empty, then the result will also be empty.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on="id", how="inner").show()
+---+--------+------+ 
| id|    name|  race|
+---+--------+------+
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+

So in the example above, only Iron man and Deadpool have entries in both tables hence inner join returns only those rows.

But if there are any duplicates in the original data these duplicates will be in the result as well:

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on="id", how="inner").show()
+---+--------+------+
| id|    name|  race|
+---+--------+------+
|  1|Iron man| Human|
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+

Yes, the circles don't show that.

Left join / Left outer join

Left join

We've looked at the inner join, which returns only those combinations of rows of the left / right table for which the predicate value in the on clause is true. The outer joins allow us to include in the result rows of one table for which no matching rows were found in another table.

In a left join, all the rows from the left table remain unchanged whether or not there is a match in the right side table. When a matching id is found in the right table, its value is returned and null otherwise.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on="id", how="left").show()
+---+--------+------+
| id|    name|  race|
+---+--------+------+
|  7|   Groot|  null|
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+
>>> heroes.join(races, on="id", how="leftouter").show()
+---+--------+------+
| id|    name|  race|
+---+--------+------+
|  7|   Groot|  null|
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+

You can use left or left_outer and the results are exactly the same. It is just an alias in Spark. It seems like this is a convenience for people coming from different SQL flavor backgrounds.

Right join / Right outer join

Right join

The right outer join performs the same task as the left outer join, but for the right table. That is, it returns all the rows of the right table as a result.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on="id", how="right").show()
+---+--------+----------+
| id|    name|      race|
+---+--------+----------+
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+
>>> heroes.join(races, on="id", how="rightouter").show()
+---+--------+----------+ 
| id|    name|      race|
+---+--------+----------+
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+

Here the right side table is the race, hence all the data from the race table is returned. Yeah, we don't see any Kryptonians in the data.

Also, the right join and right outer join yield the same output. Theoretically speaking all the things that could be achieved from the right join can be achieved by using left join but there can be few scenarios where right-join might come in handy.

Full outer join

Full outer join

We use a full outer join to keep records from both tables along with the associated null values in the respective left/right tables. It is kind of rare case but generally used in situations when you don't want to lose data from any table.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on="id", how="outer").show()
+---+--------+----------+
| id|    name|      race|
+---+--------+----------+
|  7|   Groot|      null|
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+
>>> heroes.join(races, on="id", how="full").show()
+---+--------+----------+
| id|    name|      race|
+---+--------+----------+
|  7|   Groot|      null|
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+

The full join can be thought of as a combination of inner join + left join + right join.

Left semi-join

Left semi-join

This returns only the data from the left side that has a match on the right side based on the condition provided for the join statement. In contrast to left join where all the rows from the right side table are also present in the result, the right side table data is omitted from the output.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on="id", how="leftsemi").show()
+---+--------+
| id|    name|
+---+--------+
|  1|Iron man|
|  3|Deadpool|
+---+--------+

Left anti join

Left anti join

As the name suggests, it does exactly the opposite of left semi-join. It just returns the data that doesn’t have a match on the right side table. Only the columns on the left side table would be included in the result.

>>> heroes.show()
+--------+---+
|    name| id|
+--------+---+
|Deadpool|  3|
|Iron man|  1|
|   Groot|  7|
+--------+---+
>>> races.show()
+----------+---+
|      race| id|
+----------+---+
|Kryptonian|  5|
|    Mutant|  3|
|     Human|  1|
+----------+---+
>>> heroes.join(races, on="id", how="leftanti").show()
+---+-----+
| id| name|
+---+-----+
|  7|Groot|
+---+-----+
  1. If you do not apply an alias to the dataframe, you’ll receive an error after you create your joined dataframe. With two columns named the same, referencing one of the duplicates named columns returns an error that essentially says it doesn’t know which one you selected.

  2. In the second parameter, you use the &(ampersand) symbol for AND the |(pipe) symbol for OR between columns.

Support author