Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Combining Data for Analysis!
You have completed Combining Data for Analysis!
Preview
Welcome back! In this video, I will provide an overview of the data we will use in the workshop, and review the concept of joining data.
Types of Joins
- The Inner Join returns only matching records found in both tables.
- The Outer Join returns all records from both tables (aka Full Outer Join).
- The Left Join returns all records from the left table and the matching records from the right table (aka Left Outer Join).
- The Right Join returns all records from the right table and the matching records from the left table (aka Right Outer Join).
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Welcome back.
0:00
In this video, I will provide an overview
of the data we will use in the workshop,
0:02
and review the concept of joining data.
0:06
If you're familiar with SQL or
relational algebra,
0:09
you're probably familiar with the term "join".
0:12
You may have seen the Venn diagram
with the two interlocking shapes, and
0:15
we'll revisit that idea.
0:19
There are four types of joins.
0:21
So, let's look at how each join
type yields a different result.
0:23
For this example, I have two tables.
0:27
The first table is the Top 10 songs
from the Billboard 100 Chart,
0:30
released on July 28, 2018.
0:35
It contains the Billboard rank
along with the song information.
0:37
The second table contains the 12 songs
with the most streams on the Spotify music
0:42
platform for the same week.
0:46
This table contains the song info and
the number of streams for that week.
0:49
Comparing the name column in both tables,
there are six songs present in both,
0:54
In My Feelings by Drake,
I Like It by Cardi B,
1:02
Girls Like You by Maroon 5,
Nice for What, also by Drake,
1:06
Lucid Dreams by Juice WRLD,
Better Now by Post Malone.
1:11
There are additional songs in each table
that are not present in the other table.
1:16
Each join option handles those
additional songs differently.
1:21
The inner join returns only matching
records found in both tables.
1:26
The matching records will display as
a combination from the two tables as
1:31
a single record.
1:35
For rows for which there is no
match between the two tables,
1:37
the records are discarded.
1:40
The inner join results in
the smallest combined dataset.
1:43
For this example, the combined table will
contain the six records found in both
1:47
tables, listing the Billboard rank,
song info, and number of Spotify streams.
1:52
The other songs from both
tables are discarded.
1:58
The outer join, or full outer join,
returns all records from both tables.
2:02
The matching records will display as
a combination from the two tables,
2:08
as a single record.
2:12
The number of rows in the resulting table,
will be the sum of the rows from the two
2:14
tables minus the number
of matching records.
2:18
In this case, 10 rows from the left table,
plus 12 rows from the right table,
2:22
minus 6 matching records,
equals 16 rows in the combined table.
2:27
The combined table will include
the Billboard rank, song info, and
2:33
number of Spotify streams in each row.
2:37
We have the Billboard rank and
Spotify streams for six songs.
2:41
However, we don't have the number of
Spotify streams for four songs for
2:46
the Billboard table, and
2:50
we don't have the Billboard rank of
six songs from the Spotify table.
2:51
For the missing numbers,
we show them as null or NAN, Not a Number.
2:56
In an outer join,
we preserve all data from both tables.
3:02
We don't lose any information
with a full outer join.
3:06
This results in the largest
combined dataset.
3:10
The left join, or left outer join, returns
all records from the left table and
3:13
the matching records from the right
table if any matching records exist.
3:18
The resulting table will contain the same
number of rows as the left table, and
3:24
contain the same columns.
3:28
Additional columns found in the right
table will be added, as well.
3:30
For this example, the combined
table will contain ten records,
3:34
listing the Billboard rank song info,
and number of Spotify streams.
3:38
We have both Billboard rank and
Spotify streams for six songs.
3:44
However, we don't have the number
of Spotify streams for
3:48
four songs from the Billboard table.
3:51
For those values, we usually show these
missing values as null or Not a Number.
3:54
In the left join, we preserve all
data from the Billboard table and
3:59
add relevant data from the Spotify table.
4:03
The other songs from the Spotify
table are discarded.
4:06
The right join for right outer join,
returns all records from the right table
4:10
and the matching records from the left
table, if any matching records exists.
4:14
The resulting table will contain the same
number of rows as the right table and
4:20
contain the same columns.
4:23
Additional columns found in the left
table will be added, as well.
4:26
For rows for
which there is no match on the right side,
4:30
the resulting table will contain
a null value or Not a Number.
4:33
For this example, that combined
table will contain 12 records,
4:37
listing the Billboard rank, song info, and
number of Spotify streams.
4:41
We have both Billboard rank and
Spotify streams for six songs.
4:46
However, we don't have the Billboard rank
for six songs from the Spotify table.
4:50
For those values, we usually show these
missing values as null or Not a Number.
4:56
In the right join, we preserve all
data from the Spotify table, and
5:01
add relevant data from
the Billboard table.
5:05
The other songs from the Billboard
table are discarded.
5:08
We have much more data from Billboard and
Spotify for our analysis.
5:12
In fact, our data sets include
song information on the weekly
5:16
Billboard 100 and
Spotify 200 charts for TWO WHOLE YEARS.
5:20
That's tens of thousands of records for
us to analyze.
5:25
Here's our mission.
5:29
For every song on the Billboard Chart,
5:30
we wanna know how many Spotify
streams are received for that week.
5:32
Billboard will be our left dataset,
and Spotify will be our right dataset.
5:36
Can you guess which type
of join we want to perform?
5:40
If you said left join, or
left outer join, you would be correct.
5:44
For our purposes, if a song was not on the
Billboard Chart that week, we don't need
5:49
to know how many times it was streamed,
so that data will be discarded.
5:54
Here's another scenario.
5:59
What if we have another table that
contains the next five songs on
6:01
the Billboard Chart for our chosen week?
6:04
We could append these new
songs to the original list.
6:07
And now we have a new table with the top
15 songs on the Billboard Chart.
6:10
We will learn how to combine
data in this fashion, as well.
6:16
Like I said,
Pandas is a very powerful tool.
6:19
I can't wait to get started.
6:23
In the next video,
we will create our datasets and
6:25
learn how to merge them
within the Pandas tool.
6:27
See you there.
6:30
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up