Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
You have completed Combining Data for Analysis!
You have completed Combining Data for Analysis!
Preview
Welcome! In this video, we'll introduce you to the merge() function and the arguments used to successfully merge two datasets.
Data files
Definitions
- DataFrame - a two-dimensional data structure, arranged in rows and columns
Similar terms
- DataFrame ⇔ dataset
- record ⇔ row
- field ⇔ column
Load data into pandas
billboard = pd.read_csv("Billboard_100_2017-2018.csv", index_col="ID")
spotify = pd.read_csv("Spotify_200_2017-2018.csv", index_col="ID")
Create DataFrames for Ariana Grande Billboard and Spotify song data
ariana_bill = billboard[billboard["Artists"].str.contains("Ariana Grande")]
ariana_spot = spotify[spotify["Artists"].str.contains("Ariana Grande")]
Merge Ariana Billboard dataset and Ariana Spotify dataset
ariana_combined = pd.merge(ariana_bill, ariana_spot, how='left', on=['Name', 'Artists', 'BB.Week'])
Additional Resources
- Pandas API: merge() function
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Welcome.
0:00
In this video, we will set up
our data frames in Pandas,
0:01
then merge the data from two data
frames into one combined data frame.
0:04
Download the two CSV files from the
teacher's notes to make sure you remember
0:10
where you save them.
0:13
Anaconda navigator is open on my screen.
0:15
We'll run JupyterLab.
0:17
So I click the Launch button
in the JupyterLab cell.
0:19
It'll launch a new tab
in my browser window.
0:26
This is my file tree on the left,
0:29
I'll navigate to where I
downloaded my CSV files.
0:32
I created a folder specifically for
this workshop,
0:35
Documents > anaconda > combine-data.
0:39
CSV stands for comma separated values.
0:44
A CSV file is a text file.
0:47
Each line of text is a row, and the commas
separate each line into columns.
0:50
The first line of text
contains the headings and
0:57
each additional line is a record of data.
0:59
Billboard_100_2017-2018 shows
1:03
the Billboard 100 chart
data from 2017 to 2018.
1:08
JupyterLab displays the file as
a chart in rows and columns.
1:13
Spotify_200_2017-2018 contains
the Spotify 200 data from those years.
1:17
Back on my Launcher tab,
I want to create a new notebook.
1:31
So I'll click Python3 under Notebook.
1:34
I can rename the notebook now,
right-click or
1:38
Ctrl-click the notebook name,
choose Rename,
1:41
Then enter billboard_spotify.
1:49
Let's first import the Pandas package and
check the version.
1:56
import pandas as pd.
2:01
pd.__version__.
2:09
I'll click the Run button
here in the toolbar.
2:15
The keyboard shortcut to run
the code in the cell is Shift+Enter,
2:18
I'll use them moving forward.
2:21
Pressing Enter just adds
a new line within this cell,
2:23
like I did here with input and version.
2:26
But this shows that Pandas is installed,
your version may be higher.
2:29
Then let's load the data from each
CSV file into a new data frame.
2:34
Billboard = pd.read_csv
2:41
("Billboard_100_2017-
2:48
2018.csv",
2:59
index_col="ID").
3:05
spotify =
3:13
pd.read_csv("Spotify_200_2017-2018.csv,
3:15
index_col="ID").
3:31
The index_col tells Pandas that
the ID field is the index or
3:41
unique identifier for each record.
3:44
In other words, no two rows in a data
frame should have the same ID number.
3:48
We identify this field so that Pandas
doesn't add its own index column.
3:52
Let's use a couple of Pandas tools to
get a sense of what we just imported.
3:57
First .shape will show us
the dimensions of each data frame.
4:01
Billboard.shape, spotify.shape.
4:08
This shows us there are over 10,000
records in the Billboard data set,
4:16
with 7 columns other than the index.
4:20
There are more than 20,000 rows and 5
non-index columns in the Spotify data set.
4:23
Let's get a peek at the first few rows of
each data frame using the head method.
4:30
Billboard.head().
4:37
This shows us the column headings
and the first five rows of data.
4:42
By the way, if I wanted to see eight rows,
for example, I would say head(8),
4:47
like this spotify.head(8).
4:53
But five is the default if I
don't specify an argument.
5:04
And this is fine because I really only
care about the headings right now.
5:07
Both data frames have a name field,
which is the name of the song.
5:11
Both have an artist column, which lists
the recording artist of the song.
5:15
And both have a BB.Week field which is
the release date of the weekly Billboard
5:20
100 chart.
5:24
So for our purposes we will use
the combination of name, artists, and
5:26
BB.Week to merge our data frames.
5:30
Actually, we have a lot of data here.
5:34
Let's first work with a smaller sample of
the data just to make sure we're doing
5:36
it right.
5:41
If you remember 2017, 2018, there were
a few people who dominated American music,
5:41
Drake, Ariana Grande,
Imagine Dragons, few others.
5:48
They all had a lot of
songs on these charts.
5:53
Let's choose Ariana.
5:56
I'll make a data frame of just Ariana
Grande's songs on the Billboard chart.
5:59
Ariana_bill =
billboard[billboard["Artists"],
6:07
.str.contains("Ariana Grande").
6:18
Do the same with Spotify data, ariana_spot
6:29
=
spotify[spotify["Artists"].str.contains("-
6:37
Ariana Grande")].
6:46
By the way, make sure you
capitalize the A in artists,
6:52
the leading A in Ariana, and the G in
Grande, everything else is lowercase.
6:56
Our data is very case sensitive.
7:03
Now, let's check the dimensions of each
Ariana data frame, ariana_bill.shape,
7:07
Ariana_spot.shape.
7:21
A few 100 records is much more manageable.
7:26
If we find any errors we
can check them manually.
7:28
Let's talk about the merge function.
7:32
It has two required arguments, the left
data frame and the right data frame.
7:35
By default, it performs an inner
join on the index column.
7:40
But we don't want that,
we want a left join on the name,
7:44
artists, and BB.Week columns.
7:49
So let's see what that looks like.
7:53
Ariana_combined =
7:58
pd.merge(ariana_bill, ariana_spot,
8:02
how= 'left',
8:13
on=['Name'], 'Artist',
8:19
BB.Week]).
8:30
So we called our new data
frame ariana_combined.
8:36
ariana_bill is our left data set.
8:41
ariana_spot is our right data set.
8:44
The how argument is the type of join
we wanna perform, which is left.
8:48
The on argument is a column or
columns on which to perform the join.
8:53
Since we have three columns,
we need to enter them as a Python list.
8:58
Because this is a left join, I expect for
9:02
this new data frame to have the same
number of rows as the left data frame.
9:04
Let's see, arianna_combined.
9:09
shape.
9:13
Great, and now let's look at
the head of this new data set,
9:17
ariana_combined.head.
9:23
And now I see that my new data set has all
the columns from the Billboard data frame,
9:32
and a few extra columns taken
from the Spotify data frame.
9:36
Let's display this data frame again.
9:40
But since we only really care
about certain columns,
9:42
let's specify which columns we wanna view
and in which order.
9:44
ariana_combined[['Artists', 'Name',
9:51
'BB.Week', 'Weekly.Rank',
10:00
'Streams'}].
10:07
Now we have a better
understanding of this data merge.
10:15
Each record shows Ariana Grande as
the artist, the name of the song,
10:19
the Billboard chart date for that week,
the Billboard ranking for that week, and
10:22
the number of Spotify streams for
that week.
10:26
For any missing data,
there's a value of NaN or not a number.
10:30
This means that there's no matching
record between the two data sets.
10:35
It's a bit hard to find for
10:39
Ariana Grande, most of her songs that
charted on Billboard had a lot of streams.
10:40
There has to be a few so let's check for
10:45
songs in ariana_combined where
Streams is NaN, not a number.
10:48
We'll use the isnull() function.
10:53
ariana_combined[ariana_combined
10:57
['Streams'].isnull()
11:09
For example, the song Everyday,
11:19
it was number 85 on the Billboard
chart on March 4, 2017.
11:22
But we don't know how many Spotify
streams because it was not in the top 200
11:26
streaming songs for that week.
11:30
Okay, I've added some helpful
resources in the teacher's notes,
11:34
including some other optional arguments
available for the merge function.
11:37
Now here's my challenge to you.
11:41
We merged a portion of
our data frames together,
11:44
specifically the Ariana Grande
songs in our list.
11:46
I would like you to use the same method
demonstrated in this video to combine
11:51
the full Billboard and Spotify data sets,
call your new data frame bill_spot.
11:57
In the next video I'll
show you my solution, and
12:03
then we'll learn how to concatenate
two data sets within the Pandas tool.
12:06
See you there.
12:10
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up