AI can only take you so far. 🌟 Start with core skills in JavaScript, HTML, CSS, or Python. 🚀

Join the Treehouse affiliate program and earn 25% recurring commission!

New No-Code Track! 🚀start learning today!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Well done!

You have completed Combining Data for Analysis!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Merge Data with pandas.merge()

12:12 with Richard Tillies

Welcome! In this video, we'll introduce you to the merge() function and the arguments used to successfully merge two datasets.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Data files

Definitions

DataFrame - a two-dimensional data structure, arranged in rows and columns

Similar terms

DataFrame ⇔ dataset
record ⇔ row
field ⇔ column

Load data into pandas

billboard = pd.read_csv("Billboard_100_2017-2018.csv", index_col="ID")
spotify = pd.read_csv("Spotify_200_2017-2018.csv", index_col="ID")

Create DataFrames for Ariana Grande Billboard and Spotify song data

ariana_bill = billboard[billboard["Artists"].str.contains("Ariana Grande")]
ariana_spot = spotify[spotify["Artists"].str.contains("Ariana Grande")]

Merge Ariana Billboard dataset and Ariana Spotify dataset

ariana_combined = pd.merge(ariana_bill, ariana_spot, how='left', on=['Name', 'Artists', 'BB.Week'])

Additional Resources

Pandas API: merge() function

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Welcome. 0:00

In this video, we will set up our data frames in Pandas, 0:01

then merge the data from two data frames into one combined data frame. 0:04

Download the two CSV files from the teacher's notes to make sure you remember 0:10

where you save them. 0:13

Anaconda navigator is open on my screen. 0:15

We'll run JupyterLab. 0:17

So I click the Launch button in the JupyterLab cell. 0:19

It'll launch a new tab in my browser window. 0:26

This is my file tree on the left, 0:29

I'll navigate to where I downloaded my CSV files. 0:32

I created a folder specifically for this workshop, 0:35

Documents > anaconda > combine-data. 0:39

CSV stands for comma separated values. 0:44

A CSV file is a text file. 0:47

Each line of text is a row, and the commas separate each line into columns. 0:50

The first line of text contains the headings and 0:57

each additional line is a record of data. 0:59

Billboard_100_2017-2018 shows 1:03

the Billboard 100 chart data from 2017 to 2018. 1:08

JupyterLab displays the file as a chart in rows and columns. 1:13

Spotify_200_2017-2018 contains the Spotify 200 data from those years. 1:17

Back on my Launcher tab, I want to create a new notebook. 1:31

So I'll click Python3 under Notebook. 1:34

I can rename the notebook now, right-click or 1:38

Ctrl-click the notebook name, choose Rename, 1:41

Then enter billboard_spotify. 1:49

Let's first import the Pandas package and check the version. 1:56

import pandas as pd. 2:01

pd.__version__. 2:09

I'll click the Run button here in the toolbar. 2:15

The keyboard shortcut to run the code in the cell is Shift+Enter, 2:18

I'll use them moving forward. 2:21

Pressing Enter just adds a new line within this cell, 2:23

like I did here with input and version. 2:26

But this shows that Pandas is installed, your version may be higher. 2:29

Then let's load the data from each CSV file into a new data frame. 2:34

Billboard = pd.read_csv 2:41

("Billboard_100_2017- 2:48

2018.csv", 2:59

index_col="ID"). 3:05

spotify = 3:13

pd.read_csv("Spotify_200_2017-2018.csv, 3:15

index_col="ID"). 3:31

The index_col tells Pandas that the ID field is the index or 3:41

unique identifier for each record. 3:44

In other words, no two rows in a data frame should have the same ID number. 3:48

We identify this field so that Pandas doesn't add its own index column. 3:52

Let's use a couple of Pandas tools to get a sense of what we just imported. 3:57

First .shape will show us the dimensions of each data frame. 4:01

Billboard.shape, spotify.shape. 4:08

This shows us there are over 10,000 records in the Billboard data set, 4:16

with 7 columns other than the index. 4:20

There are more than 20,000 rows and 5 non-index columns in the Spotify data set. 4:23

Let's get a peek at the first few rows of each data frame using the head method. 4:30

Billboard.head(). 4:37

This shows us the column headings and the first five rows of data. 4:42

By the way, if I wanted to see eight rows, for example, I would say head(8), 4:47

like this spotify.head(8). 4:53

But five is the default if I don't specify an argument. 5:04

And this is fine because I really only care about the headings right now. 5:07

Both data frames have a name field, which is the name of the song. 5:11

Both have an artist column, which lists the recording artist of the song. 5:15

And both have a BB.Week field which is the release date of the weekly Billboard 5:20

100 chart. 5:24

So for our purposes we will use the combination of name, artists, and 5:26

BB.Week to merge our data frames. 5:30

Actually, we have a lot of data here. 5:34

Let's first work with a smaller sample of the data just to make sure we're doing 5:36

it right. 5:41

If you remember 2017, 2018, there were a few people who dominated American music, 5:41

Drake, Ariana Grande, Imagine Dragons, few others. 5:48

They all had a lot of songs on these charts. 5:53

Let's choose Ariana. 5:56

I'll make a data frame of just Ariana Grande's songs on the Billboard chart. 5:59

Ariana_bill = billboard[billboard["Artists"], 6:07

.str.contains("Ariana Grande"). 6:18

Do the same with Spotify data, ariana_spot 6:29

= spotify[spotify["Artists"].str.contains("- 6:37

Ariana Grande")]. 6:46

By the way, make sure you capitalize the A in artists, 6:52

the leading A in Ariana, and the G in Grande, everything else is lowercase. 6:56

Our data is very case sensitive. 7:03

Now, let's check the dimensions of each Ariana data frame, ariana_bill.shape, 7:07

Ariana_spot.shape. 7:21

A few 100 records is much more manageable. 7:26

If we find any errors we can check them manually. 7:28

Let's talk about the merge function. 7:32

It has two required arguments, the left data frame and the right data frame. 7:35

By default, it performs an inner join on the index column. 7:40

But we don't want that, we want a left join on the name, 7:44

artists, and BB.Week columns. 7:49

So let's see what that looks like. 7:53

Ariana_combined = 7:58

pd.merge(ariana_bill, ariana_spot, 8:02

how= 'left', 8:13

on=['Name'], 'Artist', 8:19

BB.Week]). 8:30

So we called our new data frame ariana_combined. 8:36

ariana_bill is our left data set. 8:41

ariana_spot is our right data set. 8:44

The how argument is the type of join we wanna perform, which is left. 8:48

The on argument is a column or columns on which to perform the join. 8:53

Since we have three columns, we need to enter them as a Python list. 8:58

Because this is a left join, I expect for 9:02

this new data frame to have the same number of rows as the left data frame. 9:04

Let's see, arianna_combined. 9:09

shape. 9:13

Great, and now let's look at the head of this new data set, 9:17

ariana_combined.head. 9:23

And now I see that my new data set has all the columns from the Billboard data frame, 9:32

and a few extra columns taken from the Spotify data frame. 9:36

Let's display this data frame again. 9:40

But since we only really care about certain columns, 9:42

let's specify which columns we wanna view and in which order. 9:44

ariana_combined[['Artists', 'Name', 9:51

'BB.Week', 'Weekly.Rank', 10:00

'Streams'}]. 10:07

Now we have a better understanding of this data merge. 10:15

Each record shows Ariana Grande as the artist, the name of the song, 10:19

the Billboard chart date for that week, the Billboard ranking for that week, and 10:22

the number of Spotify streams for that week. 10:26

For any missing data, there's a value of NaN or not a number. 10:30

This means that there's no matching record between the two data sets. 10:35

It's a bit hard to find for 10:39

Ariana Grande, most of her songs that charted on Billboard had a lot of streams. 10:40

There has to be a few so let's check for 10:45

songs in ariana_combined where Streams is NaN, not a number. 10:48

We'll use the isnull() function. 10:53

ariana_combined[ariana_combined 10:57

['Streams'].isnull() 11:09

For example, the song Everyday, 11:19

it was number 85 on the Billboard chart on March 4, 2017. 11:22

But we don't know how many Spotify streams because it was not in the top 200 11:26

streaming songs for that week. 11:30

Okay, I've added some helpful resources in the teacher's notes, 11:34

including some other optional arguments available for the merge function. 11:37

Now here's my challenge to you. 11:41

We merged a portion of our data frames together, 11:44

specifically the Ariana Grande songs in our list. 11:46

I would like you to use the same method demonstrated in this video to combine 11:51

the full Billboard and Spotify data sets, call your new data frame bill_spot. 11:57

In the next video I'll show you my solution, and 12:03

then we'll learn how to concatenate two data sets within the Pandas tool. 12:06

See you there. 12:10

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up