Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Dig into the books dataset to determine the most popular book.
What is the most popular book of the 1960's?
Use pd.Timestamp to compare dates. For example:
books['publication_date'] > pd.Timestamp(1960,1,1)
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
We're going to get into
analyzing book popularity.
0:00
I'm gonna start off with
a couple of questions I have and
0:03
I'll add these to my notebook
as markdown cells, markdown.
0:06
First, what is the most popular book?
0:13
And then second, Are books
0:20
with fewer pages rated higher than
0:26
those with large page counts?
0:31
Now that we've got our initial questions,
0:38
let's start digging into our
data to find the answers.
0:40
If I scroll up a bit, I can see that we
0:43
have average ratings for our books.
0:47
This shows us how users on Goodreads have
rated each book on a scale of one for
0:54
the worst and five for the best.
0:58
This one seems pretty easy to see what
the max value is currently in the rating
1:01
column.
1:06
So let's do add a cell here.
1:07
Books where the and
we're gonna get the average rating and
1:12
we're gonna get the max value and
we get back a five.
1:19
So the highest rating in
the database is a five.
1:26
Out of curiosity, let's do a quick min.
1:30
And unsurprisingly it's a zero.
1:34
So we need to see what the five
star rated book or books are.
1:37
So let's change this up.
1:44
Let's do books.loc, L-O-C,
1:46
where we're looking for the books,
1:51
average rating is equal to a 5.0.
1:56
Looks like we get a few books
with five star ratings.
2:03
You may think this question is complete,
but
2:10
on further look at the data I see that
there's also a ratings count column.
2:13
This says how many people
have rated a book.
2:20
This is important to add to our
analysis because what if one person
2:23
writes a book as a 5, but
5,000 people rated another book and
2:28
it's at a 4.7,
which is actually more popular?
2:33
The first book here has zero ratings,
so is it really a popular book?
2:38
I think we should go solely based on
the number of reviews to show many people
2:44
at least read a book, and then use
the ratings as a secondary ranking.
2:49
In my head this makes more sense.
2:54
If a book is popular,
it's probably going to have many reviews.
2:56
And then if it's a good book,
it should have a high rating.
3:00
Let's make a note here, so
we don't lose our thoughts.
3:04
Great, now let's fix our code.
3:28
Let's sort to see the books
with a high number of ratings.
3:30
I think this is a good place to start.
3:34
Let's look at our data now.
3:49
It looks like we have some
Harry Potter books and
3:51
looks like some His Dark Materials,
and quite a few others.
3:54
I think we can agree that these are names
you may recognize more compared to our
4:02
previous results.
4:06
So I think we're getting somewhere.
4:08
I think there's another
layer needed here though.
4:10
Our top book has a rating of 4.57.
4:13
But there may be others that
are rated higher than the one that
4:16
we've currently found like
this one that's rated 4.78.
4:20
I think we need to specify
a rating conditional as in
4:25
a rating should be above a 4.0,
or maybe even a 4.5.
4:29
Let's try both to see what we get.
4:35
I'm gonna save this as a variable.
4:37
And then, Where popularity,
4:43
Average rating, Is greater than a 4.0.
4:53
So our results look very similar,
5:02
we still have a book lower
down that's rated higher.
5:05
So let's also sort our values by the
average rating just to make sure we end up
5:10
with the highest rated books at the top.
5:15
I'm gonna set our same variable
5:18
equal to our new filter.
5:23
And then I'm gonna do popularity.sort
5:29
values by, Average rating.
5:36
Ascending, Equals false.
5:44
Now with all that put together
I have The Complete Calvin and
5:49
Hobbes by Bill Watterson has
an average rating of 4.82 and
5:54
has over 32,000 readings.
5:58
I think we can call that a popular book.
6:01
And just to make it super clear
I'm gonna add a slice here
6:04
at the end just to get one result.
6:08
There we go.
6:13
That way, we just don't have
a whole bunch of rows there.
6:14
We only need the first one.
6:16
On to the next question.
6:17
Are books with fewer pages rated higher
than those with large page counts?
6:21
This one is a comparison to see if there's
a correlation between the number of pages
6:26
and a book's rating.
6:30
We can filter again to see
the books with low page numbers.
6:32
So let's do books where the books
6:36
num pages is let's say less than 300.
6:41
We also need to organize the books by
rating count to make sure we're getting
6:50
books that have a good amount of
ratings to support their score.
6:55
So I'm gonna set this as few pages.
6:59
And let's do few pages where
7:05
few pages, ratings count.
7:10
And we'll do the same as we've been
doing before, greater than 1,000, okay?
7:16
And then lastly, let's make sure
we sort by the average ratings so
7:24
we can see the best one of the bunch.
7:28
So I'm gonna set this equal
to the variable again.
7:32
So it now contains both of our filters and
7:35
then we can do few pages.sort
7:42
values by the average rating.
7:47
We want ascending equals false.
7:52
And it looks like we got It's
a Magical World which is Calvin and
7:58
Hobbes number 11.
8:03
Rated a 4.76.
8:04
It has 176 pages and 23,000 ratings.
8:06
Same thing here I'm gonna add a slice so
8:08
we just get the first one.
8:14
Just to clear up our notebook a bit.
8:19
Cool, now we need to do the opposite.
8:22
We can do this by
modifying the first line.
8:24
So, I'm just gonna copy all of this,
I'm gonna paste it and then,
8:27
just to be clear with our, Variable names,
I'm gonna switch this to be many,
8:33
And I think that's all with that, cool.
8:52
And if I run it, Oops, we got the same
one, I forgot to change this.
8:54
[LAUGH] This will probably be helpful.
9:01
So we had less than 300 for a few pages.
9:03
Let's do greater than 300 for most pages,
and we run it and Calvin and Hobbes again.
9:06
It's our same popular book
that we got previously.
9:12
With an average rating of 4.82,
so between our two books,
9:16
we don't have much of a difference
in the overall rating.
9:20
Between point seven, six and point eight
two, it's what 6. 0.06 between the two.
9:24
That's not a lot.
9:29
Let's add a note in here.
9:33
There isn't a large difference
between the book ratings.
9:37
Only 0.06 between the top,
9:49
In each category.
9:56
Now while we don't really see a difference
between these two numbers, a chart might
9:59
better show if there is a correlation and
may just give us a better visualization.
10:03
We won't get into charting
in this workshop, but
10:08
it's a good thing to note in your
analysis for future improvements.
10:11
As a challenge in
the teachers notes below,
10:22
see if you can find the most
popular book of the 1960s.
10:27
There's some hints in the teachers
notes to help you out.
10:35
Nice work Pythonistas,
you've done a ton of code so far.
10:38
Keep it up.
10:42
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up