Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
How do you store all this data?
Learn More
- SQL Basics Course
- MongoDB Basics
- PostgreSQL
- MySQL
- HDFS Overview
- Cassandra
-
Amazon S3 at Dropbox
- They have since moved away (only as of 2016), but it is still a great use case for Amazon S3.
- Free Graph Databases e-book from OβReilly
- Neo4j
- Market Survey of Graph Databases/Analytics Platforms
-
GraphDatabases Neoj4 2.0 examples
-
Discover Graph Databases with Neo4j and PHP
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
In order to work with data, you often
need to store it in some medium before or
0:00
after processing it.
0:04
For this reason, data storage tools and
frameworks make up a large part of
0:05
the major tools and
frameworks in the big data ecosystem.
0:09
We'll be taking a look at these three
major classes of data storage systems.
0:13
Relational databases are used to store
structured data like we just talked about.
0:17
These databases store the structured data
according to what is known as a schema.
0:21
A schema defines the way the data
is organized in the system,
0:26
also known as its structure.
0:30
Relational databases use a query
language that can access these schemas,
0:32
most typically a dialect of SQL, which
stands for structured query language.
0:37
SQL provides a standard way to query,
manipulate, and
0:42
store data in relational databases.
0:45
Check the teacher's notes if you're
looking to learn more about SQL.
0:48
Relational databases perform very well for
data that is not sparse.
0:52
Sparsity in a database is defined
by the amount of blank entries.
0:57
In the case of relational databases,
the less sparse the data, the better.
1:01
It also does well with data that can
be contained on a single machine.
1:06
It becomes less appropriate when data is
needed to be spread across many machines
1:10
and accessed in parallel.
1:14
There are databases built specifically for
1:16
these highly distributed purposes and
we'll cover those here shortly.
1:18
A few of the major relational
databases you've probably heard
1:22
of are PostgreSQL, MySQL,
MS SQL, and MariaDB.
1:27
Non-relational databases are often based
around documents, which you can think
1:31
of as a piece of data that doesn't have
a predefined schema, or structure.
1:35
Now these documents could be JSON, which
stands for JavaScript Object Notation,
1:40
XML, or just plain old text blobs.
1:45
Non-relational databases, or NoSQL,
perform better when the data needs
1:47
to be distributed or
shared across many machines.
1:52
It opens up the possibility for
having everything accessed in parallel
1:57
with the ability to read and
write in parallel across a cluster.
2:00
One of the most popular NoSQL databases
is MongoDB, a document-based NoSQL
2:05
database that stores data in BSON,
which is a binary format.
2:10
And then clients retrieve
the results in the form of JSON.
2:14
Remember that there's often more data
than can fit on a single computer.
2:18
When you need to scale the number of
machines where you're storing your data
2:22
to potentially thousands or more, you
have to use specialized storage systems.
2:26
These systems have the ability to
scale and have been battle tested and
2:31
can now store up to petabytes of data.
2:34
A few of the most popular storage
engines for large distributed data sets
2:37
are the Hadoop Distributed File System,
or HDFS, and Cassandra.
2:41
These are used for unstructured or
structured text data.
2:47
Amazon's Simple Storage Service,
more commonly referred to as Amazon S3,
2:51
is used to store files of nearly any size.
2:55
Hadoop was originally built by Google to
index the entire web, like all of it.
2:59
Cassandra is a system used by Facebook
to power a large part of their systems.
3:04
Amazon S3 is used by Dropbox and
many others for
3:09
storing files across many
regions of the world.
3:12
And last but not least,
we should discuss graph based databases.
3:15
Graph databases store data that can be
represented by nodes and edges, where
3:20
a node could be a person and an edge could
be a property that the two nodes share.
3:24
They help search and
walk relationships, and
3:30
find patterns in
the interconnectivity between nodes.
3:32
The canonical example of a good use case
for a graph database is a social network.
3:36
It's important to keep in mind that you
don't wanna just use a graph database
3:41
just for
the sake of using a graph database.
3:45
It sounds cool, but often,
the normal SQL database will do the trick.
3:47
If you do find this is a good choice for
your data, Neo4j and Dgraph
3:52
are two very popular graph databases
that are open source and widely used.
3:56
Now that we've taken a brief overview
of the domain of data storage,
4:01
let's start looking at our next domain,
computation.
4:04
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up