Big data is a blanket term for any collection of data sets so large or complex that it
becomes difficult to process them using traditional data management techniques
such as, for example, the RDBMS (relational database management systems).
You can think of the relationship between big data and data science as being like
the relationship between crude oil and an oil refinery. Data science and big data
evolved from statistics and traditional data management but are now considered to
be distinct disciplines.
Benefits and uses of data science and big data
Data science and big data are used almost everywhere in both commercial and noncommercial settings. The number of use cases is vast, and the examples we’ll provide
throughout Data science Certification
Commercial companies in almost every industry use data science and big data to
gain insights into their customers, processes, staff, completion, and products. Many
companies use data science to offer customers a better user experience, as well as to
cross-sell, up-sell, and personalize their offerings. A good example of this is Google
AdSense, which collects data from internet users so relevant commercial messages can
be matched to the person browsing the internet.
Governmental organizations are also aware of data’s value. Many governmental
organizations not only rely on internal data scientists to discover valuable information, but also share their data with the public. You can use this data to gain insights or
build data-driven applications. Data.gov is but one example; it’s the home of the US
Government’s open data. A data scientist in a governmental organization gets to work
on diverse projects such as detecting fraud and other criminal activity or optimizing
Nongovernmental organizations (NGOs) are also no strangers to using data. They
use it to raise money and defend their causes. The World Wildlife Fund (WWF), for
instance, employs data scientists to increase the effectiveness of their fundraising
efforts. Many data scientists devote part of their time to helping NGOs, because NGOs
often lack the resources to collect data and employ data scientists. DataKind is one
such data scientist group that devotes its time to the benefit of mankind.
The big data ecosystem and data science
Currently many big data tools and frameworks exist, and it’s easy to get lost because
new technologies appear rapidly. It’s much easier once you realize that the big data
ecosystem can be grouped into technologies that have similar goals and functionalities, which we’ll discuss in this section. Data scientists use many different technologies, but not all of them; we’ll dedicate a separate chapter to the most important data
science technology classes. The mind map in figure 1.6 shows the components of the
big data ecosystem and where the different technologies belong.
Let’s look at the different groups of tools in this diagram and see what each does.
We’ll start with distributed file systems.
Distributed programming framework
Once you have the data stored on the distributed file system, you want to exploit it.
One important aspect of working on a distributed hard disk is that you won’t move
your data to your program, but rather you’ll move your program to the data. When
you start from scratch with a normal general-purpose programming language such as
C, Python, or Java, you need to deal with the complexities that come with distributed
programming, such as restarting jobs that have failed, tracking the results from the
different subprocesses, and so on. Luckily, the open source community has developed
many frameworks to handle this for you, and these give you a much better experience
working with distributed data and dealing with many of the challenges it carries.
Machine learning frameworks
When you have the data in place, it’s time to extract the coveted insights. This is where
you rely on the fields of machine learning, statistics, and applied mathematics. Before
World War II everything needed to be calculated by hand, which severely limited
the possibilities of data analysis. After World War II computers and scientific computing were developed. A single computer could do all the counting and calculations and a world of opportunities opened. Ever since this breakthrough, people only
need to derive the mathematical formulas, write them in an algorithm, and load
their data learn more data science online Course
If you need to store huge amounts of data, you require software that’s specialized in
managing and querying this data. Traditionally this has been the playing field of relational databases such as Oracle SQL, MySQL, Sybase IQ, and others. While they’re still the go-to technology for many use cases, new types of databases have emerged under
the grouping of NoSQL databases.