"We have lots of data – now what?"
(How can we unlock valuable insight from our data?)
Data science is deep knowledge discovery through data inference and exploration.
This discipline often involves using mathematic and algorithmic
techniques to solve some of the most analytically complex business
problems, leveraging troves of raw information to figure out hidden
insight that lies beneath the surface. It centers around evidence-based
analytical rigor and building robust decision capabilities.
Ultimately, data science matters because it enables companies to operate and strategize more intelligently. It is all about adding substantial enterprise value by learning from data.
The variety of projects that a data scientist may be engaged in is incredibly broad. Here are few examples:
- tactical optimization – improvement of marketing campaigns, business processes, etc
- predictive analytics – anticipate future demand, future events, etc
- nuanced learning – e.g. developing deep understanding of consumer behavior
- recommendation engines – e.g. Amazon product recs, Netflix movie recs
- automated decision engines – e.g. automated fraud detection, and even self-driving cars
The objectives of these types of initiatives may be clear, but the
problems require extensive quantitative expertise to solve. They may
require building predictive models, attribution models, segmentation
models, heuristics for deep pattern-discovery in data, etc — this
commands having exhaustive knowledge of all sorts of machine-learning
algorithms and sharp technical ability. As you might guess, these are
not the easiest skills to pick up.
Data science is multidisciplinary; the skill set of a data scientist lies at the intersection of 3 main competencies:
At the heart of deriving insight from data is the ability to view the
data through a quantitative lens. There are textures, patterns,
dimensions, and correlations in data that can be expressed numerically,
and discovering inference from data becomes a brain teaser of
mathematical techniques. Solutions to many business problems often
involve building analytic models that are deeply grounded in the hard
math theory, and being able to understand how models work is as
important as knowing the process to build them (
danger of building without knowing the math).
Also, a big misconception is that data science all about statistics.
While statistics are important, it is not the only type of mathematics
that should be well-understood by a data scientist. First, there are two
main branches of statistics –
classical statistics and
Bayesian statistics. When most people refer to
stats they are generally referring to
classical stats,
but knowledge of both types is very helpful. Furthermore, many
inferential techniques and machine learning algorithms lean heavily on
knowledge of
linear algebra. For example, key data science processes like
SVD
(used for dimension reduction / latent variable discovery) are grounded
in matrix mathematics and have much less to do with classical
statistics. Overall, data scientists should have substantial breadth and
depth in their knowledge of math.
First, let's clarify on that we are
not talking about hacking as in breaking into computers. We're referring to the tech/developer subculture meaning of
hacking – i.e., creativity and ingenuity in using technical skills to build things and find clever solutions to problems.
Why is hacking ability important? Because data scientists absolutely need to leverage
technology
in order to wrangle enormous data sets and work with complex
algorithms, and it requires using tools far more sophisticated than
Excel. Examples of such tools are
SQL,
SAS, and
R,
all of which require technical/coding ability. With these
high-performance tools, a true 'hacker' is a technical ninja, able to
use ingenious problem solving ability to achieve mastery in data
exploration – piecing together unstructured information and teasing out
golden nuggets of insight.
Another way to define a hacker is as a solid
algorithmic thinker
– that is, having the ability to break down messy problems and
recompose them in ways that are solvable. This is critical for good data
science, especially since data scientists work intimately within
existing algorithmic frameworks and oftentimes create their own
algorithms to solve complex problems. Clarity of thinking within
deeply-abstract mental maps of data dimensions and processing capability
is how challenging problems get solved.
It is very important to note that a data scientist is first and foremost a
strategy consultant.
Data science teams have become invaluable resources within companies
because by being able to learn from data in ways no one else can, they
are extraordinarily well-positioned to figure out how to add substantial
business value. But this means having a keen sense of how to dissect
and approach business problems becomes as important as having a keen
sense of how to approach algorithmic problems. Ultimately, the value
doesn't come from numbers; it comes from strategic thinking based on
those numbers.
Additionally, a core competency of data science is in using data to
cogently tell a story. This means no data-puking; rather, presenting a
cohesive narrative of problem and solution, using data insights as
supporting pillars, that lead to guidance.
Clearly, get all the competencies right — math, technology,
and business — and this is an incredibly potent combination. There is a reason why data scientists are
well paid
and probably will never have to worry about job security. Not a bad
place to be to have the rarefied talents that big companies everywhere
are trying to recruit.
A defining personality trait of data scientists is they are deep thinkers with intense intellectual curiosity.
Data science is all about being inquisitive – asking new questions,
making new discoveries, and learning new things. Ask true data
scientists what drives them in their job, and they will not say "money".
The real motivator is being able to use their creativity and ingenuity
to solve hard problems and constantly indulge in their curiosity.
Deriving insight from data is not about getting an answer, it is about
uncovering "truth"
that lies hidden beneath the surface. Problem solving is not a task,
but rather an intellectually-stimulating journey to a solution. There is
passion for the work, and great satisfaction in taking on challenge.
While solid math skills are necessary, there is a glaring
misconception out there that you need a Ph.D in Statistics to become a
legitimate data scientist. That view completely misses the point that
data science is multidisciplinary; years of study in academia may not
leave graduates with the correct set of experience and abilities to
excel – i.e. a Ph.D statistician may not have nimble hacking skills or
strategic business intuition to complete the trifecta.
As a matter of fact, data science is such a relatively new and rising
discipline that universities have not caught up in developing
comprehensive data science degree programs – meaning that no one can
really claim to have
"done all the schooling
"
to be become a data scientist. Where does much of the training come
from? The unyielding intellectual curiosity that data scientists possess
drive them to be passionate
autodidacts, motivated to learn skills on their own with deep determination (
Read: where can you find people like this?).
There are a slew of terms closely related to data science, that we hope to add some clarity around.
Analytics has risen quickly in popular business lingo over the past
several years; the term is used loosely, but generally meant to describe
critical thinking that is quantitative in nature. Technically,
analytics is the
"science of analysis
" — put another way, the practice of analyzing information to make decisions.
Is
"analytics
"
the same thing as data science? Depends on context. Sometimes it is
synonymous with the definition of data science that we have described,
and sometimes it represents something else. A data scientist using raw
data to build a predictive behavior model falls into the scope of
analytics. At the same time, a general business user interpreting
pre-built dashboard reports (e.g.
GA)
is also in the realm of analytics, but does not cross into the
specialized skill needed in data science. Analytics has come to have
fairly broad meaning, though at the end of the day, the semantics don't
matter much.
"Analyst
"
is somewhat of an ambiguous term that can represent many different
types of roles (marketing analyst, operations analyst, portfolio
analyst, financial analyst, etc). Is an analyst the same as a data
scientist? We've discussed pretty strict canon around what is a data
scientist – as an expert's role with requisite talents in math,
technology, and strategy consulting. Let's just say that some analysts
are definitely data-scientists-in-training. As represented in this
visual, there is a place in the middle where the distinction can blur a
bit.
Here are examples of growth from analyst to veritable data scientist:
- An analyst who has previously only mastered Excel, learns how to dive into raw warehouse data using SQL and R
- An analyst who previously only knew enough stats to report the
results of an A/B test, gains the expertise to build a predictive model
with latent variable analysis and cross-validation
Overall point is that moving in the direction of
"data scientist
"
requires motivation to learn many new skills. Many companies have
actually found success cultivating their own home-grown data scientists,
by giving their analysts the resources and training to take their
abilities to the next level.
Machine learning is a term that is closely tied to data science.
Simply, it means being able to train systems or algorithms to derive
insight from a data set. The actual types of machine learning are
varied, ranging from regression models to support vector machines to
neural nets, but it all centers around 'teaching' a computer to become very good at pattern recognition. Examples of machine learning include:
- predictive models that can anticipate user behavior
- clustering algorithms that mine for natural similarities between different customers
- classification models that can recognize and filter out spam
- recommendation engines that 'learn' about preferences at an individual level
- neural nets that can recognize what image patterns look like
Data scientists work intimately with machine learning techniques to
build algorithms that automate elements of their problem-solving. It is a
requisite part of the data science toolset, needed to tackle some of
the most complex data-driven projects.
Raw data can be unstructured and messy, with information from
disparate data sources and mismatched records. Data munging is a term to
describe the important process of cleaning up data so that it is ready
for data analysis and use in machine learning algorithms. This requires
good pattern-recognition ability and clever hacking skills in order to
merge and transform masses of raw information. Dirty data can obfuscate
the
'truth
'
hidden in the data and completely mislead an analysis, thus, any data
scientist must be skillful and nimble at data munging in order to have
accurate data for deriving insight.
In any organization that wants to leverage big data to gain value,
data science is the secret sauce. But, it is incredibly difficult to
find experts who embody all the necessary talents – so if you manage to
hire a data scientist, nurture them, keep them engaged, and give them
autonomy to be their own architects in figuring out how to add value to
the business. At the end of the day, data science is a capability that
turns information to gold, and data scientists are uniquely positioned
to be transformative figures within a company.
A Superb article from
DataJobs.com