Data science is deep knowledge discovery through data inference and exploration. This discipline often involves using mathematic and algorithmic techniques to solve some of the most analytically complex business problems, leveraging troves of raw information to figure out hidden insight that lies beneath the surface. It centers around evidence-based analytical rigor and building robust decision capabilities."We have lots of data – now what?"(How can we unlock valuable insight from our data?)
Ultimately, data science matters because it enables companies to operate and strategize more intelligently. It is all about adding substantial enterprise value by learning from data.
The variety of projects that a data scientist may be engaged in is incredibly broad. Here are few examples:
- tactical optimization – improvement of marketing campaigns, business processes, etc
- predictive analytics – anticipate future demand, future events, etc
- nuanced learning – e.g. developing deep understanding of consumer behavior
- recommendation engines – e.g. Amazon product recs, Netflix movie recs
- automated decision engines – e.g. automated fraud detection, and even self-driving cars
The objectives of these types of initiatives may be clear, but the
problems require extensive quantitative expertise to solve. They may
require building predictive models, attribution models, segmentation
models, heuristics for deep pattern-discovery in data, etc — this
commands having exhaustive knowledge of all sorts of machine-learning
algorithms and sharp technical ability. As you might guess, these are
not the easiest skills to pick up.
What is data science – the requisite skill set
Data science is multidisciplinary; the skill set of a data scientist lies at the intersection of 3 main competencies:
Mathematics Expertise
At the heart of deriving insight from data is the ability to view the
data through a quantitative lens. There are textures, patterns,
dimensions, and correlations in data that can be expressed numerically,
and discovering inference from data becomes a brain teaser of
mathematical techniques. Solutions to many business problems often
involve building analytic models that are deeply grounded in the hard
math theory, and being able to understand how models work is as
important as knowing the process to build them (danger of building without knowing the math).
Also, a big misconception is that data science all about statistics.
While statistics are important, it is not the only type of mathematics
that should be well-understood by a data scientist. First, there are two
main branches of statistics – classical statistics and Bayesian statistics. When most people refer to stats they are generally referring to classical stats,
but knowledge of both types is very helpful. Furthermore, many
inferential techniques and machine learning algorithms lean heavily on
knowledge of linear algebra. For example, key data science processes like SVD
(used for dimension reduction / latent variable discovery) are grounded
in matrix mathematics and have much less to do with classical
statistics. Overall, data scientists should have substantial breadth and
depth in their knowledge of math.
Technology and Hacking
First, let's clarify on that we are not talking about hacking as in breaking into computers. We're referring to the tech/developer subculture meaning of hacking – i.e., creativity and ingenuity in using technical skills to build things and find clever solutions to problems.
Why is hacking ability important? Because data scientists absolutely need to leverage technology in order to wrangle enormous data sets and work with complex algorithms, and it requires using tools far more sophisticated than Excel. Examples of such tools are SQL, SAS, and R, all of which require technical/coding ability. With these high-performance tools, a true 'hacker' is a technical ninja, able to use ingenious problem solving ability to achieve mastery in data exploration – piecing together unstructured information and teasing out golden nuggets of insight.
Another way to define a hacker is as a solid algorithmic thinker – that is, having the ability to break down messy problems and recompose them in ways that are solvable. This is critical for good data science, especially since data scientists work intimately within existing algorithmic frameworks and oftentimes create their own algorithms to solve complex problems. Clarity of thinking within deeply-abstract mental maps of data dimensions and processing capability is how challenging problems get solved.
Strong Business Acumen
It is very important to note that a data scientist is first and foremost a strategy consultant.
Data science teams have become invaluable resources within companies
because by being able to learn from data in ways no one else can, they
are extraordinarily well-positioned to figure out how to add substantial
business value. But this means having a keen sense of how to dissect
and approach business problems becomes as important as having a keen
sense of how to approach algorithmic problems. Ultimately, the value
doesn't come from numbers; it comes from strategic thinking based on
those numbers.
Additionally, a core competency of data science is in using data to cogently tell a story. This means no data-puking; rather, presenting a cohesive narrative of problem and solution, using data insights as supporting pillars, that lead to guidance.
Clearly, get all the competencies right — math, technology, and business — and this is an incredibly potent combination. There is a reason why data scientists are well paid
and probably will never have to worry about job security. Not a bad
place to be to have the rarefied talents that big companies everywhere
are trying to recruit.
What is a data scientist – curiosity and training
The Mindset
A defining personality trait of data scientists is they are deep thinkers with intense intellectual curiosity.
Data science is all about being inquisitive – asking new questions,
making new discoveries, and learning new things. Ask true data
scientists what drives them in their job, and they will not say "money".
The real motivator is being able to use their creativity and ingenuity
to solve hard problems and constantly indulge in their curiosity.
Deriving insight from data is not about getting an answer, it is about
uncovering "truth"
that lies hidden beneath the surface. Problem solving is not a task,
but rather an intellectually-stimulating journey to a solution. There is
passion for the work, and great satisfaction in taking on challenge.
Training
While solid math skills are necessary, there is a glaring
misconception out there that you need a Ph.D in Statistics to become a
legitimate data scientist. That view completely misses the point that
data science is multidisciplinary; years of study in academia may not
leave graduates with the correct set of experience and abilities to
excel – i.e. a Ph.D statistician may not have nimble hacking skills or
strategic business intuition to complete the trifecta.
As a matter of fact, data science is such a relatively new and rising
discipline that universities have not caught up in developing
comprehensive data science degree programs – meaning that no one can
really claim to have "done all the schooling"
to be become a data scientist. Where does much of the training come
from? The unyielding intellectual curiosity that data scientists possess
drive them to be passionate autodidacts, motivated to learn skills on their own with deep determination (Read: where can you find people like this?).
Analytics and machine learning – how it ties to data science
There are a slew of terms closely related to data science, that we hope to add some clarity around.
What is Analytics?
Analytics has risen quickly in popular business lingo over the past several years; the term is used loosely, but generally meant to describe critical thinking that is quantitative in nature. Technically, analytics is the "science of analysis" — put another way, the practice of analyzing information to make decisions.Is "analytics" the same thing as data science? Depends on context. Sometimes it is synonymous with the definition of data science that we have described, and sometimes it represents something else. A data scientist using raw data to build a predictive behavior model falls into the scope of analytics. At the same time, a general business user interpreting pre-built dashboard reports (e.g. GA) is also in the realm of analytics, but does not cross into the specialized skill needed in data science. Analytics has come to have fairly broad meaning, though at the end of the day, the semantics don't matter much.
What is the difference between an analyst and a data scientist?
"Analyst"
is somewhat of an ambiguous term that can represent many different
types of roles (marketing analyst, operations analyst, portfolio
analyst, financial analyst, etc). Is an analyst the same as a data
scientist? We've discussed pretty strict canon around what is a data
scientist – as an expert's role with requisite talents in math,
technology, and strategy consulting. Let's just say that some analysts
are definitely data-scientists-in-training. As represented in this
visual, there is a place in the middle where the distinction can blur a
bit.
Here are examples of growth from analyst to veritable data scientist:
Here are examples of growth from analyst to veritable data scientist:
- An analyst who has previously only mastered Excel, learns how to dive into raw warehouse data using SQL and R
- An analyst who previously only knew enough stats to report the results of an A/B test, gains the expertise to build a predictive model with latent variable analysis and cross-validation
What is Machine Learning?
Machine learning is a term that is closely tied to data science.
Simply, it means being able to train systems or algorithms to derive
insight from a data set. The actual types of machine learning are
varied, ranging from regression models to support vector machines to
neural nets, but it all centers around 'teaching' a computer to become very good at pattern recognition. Examples of machine learning include:
- predictive models that can anticipate user behavior
- clustering algorithms that mine for natural similarities between different customers
- classification models that can recognize and filter out spam
- recommendation engines that 'learn' about preferences at an individual level
- neural nets that can recognize what image patterns look like
What is Data Munging?
Raw data can be unstructured and messy, with information from disparate data sources and mismatched records. Data munging is a term to describe the important process of cleaning up data so that it is ready for data analysis and use in machine learning algorithms. This requires good pattern-recognition ability and clever hacking skills in order to merge and transform masses of raw information. Dirty data can obfuscate the 'truth' hidden in the data and completely mislead an analysis, thus, any data scientist must be skillful and nimble at data munging in order to have accurate data for deriving insight.Final word
In any organization that wants to leverage big data to gain value, data science is the secret sauce. But, it is incredibly difficult to find experts who embody all the necessary talents – so if you manage to hire a data scientist, nurture them, keep them engaged, and give them autonomy to be their own architects in figuring out how to add value to the business. At the end of the day, data science is a capability that turns information to gold, and data scientists are uniquely positioned to be transformative figures within a company.A Superb article from DataJobs.com