Data Science In English - Part 1

Category
Data Science
Publish Date
Reference Link
Status
Draft
Tags
A fucking straight forward Introduction into what computers can do for you.
Pretty much wherever you go in the working world, you've heard these terms thrown around with varying degrees of plausibility. Let's be clear up front: there is no such thing as a "let's just let AI solve this" approach to your business challenge. Not yet anyway.
In a nutshell, what is often missing from the debates around how to use data to address various business challenges is a thorough understanding of what options are on the table. Such knowledge is often held by technical members of a team, but can often be lost or misinterpreted in relaying that information to the key decision makers.
With that said, the essentials of Data Science are not hard. In fact, here they are.
First off: Data Science vs. Machine Learning
  • Data Science is a field that studies the best ways to derive human-relatable insights or recommendations from data.
  • Machine Learning (ML) is a subset of Data Science that focuses on the algorithms that a computer uses to generate those insights. There are a few different types or classes of ML algorithms that do different things based on the type of data you are analyzing and what type of insights you need to uncover (we'll get into this).
One term you may also hear is "AI" or "Artificial Intelligence". This term is often misused and a fairly reliable indicator that the person speaking doesn't have a full grasp of data science. Technically, Artificial Intelligence is a general term and a general field that considers how computers can learn things. In that way, much of data science and machine learning is "artificial intelligence."
However, "AI" can also be used in a specific way (consider an IBM Watson commercial) that suggests some kind of anthropomorphic character that pops out of the computer and just solves things. We don't really have this yet, and we don't know when we will. The closest thing we do have is what we'd more accurately call "Deep Learning" which is a type of Machine Learning that uses a specific kind of algorithm called a Neural Network.
Hopefully you are with me so far! From here, we're going to focus on the Machine Learning world and the subsets within that. If you have someone asking "we have this data, can we do X with it?" then the likely answer is in the world of machine learning.
Machine Learning
As we noted above, machine learning is basically a field of many algorithms that essentially tell the computer how to learn. There are many different techniques a computer can use, but we can broadly separate these into a few buckets:
  • Supervised Learning - these are algorithms where you give the computer a bunch of data where you already know the correct answer, and you want the computer to reliably figure out how it out.
  • Unsupervised Learning - these are algorithms where there is no "correct" answer, and the computer is essentially searching for natural trends or relationships within the data.
  • Deep Learning - as we mentioned above, this is a specific class of algorithms that are semi-supervised (they still rely on needing to know what is a good or correct outcome vs what is a bad or incorrect outcome), but they largely interpret the data and identify features themselves.
This isn't exhaustive, but this covers the vast majority of cases that you'll see. It's worth noting here: data science is concerned with how we can predict something. It is forward looking. Anything outside of that would be broadly considered data analysis or analytics. It is still very good to, as an example, know sales trends for each category over the past year. But the computer doesn't need to learn anything to do that, it can just show it. But doing some initial data analysis is a critical step to do before using machine learning techniques.
With that said, the uses for each of these classes are fairly clear-cut. If you are trying to predict whether or not someone will open an e-mail or how much money they will spend in your store, that's supervised learning. If you are analyzing large volumes of text: books, articles, comments, or e-mails, then you'll need to use unsupervised learning. And if you are trying to identify faces within a set of pictures, then deep learning is the right approach. As we go through each example, we will provide some more context on this.
Supervised Learning
Supervised learning is common, and almost everyone has had some exposure to it in the form of Linear Regression. That's right, linear regression is a machine learning! Admittedly, it is a simple algorithm, but the gist of it is this: the computer takes in a whole bunch of data points, and "learns" how the input variables (or features) can be used to predict that target or outcome variable.
Fortunately, there are many more advanced algorithms within supervised learning: more advanced forms of regression, support vector machines, random forests, gradient boosted machines, and more! But fundamentally these all do something similar: they take in the features and, given the related set of targets or outcome variables, they will "learn" the best way to predict the outcome.
We also need to note that there are two main types of supervised learning problems: classification and regression. Classification is focused on predicting discrete outcomes (e.g. did the team win, lose, or tie the game), while regression is focused on predicting numeric outcomes (e.g. how many points will the team score). Most actual algorithms can be used for each, but have slightly different mechanics under the hood. More tangibly, we evaluate the accuracy of a model differently in these cases.
Unsupervised Learning
Unsupervised learning is less commonly used and a bit less defined. Perhaps the simplest example is clustering. There are various clustering algorithms that essentially take in all the data and determine what natural groupings exist within the data. You may do this with your customer data in order to come up with different personas.
Another example is recommendation generation (e.g. Netflix or Amazon). For them, there is no "right" answer about what you should click on next, but we can write various algorithms to generate a recommendation based on what you've viewed previously.
Lastly, there is Natural Language Processing (NLP for short) which is broadly the process of having a computer derive understanding/insights from text. As with the rest of Machine Learning, this isn't magic either! It typically involves using algorithms to convert text into numeric values, clustering, and more.
Because we don't know a "right" answer for unsupervised learning, it is a bit tougher to understand if you are doing it well. Practically speaking, creating a new algorithm and then testing it versus the old one is one way to get insight into it. But you could also turn almost any unsupervised problem into a supervised problem if you don't mind manually classifying the right answers! Let's say you wanted a perfect recommendation engine: you could classify all of the recommendations it generates as either good or bad. Then you have a classification problem that you can use supervised learning techniques on!
Point being, the line is fairly gray, but all you need to know is that there are some different ways to analyze cases where there isn't a right answer already in your data set.
Deep Learning
Last but certainly not least, there is deep learning! When we think of futuristic AIs that will either make life a breeze or terminate all of us, well, deep learning is as close as we currently come, and it is already in heavy use. Just some examples include: photo recognition, generation of deep fakes (audio/video), and the playing of various games (Chess, Go, Starcraft) at superhuman levels.
No matter the subject, deep learning relies on the same algorithm class to do it's thing: neural networks. Neural networks were proposed in the 1970's as the hypothesis of how human brains interpret data and learn at a cellular level, but it has only been recently that computers have become powerful enough for them to be feasibly applied to real use cases.
It is actually worth providing a high-level of how a neural network works. It takes in data at its base level, and then has a series of nodes that are all trying to predict the outcome. But we actually have multiple levels of these predictors, and so we continue to apply different weights to each node in the system. Ultimately, this creates a web of nodes all weighted differently but that ultimately generate a prediction. Don't worry too much about the details (or dive into them if you want), but the important thing is to get an overall picture in your head like this one:
[Picture]
So Neural Networks have a number of interesting properties that we'll have to save for another time. But what is relevant is that researchers have discovered a few shortcuts and computational tricks that can allow you to make really really deep (many layers) and wide (many nodes per layer) neural networks that still operate in a reasonable period of time.
Why does that matter? Well it's these big neural networks that can handle all those use cases we mentioned above. You simply need a lot of layers and control points to give the computer the flexibility it needs to interpret and classify a picture, let's say.
Oh, and one other benefit: neural networks can be stored and updated just by passing the model the new data points. So, as an example, say you build a model to play Tic Tac Toe. You can have that model play 100 games. This will generate a network similar to the above which reflects what it learned from those 100 games. Now you can have it play 10 more games.
Traditional modeling would leave you with no option but to have the computer process all 110 games. But neural networks enable you to load just the 10 new games and the model can be updated so that it still reflects all 110 games without any reprocessing of the first 100. As you can imagine, that's also a big benefit when dealing with really large data sets and/or streaming data that flows in over time.
Wrapping It Up
Hopefully this gives you a baseline for the world of machine learning. Most notably, consider these three takeaways:
  • Machine learning refers to algorithms within Data Science that tell a computer how it should learn.
  • These algorithms are fed data in large or small quantities and learn the best way to predict a key output or target.
  • Deep Learning uses an algorithm called a neural network to approximate how we believe a human brain learns things, and has been very effective so far in a variety of fields.
We will cover more in [Part 2] of this piece, notably how these can be implemented and the questions that would come up along the way.