If you missed last time, we started at the ground floor of what Data Science and Machine Learning is, so take a look.
Today, we're going to get into some of the mechanics of it for relatively classic problems. We won't focus on what the code would look like - there are plenty of great tutorials for that out there if you are so inclined. Rather, we're going to provide a high-level voiceover of what you would do and the questions/decisions you would have to consider along the way.
The Start: Choosing Your Development Environment
This should be fairly obvious, but you need a place to write code and process the data! This seems obvious, but choosing a development environment is important. It could range anywhere from:
- I have a couple hundred rows of data and just want to run a linear regression: I'll do this in Excel
- I have 200 million rows of data coming in every day: I'll need to process using Spark on a distributed cluster on the cloud
And there's a lot in between. Point being, you are going to need different tools depending on the use case. Despite what job postings typically ask for, someone who is good at one of these will be able to pick up the others and add value very quickly. There's a learning curve of using new technology, but the concepts are more or less the same, so don't get too bogged down.
If you are getting started on something, a fairly standard environment or "tech stack" is to run an Anaconda environment in Python writing code in Jupyter notebooks. That may look like jibberish, but tech tends to have funny names. Point being, you can just go to a website and install this just like you would for any other software. Alternatively, you can run code directly on the cloud through sites like Kaggle or Google's Colab offering (both free). These both run something analogous to an Anaconda environment on your own computer, but it will run on the cloud. If you have less than 16 GB of RAM, this is probably the best approach anyway.
Next Comes the Data
Whatever the system, you get the data into it! This should be fairly easy, but once it is in you'll need to do three things, which is kind of where the data science starts. They are:
- Look at the data and get a general summary of what it is showing. This is called an Exploratory Data Analysis or an EDA.
- Determine what it is that you are trying to analyze or predict, what question(s) you are trying to answer.
- Start to clean and prepare the data. This is a whole topic, but generally speaking you need to set a way to handle missing data, a way to define and handle anomalous (outlier) data points, removing data that isn't relevant, and generally normalize your data as many algorithms work better with normally distributed data.
This phase often takes the most time and is often where the value comes in of having someone who knows what they are doing. Simply put, computers still learn in very specific, fixed, ways. Since that's the case, you have to do a lot of preprocessing in order to ensure that the computer is learning the right thing based off the right data.
Now We Model!
Once you have the data setup, you can finally get your machine learning! As we've mentioned, the modeling phase essentially has the computer processing all the data and identifying what features are useful in predicting the target. At this point, we can try a few different models.
Just as there's nothing preventing you from skipping the data cleaning, there's also nothing to prevent you from just calling a modeling function and tossing it at the data without any more thought than that. As you can imagine, this is also a bad mistake. Here are some of the factors you need to consider in modeling:
- Protecting Against Overfitting - these algorithms have a tendency to see patterns everywhere (just like humans) and tend to make a model too tightly fit to the training data. This means the model will fit the data really well, but will be bad at predicting in practice. Among other precautions, it's essential to holdout a subset of your data (often 20%) that can be used to validate that the model predicts that data as well as it predicts the other 80% used to build the model.
- Parameter Tuning - most models have a variety of options that they can be run with. Neural nets are a good example of this: you can run almost any network architecture that you'd want. And most approaches have at least one parameter that can be varied to tweak how the algorithm fits the data. One of the packages that is installed with Anaconda called SciKitLearn has a function that allows you to try sets of these parameters and assess how well they perform relative to each other.
- Connecting it back to the real world - often forgotten, Data Scientists can get lost in just trying to build the best model possible. That may be what you need, but there should be a decision point of where and when a model becomes "good enough" that should be informed by how the model is actually going to be used. Say you are modeling who is an "at risk" customer that is going to receive a free month on their subscription. In this case you'd want to assign a benefit to a true positive detection and a cost to a false positive detection. Based on that, you can quickly get a sense of a) the value of the model itself, and b) the value attributable to any increase in the model's accuracy.
There's a bit more to it than just those three things, but if you have a straightforward case and plenty of data, these are the three main things you want to consider. As you can imagine, knowing how to use the tools in Python packages in order to validate the model fairly (e.g. on a holdout group) is critical. There are certain types of models that handle certain types of data better, but these rules aren't hard and fast and can be picked up through trial and error.
What Comes Next
It'd be at this point that you can/should have a model that can predict something given the inputs. That's great, but only if you do something with that knowledge and put it into action. Depending on what you've modeled, this action may be manually acting on it, but it is ideally baked into a software process. There are tons of examples of this: algorithmic trading picking stocks, prices being set for coffee and other goods, video recommendations being made, et cetera.
You could consider this the "plumbing" of Data Science - it largely involves connecting data pipelines with the algorithm and the recommendation or action back into the system as well. That said, it'd be a mistake to short-change this part: there are a myriad of tools and approaches you can use depending on the systems you are integrating with. The speed of the algorithm is also of critical importance (e.g. in the case of a live video recommendation engine). Last but certainly not least, the model would be quite sensitive to any changes in the data and so setting up systems to monitor the inputs and outputs is also critical. Increasingly, this is becoming the realm of Data Engineers rather than Data Scientists, but the lines are blurred and knowledge of both aspects is quite helpful.