This is a guest blog post from Colaberry, a Premier sponsor of the StampedeCon Artificial Intelligence Conference 2017 in St. Louis on October 17.
Every organization or individual Data Scientists performs a set of tasks in order to run predictions on input datasets. At Colaberry we have extensive experience working with data science and predictive analytics pipelines. As part of our effort to share our experience and expertise with our clients and data science community, we created a learn-by-doing data science platform refactored.ai. On the platform, there are various paths that one can follow to learn how to analyze various types of datasets to create predictive analytics pipelines for their organizations. Python is used for all learning in a jupyter notebook environment. If you are new to python you can start learning python that is relevant to data science at: https://refactored.ai/path/python/.
In this blog, we will discuss data science and predictive analytics pipelines and pointers to where you can learn more on the refactored.ai platform.
A Data Science workflow or a pipeline refers to the standard activities that a Data Scientist performs from acquiring data to delivering final results using powerful visualizations.
Here are the important steps in the pipeline:
- Data Ingestion
- Identify Nature of Dataset
- Data Visualization
- Statistical Analysis
- Anomaly Detection
- Mapping Algorithm to the Dataset
- Problem Identification
- Model Validation and Fine Tuning
- Model Building Using Machine Learning Algorithms
- Scaling and Big Data
The pipeline can be explained with the help of the diagram as shown below:
1. Data Ingestion
Acquiring data is the first step in the pipeline. This involves working with Data Engineers and Infrastructure Engineers to acquire data in a structured format such as JSON, csv, or Text. Data Engineers are expected to provide the data in the known format to the Data Scientists. This involves parsing the data and pushing it to a SQL database or a format that is easy to work with. This can involve applying a known schema to the data that is already known or can be inferred from the original data. When original data is in unstructured format, the data needs to be cleaned and relevant data extracted from it. This involves using a regular expression parser or multiple methods of parsing such as using perl and unix scripts, or language of your choice to clean the data.
An example of acquiring data is shown for the “Women in STEM” dataset tutorial at: https://refactored.ai/path/data-acquisition/. This dataset provides information about various college majors that women are graduating from.
To understand more about data ingestion, you can follow our Junior Data Scientist track at: https://refactored.ai/path/data-analyst/
2. Identify the Nature of the Dataset
Identifying the nature of the data set is the second step in the pipeline. At a high level, datasets can be classified into linearly separable, linearly inseparable, convex and non-convex datasets. Linearly separability refers to such datasets where a linear hyperplane or decision boundary will classify datasets with a good accuracy. Convexity refers to datasets where every line that joins two points in the dataset lie within the dataset. This identification is typically the foundation of what type of data modeling can be done and what type of machine learning algorithms can be potentially applied to analyze data and do predictive analytics.
Here are a few visual examples of what kinds of data we may encounter:
It is helpful to analyze the type of dataset and its features such as linear separability, convexity and sparsity. Such characters help identify nature of dataset so that we can apply relevant algorithms in the pipeline ahead. The process of identifying the nature of data has been described in the data intelligence conference workshop at https://refactored.ai/path/data-intelligence/
3. Exploratory Data Analysis (EDA)
Exploratory Data Analysis is the third step which involves looking at various statistics and visualizations generated from various dimensions of the dataset. EDA core activities include anomaly detection, statistical Analysis, data visualization, clustering and cleaning.
Anomaly detection may involve simple statistical anomalies or complex anomalies. For example, let us say we are looking to identify a dataset that has a column of social security numbers which could contain anomalies of few entries with 0s (000-00-0000). This incorrect data could lead to problems in applying machine learning techniques. We can identify such spurious numbers by looking at statistics such as frequency counts. Plotting a histogram can show up such numbers in significant probabilities signaling that spurious numbers are present. Also looking at mean, median, variance and other statistical measures will convey information about the characteristics of the data. Anomalies in a time-series graph plotted on a dataset that contains data about credit card activity of a user could signal fraudulent activity. There are other ways of looking at the graphs too by which we can identify anomalies and such spurious data. The anomalies detected can be put through cleaning process and get data ready for further processing. You can learn a lot of EDA methods by following our Data Science Track at: https://refactored.ai/path/data-science/.
Let’s explore EDA though an example of using Titanic Dataset:
Titanic Survivors example:
www.kaggle.com posted a famous dataset from Titanic posted at https://www.kaggle.com/c/titanic. The challenge is about identifying the survivors amongst the people aboard Titanic.
The train_data and test_data provided in the challenge can be loaded from kaggle github with read_csv command: import pandas as pd train_data = pd.read_csv("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv") test_data = pd.read_csv("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv")
The source data provided e is missing some data in the age column which is referred to as sparsity. Instead of throwing away such sparse rows entirely, we interpolate it by doing statistical analysis as shown below:
miss_est = train_data[train_data['Name'].str.contains('Miss. ')].Age.mean() master_est = train_data[train_data['Name'].str.contains('Master. ')].Age.mean() mrs_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean() mr_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()
In this example, we can visualize the results of analysis using violin plots. Violin plots are 2-D plots that can represent the distribution of datasets for two types of qualifiers within a dataset. In the plot below, you can see that the violin is split by half and is asymmetric across the vertical axis. In the leftmost violin plot, you can see that most females survived as the peak is high around 1.0 (survival probability) with low variance (less uncertainty of death). Hence, we can conclude that most women in class 1 survived compared to men.
More on data visualization and EDA can be found at: https://refactored.ai/user/rf/notebooks/1/dv.ipynb
4. Mapping Algorithm to the Dataset
This stage typically involves problem Identification, modeling, model Validation and Fine Tuning. At this step of the pipeline, we can associate a relevant algorithm(s) based on the nature of the dataset, apply all the relevant algorithms and measure their performance.
Problem identification involves identifying what type of problem are we dealing with such as causal or non-causal (involving time as a feature or not), classification, prediction or anomaly detection. For example, a data containing time as a column is referred to as a time-series dataset, and by identifying it, we can pick a class of time-series algorithms.
Modeling refers to applying Machine Learning models to the the dataset. The models we have built need to be validated for performance. This stage is called as hyperparameter tuning. The first run of mapping algorithms to dataset will contain parameters that fit the model but are not necessarily optimal. To determine the optimal parameters, we need to apply tuning techniques. Commonly used techniques in linear regression involve cross-validation and regularization. You can try an example of applying linear regression to boston housing dataset at: https://refactored.ai/user/rf/notebooks/10/reg-journey.ipynb
5. Model Building using Machine Learning Algorithms
It is necessary to build models from scratch when existing algorithms of the standard packages fail. This is when, we refer to literature to build custom machine learning models. More on Machine Learning is on our ML Track: https://refactored.ai/path/machine-learning/
Often we encounter datasets that are linearly inseparable for classification. For linearly separable classes by using an SVM, it is easy to add a hyperplane that classifies the data. However, for linearly inseparable data, a transformation to a higher dimension will help map the data to a linearly separable space. One such cool ways to map the data is by using functions called kernels.
Here we shall look at an example in 2-dimensions when mapped to 3-dimensions will help classify the data with a high accuracy.
First, let us create a 2-D circles dataset.
from sklearn.datasets import make_circles from sklearn.svm import SVC from sklearn.metrics import accuracy_score import seaborn as sns; sns.set() import matplotlib.pyplot as plt import numpy as np X, y = make_circles(n_samples=500, random_state=20092017, noise=0.2, factor=0.2) plt.figure(figsize=(8,6)) plt.scatter(X[y==0, 0], X[y==0, 1], color='red', alpha=0.5) plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', alpha=0.5) plt.show()
When we fit a linear support vector machine (SVM) and calculate the accuracy with a linear kernel we arrive at an accuracy of 0.644. However, instead of linear kernel if we apply Radial Basis Foundation (RBF) kernel, the accuracy will go upto to 0.986. We can also apply transformation functions to convert 2d points into 3d points by warping the space as seen in the illustration below.
Z = X[:, 0]**2 + X[:, 1]**2 trans_X = np.c_[X, Z] svm = SVC(C=0.5, kernel='linear') svm.fit(trans_X, y) SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) y_hat = svm.predict(trans_X)
Understanding and applying Machine Learning needs a thorough understanding of linear algebra and probability. You can learn them through visualizations at:
6. Scaling and Big Data
The models that perform greatly on small datasets might not do so on large datasets due to the variance present in the dataset. Hence, working with big data and scaling up the algorithms is a challenge. The models are initially validated with small datasets before working with big data. The popular technology stack for working with large datasets are Hadoop and Spark. For prediction on smaller datasets, pandas, sci-kit learn and numpy libraries are used and for large datasets, Spark MLlib is used.
Colaberry is a data science consulting and training company. Refactroed.ai is a data science learn-by-doing platform created by data scientists, data engineers and machine learning specialist at Colaberry. Contact colaberry at email@example.com
Ram Katamaraja the founder and CEO of Colaberry and architect of Refactored.ai platform. He can be reached at firstname.lastname@example.org
Harish Krishnamurthy is chief data scientist at Colaberry and the primary content author of Refactored.ai platform. He can be reached at email@example.com