Data Science — Building Blocks

Data Science Authority

3 min readMay 29, 2019

Data science is a multidisciplinary field that requires knowledge of math, technology, and domain.

Based on the business requirements, the analysis needed are:

Exploratory analysis is the process of analyzing the dataset to summarize or get an overview of it. It is often done with visual methods using libraries like matplotlib, d3.js, and applications like a tableau.
Predictive analysis is the major branch of data science where models are created using existing data to make predictions on future or unknown data.
A prescriptive analysis is like an extension of predictive analysis in the sense that it not only predicts what will happen, it also suggests decision options to change the outcome.
IPA analysis — Interpretative phenomenological analysis (IPA) is an approach which deals with psychological research.

Tools/Products

Visualization — For exploratory analysis, the tableau is a popular tool to create interactive data visualizations. D3.js is an open source library that is used to create visualizations inside web pages.
Programming Languages — Python, R are the most used languages by data scientists. Python is useful to create end-to-end product as it can be used to create websites. R is preferred for research purposes.
For dealing with large amounts of data, open source big data tools like spark, hive, hadoop are useful.

Data Science Life-cycle

Business Requirement

The first step is to define the objective by discussing with customers or stakeholders to identify the business problems and define the target metric for the project.

Collecting the data

The next step is to acquire the relevant data by direct sources like analytics or from third party sources if necessary. High-quality data is an important requirement for a data science project.

Understanding the data

Before training a model, it is important to explore the data first. Most of the data in production have missing values and errors, they should be dealt with domain knowledge and available algorithms. The data may also be normalized and transformed for better model training.

Creating a model

Out of all the columns available in the dataset, choosing the relevant columns is an important task, this is called feature engineering. It needs exploration of data and domain expertise to decide on the features to use for training the model.

Based on the problem statement of the project, there are different types of models available to choose from. The models can be compared with each other by metrics like accuracy.

The model creation includes the following steps:

Split the data randomly into train, validation and test sets. Most commonly used approach is to use 50% — 70% of the data for training, 20% for validation and 10% for testing but this can vary based on the dataset
Build the model using a training dataset and use the validation data to fine-tune hyperparameters and retrain the model on training data.
Evaluate the model — After the model is finalized using the training and validation dataset, evaluate the model accuracy on the test data.

Deploying the model

Decide whether the accuracy of the model is sufficient to use in production. If not, try training the data on different models and collect more data if necessary. Once the model is finalized, deploy the model to the web to facilitate users to get predictions using their data. APIs can be used to get predictions from other applications as well.

Source: Data Science Authority| DSA Hyderabad

Data Science — Building Blocks

Written by Data Science Authority