Data Science Process CRISP-DM

Data Science Life Cycle: CRISP-DM and OSEMN frameworks

Summary

There are two frameworks, the CRISP-DM and OSEMN, that is used to describe the data science project life cycle on a high level. The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases that naturally describes the data science life cycle. While the OSEMN framework categorises the general workflow that a data scientists typically perform.

How to use Data Science?

Data science has two frameworks that are generally used to describe the data science life cycle, the OSEMN framework and the CRISP-DM framework. Let’s have a look at them both.

Data Science Project Life Cycle in 6 phases – the CRISP-DM framework

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases that naturally describes the data science life cycle. View it as a set of guidelines to help you set up, plan and make your data science, machine learning project come to life.

The CRISP-DM process has six steps: 

  1. Business understanding: What does the business need?
  2. Data understanding: What data do we have or need? Is it ready to use?
  3. Data preparation: How to organise the data for modelling?
  4. Modelling: What modelling techniques should be applied?
  5. Evaluation: What type of model best meets the business objectives?
  6. Deployment: How do we share insights and make stakeholders access the results?

These six steps can be illustrated as

Data Science Process CRISP-DM

Let’s look at the six steps a bit closer

01. Business Understanding

Focuses on understanding the objectives and requirements of the project. Important to determine business objectives and define business success criteria. In other words, what should the project try to achieve and what measurements (KPI:s) would be a successful project.

02. Data Understanding

The second phase focuses on identifying, collecting, and analysing the data sets used in the project. This includes collecting initial data, describing the data, exploring data, and verifying the data quality.

03. Data Preparation

Next up is preparing the data for modelling; this involves selecting and cleaning the relevant data, constructing and deriving new attributes that will be helpful, and integrating data from various sources. Basically, make sure the fuel (data) is ready to be used in your model.

04. Modelling

Finally, build and evaluate the model. Often, here is where you get to write your code and start to dig into the data set with your model. Build and evaluate different models based on several various modelling techniques. This means that you need to determine which algorithms to try, generate test design and divide the data into training, test, and validation sets.

05. Evaluation

The evaluation phase assesses which of the models best meets the business and the next steps. This includes, evaluating results and reviewing the process to try to see if the models meet the business success criteria and see if something was overlooked or missed. Then finally, set the next steps and actions to take.

06. Deployment

In the final step, the preparation of results is arranged and provided to the business or organization. Depending on the requirements, the deployment phase can be as straightforward as a report or as complex as implementing a data mining process.

Data Science Project Life Cycle in 5 steps – OSEMN framework

In a 2010 post called “A Taxonomy of Data Science” on the dataists blog, Hilary Mason and Chris Wiggins introduced the OSEMN framework. That basically included categorising the general workflow that a data scientists typically perform. It is a list of tasks a data scientist should be familiar and comfortable working on.

The image below shows a summary of the five steps in the framework

Data Science Process OSEMN Framework

Now, let’s walk through the five steps of the OSEMN framework in more detail

Obtain – Data Collection

First step is that you obtain the data you need from available data sources. Very important to collect complete and reliable data. It happens that data projects fail at this very first step, so pay attention to the data quality and remember the saying – “Garbage in, garbage out (GIGO)”. Additionally, it can be one of the most time-consuming steps in the process, although often behind step two, cleaning

Scrub – Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When merging multiple data sources, there are many chances for data to be duplicated or mislabeled.

If data is incorrect, outcomes and algorithms are unreliable, even though they may look accurate. Although this is often the most time-consuming (and for most the boring) task, it’s a crucial step   

Explore – Data Analysis

Exploratory data analysis. This phase allows us to understand the data so that we can figure out the course of actions and areas that we can explore in the modeling phase

Model – Modelling data

In the fourth step, we use machine learning techniques to help you make sense of data and acquire important insights for data-driven decision-making. As many people would call it, “where the magic happens”.

In short, regression and predictions are used to forecast future values, and classification identifies and groups your values

Interpret – Model Deployment

In the final step, we try to make sense of the data by simplifying and summarising results from all the models built and by communicating our findings.

This requires reaching meaningful conclusions and justifying actionable findings, which will allow and support you and your colleagues to determine the next line of action.

FAQ: Data Science Life Cycle

What is the CRISP-DM framework?

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases that naturally describes the data science life cycle. View it as a set of guidelines to help you set up, plan and make your data science, machine learning project come to life.

What are the six phases in the CRISP-DM framework?

The six phases in the CRISP-DM framework are:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modelling
5. Evaluation
6. Deployment

What is the OSEMN framework?

OSEMN is an acronym that stands for Obtain, Scrub, Explore, Model, and iNterpret. It is a list of tasks a data scientist should be familiar and comfortable working on

Share
Eric J.
Eric J.

Meet Eric, the data "guru" behind Datarundown. When he's not crunching numbers, you can find him running marathons, playing video games, and trying to win the Fantasy Premier League using his predictions model (not going so well).

Eric passionate about helping businesses make sense of their data and turning it into actionable insights. Follow along on Datarundown for all the latest insights and analysis from the data world.