macro photography of water waves

Mastering the Data Science Process: An Introduction

What is data science?

Definition

Data science is an interdisciplinary field that focuses on extracting information and insights from data to guide and inform decision making and is a key part in the overall data lifecycle in an organization.

It solves business problems by combining domain knowledge, statistics, and computer science techniques. Typically, one would learn algorithms and the mathematical and computer science theories behind it, then apply it to a use case. You would learn to follow a standard process of something along the lines of data wrangling, cleaning, model building and evaluation. Except sometimes you may come across that there are extra steps involved, or perhaps skip some. There is no one true, standard process of data science. It isn’t always like the linear workflow that you may have learned in class. Data science is an iterative process, meaning you go back and forth between different stages of the project.

For example, you had chosen a linear regression model to predict an outcome and the model is showing promising results, but there is a feature that you could drop, add or engineer to boost its performance and accuracy. If there are stakeholders involved, you will find that the consultation process exists throughout the entire project. This is partly due to maybe not as in depth of domain knowledge on your part of the business problem, or their lack of knowledge of the data and the data science process. Sometimes you may see a pattern they might have missed, and the only way to find out is to bring them the information for their review to determine if it is something important to keep in mind for the project, something that they have already considered, or perhaps nothing at all.

To reiterate, data science is an iterative process, and once you get into processing, analyzing and modeling the data, you will see that unique steps are taken to transform the extracted information into actionable insights.

Often times, the stakeholders do not have in depth knowledge about the data science process and give a prompt that is ambiguous. Managing this ambiguity is an important skill for a data scientist to have so that they can transform the stakeholder requirements into something actionable.

One way to do this is to ask the right questions, such as:

  1. What does the current process look like?
  2. What are the pain points you are looking to fix?
  3. Who are the key customers
  4. What does success look like and how do we measure it?
  5. What is the business problem we are aiming to solve?
  6. Can you give examples?

There are a plethora of questions one should ask the stakeholders, ourselves and our team.

The general data science process

Generally, the data science process is a systematic approach to solving the business problem at hand. It is a structured framework designed for articulating the problem/question, determining how to solve it and presenting the solution to stakeholders.

The general data science life cycle is as follows:

  1. Problem definition
  2. Data collection and cleaning
    • Corrupted, invalid, or missing values
    • Times, dates and other data are in correct formats
    • Check certain aggregates to ensure that the values make sense
  3. Exploratory analysis
  4. Model building
    • Communicate results of the analysis
  5. Model deployment

Keep in mind that this is not a linear process, and you should flow between the steps as necessary during the iterative cycle as we present artifacts and solicit feedback from stakeholders.

There is no one size fits all process flow for data science projects. The steps that you use vary for each use case, as the data that is given will vary project to project. Sometimes you will get a nicely cleansed data set, and other times you may need to do some additional collecting, cleaning and preparation.

There are several different processes available to follow, and each one is great to use for specific use cases. The one I have personally used and seen the most is the Cross Industry Standard Process for Data Mining (CRISP-DM). It is a good starting point to understand the general process flow because it closely resembles the general process outlined previously.

CRISP-DM

It is a model that has six phases:

  1. Business understanding – what does the business need
    • The Business Understanding phase focuses on obtaining a deep understanding of the stakeholder’s needs to understand the objectives and requirements of the project. There are a few tasks that one could use in this phase:
      • Determine business objectives: Understand what the stakeholders are trying to accomplish, determine the project requirements, and define the success criteria.
      • Assess situation: Determine resource and code availability, assess risks and contingencies, and conduct a cost-benefit analysis.
      • Produce the project plan: Select technologies, resources and tools that are required to complete the project. Then determine the project plan for each phase.
  2. Data understanding – what data do we have? What do we need?
    • This phase drives the focus to identify, collect, and analyze the data sets to accomplish the project goals. This phase also has a few tasks:
    • Acquire the necessary data
    • Verify data quality: How clean/dirty is the data?
    • Describe data: Examine the data properties like data format, number of records, or field identities, and generate summary statistics
    • Explore data: Analyze, visualize and identify relationships in the data
  3. Data preparation – how do we organize the data for modeling?
    • Select data to use and document reasons for inclusion or exclusion. Create new data sets by combining from multiple sources if necessary.
    • Clean data: Correct, impute, or remove incorrect values and analyze outliers. Keep in mind the saying, garbage in – garbage out. If you put bad data into the model, you will get bad results.
    • Feature engineering: Create new attributes that may be helpful to explain the relationships in the data, such as determining body mass index from height and weight fields.
  4. Modeling – which modeling techniques and algorithms should we apply?
    • Build and assess various models and algorithms. This phase has four tasks:
      • Determine modeling techniques and algorithms to apply.
      • Depending on the modeling approach, you may need to split the data into training, test, and validation sets.
      • Build models: Choose one to act as a base line to compare results
      • Assess models: Compare multiple models against each other and interpret the results.
    • Iterate the model building and assessment step until you have found the best models, then proceed through the CRISP-DM lifecycle to make model improvements.
  5. Model evaluation – which model best meets the business objectives?
    • This phase looks at which model best answers the business problem and what are the next steps. This phase has three tasks:
      • Evaluate results: Do the models meet the business success criteria?
      • Review process: Review the work completed. Was anything overlooked? Were all steps properly executed? Summarize findings and share results with the stakeholders, as well as other data science and domain experts for more input.
      • Next steps: Determine whether to deploy the solution, iterate again and make necessary changes, or initiate a new project.
  6. Deployment – how do stakeholders access and use the results?
    • Plan deployment: Develop and document a plan for deploying the model and model outputs.
    • Plan monitoring and maintenance: Develop a monitoring and maintenance plan to avoid issues during the operational phase of a model.
      • CRISP-DM does not have an outline for what to do after model deployment. But, generally, it is best practice to implement continuous monitoring and tuning of the model at planned intervals sometime after the project ends.
    • Final report: Summarize the project and final presentation of the results, and document what went well, what could be better, and how to improve in future projects.

Visual of the CRISP-DM data science process:

CRISP-DM Data Science Process Model, image by author

There are plenty of other data science process flows out there to choose from such as OSEMN, Microsoft TDSP, and SEMMA. Understanding the components of these processes will help you plan the design of each project as many vary in size, objective and data.

Benefits, weaknesses, and recommendations

If the CRISP-DM approach is the one that you are thinking of implementing, there are many benefits, challenges, weaknesses, and recommendations.

Benefits

  • Significance and benefits
    • Even though CRISP-DM is for data mining, all data science projects start with understanding the business problem, gathering data, cleaning, analyzing, and applying algorithms and techniques. This process provides guidance for the majority of data science projects that one would come across.
    • Starting with the right scope and avoiding scope creep: starting with focusing on the business problem is helpful to align the technical work with the business needs and to direct data scientists away from starting the project without properly understanding the business objectives.
    • The final phase addresses important considerations to close out the project and transition to deployment and maintenance.
    • A loose CRISP-DM implementation can be flexible to provide many of the benefits of agile principles and practices. The data scientist can iterate through the phases, each time gaining a deeper understanding of the data and the problem. The empirical knowledge learned from previous iteration can feed into the following iterations.

Weaknesses

  • Challenges and weaknesses
    • Because one would expect to add documentation at every step in the process, it can become too documentation heavy and unnecessarily slow down the process.
    • This isn’t something that is applicable to EVERY project. It works well for those working individually or on a small team. However, it may not be suitable for larger projects, especially those involving big data.

Recommendations

  • Some tips to overcome these challenges and weaknesses
    • Iterate, iterate, iterate. Donโ€™t fall into a waterfall trap by working thoroughly across layers of the project. Instead, think vertically and deliver small pieces of value. Your first deliverable might not be too useful. Thatโ€™s okay. Iterate.
    • Because documentation can become heavy using this methodology, don’t overdo it. Only document the most important steps and items that need to be considered for future owners.
    • When to communicate with stakeholders is not implicitly outlined in the process. Layout with the stakeholders beforehand when to meet to go over iterations so you can take their feedback and make adjustments accordingly.
    • This isn’t a one size fits all and is not suitable for EVERY project. It may be helpful to combine with other project management frameworks such as Kanban and scrum.
    • While these aren’t mentioned in the steps above, considering the plan for data quality, management and governance throughout the process are also important things to consider during project planning.

Conclusion

You are not required to use any framework for the data science process. But it is best practice to have a structured plan so you can easily map out the steps to stakeholders. It is also helpful to keep on track and make progress so you can create a well organized deliverable.

References


Questions & Comments

If you have any questions or comments, or just want to chat, please email us at: [email protected]