Introduction

Data Cleaning

  • Data Cleaning is the task of cleaning up raw data to make it usable and ready for analysis.

Normalization: Standardizing or “cleaning up a column” by transforming it in some way to make it ready for analysis.

Strategy

  1. What data do you need?: Review the problem you are solving. Review what data you need to run an analysis and solve the problem at hand.

  2. What data do you have?: Understand the data types across the dataset. Take stock of not only the information you have in your dataset today but what data types those fields are. Do these align with your data needs?

  3. How will you clean your data?: Manipulate an existing column or create new ones. Build a game plan of how you’ll convert the data you currently have to the data you need. What types of actions and data cleaning techniques will you have to apply? Do you have the skills you need to go from the current to future state?

  4. How will you analyze your data?: Now, it’s game time! How do you run an effective analysis? Build an approach for analysis, as well. And visualize your plan to solve the problem. Finally, remember to question “so what?” at the end of your results, which will help drive recommendations for your organization.

Methods

The following set of methods cover three types of data cleaning techniques:

  • parsing information,

    • Substr: Extracts a substring from a string (starting at any position)

    • Right: Extracts a # of characters from a string starting from the right

    • Left: Extracts a # of characters from a string starting from the left

  • returning where information lives, and

    • Coalesce: Returns the first non-null value in a list

    • Strpos: Returns the position of a substring within a string

    • Position: Returns the position of the first occurrence of a substring in a string

  • changing the data type of the information.

    • Cast: Converts a value of any type into a specific, different data type

    • Concat: Adds two or more expressions together

LEFT(string, number_of_chars)
RIGHT(string, number_of_chars)
SUBSTR(string, start, number_of_chars)

CONCAT(string1, string2, string3)
STRING_SPLIT() Example
WITH table AS(
SELECT student_information, value, 
        ROW_NUMBER() OVER(PARTITION BY student_information ORDER BY (SELECT NULL)) AS row_number
FROM student_db
CROSS APPLY STRING_SPLIT(student_information, ',') AS back_values
)
SELECT student_information,
        [1] AS STUDENT_ID,
        [2] AS GENDER,
        [3] AS CITY,
        [4] AS GPA,
        [5] AS SALARY
FROM table
PIVOT(MAX(VALUE) FOR row_number IN([1],[2],[3],[4],[5])) AS PVT)

Last updated