Introduction
Data Cleaning
Data Cleaning is the task of cleaning up raw data to make it usable and ready for analysis.
Normalization: Standardizing or “cleaning up a column” by transforming it in some way to make it ready for analysis.
Strategy
What data do you need?: Review the problem you are solving. Review what data you need to run an analysis and solve the problem at hand.
What data do you have?: Understand the data types across the dataset. Take stock of not only the information you have in your dataset today but what data types those fields are. Do these align with your data needs?
How will you clean your data?: Manipulate an existing column or create new ones. Build a game plan of how you’ll convert the data you currently have to the data you need. What types of actions and data cleaning techniques will you have to apply? Do you have the skills you need to go from the current to future state?
How will you analyze your data?: Now, it’s game time! How do you run an effective analysis? Build an approach for analysis, as well. And visualize your plan to solve the problem. Finally, remember to question “so what?” at the end of your results, which will help drive recommendations for your organization.
Methods
The following set of methods cover three types of data cleaning techniques:
parsing information,
Substr: Extracts a substring from a string (starting at any position)
Right: Extracts a # of characters from a string starting from the right
Left: Extracts a # of characters from a string starting from the left
returning where information lives, and
Coalesce: Returns the first non-null value in a list
Strpos: Returns the position of a substring within a string
Position: Returns the position of the first occurrence of a substring in a string
changing the data type of the information.
Cast: Converts a value of any type into a specific, different data type
Concat: Adds two or more expressions together
Last updated