- Course: Axel The Analyst - Cleaning Data in Excel | Excel Tutorials for Beginners
- Source file: https://www.youtube.com/watch?v=_jmiEGZ6PIY
- Cleaned file: https://1drv.ms/x/c/315a2e7b2c65925e/EZ1sU5PuxwJDv__HNrjtVloBFru5VaN7D2KIFiX85ABspA?e=PTKMbA
To get an overview how analyst uses Excel for data cleaning.
- Find if there are any duplicates
- Data -> Remove Duplicates.
- Cleaning the President column
- UPPER/LOWER
- PROPER - each word starts with a capital letter.
- Cleaning the Party column
- Removing blank fields
- Whig/Republican/Democrats - choosing one version or correcting spelling mistakes.
- Democrat - Republic - not familiar with history of USA - maybe this distinction is needed
- Cleaning the Vice column
- TRIM - cleans up the spaces before, between duplicate spaces, and after
- Cleaning the Salary column
- Since $ isn't useful in DB, changed the column to a number.
- Cleaning the Date columns
- Format to Short Date (dd.mm.yyyy).
- At the moment I kept all columns. But as keeping the source file and actual working file separatly - probably columns A, C, G, E won't be needed in cleaned file.
-- After using Google Sheets, I changed following:
- In Google Sheets, trimmed in the Prior column "Democratic- Republican". Correcting hint dissappeared, but it kept the space before "Republican". In Excel, I used CTRL+H and changed it to "Democratic-Republican".
- And there was also encoding problem in Prior column:
There are several solutions to do it. I exported the .csv file, saved it in VSCode as .xls. Then uploaded in the desktop version of Excel again. The program opened the Export Wizard:
- Choose the UTF-8 encoding.
- Check the checkmark to mark column headers are in file.
- And choose the correct delimiter. In my case ";".
- PROPER is an interesting finding. Upper/lower or something similar is used in different programming languages. But writing every word with first capital letter - you have to really search or know it because different wording is used e.g. capitalize vs proper.
- Finding and replacing strings can be very tricky. Column name or some part of word which does not need changing can be replaced without noticing. What fun to correct, yeah!
- The biggest lesson came actually after finishing tutorial. I uploaded it to Google Sheets and I later opening it, I saw next picture:
- First time I noticed this Cleanup section. I do not know what caused this pop-up, but it can also be found Data -> Data Cleanup.
- Columns E/G can be ignored, because these are orignal columns, but rows 29-30 got my attention (marked yellow on the pic). Why did not Excel remove one of them as a duplicate? A little bit digging and the answer was in column E: a spelling mistake in word "democratic" was the reason. As my flow was following: removing duplicates -> removing spelling mistakes -> Excel coudn't remove because the rows were different.
- The workflow should have been: remove duplicates -> manipulating strings (e.g. spelling mistakes, trimming, choosing on standard-word, replacing) -> remove duplicates once again .
- If the dataset is not big, use different tools to cross-check yourself.

