From the course: Learning Data Analytics: 1 Foundations

Working with flat files

From the course: Learning Data Analytics: 1 Foundations

Working with flat files

People are not always able to directly connect to the back end of a database system, and sometimes they're limited to exports of the data, or they receive files that are in Excel format or CSV. I would refer to these as flat files, and they're disconnected from the data. So what do I mean by flat or disconnected data? You're not directly connected to the system that houses that data, which means when the data in the system is updated, your data in that flat file is not updated. You would need another file to capture those data changes. CSV are common flat files. CSV stands for comma separated values. With a CSV, it's a delimited file and the comma serves as the delimiter. What that delimiter does is separates each of the columns from each other. There's also other types like tab delimited or fixed width, which isn't a delimiter but an amount of space. I've always had more CSV or tab delimited than any other style and by default CSV files open in Excel. All too often I see these individuals stuck just using XLS their data solution because that's what they receive and it's already in Excel so it makes total sense to stay in the tool but they may not be leveraging the tools in Excel like Power Query to support their efforts. You may receive a comprehensive list of information. It has everything you need to report on and then the next time you receive the file it has all of the information you already reported on plus any new data. You can then change the data out and update your file. You may run a series of macros whatever you need to do to get the necessary changes to the data into your file. I want to show you a way to leverage Power Query to handle this type of solution because to me it's a whole lot easier than doing a lot of copying and pasting. Okay, let's go take a look at our chapter five folder in the 0503 folder. You see two files here. They're both CSV files, Research Project March and Research Project March and April. So let me set us up. The very first file we ever received is the Research Project March. You notice the Excel icon has a little A there. This is actually a comma separated value file. Okay, so it's the first file we receive. And then next month, they send us a new comprehensive data file that has March and April. So if this was us and we didn't know Power Query, we might open this April file, we might pull all of the April data out and copy and paste it into March. There is definitely a better way to work with this type of data. Okay, let's go to a new blank workbook. Okay, let's go to my favorite tab of Excel, Data, and my favorite option, Get Data, and choose From File, From Text. I go to my Chapter 50503 folder, and I'll double-click that March file. Okay, so now I've connected my Excel file to my CSV. It reads the type of file and gives me some default information. And interesting, look, data type detection based on the first 200 rows. Okay, great. So on the bottom right, I'm going to choose Load. We'll talk about Transform later. Now the data comes from Power Query and is loaded to my sheet. Notice I have 126 rows loaded. I'm going to build a basic pivot just for demonstration purposes. I'll go to Summarize with Pivot. I'll click OK. I'll go ahead and drag the Respondent ID to Values and change that Sum to Account. It sees the Respondent ID as a number, which is why it summed it. Okay, I'll drag my Start Date to my Columns, and then I'll drag my current age to my rows. Okay, let me close my pivot table fields. So if you look at the pivot table, you now see each age of the respondent and when they responded to the survey. This is great. I can set it up, I can build more visuals, and then when April rolls around, they're going to send me March and April's data. Okay, so again, most people would go add the the April data to the bottom of the March data. There's such an easier way to do that. So let's take a look. I'm going to right-click my research project March. I'll choose edit. I'm in Power Query and on the right-hand side, I'll choose data source settings. It shows me my current source is the March CSV. I'll go change my source on the bottom left and then I'll browse. I'll choose my March and April data. Now, worth noting, these file structures are identical. I'll go ahead and choose Import, and then Okay, and then I'll choose close. I have 126 rows, which I can see on the bottom left of my screen. Let me refresh. Now, it's pulled in the April data. I have 480 rows of data. I'll go ahead and choose close and load. It shows me that it's refreshed over here with 480 rows. I'll go to my sheet 3 which has my pivot and I'll go ahead and refresh here. I'll go to my pivot table analyze and choose refresh. Okay great let me close my connections and you'll notice any new data that came in, and then also there's my April data. Okay, let's work through another scenario. Let's go ahead and open up a new blank workbook. Sometimes all you receive is the new data. So in your 0503 folder, do you see the research project data? If you open that up, you see date stamped data sets. This research project has different data sets that come in 323, 327, 330, and 401. These data sets only contain the newest data at that point. Okay, I'm going to tell Power Query to read the entire folder. That way if I add a new file to the folder, it's automatically brought into my data set. Let's go back to our new blank spreadsheet. All right, I'll go back to Data, I'll choose Get Data. I'll choose From File, and then I'll select From Folder. I'll go ahead and browse, and I'll navigate to my Chapter 5 folder in 0503. And I'll choose my Research Project Data, and click OK. And then I'll choose OK again. Okay this lets me see all of the files that are in that folder. Okay I'll go ahead and choose combine and I'll tell it to combine and load. Now it's going to sample everything based on the first file and again it's important to note that all of these files are structured with the same headings in each file. All right I'll go ahead and click OK, and then you'll notice that the research project data has 246 rows loaded. And then if you look in the source name, you're going to notice that it has the actual source name with the .csv. If I hit that drop down, I can see all four of those files from that folder were brought into my data set. This means when I get, let's just say, the 4.15 2020 file, if I dump it into that folder, I can come directly to my spreadsheet, choose refresh, and I'll have all that data included. No matter how you work with flat files, you can leverage the data tools that you have right inside of Excel to make it an easier process.

Contents