From the course: Complete Guide to Google BigQuery for Data and ML Engineers
Batch data ingestion - BigQuery Tutorial
From the course: Complete Guide to Google BigQuery for Data and ML Engineers
Batch data ingestion
- Okay, let's work on batch data ingestion. Now, when we think about data ingestion, we want to immediately think about formats, what kind of data can we load into BigQuery? Well, BigQuery supports a wide array of data formats. We can have Comma or Character-Separated Values, JSON files, Avro files, Parquet, of course, which is a columnar format, ORC, as well as Cloud Datastore Backup Format, and Iceberg. Iceberg is a format for high-performance analytic tables. Now one of the nice things about Iceberg is that it can be used across multiple different kinds of analytic tools. So these are some of the data formats that are supported for batch ingestion in BigQuery. Now, sometimes, you have large volumes of data that you want to move from different sources. Well, there's something known as the BigQuery Transfer Service, and it's really useful if you want to move data from say object storage services like Google Cloud's cloud storage, AWS S3, or Azure, Microsoft Azure's object storage. Now the Transfer Service also works with specialized like SaaS service like Google and Facebook ads, Google Merchant Center, Google Play, with Oracle, and Salesforce, ServiceNow, YouTube, both channels and content owner data. Now, when we're working with Transfer Service, we often work with configurations where we define things like the source, the destination, how we want to append the data, what file format we're using, and file format configurations, which include things like field delimiters, and should we be skipping ahead a row, and is there any kind of reference file, also, what service account we should use? And if we're working with data that we want to be notified about, where do we want to send, say an email notification, or push a notification to a Pub/Sub topic? We can do all of these things with the BigQuery Transfer Service. Now BigQuery also supports federated querying, or querying other datastores. So this is the ability to query data that's external to BigQuery. Now there are different types of sources that we can use for federated query, we can use Cloud SQL. Now Cloud SQL supports things like Postgres, MySQL, SQL Server, and AlloyDB, which is a cloud-scalable implementation of Postgres. Now, you may want to have data that's stored in Cloud SQL, for example, maybe reference tables, or lookup tables, that are constantly changing, but you want to somehow have access to that reference data in BigQuery. Rather than load it into BigQuery, you can use federated query to query that data while it sits in Cloud SQL. Similarly, you could use Cloud Spanner, which is Google's globally distributed relational database. Now there are some geographic region restrictions, so you want to basically look up that. And when you use BigQuery federated queries, you're using service accounts that act on your behalf to query the database and return the results to BigQuery. So you're going to have to think about configuring those. Now, you'll also be creating a connection between BigQuery and your database, and you'll use a function called EXTERNAL_QUERY to actually execute the SQL statement that will generate the results and return those results to you. Now we're going to take a look at how we can actually load data into BigQuery. So we're actually going to jump over into the BigQuery Console and get to work loading some data. Okay, so now what I've done is, I've opened a browser, and I am going to navigate to Google Cloud. And for that, I'm going to go to console.cloud.google.com. And what we'll see here is, I've already authenticated, so I don't need to log in here, I am now at the main page for the Google Cloud Console. I am working with a project called bq-data-ml-engineering, and that's where we're going to do all of the work for this course. So all of my resources will be in this project. Now I'm interested in working with BigQuery, so I'm going to navigate to BigQuery. And I see, here is BigQuery. And this will open up the main BigQuery Console. Now, what we have here is the main console page for BigQuery. And we'll see here in the explorer, at the top of our hierarchy, we have bq-data-ml-engineering, that's our project. So now I can drill down and see different resources. So, for example, I can have different Notebooks and Queries. I'm particularly interested in datasets. Now I have a couple of datasets created that we will use later on, but in this lesson, I want to focus on how we can go about creating a dataset and batch loading data into that. So, for that, I'm going to go to the bq-data-ml-engineering. So I'm going to click on this three button stack here next to the bq-data-ml-engineering project, and I'm going to select the option to create a dataset. Now here, I'm going to specify a Dataset ID. Now, for this, I'm going to load a few tables that are related to retail. So I'm going to call this retail_dataset. Now, when we store data, we store the data in one or more Google Cloud regions. Now I'm going to choose just a single region for storing the data. And if I wanted high availability and I wanted maybe users from multiple regions to get the similar level of performance, I might choose multi-region. And with multi-region, Google Cloud or BigQuery would store multiple copies of my data, but keep them in different regions and keep them all in sync. So that's really useful in many use cases. From a learning perspective, a single region is perfectly sufficient. So I'm going to pick a region to work with, and I'm in the US and I like to work in us-west1. So I will pick us-west1, and that's sufficient, and I will create the dataset now. And what we'll notice back here when I go back to the Explorer, I see I now have a retail_dataset, and there are no tables there. So I'm going to go ahead and create a table. And, again, I'm going to use the menu for that. Now there are different ways to create tables. Now I'm interested in batch loading, so I'm going to create by uploading data. Now here are different ways you can pull data in, in kind of a batch mode or create a table. I can upload some data, I can work with Google Drive, I can work with Google Bigtable, which is a NoSQL database. Google Bigtable is a really good option if you have large volumes of data that you need to stream at very low latency, and you're able to use a NoSQL option, then Bigtable is a good option. If you're familiar with Apache Cassandra, Bigtable and Cassandra are both sparse, multidimensional matrix model NoSQL databases. We could also load data from Amazon S3, Azure Cloud Storage. So I'm going to choose Upload, because that allows me to upload from a file. And what we'll see here is I have three CSV files, products, sales_transaction, and stores. So I'm going to start with products, and I'm going to open that. And the file is CSV format, but you can see we have different options. So I'm going to just leave it at CSV. I'm going to use the bq-data-ml-engineering project and the retail_dataset. And I'm going to call this table products, so it mirrors the name of the table. Table type is Native data type. Now with the Schema, I'm going to use Auto detect. Now I could add a field, I could manually add if I manually wanted to create a field. I rarely use that with BigQuery, so Auto detect I find works really well. Now I'm working with small amounts of data, so I'm not going to choose partitioning, and I'm also not going to choose a Cluster setting. So at this point, I'm going to just go ahead and create the table. And what we see here is we have a table with a product_id, product_name, category, price, stock_quantity. And we can see some details about this table now. So we've loaded it today, we have 100 rows in that table, and let's see if we wanted to preview the data, kind of get a sense of what the data looks like, it's like, "Okay, here we go. Here we have some data." This is a very simple product table, this is a synthetic dataset. So we have a product_name, a category, a price, and a quantity in stock, so relatively simple data. So that's an example of how we can batch load using BigQuery's Explorer. Now let's go ahead and create the tables for the other files that we have, and we'll select a file here. So we uploaded products, so we'll now do sales_transaction, that's also in CSV format. We'll use sales_transaction. Did I spell that right? Yep. Native table, Auto detect. And we won't be partitioning or clustering here. So I'll go ahead and create that. And let's go to job where you can see that there is a job that is running. Now, basically, anytime you do any work in BigQuery, you're going to be creating a job, and it runs in the job environment. And that is because BigQuery is a distributed system. So it is a service that is running many different compute resources, many different nodes. BigQuery uses an abstraction called a slot, which you can think of as a combination of some CPU resources and some memory resources. And those slots do different work, whether it's loading data or querying data. And those slots are running on bunch of different physical servers, maybe in different, you know, parts of different zones within a different regions. And so, it's highly distributed. So anytime we have a task like loading data or running a query, BigQuery will create a job, and then it'll go find a slot or some, you know, just basically that abstraction that is available and can do the workforce. So that's why you see this message that, "Oh, we're creating a job and the job is finished," and then you can again go to the table once the job is finished. So for our transaction, our sales_transactions, we have a transaction_id, store_id, product_id, pretty simple. We can see the description of the details about the table here. Looks like we have a hundred rows there, so we can preview the data. So, this is, again, a simplified version of what a sales_transaction table might actually look like. But you see we have transaction_id, store_id, product_id, quantity, and so forth. And then let's go ahead and load one more table, and we'll create a table, and we'll use Upload one more time, and this is CSV, and the file, we're going to upload is stores, so this has the stores data, and we will put this into a table called stores, use Auto detect one more time, and we're going to follow the same pattern of no clustering and no partitioning. And then we'll see we have stores here. And if we double-click on that, we see our schema. So we have a store_id, name, location, city, state, zip_code, and let's just jump to preview. So we have relatively simple five stores. So that's how we can go ahead and load data in batch mode. Now, if you prefer to work from the command line, BigQuery has a command line utility called bq, and you can load data from bq. Again, it's very similar to this where you're specifying, you know, the datasets and the table that you want to load into, and the file and the format. But that's basically how you can go about batch loading data into BigQuery.