From the course: Complete Guide to Google BigQuery for Data and ML Engineers
Introduction to machine learning - BigQuery Tutorial
From the course: Complete Guide to Google BigQuery for Data and ML Engineers
Introduction to machine learning
- Okay, in this lesson we're going to take a more in-depth look at the discipline of machine learning. Now there are a few different types of machine learning. We have supervised learning where we basically want the machine to learn from labeled examples. Now, that's in contrast to unsupervised learning where we don't have any labeled examples and we want the machine to draw insights from the data without looking at labels. And then a third and an important type of machine learning is known as reinforcement learning, and that's where agents or learning mechanisms learn from feedback from the environment. Reinforcement learning is particularly important in training large language models, but it's not something we do very often when we talk about building machine learning systems in enterprises. Typically when we're talking about machine learning in enterprises, we're using either supervised or unsupervised machine learning. With supervised learning, we are training with labeled data sets, in which each data point that we have has a known target value. So for example, if we're trying to train to detect images with dogs in the image, then the target value would be either a true or a false depending on whether or not there was a dog in the image. So basically, what we're doing is we are providing training data that is labeled and the model learns patterns or characteristics of cases where the label has one value, and it tries to discriminate that from cases where the training data has a different value for the label. Now, there are multiple algorithms that we use for supervised learning. Some really popular ones are logistic regression and random forests. And then unsupervised learning is a learning model where we train with data sets that don't have an attribute that is explicitly labeled as the target or something that we're trying to learn. And rather than working with trying to learn a label, instead, what we're trying to do is group the data in different ways. So for example, some of the algorithms are K-means algorithms, which tend to cluster groups of data points that are similar, or you can think of them spatially as being close together. There's also another algorithm known as principle component analysis, which tries to group attributes in a way that allows us to sort of group the characteristics that describe the most variability in the data so we can reduce the number of dimensions or attributes that we actually look at, and that's what PCA is useful for. Now, in addition to sort of the predictive AI that we're, that we've just been describing, there's also generative AI. Now, generative AI is a subset of AI in which models are created and these models generate content. And that content can be of different types. It could be text or image or audio, and it all depends on what the model was trained on. Now, oftentimes we build on large language models, which are trained primarily with text and foundation models. Now, foundation models are a more general classification and foundation models include some of the large language models. But the key distinction is foundation models are used to solve multiple problems. So for example, a foundation model could be used for classifying documents or identifying sentiment or extracting key names and entities. Now, when we're working with generative AI, it's important to keep in mind the limitations that we have to deal with. Now these include a bias and inaccuracy, and that's typically based on what data was used to train the model and how long it was trained. There are also problems with lack of reasoning capability. Now many models that are currently available like on a commercial basis like Gemini or Claude, are actually incorporating more reasoning capabilities, but you just want to be aware of the limitations of the reasoning capabilities of any generative AI model you happen to be using. And then also we have to just be aware of potential misuses like fraud or copyright issues. Now, large language models are AI algorithms that use deep learning techniques to understand and generate text. Now, deep learning techniques are methods that use large neural networks with many layers, and that's why they're, the deep in deep learning techniques refers to the number of layers in a neural network. So there are many layers in large language models or in deep learning models. Now they're typically trained on massive data sets, so think of web scale data sets, and they're used to perform natural language processing tasks like classifying documents or identifying the sentiment, for example, in customer comments. Was the sentiment positive, neutral, or negative? We also use large language models for things like chatbots and conversational AI. And one of my favorite use cases, of course, is code generation. Now within Google Cloud , there are multiple generative AI services. There is Gemini, which is really the foundation of many of the tools within Google Cloud. For example, you know the help within BigQuery if you want to generate SQL, that's based on Gemini. There's also a service called Model Garden where you have access to different large language and other kinds of foundation models from within Google Cloud. So for example, if you're working in Google Cloud and you want to work with, say, a Llama model, or if you want to work with ChatGPT or some other model you can probably find it in the model garden and deploy it easily into Google Cloud. There's also AI Studio as well as specialized tools like Vertex AI and Conversation. Now with predictive AI, things like supervised learning and unsupervised learning, what we're trying to do is learn probabilities of labels for the given input in the case of supervised learning. But with predictive AI, we're not generating new data. We use predictive AI for things like classification, so categorizing or grouping things, regression for predicting values, like, how much is this house likely to sell for based on what other houses have sold for? And then there's also a common unsupervised learning task is clustering. So how can we group, for example, customers into segments that allow us to maybe more precisely target and tailor our messages and our solicitations to those particular groups of customers. Now, generally, when we're doing predictive AI, we're using methods and algorithms and datasets that are more computationally efficient than if we're working on generative or building generative AI models, which can be really resource intensive. Now we use predictive AI for supervised learning tasks for the goal of the classification, where the prediction is based on label data, but we can also use it for unsupervised data as well. Now in Google Cloud and in BigQuery, we have access to a number of different kinds of predictive AI and different algorithms. For example, we can use linear regression for making predictions. We can use logistic regression for classification. K-means clustering for grouping or clustering, matrix factorization of we want to build recommendation engines. We also have principle component analysis, which I mentioned earlier, which is a type of dimensionality reduction. And then we also have access to time series forecasting. Now there are different kinds of classification problems. There's a binary classification, for example, is this a picture of a dog or is it not? There's multi-class classification. So for example, you know, you might have multiple classifications. So it might be this is a picture of a dog and it's a picture of an animal, and it's a picture of a vertebrate, for example. Those are multiple classes. Now they don't necessarily have to be hierarchical, but that example was. We can also have multi-label classification. So for example, we could have a picture and it could be a picture of a dog and it could be a color picture or it could be a black and white picture. So we might have multiple labels, some indicating the content and then maybe some other indicating something about the representation itself. Now there are different types of algorithms we use for classification. Logistic regression and decision trees are popular ones. Support vector machines and random forests are also widely used. Gradient boosting, especially XG Boost algorithm is really good in a lot of different problem spaces. So if you're starting out, you're not quite sure what to use, probably, you know, starting with XG Boost is probably going to be a good choice in many cases. Now when we're doing regression, what we're trying to do is learn a relationship between dependent and independent variables. And we typically, what we're trying to learn is the dependent variable is some numeric value that we want to be able to predict. Now we want to be able to quantify the strength and direction of relationships when we're doing this. So we want to be able to predict, for example, different types of regression that we might use are things like linear regression, and there's simple or multiple linear regression. There's also non-linear regression, which is used for, as the name implies, for relationships where you don't have a strict linear relationship. Now with linear models, there's linear regression and generalized linear models. There's also tree based methods like decision trees, random forest, and gradient boosting. So as you're getting into say, deeper into regression, you might want to look into say, you know, when would you use generalized linear versus linear? But that gets into some more details that are beyond the scope of this particular course. Now you can also use neural networks for regression. That's particularly useful if you're dealing with non-linear relationships. Now with unsupervised learning, we often break down or organize the kinds of problems that we solve with unsupervised learning. One type of problem is called clustering, which is basically how do we group things together? Now K-means is really useful when you want to do things like customer segmentation and figure out which are different types of customers. Sometimes though, you want to work with hierarchies. Now hierarchical clustering is useful for that. For example, if you wanted to build up like a taxonomy of animals and you want to learn from examples of datasets and you want to build a hierarchical taxonomy of animals in a particular dataset, you could use hierarchical clustering for that. And another type of clustering can be base on the density of data points, and DBSCAN is a good algorithm for that. Sometimes we want to group data based on how likely they are to, say, be purchased together. The Apriori algorithm is good for that. And then I mentioned earlier about dimensionality reduction. That's really useful if you have highly, a lot of dimensions in your data. You might have data with maybe hundreds of different attributes and you really want to reduce it down to maybe 10 or 20 of the most important dimensions, well, dimensionality reduction can help basically map those hundreds of dimensions into 10 or 20, or you know, fewer dimensions that capture a lot of the variability that the full hundreds of dimensions do as well.