Wine Review Classification

There's so many wines out there today, and often time a trip to the grocery store to pick up a quick drink to go with dinner can take longer than expected. Even when you know the type of wine you like, there's so many different wineries, and let's not even start on the years. Even wanting to try something new can be daunting, what if you spend $15 dollars just to find out you can't stand more than one sip?! Luckily, there's sommeliers in the world... unluckily, they probably aren't hanging around your local grocery or liquor store. Classification beyond the wine types could prove valuable in this case! Using machine learning and a hefty data source of sommelier's reviews, I plan to perform LSA text classification to find groups of wine, possibly with more in common than just being the same type! To watch a video presentation of the below topic, please go here.

Step 1

Exploratory Data Analysis

Before we start LSA, it can be valuable to begin with an exploration of the dataset being used, even if parts of the dataset won’t be used in the LSA process itself. For full details, see the process here. Our data looks fairly good, and with some changes to the headers and data types, we are able to dive in.

There are lots of countries represented in this list of wine reviews!

There’s some clear leaders when it comes to countries of origin in this dataset, the US, France, and Italy. We can focus on those as we dive into this large dataset (97,850 reviews).

Overall, we can see some interesting trends on numeric values.

And those are echoed in the top 3 countries as well.

Have you ever wondered if paying more for a bottle of wine means you’ll get a better quality of wine? Me too! There are no obvious visual correlations going on and running a Pearson’s Correlation test confirms that there isn’t a correlation in the dataset as a whole, as the resultant value is .390. Pearson’s tests indicate a full correlation at 1 or -1, with 0 being no correlation. The standard of .5 or -.5 is the standard for a notable correlation (Brownlee, 2018).

An interesting further use of this dataset might be examining whether this correlation is the same at all price points, or if certain price ranges are more likely to be worth your money.

Step 2

Text Preparation

Before we can begin the LSA process, we need to complete some additional process-specific cleansing on the review text. You can see the full code for this section here. Our first step is removing the non-alphabetical characters.

Before:

After:

Next, we can remove words that are three characters in length or less.

We then want to put everything in lowercase letters, so that our later analysis doesn’t think Ripe and ripe are two different words.

You could begin the LSA process from here, but initial attempts revealed that there are a lot of wine specific words, such as the names of the types, used within the review text. These are proper nouns in this case, and could unduly influence the topics by creating them by type more so than the other words within the review. Here’s the list of words we’ve removed as a result.

Step 3

Perform LSA by vectorization and SVD

With all of preparation completed, we can get to the main event, Latent Semantic Analysis, or LSA. You can see the full code here. Our first step is creating the vectorizer, which will create a table of ids by words. We are going to limit this to the top 1000 used words as there are a lot in this data set, but that isn’t necessary for smaller data sets or if computing power is not a concern. We are also going to complete a traditional stop word removal at this point, as words such as “this”, “that”, “from”, etc., won’t be very helpful when creating topics!

Before we create our model, since we are using the Truncated SVD package from sklearn, we need to define the number of topics we want the model to generate. While some datasets might have an obvious number of topics, ours doesn’t really! You might think the wine variety could be a good place to start, but our dataset is large there 632 distinct groups… while we will end up using a rather large number of topics, 632 is still quite unwieldly. We want our topics to account for as much of the variance as possible, though some recommend that covering 50% of the variance is a good standard (LSA, 2015). We can create some initial models and fit them with different component sizes in order to asses the number of topics we want to create. It’s best to start big here so that we can pare back from there.

We will aim for at least the 50% mark for this large of a dataset, which occurs at 219 topics.

We can round up a little to create an even number of groups, and use a final component size of 250, which explains 54% of the variance.

Now let’s take a look at some of the topics! You can see a dashboard here for the full results, and below is a small sample.

We can also tie this back to our original dataset (George, 2020b).

To explore these many topics further, or start looking for a new wine to try, check out this interactive dashboard here. Variety seems to play little roll within these groups surprisingly enough, my favorite variety of Sauvignon Blanc occurs in over 40 groups!

There’s a lot of potential reasons for this, which gives us many ideas for refining this LSA project in the future.

There are a lot of groups, and one of the most obvious methods to pare this down is to pare down the dataset. This could easily be done by price, rating, wine variety, country of origin, or so on.
Another way to reduce the groups would be to optimize the number of words our model has to analyze. It’s clear that after looking over the existing topics for awhile that there are more stop words that are highly specific to a wine review that we could remove. We could also lemmatize the words, so that words such as smoke and smoky are considered the same.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Images		Images
LSA_Result_Data		LSA_Result_Data
1_WineClassification_CleansingandEDA.ipynb		1_WineClassification_CleansingandEDA.ipynb
2_WineClassification_TextPreparation.ipynb		2_WineClassification_TextPreparation.ipynb
3_WineClassification_LSA.ipynb		3_WineClassification_LSA.ipynb
InteractiveDashboard.md		InteractiveDashboard.md
README.md		README.md
References.docx		References.docx
VideoPresentation.md		VideoPresentation.md
WINE CLASSIFICATION.ipynb		WINE CLASSIFICATION.ipynb
Wine.csv		Wine.csv
WineClassificationPresentation.pptx		WineClassificationPresentation.pptx
WineClean.csv		WineClean.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wine Review Classification

Step 1

Exploratory Data Analysis

Step 2

Text Preparation

Step 3

Perform LSA by vectorization and SVD

About

Uh oh!

Releases

Packages

Languages

LBusalacchi/msds696

Folders and files

Latest commit

History

Repository files navigation

Wine Review Classification

Step 1

Exploratory Data Analysis

Step 2

Text Preparation

Step 3

Perform LSA by vectorization and SVD

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages