Not Just Thinking in Red and White
The practice of wine tasting is almost as old as the history of wine itself. With the entire industry now built around wineries, wine sommeliers, and wine ratings, some wines are as expensive as $10,000 while others are known as “one-buck-chucks” found at the grocery store.
For the average layperson, the complexities of wine are often obfuscated in fancy labels and marketing tactics. We wanted to create a wine classifier which can provide the average person some insight into wine without actually having to go to Italy for the wine tasting!
For this project, we are using one dataset which comes from Kaggle, linked here. We initially tried scraping the Vivino website for complementary data but were not able to include the results in our final dataset. We explain later in the Challenges and Obstacles section the process we went through with the Vivino data.
The Kaggle data was already quite clean, but we added a few columns to help with the data analysis for the classifier:
- We added a column for the year the wine was produced by parsing the wine title, for example “Citation 2004 Pinot Noir (Oregon)”.
- We added a column for the grape color type based on the Description, Title, and Variety.
- We also added a column for price group; wines were placed into different price groups depending on their price. Wines that were less than $25 were placed in price group 1. Wine between $25 and $50 were placed in price group 2, wines between $50 and $100 were placed in price group 3, and wines between $100 and $150 were placed in price group 4. Finally, wines beyond $150 were placed in price group 5.
- Finally, we added a column for Latitude and Longitude, which are the coordinates of the corresponding winery that produced the wine. This column was added by processing the data in batches and then using the geolocator package which corresponded to the particular city of the winery.
Exploratory Data Analysis
The charts below display some initial analysis that was done to learn some overall information about the wine data. We first plotted all the wineries on a map to see what the spread we were working with looked like. This map was created through the added feature column containing latitude and longitude coordinates of each winery.
We determined the top 10 countries for wine production as well as the top 10 types of wine. Within this dataset, it was clear that the United States was the largest wine-producing country. We could hypothesize this to be a result of the dataset origin (Wine Enthusiast), which was produced by a company headquartered in the United States, potentially skewing the results to be limited to largely US-available wine. Many of the wineries on Wine Enthusiast were based in Europe, but these wineries seemed to produce less wine or produce less recognized wine on Wine Enthusiast — even though there were more wineries concentrated in Europe, the United States still ranked first as the highest wine-producing country.
We also worked on calculating the top wine varieties. Pinot Noir was the most popular with over 12000 varieties. Chardonnay, Cabernet Sauvignon, and Red Blend were all quite popular.
We used a violin plot to find the price range of the wines produced by the top 10 wineries through the volume of wines produced. It was interesting to note that Louis Latour not only produced wine on the lower end but also produced one of the most expensive wines. The other wineries mostly stayed within the same $100 range.
We used a scatterplot for the price range of wines by country. England had the highest price range and Ukraine had the lowest. The US was also on the higher end but still relatively less expensive than Italy and France, indicating that US wineries could be more commercialized in their wine production as they also produce the highest volume of wine.
People tend to believe that the more expensive something is, the better it tastes. We decided to plot the price range of each wine against the points/rating it was given on a scale of 0–100. First we broke the price of wine into five price ranges: $0 — $25, $25 — $50, $50 — $100, $100 — $150, and $150+. The pie chart below shows how the different types of wines were distributed across the ranges. The red shows the wines under $25- 45.9% of the dataset were all wines under $25. The blue shows $25 — $50, yellow $50 — $100, green $100 — $150, and purple $150+.
Next we plotted price range vs points. As expected, there was a positive correlation with price and points. Another observation we made was that as the price range increases, the point rating increase becomes incrementally smaller.
After adding a feature column for grape color, we performed exploratory analysis on the dataset to see if there were any features which correlate with grape color.
We graphed grape color versus their price range using a violin plot. Here, we can observe that there is a higher concentration of white wines which are more affordable, where there exist more expensive red wines than white wines.
We also plotted points versus grape color, finding that the ratings of white wines skewed slightly toward the left as opposed to the red wines. However, given the generally lower price points of white wines, the violin plot indicates that even though white wines are generally less expensive, they receive similar ratings to the red wines. This made us curious about including grape colors in our classifier and machine learning model as they are binary and often used as buckets for wine.
Finally we looked at the wine prices over the years. It was interesting to see that prices of wine decrease over time. The data spans from 1904 to 2017. Prices reached almost $1000 in 1935 and have gradually declined to the $20–60 range. When we searched this online, according to CNN, the price of wine is on the decline due to a grape surplus in California and a decrease in demand across the country. This also indicates the general accessibility of wine nowadays.
Finally we used word clouds to analyze the descriptions of each wine. After removing all the stop words (which included “wine”), we first looked at the most common words used in the top 100 rated wines. ‘Year’, ‘tannin’, and ‘vintage’ all stood out here.
Next we looked at the most common words used to describe the lowest 100 rated wines. We found that words like finish, aroma, and palate were commonly used.
Model 1: Predicting Grape Color
For our first model, we wanted to start with a binary classification problem and decided to tackle grape color classification. Although even amateur wine drinkers can typically tell whether the wine came from a red or white grape based on the liquid color, some wines’ grape origins are non-obvious (for example, champagne is actually a blend of two red grape varieties and one white grape variety).
For a feature set, I used all numerical features from the original dataset and one-hot-encoded the country and variety variables as well, dropping all other non-numeric features. The final feature set included points, price, year, latitude, longitude, and the one-hot-encoded country and variety, with 658 features total. The labels themselves consisted of binary 0s and 1s, where 1 represented “Red” and 0 represented “White”.
We ran PCA on the feature set for dimensionality reduction purposes. We initially tried scaling the data using StandardScaler but found that it detracted from our model’s ability to accurately predict grape color. Since the majority of our features are binary due to the one-hot-encoding, it makes sense that the features should not be scaled. The explained variance ratio output from the PCA transformation on the training feature set is shown below.
Based on these results, we PCA-transformed the feature set using 50 features, since this explained 99.5% of the variance in the data. Since the classification task is not too complicated, we started with a linear regression model to first predict the two classes, where outputs greater than 0.5 are considered label “1” and outputs smaller than 0.5 are considered label “0”. I considered Lasso, Ridge, and ElasticNet regressions, though ultimately Ridge regression had the highest performance on the validation set, alluding to the fact that most of our features are useful for prediction. Since this was a binary classification problem and both false negatives and false positives are equally important, we used the F1 score as the evaluation metric. Using Ridge regression, the training set had an F1 score of 93.6% with a mean squared error of 0.067, and the testing set had an F1 score of 93.7% with MSE of 0.066.
Out of curiosity, we decided to look into the 20 most important PCA components in the PCA-transformed feature set, which are listed below. We predicted that price or points would be highly correlated with grape color, but it is interesting to note that the varieties and country of origin are much more important in determining grape color.
Model 2: Predicting Taster Identity
Next, we sought to predict the identity of the taster based on their descriptions of the wines alone. Example descriptions include “Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, horseradish. In the mouth, this is fairly full bodied, with tomatoey acidity. Spicy, herbal flavors complement dark plum fruit, while the finish is fresh but grabby.” (Michael Schachner), and “This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It’s already drinkable, although it will certainly be better from 2016.” (Roger Voss). Although the descriptions for different wines did not look too distinct between tasters, we were curious to see whether the descriptions alone would be a sufficient feature set for the classification task. The distribution of the number of reviews for each taster is shown below:
Based on this distribution, we decided to only use data where the taster reviewed at least 3000 wines, since the others had too few representative instances for classification. This left us with only the top 10 tasters (out of 19 total), lending itself to a 10-class multi-class classification problem.
To use the descriptions as features, we used the NLTK library’s built-in stopwords and WordNetLemmatizer. Out of the “Description” column, we first lemmatized all words to obtain just the stem, set the stem to lower case and tokenized it, check that it was alphanumeric and not in NLTK’s stopwords, and then added them as a list of strings to a new column in the dataframe. Using the example from above, the description “This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It’s already drinkable, although it will certainly be better from 2016.” became [‘ripe’, ‘fruity’, ‘wine’,’smooth’,’still’, ‘structured’, ‘firm’, ‘tannin’, ‘filled’, ‘juicy’, ‘red’, ‘berry’, ‘fruit’, ‘freshened’, ‘acidity’, ‘already’, ‘drinkable’, ‘although’, ‘certainly’, ‘better’, ‘2016’].
After splitting the data into a train and test set, however, it was clear that the dataset was highly imbalanced. One taster had 17953 instances, while another had only 3925. To fix this problem, we used the imblearn library’s built-in SMOTE function (standing for Synthetic Minority Over-Sampling Technique). The function upsamples all minority classes to have the same number of instances as the majority class, using a K-nearest neighbors approach with original instances to synthetically create new instances. We used the TfidfVectorizer function to turn the representative words of each instance into a sparse matrix, and called SMOTE on the training features so the training feature set would be perfectly balanced.
Given the more complex multi-class classification problem, we decided to use a random forest classifier. We first ran a grid search on the number of estimators and maximum depth of each tree using GridSearchCV with a range of 75–150 and 10–50, respectively. We found that the optimal number of estimators was 125, though the maximum depth parameter did not seem to increase the validation accuracy at all, suggesting that the training and validation sets have very similar distributions. Thus, we used a random forest classifier with 125 estimators and no maximum depth, which resulted in a training accuracy of 100% and a testing accuracy of 92.1%.
The success of our classifier was a complete shock, given that we did not notice clear distinctions between the descriptions of different tasters looking at the dataset originally. As a result of these findings, we decided to plot the most common words used by each of the top 3 tasters with the most reviews, the results of which are shown below:
It is interesting to note that although there are some shared commonly used words between the tasters (for example, “palate”, “aroma”, “tannin”, etc.), by and large the most commonly used words are quite distinct between different tasters. This means it is likely that although each taster independently reviews different wine types, colors, vintages, etc., their writing style tends towards certain words that are specific to each taster’s dictions, making it possible to predict the identity of the taster based on their words alone.
Model 3: Predicting Point Ratings
For our third classification task, we aimed to predict the point ratings of each wine, which was the most complex problem of all. The point values are integers ranging from 80 to 100, with 21 possible values total.
Given the nature of the point rating system, we first tried to use a linear regression model to predict the ratings using the sparse matrix of vectorized descriptions (as described in Model 2 above) along with one-hot-encoded price, grape, country, and variety features. The resulting concatenated sparse matrix had 24073 features total. We initially tried using truncated SVD for dimensionality reduction given the large number of features, but even with 3000 features only 80% of the variance could be explained, and thus SVD-transformed feature sets led to worse results with the validation set. Instead, we used the entire feature set for Lasso, Ridge, and ElasticNet regressions — Ridge regression had the best results with a 29.7% training accuracy, 26.7% test accuracy, and MSE of 2.32, whereas Lasso and ElasticNet both had about 13.1% validation accuracies. This again indicates that most of the features used are useful for the regression task. It is interesting to note that when we tried training the Ridge regression with the entire feature set without the vectorized words, the validation accuracy was only 14.3%, so clearly the description is very important for predicting the points ratings. To increase the complexity of the regression models, we also tried using a Random Forest Regressor as a form of ensemble learning of regression, but found that even with the most optimal parameters, the validation accuracy was at most 18.6%. Overall, the Ridge regression strategy at its best yielded a test accuracy of 26.7%, which is much better than random but not satisfactory by our standards.
Since the Ridge regression’s training and testing accuracies were similar and low, we concluded that the model was likely underfitting and decided to next try a feed forward neural network, which is a more complex model. Since Pytorch does not take sparse matrices as inputs and converting the 24073 sparse features into tensors resulted in frequent RAM crashes, we decided to use all the numeric data from our original dataset (points, price, year, latitude, longitude, and the one-hot-encoded country and variety), with 658 features total. Through trial and error, we found the most optimal neural network structure to be a 4-layer network with ReLU nonlinearities, with 8.671,521 trainable parameters. We found that a more complex network with even more trainable parameters was better for the classification problem at hand, but the Google Colab RAM was unable to handle a larger network.
We tried using different optimizers with varying learning rates as well as various loss criteria, but the most optimal network used an SGD optimizer with a learning rate of 0.0005 and cross entropy loss. Although we verified that the loss steadily decreased and accuracy steadily increased, the maximum testing accuracy we could achieve was 13.1%.
Based on how low the accuracies were at predicting 21 possible points, I binned the points into 4 labels to reflect bad wines, mediocre wines, good wines, and great wines. Based on the dataset, there are many mediocre and good wines, few bad wines, and very few great wines. The distribution is shown below:
Initially, we used a feature set of price, year, latitude, longitude, grape type, and one-hot-encoded country and variety and discarded all other non-numeric features. After splitting into a training and testing set, we again used SMOTE to upsample the training set for a more balanced training set. Since this classification problem has many fewer classes and thus is much simpler, we tried using a K-Nearest_Neighbors classifier on this initial feature set, and obtained a 82.7% training accuracy and 53.6% testing accuracy. Although this is a much higher accuracy compared to the Ridge regression from before, given that in this classification problem we only had 4 possible classes, we still hoped for a higher validation accuracy.
We ran PCA on the feature set to reduce dimensionality, and decided to transform the feature set with 100 components, explaining 99.95% of the data variance. We tried using a multinomial logistic regression model with the PCA-reduced features, resulting in a 57.7% training accuracy and 47.8% testing accuracy. To increase the complexity of the model, we experimented with using Support Vector Machine (SVM) for this classification problem, using linear support vector classification (LinearSVC) with an L2 penalty on the PCA-transformed feature set. The results were not promising, with a 57.5% training accuracy and 36.9% testing accuracy, with a 0.98 testing MSE.
We also tried running a feed-forward neural network with the PCA-transformed feature set and the 4-neuron output layer. The training accuracy and loss are shown below:
As shown above, the training accuracy levels off after about 12 epochs of training, and the resulting model had a training accuracy of 50.2% and training loss of 1.16, and a testing accuracy of 32.2% and testing loss of 1.29. Based on the model results above, where classifiers ranging from simple to very complex could only achieve a maximum testing accuracy of ~50%, we concluded that our current feature set was unable to capture sufficient patterns in the data for predicting points.
Based on our results, we wondered if even the PCA-transformed feature set was too complex for the classification problem at hand. Given the points vs. price and points vs. grape color graphs we produced in our EDA, we created a feature set consisting only of the price, the price group, and grape label. A simple Ridge regression resulted in a training accuracy of 60.9%, a testing accuracy of 60.7%, and a test MSE of 0.33. Using a more complex model of Random Forest Classifier, the training accuracy was 64.3% and testing accuracy was 63.6%. The close training and testing accuracies for both classifiers indicates underfitting, but they both did result in higher accuracies compared to the larger feature set we were using before. This indicated to us that we needed to be more selective in choosing useful features for the classifiers, but more features than just price and grape color are needed.
We decided to augment our feature set by including the vectorized descriptions, where the new feature set consisted of the sparse vectorized descriptions, one-hot-encoded price, and grape label. Since the Random Forest Classifier performed well on the simpler feature set, we decided to use the same classifier for this augmented data set as well. We ran a grid search using GridSearchCV on the number of estimators and maximum depth of the classifier, and found that the optimal number of classifiers was 200, with a maximum depth of 75. The resulting training accuracy was 99.7% and a testing accuracy of 74.2%. We were satisfied with this classification accuracy, and were pleasantly surprised at how useful the descriptions of the wines are as features.
Challenges and Obstacles
There were a number of challenges and obstacles that we faced during this project. Given the popularity of the Wine Enthusiast data, we initially wanted to augment the dataset by scraping Vivino’s website. Vivino is a popular app for those who want to discover good wines, especially on a smaller budget. There are hundreds of thousands of reviews, user-ratings, and information about wines, but we could not find any of their data in a published dataset, nor does Vivino have an associated API. We found a paper that analyzed data from Vivino and contact the authors, who shared their dataset with us (Vivino_data.jsonl). However, their data focused on user interactions instead of the wine database, which was not helpful for augmenting the Kaggle dataset.
We then wrote our own crawler using Selenium’s built-in libraries, using XPath elements to automatically locate pertinent information such as alcohol content, user rating, number of reviews, wine flavors, etc. The crawler started from Vivino’s main search home page (which includes roughly 600,000 wines), performs infinite scrolling to the bottom of the page, identifies the links to all wines linked on the search page, and then individually clicks into each of those links to scrape specific information about each wine. The information for each wine was saved as a dictionary and written to a TXT file. Unfortunately, after scraping 1251 wines Vera’s IP address was blocked for overloading Vivino’s server. For ethical reasons and the time constraints of the project, we decided not to use proxies to continue scraping the Vivino dataset. The scraper code is included in the Google Drive (545FinalProject_Scraper.ipynb) along with the scraped data (wine_data_compiled), and a sample of the scraped data is shown below.
In the end we were not able to use the data from Vivino given how small the scraped dataset was, which was a shame given how much time we put into the scraper. With more time, we would have scraped the data in smaller portions to avoid overloading the server. This data would have been wonderful for augmenting the Kaggle dataset — we were interested in comparing the user ratings to the sommelier-given points, and were curious to see how alcohol content plays into the perceived quality of a wine.
Another problem that arose was the large size of the dataset. With over 120,000 instances, many functions that were simple ended up taking a lot longer to run than expected; for example, extracting the latitude and longitude coordinates of the winery locations. A mixture of Colab crashing, the API disconnecting, and a laptop trying to run multiple files at once caused us to spend around a week extracting this feature. We overcame this obstacle by splitting the data into over 300 smaller batches and merging the resulting dataframes together to obtain the coordinates for the entire dataset.
Additionally, extracting the year of production as a feature presented an interesting challenge. We noticed that many of the titles included the year the wine was produced so we initially parsed the titles and extracted all numeric data, but a quick lookthrough of the resulting dataframe showed us that some wine titles include random numbers (for example, “007”, “1”, etc.), which made extracting only the year more difficult. To get around this, we split the extracted numbers from the titles with a space, replaced all other characters with 0s, and used only the maximum extracted number as the year. We then also found that some wineries place their founding year on all of the wine titles (for example, “1847) instead of the production year. We limited the year features to be within a reasonable range, keeping only those years between 1900 and 2021, and replacing all other extracted years with Null values. Given the year vs. price bar chart in our EDA, we are quite confident that our year extraction is correct, since wines that have aged for longer are generally priced much higher.
Lastly, creating the model for predicting points proved to be very tricky. Compared to predicting grape color and taster identity, where the relevant features were clearer, it was not immediately obvious to us which features were most useful for predicting points. We started by using all features and finding that our models were insufficient, then tried using only a few features that we thought were useful and finding that the models were underfitting. With a lot of time and experimentation we finally discovered that the descriptions, price, and grape color were most pertinent to the classification problem.
Next Steps and Future Directions
We have a few potential future steps in mind to take with this project:
- First and foremost, we would like to finish the Vivino scraping to augment the Kaggle dataset and our subsequent models. We would like to compare user ratings from Vivino to sommelier ratings from Wine Enthusiast (Kaggle dataset) — specifically, whether price has as much of an influence on user ratings as it does on sommelier-given points. With this dataset, we would like to explore how alcohol content is related to country of origin, user rating, and price, and how it performs as a feature in our models above.
- Another interesting path to continue on is to see how the same wines produced by the same wineries change over their years of production. We broadly noticed the trend that wines produced in earlier years typically have higher prices, but it would be interesting to investigate what other differences, if any, these wines from different years possess. If they are distinct, we would further like to create a model for predicting a specific wine’s year of production — it would be interesting to see which features are most useful for this prediction task.
- In our current analysis, we did not use the winery information at all — in the future, we would like to perform a more in-depth exploratory analysis of the wineries — for example, do wineries lean towards producing red vs. white wines, and how does that affect the average price of their wines? It would also be interesting to augment the dataset with the winery’s year of founding to see how that has an effect on other features as well.
- Given how useful the vectorized descriptions were as features for almost all of our models, it would be interesting in the future to do a more in-depth analysis of what other features the descriptions alone can predict. For example, can we get a modest prediction on the price of the wine based on the descriptions? Perhaps specific words in the descriptions are used more often for different countries of origin? From our analysis alone, it is clear that the descriptions themselves are good predictors for many different aspects of each wine, and it would be an interesting future step to further explore their predictive capabilities.
And that’s it!
Thanks for reading, hope you enjoyed learning about our project!
-Ally, Alicia, and Vera