Not Just Thinking in Red and White

https://winefolly.com/
  1. We added a column for the year the wine was produced by parsing the wine title, for example “Citation 2004 Pinot Noir (Oregon)”.
  2. We added a column for the grape color type based on the Description, Title, and Variety.
  3. We also added a column for price group; wines were placed into different price groups depending on their price. Wines that were less than $25 were placed in price group 1. Wine between $25 and $50 were placed in price group 2, wines between $50 and $100 were placed in price group 3, and wines between $100 and $150 were placed in price group 4. Finally, wines beyond $150 were placed in price group 5.
  4. Finally, we added a column for Latitude and Longitude, which are the coordinates of the corresponding winery that produced the wine. This column was added by processing the data in batches and then using the geolocator package which corresponded to the particular city of the winery.
Our final cleaned, wrangled, and feature-extracted dataset included 120975 rows × 15 columns.
A world map of all the wineries in our dataset.
  1. First and foremost, we would like to finish the Vivino scraping to augment the Kaggle dataset and our subsequent models. We would like to compare user ratings from Vivino to sommelier ratings from Wine Enthusiast (Kaggle dataset) — specifically, whether price has as much of an influence on user ratings as it does on sommelier-given points. With this dataset, we would like to explore how alcohol content is related to country of origin, user rating, and price, and how it performs as a feature in our models above.
  2. Another interesting path to continue on is to see how the same wines produced by the same wineries change over their years of production. We broadly noticed the trend that wines produced in earlier years typically have higher prices, but it would be interesting to investigate what other differences, if any, these wines from different years possess. If they are distinct, we would further like to create a model for predicting a specific wine’s year of production — it would be interesting to see which features are most useful for this prediction task.
  3. In our current analysis, we did not use the winery information at all — in the future, we would like to perform a more in-depth exploratory analysis of the wineries — for example, do wineries lean towards producing red vs. white wines, and how does that affect the average price of their wines? It would also be interesting to augment the dataset with the winery’s year of founding to see how that has an effect on other features as well.
  4. Given how useful the vectorized descriptions were as features for almost all of our models, it would be interesting in the future to do a more in-depth analysis of what other features the descriptions alone can predict. For example, can we get a modest prediction on the price of the wine based on the descriptions? Perhaps specific words in the descriptions are used more often for different countries of origin? From our analysis alone, it is clear that the descriptions themselves are good predictors for many different aspects of each wine, and it would be an interesting future step to further explore their predictive capabilities.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store