Moneyball, Reddit Edition

Can you predict a Major League Baseball playoff appearance using sentiment analysis from Reddit comments? That’s what my teammate and I set out to discover for our Natural Language Processing course last summer.

The very obvious answer is “no, not really”, but we wanted to use this as a framework to explore sentiment analysis techniques because of our shared love of baseball. Our thinking had been that, if this could work, it’s because fans can often share reactions faster to changes than the actual team performance may reflect, whether a key player got injured or a major mid-season trade happened.

The inspiration behind the project came from a failed startup my teammate knew about, that tried to use Reddit sentiment to predict stock performance. We hoped that the average Redditor would be slightly more literate when it came to baseball than stocks.

The Data

I scraped four years of data from the thirty team-specific subreddits, as linked in the /r/mlb hub. This was tricky from the start: Reddit’s API doesn’t allow you to search for a specific date range, only “this hour”, “today”, “this week”, “this month”, “this year”, and “all time”. My solution was to pull the top 1000 posts of all time and check their dates, then save the post and the top 20 comments if it was in our relevant timeframe.

This was also why we limited ourselves to 2021-2024 for our data years, as top posts tend to be more recent due to Reddit gaining popularity. Pulling from anything pre-2020 would leave us with sparser data, and 2020 was a truncated season due to the pandemic.

This left us with nearly 200,000 documents to use. My preprocessing method removed hyperlinks, short (2 or fewer character) words, and certain punctuation, then corrected spelling and lemmatized. This did lead to a lot of errors when it came to team-specific slang, proper nouns, and internet shorthand:

Comment	Machine-processed Text	Human-processed Text
Phillies in phirst!	phillies thirst	phillies first
Orsillo is a class act	oreilly class act	orsillo class act
tf bruh	brush	the fuck bro

If we had had more time, we would have wanted to compile a list of these errors and add an additional layer of preprocessing to convert them into more machine-friendly text first.

Splitting and Labeling

Since we were looking at predicting a season’s playoff, we decided to use posts from 2021-2023 as our training set and posts from 2024 as our test set. We considered doing more granular splits (such as early, mid, and late season), but decided to leave that for future work.

Because we were using a novel dataset that we compiled ourselves and were limited by time, we used a pre-existing Python library (textblob) to classify our training data as positive or negative. This was not always reliable. While a human reader would classify the comment “let’s fucking go!” as highly positive, textblob scored that as negative with 75% confidence.

Overall we ended up with a 45% positive, 55% negative split. We did try having a “neutral” class for anything with less than 20% confidence, but we found that the results were worse than random guessing. For example, one of our RoBERTa models would predict “neutral” 99.8% of the time, because 60% of the training data ended up being classed as neutral.

Methods

For the sentiment analysis portion, we wanted to compare two methods: a statistical model in the form of a Naive Bayes classifier, and a transformer-based model with RoBERTa. We then fed the output of each model into a logistic regression to predict playoff appearance.

I worked on the Naive Bayes classifier, which is a conditional probability model that assumes each class is independent of the other. For NLP work, this is often a “good enough” simple model that can be used as a baseline to compare more advanced model performance.

With almost 35,000 words in the training vocabulary, the individual likelihood of any one word was small and prone to underflow, so I used log-likelihood to calculate the probabilities, with some Laplace smoothing to account for any words that were only in one of the classes.

My teammate worked on creating the RoBERTa models, which is a variant on BERT with a larger pretraining corpus and increased batch size. One we called the “vanilla” model, which used the pre-trained RoBERTa model as-is, no adjustments to the hyperparameters. The other was our custom model, which re-tokenized the preprocessed text and adjusted the hyperparameters with cross-validated grid search.

Results

After we fed our models’ output into the logistic regression, we compared them based on their accuracy, recall, precision, and the combined F1 score. This was the only input into our regression model.

Naive Bayes was, unsurprisingly, the worst in terms of overall performance. It had a reasonable accuracy at 53%, but that was mostly due to predicting slightly more correct results than incorrect. This was, however, the only model where the number of predictions matches the number of teams in the MLB playoffs.

The vanilla RoBERTa model had a comparable accuracy at 57%, but much higher recall and F1. This was due to the fact that it predicted a positive result 76% of the time. With the mathematical impossibility of 23 teams out of 30 in the playoffs, this is not a particularly useful result.

The custom RoBERTa model, meanwhile, had the inverse problem: extremely high precision, but the lowest recall of any of our models.

Overall, our models had 8 shared predictions:

Result	Teams
True Positive	Phillies, Orioles, Guardians
True Negative	Angels, Rockies, Nationals, Red Sox
False Positive	n/a
False Negative	Braves

When looking at the shared results alone, this scored higher than the individual models on all our metrics. Though this is less than a third of the total teams, we thought that exploring an ensemble model in future work would be interesting.

Overall, we think a lot of our accuracy issues stemmed from our data-handling, since we underestimated how much the data would need domain-specific preprocessing and sentiment scoring. There was also the confounding factor that certain subreddits lacked data, due to Reddit’s recency bias and inability to search by date range. In our training data, /r/miamimarlins had a combined 20 posts from 2021 and 2022, then over 3000 in 2023 (which was the only year the Marlins made the playoffs in this timespan).

This was a project I greatly enjoyed, even when I found it frustrating at times. It’s definitely a dataset I’d like to revisit in the future, to finally do the updated data processing and see if that changes the results at all. I can also see a lot of vis potential with this type of data.

Our paper, including full results, can be found here.

Noah Rae-Grant Portfolio

recent posts