• Can you predict a Major League Baseball playoff appearance using sentiment analysis from Reddit comments? That’s what my teammate and I set out to discover for our Natural Language Processing course last summer.

    The very obvious answer is “no, not really”, but we wanted to use this as a framework to explore sentiment analysis techniques because of our shared love of baseball. Our thinking had been that, if this could work, it’s because fans can often share reactions faster to changes than the actual team performance may reflect, whether a key player got injured or a major mid-season trade happened.

    The inspiration behind the project came from a failed startup my teammate knew about, that tried to use Reddit sentiment to predict stock performance. We hoped that the average Redditor would be slightly more literate when it came to baseball than stocks.

    The Data

    I scraped four years of data from the thirty team-specific subreddits, as linked in the /r/mlb hub. This was tricky from the start: Reddit’s API doesn’t allow you to search for a specific date range, only “this hour”, “today”, “this week”, “this month”, “this year”, and “all time”. My solution was to pull the top 1000 posts of all time and check their dates, then save the post and the top 20 comments if it was in our relevant timeframe.

    This was also why we limited ourselves to 2021-2024 for our data years, as top posts tend to be more recent due to Reddit gaining popularity. Pulling from anything pre-2020 would leave us with sparser data, and 2020 was a truncated season due to the pandemic.

    This left us with nearly 200,000 documents to use. My preprocessing method removed hyperlinks, short (2 or fewer character) words, and certain punctuation, then corrected spelling and lemmatized. This did lead to a lot of errors when it came to team-specific slang, proper nouns, and internet shorthand:

    CommentMachine-processed TextHuman-processed Text
    Phillies in phirst!phillies thirstphillies first
    Orsillo is a class actoreilly class actorsillo class act
    tf bruhbrushthe fuck bro

    If we had had more time, we would have wanted to compile a list of these errors and add an additional layer of preprocessing to convert them into more machine-friendly text first.

    Splitting and Labeling

    Since we were looking at predicting a season’s playoff, we decided to use posts from 2021-2023 as our training set and posts from 2024 as our test set. We considered doing more granular splits (such as early, mid, and late season), but decided to leave that for future work.

    Because we were using a novel dataset that we compiled ourselves and were limited by time, we used a pre-existing Python library (textblob) to classify our training data as positive or negative. This was not always reliable. While a human reader would classify the comment “let’s fucking go!” as highly positive, textblob scored that as negative with 75% confidence.

    Overall we ended up with a 45% positive, 55% negative split. We did try having a “neutral” class for anything with less than 20% confidence, but we found that the results were worse than random guessing. For example, one of our RoBERTa models would predict “neutral” 99.8% of the time, because 60% of the training data ended up being classed as neutral.

    Methods

    For the sentiment analysis portion, we wanted to compare two methods: a statistical model in the form of a Naive Bayes classifier, and a transformer-based model with RoBERTa. We then fed the output of each model into a logistic regression to predict playoff appearance.

    I worked on the Naive Bayes classifier, which is a conditional probability model that assumes each class is independent of the other. For NLP work, this is often a “good enough” simple model that can be used as a baseline to compare more advanced model performance.

    With almost 35,000 words in the training vocabulary, the individual likelihood of any one word was small and prone to underflow, so I used log-likelihood to calculate the probabilities, with some Laplace smoothing to account for any words that were only in one of the classes.

    My teammate worked on creating the RoBERTa models, which is a variant on BERT with a larger pretraining corpus and increased batch size. One we called the “vanilla” model, which used the pre-trained RoBERTa model as-is, no adjustments to the hyperparameters. The other was our custom model, which re-tokenized the preprocessed text and adjusted the hyperparameters with cross-validated grid search.

    Results

    After we fed our models’ output into the logistic regression, we compared them based on their accuracy, recall, precision, and the combined F1 score. This was the only input into our regression model.

    Naive Bayes was, unsurprisingly, the worst in terms of overall performance. It had a reasonable accuracy at 53%, but that was mostly due to predicting slightly more correct results than incorrect. This was, however, the only model where the number of predictions matches the number of teams in the MLB playoffs.

    The vanilla RoBERTa model had a comparable accuracy at 57%, but much higher recall and F1. This was due to the fact that it predicted a positive result 76% of the time. With the mathematical impossibility of 23 teams out of 30 in the playoffs, this is not a particularly useful result.

    The custom RoBERTa model, meanwhile, had the inverse problem: extremely high precision, but the lowest recall of any of our models.

    Overall, our models had 8 shared predictions:

    ResultTeams
    True PositivePhillies, Orioles, Guardians
    True NegativeAngels, Rockies, Nationals, Red Sox
    False Positiven/a
    False NegativeBraves

    When looking at the shared results alone, this scored higher than the individual models on all our metrics. Though this is less than a third of the total teams, we thought that exploring an ensemble model in future work would be interesting.

    Overall, we think a lot of our accuracy issues stemmed from our data-handling, since we underestimated how much the data would need domain-specific preprocessing and sentiment scoring. There was also the confounding factor that certain subreddits lacked data, due to Reddit’s recency bias and inability to search by date range. In our training data, /r/miamimarlins had a combined 20 posts from 2021 and 2022, then over 3000 in 2023 (which was the only year the Marlins made the playoffs in this timespan).

    This was a project I greatly enjoyed, even when I found it frustrating at times. It’s definitely a dataset I’d like to revisit in the future, to finally do the updated data processing and see if that changes the results at all. I can also see a lot of vis potential with this type of data.

    Our paper, including full results, can be found here.

  • One of my final homework assignments for CS 7250, Information Visualization, was to create a visualization on something personal. Gaming is one of my major hobbies, so I gathered data on every video game I’ve played for at least one hour according to Steam’s built-in metrics (this data was gathered on November 18, 2025). Aside from just hours played and date last played, there were also achievements to consider, the genre of the game, whether or not a game is “randomized” (different every time you play, like a board game), and some other miscellaneous data I ended up not using (such as “could I run it on my current computer with little effort” and “how many times have I bought this game”).

    The most challenging part was figuring out how I wanted to do a rating system. There were a lot of games I hadn’t thought about in almost 10 years, and I didn’t remember anything about them. Instead of doing a 1-5 “how much did you like this game” scale, I decided instead to do a yes/no/maybe “would I play this game again?”, and considered that for each game.

    The Vis

    Color selection tool is below the vis. Select any bar in the bar charts or color icon in the legend to show only that type of game, or brush over the points in the scatter plot to limit to a certain amount of hours played or achievements completed.

    Some fun insights I gained about myself and my gaming habits from this project:

    • Despite being the largest genre within my played library and the highest average achievement completion rate, puzzle games have the second least amount of time spent playing and the lowest average hours per game of all the genres. I do think this is easily explained by the fact that once a puzzle game is solved, there’s very little replay value for me unless there’s some other draw like an interesting story or a particularly impressive visual style. The games that are classified as puzzle and “would replay” are all story-driven puzzle games (point-and-click adventures, essentially) that do something interesting, like Return of the Obra Dinn with its stunning art style and framing device.
    • I have put a lot more time into games with randomized runs, particularly strategy games. About a third of my overall gameplay time is in randomized strategy games, which I’m unsurprised by given my love of card battler/deck-building games like Slay the Spire. The trend of a game with randomization having more hours than a game of the same genre without randomization holds true for all but simulation games and puzzle games. Puzzle games it’s unsurprising, since there are very few randomized puzzle games worth playing; the two I have are a word game and a multiplayer puzzle game. Similarly, there are not many simulation games with randomization. The genre with the starkest difference is action, where randomized games have a mean play time of 68 hours while non-randomized games have a mean play time of 22 hours.
    • I have put more time and had higher completion rates for games I thought I would replay, but this does depend on the genre. Only the action and strategy genres have more play time in the “replayable” games than in the “non-replayable” games, but if you combine the two, the replayable games in those genres represent more than a third of my overall play time at over 1200 hours. This is enough to make the “yes replay” category have a higher total play time than the “maybe” and “no” categories.
    • I tend to play a lot more new games for a short amount of time in the beginning of a year, and fewer towards the end of a year. January and March are the two most common months for me to stop playing games, which have typically correlated with winter and spring breaks for when I was in school. Many of these games had a low play time, meaning I would put a few hours into a new game before moving on.

    Design Choices

    With the scatter plot, I used a log scale for hours played to spread out the points across the whole chart, rather than having most bunched up along the left side. Additionally, the size of each point correlates to the number of earned achievements. When I was gathering the data, I noticed there were a lot of games where I had 100% completed them, but they had only 10 achievements. Comparatively, there were some games in the 40-60% completion rate where I had over 200 achievements earned.

    For the timeline charts, I did initially want a toggle between the three time units, but I was unable to figure that out (still working on that problem!). I used an area chart rather than a line chart because I wanted the visible bands of color, rather than heavily overlapping lines. Visually it’s much cleaner, and also allows me to see the overall trends. However, the drawback of this is that if I select very few points, it doesn’t show anything interesting and just becomes a full bar of color.

    Color-wise, I decided to use three different encodings to highlight different data I wanted to know about: genre, randomization, and replayability.

    • Genre feels like an obvious choice to examine, since that’s often one of the first things I check when deciding if I’m going to buy a new game or not. (I’m pickier when it comes to RPGs versus strategy games, for example.) 
    • When a game is randomized (so that every round or whatever is different than the last), I feel like I will put more hours into it because it stays fresh longer, so I wanted to examine whether this feeling was true or not (it is). 
    • I find that there’s some games I go back to a lot, and examining if there were any trends in that was interesting.

  • PhD Unionization Visualization Process

    See What’s at Stake: Northeastern/GENU-UAW Contract Proposal Visualized

    Alenna Spiro, Noah Rae-Grant

    CS 7250 Fall 2025 Final Project

    Unlike traditional contract negotiations which are article-by-article, Northeastern University released a complete ‘Final’ contract proposal all at once. Our dashboard visualizes key provisions to help union members and curious graduate students understand what’s being offered.

    Motivation

    The two of us are both graduate student workers at Northeastern University, and we were curious about the ongoing contract negotiations between the Graduate Employees’ Network Union (GENU-UAW) and Northeastern University. As graduate student workers, the outcome of these negotiations would directly impact us, so we wanted to better understand what was being proposed.

    We were both familiar with the existence of the union but we didn’t know what state the contract negotiations were in, why it was taking so long, or what graduate student workers would benefit from in the contract.

    After interviewing a few union members, we realized our primary goal should be creating a visualization that serves as a tool to inform and engage graduate student workers. The union has primarily done grassroots outreach through labs, which makes it difficult to reach masters students or students in labs where there is no current presence. If the union does have the opportunity to speak to a student and that student wants to learn more, reading through the full contract proposals on the GENU-UAW’s website is time consuming and complex. Having a visualization dashboard that can supplement the grassroots outreach and communicate negotiations quickly and effectively will be helpful in educating the graduate union members.

    As part of this primary goal, we asked our interviewees about current issues they would like to focus on communicating with the graduate student workers. They had several key messages: The first was to communicate the back-and-forth nature of the negotiations, as many students were under the impression that the union was not effective, when in reality Northeastern had returned several full contract proposals, which was unorthodox compared to traditional per-article negotiations. The second was to show students the current disparity between the graduate student stipends and the living wage in this area. Expanding off of that, some departments are significantly underfunded compared to others, which was a concern for some union members. Finally, one of their big concerns was to communicate the healthcare benefits they are asking for from the university to other local universities. They had tried to communicate these healthcare benefits in the past, but their visualizations were not effective in retaining student involvement.

    Data

    For this project, we predominantly focused on the contract and its various iterations, but we also incorporated:

    • Self-reported stipend data from phdstipends.com from 2011-2025 for Northeastern and other Boston-area students
    • 2025 health plan information from Boston-area universities

    AI Disclosure: For the task of turning the contract into data, we did use the Claude LLM for two things: finding the differences between each iteration of the contract, and summarizing the final versions of the different articles. This was the only use of LLM tools in our project.

    The final contract data contains entries consisting of each article, the sub-topics which were changed within each article, the date it was changed, the party that made the change, and a short summary of what was changed. Each of these articles was grouped into a larger themed group (such as “Employment (Requirements)” or “Benefits”) for display purposes.

    The Boston-area universities we are using for comparison are the Massachusetts Institute of Technology (MIT), Boston University (BU), Harvard, University of Massachusetts Boston (UMass), and Tufts University. We chose these universities because they are local, unionized, and have or are in the process of negotiating contracts. 

    Due to the nature of the stipend data being self-reported, it can be somewhat erratic and inconsistent. However, we felt that it does successfully show the general trend of how stipends grow but still remain significantly below the living wage for a single person in the area, as calculated by the MIT Living Wage Calculator. Unfortunately, previous years of the MIT Living Wage Calculation were not available online.

    Task Analysis

    At a high level, our primary task was to offer an explanation of the contract and a tool for graduate students to analyze it themselves. To do so, we had to figure out how to compare important aspects to other local universities and what specific information we wanted to convey.

    We decided on three major topics we wanted to examine: stipends, healthcare, and the contract negotiations themselves. We would consider our project successful if a user was able to:

    • Compare stipend amounts between different Boston-area universities, as well as between departments and colleges within Northeastern
    • Compare what different Boston-area universities offered in their healthcare plans
    • Explore how articles have changed over the course of negotiations to determine what benefits were gained or lost

    Design Process

    Post-it note sketches with a central contract framework surrounded by potential subplots

    While our original dashboard plan had separate visualizations akin to a traditional dashboard, the project shifted to a more cohesive design when we decided to use Northeastern’s final proposal contract itself as our framework. We devised that clicking on a part of the contract relevant to a visualization would pop out a chart which provides an explanation of that section or a comparison related to that section.

    Whiteboard sketch of a timeline, with text bubbles at different points saying "GENU did this" and "NEU did that"

    The timeline ended up very different than our original sketch, due to the complexities of the negotiation process. In our original sketches we designed a timeline with a single point for each negotiation event, which proved very insufficient for the amount of information we wanted to convey.

    Whiteboard sketch of a line chart showing increasing PhD stipends, with a wiggly dashed line for inflation

    Along with the timeline, the stipend visualizations also went through several iterations. Initially we wanted to compare stipends against inflation, but the data we had was not sufficient to do so accurately. Instead, we added horizontal lines to compare the Massachusetts poverty line and the Boston-area living wage to our stipend data.

    Paper sketch of a unit chart to show healthcare benefits

    Our plan for comparing healthcare benefits manifested as a unit chart. The eventual visualization stayed reasonably close to its initial design, with largely aesthetic changes to make it clear. The upper portion of the chart shows benefits included in each plan, while the lower portion are items excluded from plans. Each colored icon represents a different element of a health plan, such as primary care, dental plans, or emergency room costs. In the original idea, we were going to enable multiple tooltips to appear on each university’s corresponding matching topic on hover of one icon. Instead, we added a bar chart to show the difference between benefits quantitatively.

    Final Visualization

    Link to video if embedded video unavailable

    Final landing page design

    Our final design allows a user to scroll through Northeastern’s most recent full proposal contract. At the top, we have included a short description of the negotiation context, explaining how traditional negotiations are conducted. We have also included instructions on how to use the dashboard. On certain relevant paragraphs, A yellow box indicates a clickable portion that will pop out a visualization. The title of the visualization is shown to the left of the corresponding box for context.

    Navigation tab

    To facilitate ease of use and to prevent users from missing important information, we provide a table of contents at the top for quickly jumping between the different visualizations rather than scrolling. Clicking on the “Jump to Visualization” button opens a menu which displays the titles and descriptions of each visualization, allowing the user to choose which visualization they would like to investigate and see in context.

    Timeline

    Final timeline visualization

    The timeline visualization went through the most iteration, due to the complexity of the negotiations we wished to convey. Our usability testing showed that the first iteration of the timeline was the most difficult to navigate and understand. We included the modified Gantt chart, a table showing the individual sub-topic changes, and a bar chart showing the number of changes over time. When a piece of the timeline was selected, it would update both of the sub-plots to show more detail about what changed on that date.

    When exploring, users read the Gantt chart as a horizontal bar graph, asked why the articles were grouped the way they were, and found the table of changes difficult to read due to its density. Additionally, the bar chart was redundant with the Gantt chart, as both were showing the number of changes over time. Users requested the ability to compare between different article versions and more instructions on what is selectable in the visualization.

    To solve the issue of the bar chart interpretation, we added a simple arrow to change the meaning to a clear timeline interpretation. To incorporate information about the number of topic changes in each proposal, after lots of back and forth between different options, we settled on using a diverging color palette to represent which party made the change (reds for the union, blues for the university) and used opacity to represent the number of changes made relative to the maximum changes that article went through. This way we could remove the bar chart sub-plot and replace it with a more informative table. We added a “Most Recent Language” table, meant to show a summary of the latest version of this article, so that it could be easily compared with the selected changes. To remove any ambiguity on the use of the chart, we added comprehensive instructions and prompted the user to choose an article to then see the tables.

    Grouping dropdown select and tooltip

    Visualized above is the dropdown containing all of the themed article groups. We initially tried to display all 41 articles in one visualization, but it was severely cramped and difficult to read. We found that groups of 4-7 were optimal, so we made thematic groups and a dropdown to swap between them. After user feedback, we added a description of each group explaining the general purpose of each set of articles. When interacting with the main gantt plot, hovering over a bar shows what date the changes were made, how many were made, and how long it took until the return proposal from the opposite party. “Present” is represented as May 30, 2025, as (to our knowledge) there have not been any changes to the contract proposal after that date.

    Default proposal exploration view

    Proposal examination view after selecting a bar

    When no bar is selected, instructions show to prompt the user to choose an article. Selecting any bar in the timeline opens up an examination of what changed in that article on that date, as well as a summary the final version of the article.

    With this final version of the timeline, we hoped to encourage the users to be able to explore the contract negotiations, compare different versions of articles, and understand the back-and-forth nature of the negotiations.

    Stipends

    Both stipend-related charts are line charts, showing the change over time since our earliest data. As noted in the “Data” section above, the stipend data is erratic due to the nature of being self-reported. To mitigate this as much as possible, we averaged the stipend amounts by university or department, filtered out certain departments that only had a single point of data, and grouped together some departments that were within the same college at Northeastern.

    Line chart depicting the change of Boston-area PhD stipends since 2011

    The Boston-area university stipend chart shows how the average stipend for each university has changed since 2011. Horizontal lines indicate the Massachusetts poverty line and the Boston-area living wage for a single person in 2025, to provide context for how these stipends compare to cost of living in the area. In our first iteration of this visualization, users noted that it was not obvious that the data was averages for each university, so we added that to the title. Additionally, our original color scheme used several different bright hues for each university, but it was difficult to pick out Northeastern specifically. We changed the color scheme to have Northeastern in a bright red, while the other universities are in more muted hues, so that we could use the pop-out effect.

    Line chart depicting changes in Northeastern PhD stipends since 2013

    The Northeastern department-level stipend chart shows how average stipends have changed for different departments at Northeastern since 2013 (the earliest year we had department-level data for). In our first iteration, we had users asking for a way to filter by college as well as seeing each department. We were able to add that feature in our final iteration.

    Hover tooltip detail for the Boston-area university stipend chart

    Hovering over either chart opens up a tooltip displaying what the average pay was for that university or department during that year.

    Health Plan Benefits

    Our primary goal in comparing health plans was to express to student workers the differences in coverage and costs among the various plans available in the area. During the interview, the union workers had expressed that many students asked them about negotiating for dental and vision insurance, so we made sure to include those benefits in our comparison.

    Unit chart depicting what Boston area university health plans offer

    The health plan unit chart went through a few iterations, finally ending up with the icon overview on the left and the detail bar chart on the right. The icon unit chart is organized by university, showing included benefits above the center line and excluded below. Icons represent each type of benefit, additionally shown in the legend at the bottom right at the test users’ requests. We initially used a green/red color scheme for good and bad connotation, but given how we use red as a popout effect elsewhere, we decided to change it to blue/yellow on the advice of Prof. Borkin.

    Detail view of the health plan benefits tooltip

    Hovering over an icon shows what that specific university’s plan provides for that benefit, and clicking on it will open up the comparison bar chart showing how much is covered by each plan across all universities. For the bar chart, blue indicates a “good” benefit which exists above or below the mean (depending on the type of benefit chosen), such as lower cost co-pay or a larger amount covered by insurance.

    Hover detail for health plan benefit legend

    Hovering over each item on the legend also provides an explanation of the healthcare terms for those not familiar with healthcare plan language. The health plans can be filtered by provider as well, in case someone wants to see if their preferred provider is within network.

    Data Analysis

    Stipends

    Given the data we have, we can see that Northeastern’s 2025 overall average pay is slightly lower than other private universities in the area (due to its nature as a public university, UMass-Boston has a different pay scheme than other area universities). However, none of the stipends are near the calculated living wage of $63,942 for a single person in the Boston area, meaning almost all PhD students in the area would need to live with at least one other person. Stipends have overall gone up since 2011, and most are at least closer to the living wage than they are to the Massachusetts poverty line ($15,650 annual income).

    As for Northeastern-specific departments, the stipend amount varies depending on what college a program is in. For example, departments under the Khoury College of Computer Science tend to have a higher stipend than programs in the College of Social Sciences and Humanities. In 2023, the average computer science stipend was approximately $50,000, while the average political science stipend was $10k less at approximately $40,000.

    However, given our data’s inconsistencies due to being self-reported date to a single specific site, this trend would require further examination.

    Health Plan Benefits

    While most health plans cover most things people would require, there are three benefits that are commonly excluded:

    • Birth control is only explicitly covered by Tufts’s and Harvard’s plans
    • Only MIT offers dental coverage as part of the primary plan. Tufts and Harvard offer optional dental plans.
    • Eyeglasses are only covered by the MIT and Tufts health plans. 

    Overall, it does seem like Northeastern’s health insurance is competitive with other area universities, though it lacks in the three areas listed above. Some of the plan pricing does rely on being in Blue Cross Blue Shield’s network, which might mean certain specialty care areas aren’t easily accessible in-network.

    Notably, the Tufts plan details that we were able to find are unclear about how much the plan covers. There are no listed co-pay or deductible amounts.

    Timeline

    When we first conceived of the timeline, we expected that bargaining and employment-related articles would have the most back-and-forth changes. For the most part this was correct, but one of the most surprisingly contentious issues was that of professional development (which we grouped under “Benefits”). This particular article was changed 8 times overall, alternating between the union and the university, before ending in its final form of offering up to $750 reimbursement per year for paid professional development opportunities with supervisor approval. Some of the other most back-and-forth article changes are for employment records, appointments and reappointments (both under “Employment (Requirements)”), travel (under “Employment (Rights)”), and the labor management committee (under “Union (General)”).

    Additionally, there are some articles that get introduced but never acknowledged or changed by the other party, particularly for certain benefits that the union proposed like tax assistance, retirement, relocation assistance, and housing support.

    Conclusion

    Following legal negotiations as a non-expert is a challenge, and we hope that our project helps bridge the gap for the average graduate student at Northeastern and helps them understand what this contract would mean for them.

    In future work, we would like to add some more visualizations for other sections, as well as pull in data for comparable non-Boston universities. In particular, we would like to gather information from our target audience on what they would want to see next, and reach out to other graduate student worker unions for copies of their contracts to compare. One in particular we would want to add came at the request of one of our test users, who asked about visa benefits for international students and what different area universities offered in terms of visa support.

    We also would want to try and use our benefit unit chart design for other types of benefits offered, like time off (vacation, sick, and leave of absence) or housing/relocation assistance. This particularly was a design we liked and would want to iterate on in future works.