In my latest research project, I investigated how social media analytics can predict elections. By analyzing millions of tweets about the 2020 US Presidential candidates, I discovered some interesting correlation between Twitter engagement and actual voting patterns.
Using advanced natural language processing and machine learning techniques, I developed models to achieve higher accuracy in predicting state-level electoral results. This work demonstrates how data and sentiment analysis can broaden our understanding of the political behavior in the digital age.
My analysis revealed striking differences in how Twitter users perceived the candidates. Biden received significantly more positive tweets (39.1%) compared to Trump (34.4%), while Trump faced more negative sentiment (36.5% vs Biden's 25.9%). These sentiment patterns aligned remarkably well with the final election results.
The bar plot bellow shows the distribution of positive, neutral, and negative tweets for each candidate during the election period. The disproportionate volume of negative sentiment toward Trump is an indicator of final electoral outcomes.
One of the best way to visually see how popular elections might go is to look at Geographic heat maps of US. Here we have mapped, state wise, which candidate had hid the highest voter outcome. The red being the Republican nominee Donald Trump and in blue we have the Democratic nominee Joe Biden. The electoral patterns show a big regional variation where in states like New York and California we have a big portion of the population voting for Trump and on the other hand Texas, Florida, are slightly going Trump’s way. This electoral map serves as the ground truth against which our predictive models were evaluated.
My research mapped sentiment and engagement across the United States, revealing fascinating regional variations. Just like the popular vote coastal states like New York and California generally showed more positive sentiment toward Biden, while sentiment in central and southern regions was more mixed. The big surprise the the state of Texas which historically and in popular vote has voted Republican has Twitter population leaning more towards Joe Biden. This might be due to the test population being different to the actual population. In other words, people on twitter in Texas might not entirely represent the people actually voting.
The twitter engagement analysis of metrics (likes, retweets, and replies) showed that Biden consistently generated higher engagement than Trump across almost all of the States. We can see that looking at engagement metric only, shows that most of the country engages with Tweets for Biden. We should note that not all engagements are positive engagements. For example, some people might reply poorly with some tweets and in our metric that counts as engagement.
We can see this with the fact that in California, where the popular vote was for Joe Biden, the engagement metric favors Trump. On the other hand Texas and Florida are more engaged with Biden’s tweets. One of the explanation might be the fact that most of the engagement are not necessarily positive.
Looking at the top states by engagement volume reveals a consistent pattern: Biden outperformed Trump in 9 out of 10 states analyzed. New York showed the most significant differential, with Biden generating 1.1M engagements compared to Trump's 782.9K—a difference of over 40%.
While California was the exception where Trump (194.1K) slightly exceeded Biden's engagement (183.0K), the battleground states told a compelling story. In Pennsylvania, Biden led with 40.1K to Trump's 34.7K, while Nevada showed the most dramatic difference with Biden's 39.9K engagements far surpassing Trump's 11.4K. This state-level engagement data closely mirrored the eventual electoral outcomes.
Looking at the correlation plots we can see the there is a negative relationship between the number of engagement with tweets concerning Trump and Relative Vote Share. Meaning in states where there are lower engagement with Tweets concerning Trump there is a higher Relative Vote share emphasizing the fact the not all engagement are positive engagement. Same cannot be said about the correlation between engagement score and relative vote share for Biden. In here higher engagement score has a positive relative voter outcome.
On the other the correlation between the sentiment score and relative vote share is positive for both candidates with some outliers.
Finally, as a way to combine engagement metric and sentiment metrics into Machine Learning models, I trained the data on multiple algorithms. Before actually training the model, I divided the data into training and testing models (80/20) and found some remarkable results.
The ensemble methods (Random Forest, Gradient Boosting, and a Tuned Model) achieved perfect accuracy, correctly predicting the winner in every state in the test set. Even the simpler Logistic Regression model performed exceptionally well with 93.33% accuracy.
My research employed rigorous computational methods to analyze Twitter data:
Data Collection: I obtained a comprehensive dataset of tweets containing hashtags related to Donald Trump and Joe Biden, including engagement metrics and geographic information.
Preprocessing: I normalized all text data through case standardization, removal of extraneous content, tokenization, and lemmatization to prepare it for analysis.
Sentiment Analysis: I utilized the VADER framework, specifically designed for social media content, to quantify the emotional tone of tweets about each candidate.
Engagement Metrics: I analyzed likes, retweets, and replies at both candidate-specific and state-level aggregations to measure public interaction with content.
Machine Learning: I implemented multiple predictive models with engineered features from sentiment and engagement data to forecast electoral outcomes.
This research demonstrates that systematic analysis of Twitter sentiment and engagement can yield substantial predictive power for electoral outcomes. The perfect accuracy achieved by ensemble machine learning models suggests that digital engagement patterns may serve as valuable indicators of political trends and voter behavior.
The implications extend beyond electoral predictions, suggesting new methods for understanding public opinion, tracking political sentiment in real-time, and potentially enhancing democratic processes through data science.
For more details, access the full research paper here.