Question 1

Expand for Detailed Walk-Through Below

Accepted Answer

To identify standout candidates, I devised a 'Performance' metric by calculating the difference between each district's Partisan Voter Index (PVI) and the candidate's electoral margin in 2022. PVI measures how partisan the district is compared to the nation as a whole, based on how the constituents of those districts voted in previous presidential elections. This approach identified those who significantly outperformed their district's typical partisan lean.

Of the top 18 overperforming candidates indicated in the graph above by district title, I narrowed my focus to first-time candidates to avoid any influence of incumbency effects. Mary Peltola from Alaska was also excluded due to the state's use of Ranked Choice Voting, which, while I am personally a fan of RCV, complicates direct comparison of candidates in this context.

That left me with 6 candidates to consider, all having overperformed their districts' partisan lean by at least 5 points. The following 4 candidates greatly overperformed in their districts, but were eliminated from consideration for various reasons:

Emilia Sykes would have been fun to analyze (and I love her glasses), but she deleted her campaign account following the election. Adam Frisch, who just barely fell short of victory in CO-03, was initially a candidate of interest, but was excluded due to the sheer volume of his tweets, which, thanks to Elon Musk's recent termination of free API access for Twitter, made data collection too labor-intensive.

Chosen Candidates:

Ultimately, I found myself drawn to the candidate who arguably pulled off the biggest flip of the midterms. Her unique campaign and distinctive messaging strategy provided ample material for analysis, ultimately leading me to...

Marie gluesenkamp Pérez! She faced cuckoo-bird Joe Kent, who expressed some extreme views like supporting the arrest of Dr. Anthony Fauci and endorsing the claims of a stolen 2020 election. In fact, he became the candidate for WA-03 after successfully primarying the serving Republican Congressperson, Jaime Herrera Beutler, one of only 10 republicans who voted to impeach Donald Trump following the events of January 6th.

The next candidate I wanted to assess took a little more research to come to a decision, but I wanted to find a Democrat who overperformed in their district, while contending against an opponent who was a more mainstream Republican. I landed on...

Chris Deluzio! He competed in a pure toss-up district and significantly outperformed against Jeremy Shaffer, who notably tried to sidestep affirming or denying the 2020 election fraud claims, and even released an ad promising to \"protect women's healthcare.\"

Tweet Collection

As mentioned before, the termination of free API access meant manually compiling tweets for Chris Deluzio and Marie Gluesenkamp Pérez, and then using a custom parsing script to organize and format these tweets into a structured dataset for analysis. Tweets were manually copied, separated by a pipe '|' delimiter, and then organized into a corpus of around 1000 total tweets.

Question 2

LDA with TF-IDF

Accepted Answer

Latent Dirichlet Allocation (LDA) on Term Frequency-Inverse Document Frequency (TF-IDF)

As a baseline, I used Latent Dirichlet Allocation (LDA) on Term Frequency-Inverse Document Frequency (TF-IDF) to analyze my candidates' tweets. TF-IDF measures the importance of words in a document (tweet) relative to the corpus (collection of all tweets in the campaign season). However, with only 1000 already-short tweets, LDA's effectiveness may be limited, and so I used this method as a baseline topic modeling method for comparison.

LDA uses these term frequencies to search for patterns and group things together into topics it thinks are related. It's up to the user to interpret these topics and discern underlying patterns.

Sorting Marie gluesenkamp Pérez's tweetset into 5 topics created the following key word associations to each topic for MGP:

It seems like Topic 1 involves canvassing and GOTV messaging with terms like \"volunteer\", \"join\", \"doors\", \"Vancouver\" (big population center in the district where running up turnout numbers would be important to win). The other topics' words offer some hints at overarching themes, but they are not as easy to discern as the first topic.

TF-IDF scores words based on frequency and rarity, then LDA identifies topics based on these scores. When determining topics, it assigns each word a weight indicating its importance to the topic. To demonstrate this concept, below is graph showing the weight importance for the top words in MGP's first topic.

Now, this is all well and good, but it is a baseline model, so let's not dive too deep into it and see if we can go ahead and up the ante a bit with more complex modeling.

Question 3

Twitter-Trained GloVe Embeddings with NMF

Accepted Answer

GloVe (Global Vectors for Word Representation)

GloVe is an unsupervised learning algorithm designed by these dudes at Stanford. It can train on any corpus, but the GloVe model I used was performed on 2 billion tweets, which is important for a few reasons. First, GloVe trains on word-word co-occurence rates, but my model is trained specifically on how words are used together and semantically similar on Twitter. Considering the normal corpora used for text classification, Twitter is not newspaper articles, or books, or technical journals, so the word-word codependence rates that develop on twitter are, to a large degree, affected by the character limit itself! Also, the language is more vernacular, and tweets are designed to be shared, commented on, and interacted with. It's just a different semantic universe from other corpora.

So, given all these aspects of twitter language, I used a model that vectorizes every word into 100-dimensional vectors. Word embeddings can better handle polysemy (words with multiple meanings) by providing contextually appropriate vectors, whereas TF-IDF used in my baseline model treats each word instance identically regardless of semantic context.

Non-Negative Matrix Factorization

Non-Negative Matrix Factorization (NMF) is a technique that decomposes high-dimensional datasets into lower-dimensional components. Compared to LDA on TF-IDF, NMF can handle denser data representations like GloVe embeddings more naturally, leveraging the semantic information embedded in word vectors. TF-IDF was like sorting through a giant word salad and counting the words that appear, but NMF with twitter-trained GloVe vectors knows that terms like 'Follow' and 'Mention' have related meaning in this semantic universe. This leads to better grouping and more interpretable and distinct topics.

Process

After some limited pre-processing, each word within the tweets was converted into a 100-dimensional vector using the GloVe model. The word vectors were averaged to produce a single vector to represents each tweet. These tweet vectors were stacked into a matrix, which served as the input for the NMF model to break down into associated topics. Given the non-negativity constraint inherent in NMF, absolute values of the tweet vectors were utilized to ensure all inputs were non-negative. (I also tried shifting the vector values to all exist in positive space, but it didn't yield a noticeable improvement in the resulting topics.)

Question 4

1. MGP Topic 1 -- \"Voice for Working Class\"

Accepted Answer

Question 5

2. MGP Topic 2 -- \"Digital & Community Engagement\"

Accepted Answer

Question 6

3. MGP Topic 3 -- \"Endorsements & Policy Priorities\"

Accepted Answer

Question 7

4. MGP Topic 4 -- \"Voter Mobilization Efforts\"

Accepted Answer

Question 8

5. MGP Topic 5 -- \"Anti-Extremism\"

Accepted Answer

Question 9

6. MGP Topic 6 -- \"Volunteer & Fundraising\"

Accepted Answer

Question 10

7. MGP Topic 7 -- \"Defending Rights & Freedoms\"

Accepted Answer

Question 11

1. Deluzio Topic 1 -- \"Union Solidarity & Local Empowerment\"

Accepted Answer

Question 12

2. Deluzio Topic 2 -- \"Reproductive Rights & Fighting Extremism\"

Accepted Answer

Question 13

3. Deluzio Topic 3 -- \"Community Events\"

Accepted Answer

Question 14

4. Deluzio Topic 4 -- \"Jobs & Infrastructure\"

Accepted Answer

Question 15

5. Deluzio Topic 5 -- \"Advocacy & Community Solidarity\"

Accepted Answer

Question 16

6. Deluzio Topic 6 -- \"Corporate Greed & Economic Fairness\"

Accepted Answer

Question 17

7. Deluzio Topic 7 -- \"Defending Rights & Democracy\"

Accepted Answer

Samuel Forrest Williams

Winning in Trump Country

Introduction

Methodology

Data Used

Selecting the Candidates

Chosen Candidates:

Tweet Collection

Analyzing Tweets with Trained Models (114th Congress)…

Topic Modeling – Unsupervised Learning

Baseline Model

Latent Dirichlet Allocation (LDA) on Term Frequency-Inverse Document Frequency (TF-IDF)

Advanced Modeling

GloVe (Global Vectors for Word Representation)

Non-Negative Matrix Factorization

Process

Marie Gluesenkamp Pérez Topics

Chris Deluzio Topics

Topic Comparisons Between Candidates

Insights and Conclusions

1. Abortion

2. Extremist Opponent

3. Unions vs. Corporations

4. Ground Game