Hot Twitter topics for GE16 candidates one week in

This post is timely given RTE’s recent publication of their analysis of social media usage during the 2016 general election in Ireland, available here. It looks like they’ve partnered with the ADAPT Centre for Digital Content Technology to produce the Twitter data and Facebook for that content. It’s a pity that they:

  1. don’t indicate the source of their data
  2. don’t indicate on what basis were Tweets deemed to be about the election (was it simply the presence of the #ge16 hashtag?)
  3. don’t indicate the basis for the topic codings
  4. don’t mention the sentiment analyser algo (state of the art is still pretty far from robust, often missing e.g. sarcasm)
  5. don’t indicate what consitutes a “mention” of a political party.

It would be great to take on each of these, but I only have time to examine the topics that were identified in online Twitter posts. Ranked by volume (presumably over tweets tagged with #ge16, but I don’t know) they are:

  1. Election News
  2. Finance
  3. Irish Water
  4. Healthcare
  5. Crime/Justice
  6. Housing
  7. Abortion
  8. Employment, Enterprise
  9. Education
  10. Transport

When I collect the data that I expect was used to prepare that ranking (#ge16-tagged posts) I will check to see if the topics suggested by the topic modeling algorithmn LDA match up with these. In the meantime, I can tell you what LDA makes of the tweets by candidates. Basically, this algo makes up its own mind about what topics are out there (unsupervised). Since it doesn’t really have a clue about the meaning of topics, we have to label them manually (topic interpretation) to make much sense of them. To do that, we look at the words that define a topic, in the eyes of the model. Sometimes they don’t cohere all that much. Plus, LDA might not be that robust with such small documents: tweets are pretty short, after all. But it’s a good first approximation.

So here are the topics that LDA thinks that candidates tweet about (unranked) and my interpretation/label for what I think they capture. I assign them arbitrary numbers.

Topic ID Keywords that define the topic My label
0 @pb4p @senjohnwhelan government issues dublin Leftist chat in Dublin
1 #limerick support use level @trevorocsf Limerick
2 would need great know workers Unclear - general plamas
3 ar party 2 one final Vote No. 1 / No. 2 topic
4 campaign green launch meet clare Greens launch campaign?
5 candidate support north @labour dublin North Dublin general
6 @tonguelash every galway working great @tonguelash weird account
7 don’t :) twitter https://t… free Junk
8 @joanburton support please need now! Burton support
9 election 2016 issue general via High level general topic
10 #repealthe8th yes sure made us Repeal the 8th
11 @carrie_smyth water win last #peoplesdebate Water
12 best luck week love thanks Gratitude/Hope
13 make manifesto @finegael launching fine Fine Gael manifesto launch?
14 #sinnféin young people social time SF (interesting connection with youth here)
15 #betterwithsf #nopegida sinn listen @gbayfm SF Antifa chat
16 great #votegreen2016 canvassing today delighted GP canvassing chat
17 signing times register vote last Get on the register
18 well done thanks candidates much More gratitude
19 tax good think serious believe Questions of revenue raising

Source: Twitter. Tweets by candidates in #ge16 as estimated by Storyful, from February 5th to February 11th (~ midnight EST)

Transparency

Data journalism is hard. It’s labour intensive and requires a degree of transparency of method that’s lacking here. Essentially the article on RTE News takes Facebook’s data at face value and announces “Fine Gael the most talked about party on Facebook”. Now, while that might accurately sum up the state of affairs online on Facebook (I don’t know, I deleted my account years ago), it obscures the complexity behind that predicate “talked about”. Does it count mentions of candidates? Parties by name? By Facebook page? By policy?

What it means for a party to be “talked about” is way more complex and it does the public forum an injustice to release this data without a shred of indication of methodology or even a PSA-style caveat which indicates an awareness of how this kind of data is subject to multiple interpretation.

For a good role-model in methodological transparency, RTE should turn to the work of the Guardian’s data journalism team, or that of the New York Times or something like FiveThirtyEight. All these outlets don’t need my inbound links, and their size greatly outweighs that of the team at RTE working on this but a couple of hits from Montrose wouldn’t go amiss. This kind of reportage has the potential to be transformative for the state broadcaster in the right hands and given sufficient support.

~ New York, NY

Appendix - Code

I used gensim with code looking something like this:

import gensim
from nltk.corpus import stopwords
from collections import defaultdict


# df is my big data source (Pandas)
a = df[df['created_dt'] > datetime(2016,2,5)]

documents = list(a.text)
documents = [d.lower() for d in documents]

stoplist = stopwords.words('english')
stoplist.extend(['rt', "i'm", 'im', '#ge16', '&', 'must', '-', '.', "it's", '...', 'it.'])

texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]


frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
        
texts = [[token for token in text if frequency[token] > 1] for text in texts]

dictionary = gensim.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(doc) for doc in texts]
lda = gensim.models.LdaModel(corpus, id2word=dictionary, num_topics=20)

for i in range(20):
    key_words = [x[1] for x in lda.show_topic(i)[:5]]
    print u"| {} | {} | ".format(i, " ".join(key_words))

Output:

| 0 | @pb4p @senjohnwhelan government issues dublin | 
| 1 | #limerick support use level @trevorocsf | 
| 2 | would need great know workers | 
| 3 | ar party 2 one final | 
| 4 | campaign green launch meet clare | 
| 5 | candidate support north @labour dublin | 
| 6 | @tonguelash every galway working great | 
| 7 | don't :) twitter https://t… free | 
| 8 | @joanburton support please need now! | 
| 9 | election 2016 issue general via | 
| 10 | #repealthe8th yes sure made us | 
| 11 | @carrie_smyth water win last #peoplesdebate | 
| 12 | best luck week love thanks | 
| 13 | make manifesto @finegael launching fine | 
| 14 | #sinnféin young people social time | 
| 15 | #betterwithsf #nopegida sinn listen @gbayfm | 
| 16 | great #votegreen2016 canvassing today delighted | 
| 17 | signing times register vote last | 
| 18 | well done thanks candidates much | 
| 19 | tax good think serious believe |