TechPolicy.ca Data mining politics and public policy. The politics of data mining.

1Feb/100

Tracking the Press: MPs in Canada

In my last post, I discussed different approaches to social media mining. While I am currently working on complex approaches to mining information in blogs, newspapers, and other forms of social and news media, even simple approaches can yield interesting information.

For example, the graph below shows the number of articles that mention Canadian Members of Parliament (MPs) versus the number of different MPs that are mentioned when discussing those original MPs. For example, if you have MP A mentioned in 30 articles, and several of those articles mentioned two other MPs, then A would be located at point (30, 2). Note that clicking the graph follows posts over the week ending on January 31.

What's interesting about this graph is it shows the centrality of MPs to political discussions. As one would expect, Stephen Harper is mentioned fairly often and in relation to many other MPs. The same is true for Michael Ignatieff. While we lose a great deal of information by not reading the articles themselves, it is instructive to see how observing the information in aggregate helps elucidate the underlying social and political structure of Canada.

Note that Stockwell Day is seemingly mentioned in a great deal of articles, but this is an artifact of the data collection process. Specifically, "day" is a common word regardless of the MP. I wanted to leave this data point in, however, to show how developing tools for press and media tracking is often more difficult than one would expect. The initial software for downloading newspaper or blog articles and counting words is seemingly straightforward to build. However, many practical hurdles often hamper the process. Differentiating between "Day" the politician and "day" the common noun is but one example.

28Jan/100

Social Media and Business Intelligence

To say that social media is important would be an understatement. Microblogging sites like Twitter, social networks like Facebook, blogs, and social news sites like Reddit all allow people to share their ideas, thoughts, and opinions nearly in real time. A very important question that any large business should ask is, "What role does social media play in our business intelligence strategy?"

Why Social Media?

Two major reasons for analyzing social media and incorporating it into your business intelligence strategy are (1) that it allows you to better understand various players in your markets, and (2) it helps warn you of any surprises -- think damage control.

In the first case, there are numerous examples of social media sites responding to announcements in near real-time. Numerous bloggers and Twitter users discussed the launch of Apple's iPad on January 27. Such responses are not limited to major product launches. Though the media traffic is often much lower, many press releases get discussed in blogs, and news stories are regularly circulated. At the very least, online forums and newsgroups -- especially those focused on investing in public corporations -- discuss seemingly mundane and banal details numerous times a day.

Knowing what the groups above are saying is important when designing products or exploring what customers are thinking. In the worst case scenario, such monitoring helps deal with damage control. Monitoring the media and understanding what is taking place allows a company to deal with unforeseen events. An infamous example is Google's accidental posting of a six-year old story about United Airlines going bankrupt. The mistake made the company's share price fall 75%, and monitoring social media for such blatant mistakes is worth the investment.

While this is an extreme example, knowing when customers are complaining or organizing themselves against (or in support of) your brand is crucial information. In such cases, even an early warning -- even by a few hours -- could help with damage control.

A Social Media Mining Strategy

Numerous options exist for data mining social media. In my opinion, a good social media mining strategy would incorporate a number of data collection and analytical methods:

  • Web Crawling: you do not know what people are saying without visiting their websites. As such, the most important factor is actually visiting websites and logging what they say. One can make a list of relevant websites, or simply crawl the web and look for times when a specific company or brand is mentioned.
  • Link Tracing: with the web crawls above, one has access to two important pieces of information, (1) text of the sites, and (2) their hyperlinks. Mapping out which sites link to which other sites will inform you of how various people discussing your company actually link to each other and whether they are aware of each other.
  • Content Analysis: using the text of the sites themselves can inform you of what people say about your company, and which words they tend to associate with your brands. One can also automate the content analysis to provide near-real-time warnings about negative sentiment about a brand (though whether such warnings are accurate is still a controversial debate).

Next Steps

If you are interested in learning more about social media mining, feel free to contact me.

24Jan/100

Rebranding the Blog

A number of recent developments have convinced me to rebrand this blog. Rather than focusing on research notes, I'm going to begin actively discussing data mining and politics.

There are a number of reasons for this change. I'm currently based in the UK, and a number of developments here have led me to want to share my views on these issues. First, the launch of data.gov.uk, alongside similar initiatives in my home town (Toronto), have made it clear that government-focused data analytics is important and gaining in popularity. Rumours are circulating of a national election in the UK in May, and I may be analyzing the data surrounding this. Furthermore, a great deal of governmental scrutiny is going into data mining, profiling, and tracking information about Internet users, travelers, customers, and so on.

A big question surrounding these developments is: "What does it all mean?" And this is what this blog will focus on. My research tries to explore such questions, but there's more to this issue than analyzing political discussions online. So stay tuned!

28Nov/090

What I’m Reading

I've been very slow in updating this blog, as I've been spending a lot of my time reading books and running experiments on my data. I figure it would be useful to outline the books I'm reading (but haven't quite finished) and see if they affect my research later.

Communication Power, by Manuel Castells. This book explores Castells' theory of power and how it is entrenched in control of the media and influencing how people think. Note that "power" is a social scientific term more closely associated with "influence" than with "coercion".

Micromotives and Macrobehavior by Thomas Schelling. While this is a popular science book from a while ago, it's quite relevant to my research. Indeed, I think anyone researching bloggers, or more broadly, running sociologically-focused agent-based models, should be reading this book. One of the key discussions in the book focuses on how it is difficult to use macro-level observations to discover micro-level agency.

Two books I am about to start are What Does China Think? and The Cult of Statistical Significance. Both look like fun books!

10Nov/090

Agent-Based Models of the Blogosphere

My upcoming dissertation will be focused on building mathematical models of bloggers in an effort to understand the social dynamics of the individuals themselves. There are a number of reasons for why this is a useful strategy. By building agent-based (i.e. blogger-based) models, we can test to see if certain social strategies actually play a role in forming the blog network itself. Furthermore, if the models are very good, it might be possible to make predictions about the bloggers themselves.

One of the biggest challenges with building mathematical models is defining a scope for the model. First, one should decide what one is trying to predict. Next, one should decide how to go about making those predictions. This post focuses specifically on agent-based models, where each individual blogger or post is modeled. One assumes this agent makes its own decisions, such as who he or she will link to, or what he or she will write about. Using this information, we can start to get an idea of how the blogosphere in our model will work.

The difficulty with building such models is the validation portion of the process. We can build a simple model that simply states: "Every blogger will post once a day at noon, and will not link to anyone else." This model is obviously incorrect, but how incorrect is it? What is our measure of error? In other words, what are we trying to predict and how good are our predictions?

While this list is by no means extensive, there are three areas I've been exploring for defining error in models of the blogosphere: the macro, meso, and micro levels.

The Micro Level

At one end of the spectrum, we can try to predict the most minute details of the blogosphere. We can try to predict specific links (i.e. that "The Huffington Post" will link to "Wonkette" tomorrow), new posts (i.e. "Daily Kos" will post about Iraq next week), or new entrants (i.e. 30 new political blogs will be created in the next 12 hours). Such predictions are clearly useful, and error is easy to calculate as we simply need to check if our predictions were correct. In Topic-Link LDA: Joint Models of Topic and Author Community, the authors attempt to do this with interblog linking.

The biggest drawback with such specific predictions is the accuracy. Yes, predictions are easy to validate, but the accuracy of such predictions is extremely small. The models have a long way to go before one can actually trust them. Indeed, it might be that such models of blogger behaviour will never be trustworthy to a practical extent.

The Macro Level

At the opposite end of the spectrum, one looks at global properties of the blogosphere. Since predicting what individual bloggers will be doing is so difficult, why not try to predict larger-scale properties? For example, we might predict the number of blog posts likely to be posted on a given day, or the distribution of links in the blog network. While not blog-focused, a good example of such work is provided in Team Assembly Mechanisms Determine Collaboration Network Structure and Team Performance. In this paper, the authors predict the distribution of collaborations between Broadway actors. The actors themselves have specific agendas and make choices about who to work with, and the authors validate their model by seeing how their model predicts "aggregate collaboration" -- in other words, the distribution of collaboration counts on the Broadway scene.

While error rates for such models tend to be lower, a good question to ask is whether such models provide practical information. For example, if we know that only 5 Broadway actors will collaborate with more than 100 actors during their careers, how can we actually use this information? A similar analogy can be used in the blogosphere. If our model predicts that next month, only three blogs will receive more than 100 hyperlinks to their posts, what do we really know? Can we act on this information?

The Meso Level

If the micro level provides too little accuracy and the macro level doesn't provide enough "practical" (your mileage will vary) information, let's make a compromise. Specifically, what about making predictions about subgroups in the blogosphere? Can we predict that certain communities of bloggers will do certain things? Imagine if you knew that a certain community of bloggers will likely be more active than another community. If this community is the pro-life movement in the political blogosphere, then you certainly have useful information.

Two papers in this area are Correlation Profiles and Motifs in Complex Networks and Patterns of Cascading Behavior in Large Blog Graphs. In each case, the authors look at small subgraphs in the networks they model and try to see what happens with those subgraphs. Do they appear more often than is expected? Can one predict how often they appear?

I believe that such an approach has a great deal to offer those modeling political blogs and other networked data sets. While it might be very difficult to predict whether a specific edge will form, if one can instead predict what a group of 3 or 4 bloggers is likely to do, one still has a great deal of practical information. Whether such models have higher error rates has (according to my research) yet to be seen.

7Nov/090

Types of Bloggers

I recently gave a talk at Brunel University. It was about 40 minutes long and focused on my work in data mining the political blogosphere. While I won't discuss most of the work in this post, one area that really got me excited was categorizing bloggers by their posting habits.

I haven't done any formal work in this area yet, but I've plotted a few graphs showing how different bloggers post in the context of the 2008 US Presidential election. The graphs show the the number of posts on the blog within a seven day period. The election itself takes place around day 175.

What really jumps out at me with these graphs is that bloggers are very different, but that some intuitive categories exist. The obvious one is a very active blogger regardless of political context or external events. The graph below shows one such example.

jpeg-9091-sm

Another example is a single issue blogger. These could include blogs focusing on the "Palin for VP" or "Clinton for President" campaigns. Such bloggers tend to be very active around the time when there is most hope for success in such a campaign, and activity drops off when the campaign shuts down or fails.

jpeg-8196-sm

While the single issue blogger above seems to ramp up and then die down slowly, some bloggers have much more obvious swings in activity. An obvious "issue" is the presidential election itself. While very active blogs were active throughout the entire period, some of the less popular and less active blogs had a big increase in activity around November 2008. This is shown in the example below.

jpeg-7527-sm

I'm not quite sure where this research can go, but I have a few ideas. My broader research focus is social influence in the blogosphere, and I imagine being able to categorize blogs using mathematical definitions based on the above would certainly help my work.

3Nov/090

A New Blog

This coming year, I'll be studying at the Oxford Internet Institute (OII), where I'll be doing an M.Sc. in the Social Science of the Internet. While there will be some course work, my main focus will be on blog analytics, data mining, and technology policy and law. In addition to the work being done at the OII, I'll also be involved in projects related to my technology-focused non-profit organization, Five Minutes to Midnight, and exploring some business and writing opportunities during the year.

The goal of this blog is to document some of this work, as well as share thoughts and ideas on technology policy and the potential that data mining, network science, and machine learning have for public policy and law. Keep in touch!