TechPolicy.ca Data mining politics and public policy. The politics of data mining.

17Aug/100

Information Flow and Arbitrage in the Political Blogosphere

As some of you may know, I recently submitted my dissertation for the MSc in Social Science of the Internet at the Oxford Internet Institute. The dissertation is still being graded, so I don't want to post it here just yet. However, the title of the dissertation is "Information Flow and Arbitrage in the Political Blogosphere" and the abstract is below. E-mail me if you'd like to discuss it or get a copy of the dissertation.

Over the last decade, political blogging has significantly grown in popularity and now represents a popular form of political engagement and information collection. This dissertation explores the political blogosphere in the context of the 2008 US Presidential election. From May 2008 to April 2009, 16,741 blogs were crawled on a daily basis, with their content and hyperlinks stored and analyzed. This dissertation provides an analysis of the flow of information through the blogosphere in the context of this data, through the use of social network analysis. Through a number of network-based methodological approaches, it is shown that the political blogosphere is organized in a core-periphery structure, with popular, elite bloggers organized in the core. The core itself is fragmented, composed of tightly-knit communities and members who are mutually aware of each other. These communities are fragmented and information does not easily flow between them. The periphery acts as a bridge, however, with information flowing to peripheral bloggers from multiple communities.

Through a node-level analysis, the dissertation further concludes that one can define different forms of influence and precursors to influence. Using a statistical model that controls for in- and out-degree distributions of the network, this dissertation is able to identify bloggers that act as information arbiters within their personal networks. A striking finding of the research is that the core is no better at information arbitrage and introducing their personal networks to new sources of information than the periphery; being a popular blogger entrenched in an elite community does not make it easier to promote new sources of information.

Furthermore, this thesis makes significant contributions to methodological work in social network analysis. It provides new approaches to analyzing longitudinal network data, and explores new statistical models for analyzing the significance of different social mechanisms in a network.

28Jun/100

Data Mining for Development: Sales Pitch

Over the last few weeks, I've been rekindling my interest in mathematics and international development. I studied both subjects during my undergraduate degree, and have kept trying to figure out a way to combine the two. I'm hoping to spend some time over the next two months running pilot projects in this area to see how well data mining, artificial intelligence, and statistics can be used to help development agencies, non-profits, and related institutions to do their work. Below is the rough draft of a short sales pitch I am working on in this area. E-mail me or comment if you are interested in learning more.

Data Mining for Development: Proposal

Organizations face two major data-related challenges: (1) how to collect meaningful data about their work, and (2) how to improve their operations and impact using that data. We’re here to help.

Today, non-profit organizations and social enterprises work in a challenging environment. With governments focusing on austerity measures and funding bodies receiving lower returns on their financial investments, non-profit organizations are continually facing an uphill battle. They are constantly being encouraged to improve how they operate and prove that they are having a positive effect.

Data Mining for Development (DM4D) is a non-profit project that aims to help organizations and social enterprises achieve their objectives. Composed of a team of graduate students, researchers, and professionals trained in mathematics, statistics, and computer science, DM4D helps organizations in a number of ways:

  1. DM4D helps organizations decide how to collect data and run project evaluations at a lower cost. We do this by finding ways to automate the data collection process, and by developing indicators for variables that are difficult to measure.
  2. DM4D researchers are familiar with extracting meaningful information from thousands of documents at a time. Organizations produce countless reports, e-mails, and articles, and we can help make sense of such data.
  3. DM4D builds mathematical models to help predict how the work environments of non-profit organizations are changing, so they can better prepare for unexpected events and changes coming in the future.

Collecting and understanding large amounts of information and data is difficult and expensive. However, doing so can help an organization achieve its mission and objectives. DM4D is here to provide services and advice to help organizations leverage their data.

For more information, please contact Wojciech Gryc at wojciech@gmail.com.

28Jan/100

Social Media and Business Intelligence

To say that social media is important would be an understatement. Microblogging sites like Twitter, social networks like Facebook, blogs, and social news sites like Reddit all allow people to share their ideas, thoughts, and opinions nearly in real time. A very important question that any large business should ask is, "What role does social media play in our business intelligence strategy?"

Why Social Media?

Two major reasons for analyzing social media and incorporating it into your business intelligence strategy are (1) that it allows you to better understand various players in your markets, and (2) it helps warn you of any surprises -- think damage control.

In the first case, there are numerous examples of social media sites responding to announcements in near real-time. Numerous bloggers and Twitter users discussed the launch of Apple's iPad on January 27. Such responses are not limited to major product launches. Though the media traffic is often much lower, many press releases get discussed in blogs, and news stories are regularly circulated. At the very least, online forums and newsgroups -- especially those focused on investing in public corporations -- discuss seemingly mundane and banal details numerous times a day.

Knowing what the groups above are saying is important when designing products or exploring what customers are thinking. In the worst case scenario, such monitoring helps deal with damage control. Monitoring the media and understanding what is taking place allows a company to deal with unforeseen events. An infamous example is Google's accidental posting of a six-year old story about United Airlines going bankrupt. The mistake made the company's share price fall 75%, and monitoring social media for such blatant mistakes is worth the investment.

While this is an extreme example, knowing when customers are complaining or organizing themselves against (or in support of) your brand is crucial information. In such cases, even an early warning -- even by a few hours -- could help with damage control.

A Social Media Mining Strategy

Numerous options exist for data mining social media. In my opinion, a good social media mining strategy would incorporate a number of data collection and analytical methods:

  • Web Crawling: you do not know what people are saying without visiting their websites. As such, the most important factor is actually visiting websites and logging what they say. One can make a list of relevant websites, or simply crawl the web and look for times when a specific company or brand is mentioned.
  • Link Tracing: with the web crawls above, one has access to two important pieces of information, (1) text of the sites, and (2) their hyperlinks. Mapping out which sites link to which other sites will inform you of how various people discussing your company actually link to each other and whether they are aware of each other.
  • Content Analysis: using the text of the sites themselves can inform you of what people say about your company, and which words they tend to associate with your brands. One can also automate the content analysis to provide near-real-time warnings about negative sentiment about a brand (though whether such warnings are accurate is still a controversial debate).

Next Steps

If you are interested in learning more about social media mining, feel free to contact me.

10Nov/090

Agent-Based Models of the Blogosphere

My upcoming dissertation will be focused on building mathematical models of bloggers in an effort to understand the social dynamics of the individuals themselves. There are a number of reasons for why this is a useful strategy. By building agent-based (i.e. blogger-based) models, we can test to see if certain social strategies actually play a role in forming the blog network itself. Furthermore, if the models are very good, it might be possible to make predictions about the bloggers themselves.

One of the biggest challenges with building mathematical models is defining a scope for the model. First, one should decide what one is trying to predict. Next, one should decide how to go about making those predictions. This post focuses specifically on agent-based models, where each individual blogger or post is modeled. One assumes this agent makes its own decisions, such as who he or she will link to, or what he or she will write about. Using this information, we can start to get an idea of how the blogosphere in our model will work.

The difficulty with building such models is the validation portion of the process. We can build a simple model that simply states: "Every blogger will post once a day at noon, and will not link to anyone else." This model is obviously incorrect, but how incorrect is it? What is our measure of error? In other words, what are we trying to predict and how good are our predictions?

While this list is by no means extensive, there are three areas I've been exploring for defining error in models of the blogosphere: the macro, meso, and micro levels.

The Micro Level

At one end of the spectrum, we can try to predict the most minute details of the blogosphere. We can try to predict specific links (i.e. that "The Huffington Post" will link to "Wonkette" tomorrow), new posts (i.e. "Daily Kos" will post about Iraq next week), or new entrants (i.e. 30 new political blogs will be created in the next 12 hours). Such predictions are clearly useful, and error is easy to calculate as we simply need to check if our predictions were correct. In Topic-Link LDA: Joint Models of Topic and Author Community, the authors attempt to do this with interblog linking.

The biggest drawback with such specific predictions is the accuracy. Yes, predictions are easy to validate, but the accuracy of such predictions is extremely small. The models have a long way to go before one can actually trust them. Indeed, it might be that such models of blogger behaviour will never be trustworthy to a practical extent.

The Macro Level

At the opposite end of the spectrum, one looks at global properties of the blogosphere. Since predicting what individual bloggers will be doing is so difficult, why not try to predict larger-scale properties? For example, we might predict the number of blog posts likely to be posted on a given day, or the distribution of links in the blog network. While not blog-focused, a good example of such work is provided in Team Assembly Mechanisms Determine Collaboration Network Structure and Team Performance. In this paper, the authors predict the distribution of collaborations between Broadway actors. The actors themselves have specific agendas and make choices about who to work with, and the authors validate their model by seeing how their model predicts "aggregate collaboration" -- in other words, the distribution of collaboration counts on the Broadway scene.

While error rates for such models tend to be lower, a good question to ask is whether such models provide practical information. For example, if we know that only 5 Broadway actors will collaborate with more than 100 actors during their careers, how can we actually use this information? A similar analogy can be used in the blogosphere. If our model predicts that next month, only three blogs will receive more than 100 hyperlinks to their posts, what do we really know? Can we act on this information?

The Meso Level

If the micro level provides too little accuracy and the macro level doesn't provide enough "practical" (your mileage will vary) information, let's make a compromise. Specifically, what about making predictions about subgroups in the blogosphere? Can we predict that certain communities of bloggers will do certain things? Imagine if you knew that a certain community of bloggers will likely be more active than another community. If this community is the pro-life movement in the political blogosphere, then you certainly have useful information.

Two papers in this area are Correlation Profiles and Motifs in Complex Networks and Patterns of Cascading Behavior in Large Blog Graphs. In each case, the authors look at small subgraphs in the networks they model and try to see what happens with those subgraphs. Do they appear more often than is expected? Can one predict how often they appear?

I believe that such an approach has a great deal to offer those modeling political blogs and other networked data sets. While it might be very difficult to predict whether a specific edge will form, if one can instead predict what a group of 3 or 4 bloggers is likely to do, one still has a great deal of practical information. Whether such models have higher error rates has (according to my research) yet to be seen.

7Nov/090

Types of Bloggers

I recently gave a talk at Brunel University. It was about 40 minutes long and focused on my work in data mining the political blogosphere. While I won't discuss most of the work in this post, one area that really got me excited was categorizing bloggers by their posting habits.

I haven't done any formal work in this area yet, but I've plotted a few graphs showing how different bloggers post in the context of the 2008 US Presidential election. The graphs show the the number of posts on the blog within a seven day period. The election itself takes place around day 175.

What really jumps out at me with these graphs is that bloggers are very different, but that some intuitive categories exist. The obvious one is a very active blogger regardless of political context or external events. The graph below shows one such example.

jpeg-9091-sm

Another example is a single issue blogger. These could include blogs focusing on the "Palin for VP" or "Clinton for President" campaigns. Such bloggers tend to be very active around the time when there is most hope for success in such a campaign, and activity drops off when the campaign shuts down or fails.

jpeg-8196-sm

While the single issue blogger above seems to ramp up and then die down slowly, some bloggers have much more obvious swings in activity. An obvious "issue" is the presidential election itself. While very active blogs were active throughout the entire period, some of the less popular and less active blogs had a big increase in activity around November 2008. This is shown in the example below.

jpeg-7527-sm

I'm not quite sure where this research can go, but I have a few ideas. My broader research focus is social influence in the blogosphere, and I imagine being able to categorize blogs using mathematical definitions based on the above would certainly help my work.