Diary of an internet geography project #4

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-08-05 at 1.31.00 PMContinuing with our series of blog posts exposing the workings behind a multidisciplinary big data project, we talk this week about the process of moving between small data and big data analyses. Last week, we did a group deep dive into our data. Extending the metaphor: Shilad caught the fish and dumped them on the boat for us to sort through. We wanted to know whether our method of collecting and determining the origins of the fish was working by looking at a bunch of randomly selected fish up close. Working out how we would do the sorting was the biggest challenge. Some of us liked really strict rules about how we were identifying the fish. ‘Small’ wasn’t a good enough description; better would be that small = 10-15cm diameter after a maximum of 30 minutes out of the water. Through this process we learned a few lessons about how to do this close-looking as a team. 

Step 1: Randomly selecting items from the corpus

We wanted to know two things about the data that we were selecting through this ‘small data’ analysis: Q1) Were we getting every citation in the article or were we missing/duplicating any? Q2) What was the best way to determine the location of the source?

Shilad used the WikiBrain software library he developed with Brent to identify all roughly one million geo-tagged Wikipedia articles. He then collected all external URLs (about 2.9 million unique URLs) appearing within those articles and used this data to create two samples for coding tasks. He sampled about 50 geotagged articles (to answer Q1) and selected a few hundred random URLs cited within particular articles (to answer Q2).

  • Batch 1 for Q1: 50 documents each containing an article title, url, list of citations, empty list of ‘missing citations’
  • Batch 2 for Q2: Spreadsheet of 500 random citations occurring in 500 random geotagged articles.

Continue reading “Diary of an internet geography project #4”

Full disclosure: Diary of an internet geography project #3

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-07-25 at 2.51.29 PMIn this series of blog posts, we are documenting the process by which a group of computer and social scientists are working together on a project to understand the geography of Wikipedia citations. Our aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. In this post, I outline the way we initially thought about locating citations and Dave Musicant tells the story of how he has started to build a foundation for coding citation location at scale. It includes feats of superhuman effort including the posting of letters to a host of companies around the world (and you thought that data scientists sat in front of their computers all day!)   

Many articles about places on Wikipedia include a list of citations and references linked to particular statements in the text of the article. Some of the smaller language Wikipedias have fewer citations than the English, Dutch or German Wikipedias, and some have very, very few but the source of information about places can still act as an important signal of ‘how much information about a place comes from that place‘.

When Dave, Shilad and I did our overview paper (‘Getting to the Source‘) looking at citations on English Wikipedia, we manually looked up the whois data for a set of 500 randomly collected citations for articles across the encyclopedia (not just about places). We coded citations according to their top-level domain so that if the domain was a country code top-level domain (such as ‘.za’), then we coded it according to the country (South Africa), but if it was using a generic top-level domain such as .com or.org, we looked up the whois data and entered the country for the administrative contact (since often the technical contact is the domain registration company often located in a different country). The results were interesting, but perhaps unsurprising. We found that the majority of publishers were from the US (at 56% of the sample), followed by the UK (at 13%) and then a long tail of countries including Australia, Germany, India, New Zealand, the Netherlands and France at either 2 or 3% of the sample.

Screen Shot 2014-07-30 at 12.42.37 PM
Geographic distribution of English Wikipedia sources, grouped by country and continent. Ref: ‘Getting to the Source: Where does Wikipedia get its information from?’ Ford, Musicant, Sen, Miller (2013).

Screen Shot 2014-07-17 at 5.28.50 PMThis was useful to some extent, but we also knew that we needed to extend this to capture more citations and to do this across particular types of article in order for it to be more meaningful. We were beginning to understand that local citations practices (local in the sense of the type of article and the language edition) dictated particular citation norms and that we needed to look at particular types of article in order to better understand what was happening in the dataset. This is a common problem besetting many ‘big data’ projects when the scale is too large to get at meaningful answers. It is this deeper understanding that we’re aiming at with our Wikipedia geography of citations research project. Instead of just a random sample of English Wikipedia citations, we’re going to be looking up citation geography for millions of articles across many different languages, but only for articles about places. We’re also going to be complementing the quantitative analysis with some deep dive qualitative analysis into citation practice within articles about places, and doing the analysis across many language versions, not just English. In the meantime, though, Dave has been working on the technical challenge of how to scale up location data for citations using the whois lookups as a starting point. Continue reading “Full disclosure: Diary of an internet geography project #3”

Full disclosure: Diary of an internet geography project #2

Reblogged from ‘Connectivity, Inclusivity and Inequality

In this series of blog posts, Heather Ford documents the process by which a group of computer and social scientists are working together in a project to understand the geography of Wikipedia citations. Their aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. In this post, Heather discusses how the group are focusing their work on a series of exploratory research questions.  week3 In last week’s call, we had a conversation about articulating the initial research questions that we’re trying to answer. At it’s simplest level, we decided that what we’re interested in is:

‘How much information about a place on Wikipedia comes from that place?’

In the English Wikipedia article about Guinea Bissau, for example, how many of the citations originate from organisations or authors in Guinea Bissau? In the Spanish Wikipedia article about Argentina, for example, what proportion of editors are from Argentina? Cumulatively, can we see any patterns among different language versions that indicate that some language versions contain more ‘local’ voices than others? We think that these are important questions because they point to extent to which Wikipedia projects can be said to be a reflection of how people from a particular place see the world; they also point to the importance of particular countries in shaping information about certain places from outside their borders. We think it makes a difference to the content of Wikipedia that the US’s Central Intelligence Agency (CIA) is responsible for such a large proportion of the citations, for example.

Past research from Brendan Luyt and Tan (2010, PDF) is instructive here. In 2010, Luyt and Tan took a random sample of national history articles on Wikipedia (English) and found that 17% were government sites and of those 17%, four of the top five sites were US government sites including the US Department of State and the CIA World Fact Book. The authors write that this is problematic because ‘the nature of the institutions producing these documents makes it difficult for certain points of view to be included. Being part of the US government, the CIA World Fact Book, for example, is unlikely to include a reading of Guatemalan history that stresses the role of US intervention as an explanation for that country’s long and brutal civil war.’ (p719) Instead of Luyt and Tan’s history articles, we’re looking at country articles and we’re zeroing in on citations and trying to ‘locate’ those citations in different ways. While we were talking on Skype, Shilad drew this really great diagram to show how we seem to be looking at this question of information geography: Screen Shot 2014-07-22 at 10.04.37 AM In this research, we seem to be looking at locating all three elements (the location of the article, the sources/citations and the editors) and then establishing the relationships between them i.e.

RQ1a What proportion of editors from a place edit articles about that place?

RQ1b What proportion of sources in an article about a place come from that place?

RQ1c What proportion of sources from particular places are added by editors from that place?

We started out by taking the address of the administrative contact contained in a source’s domain registration as the signal for the source’s location but we’ve come up against a number of issues as we’ve discussed the initial results. A question that seems to be a precursor to the questions above seems to be how we define ‘location’ in the context of a citation contained within in an article about a particular place. There are numerous signals that we might use to associate a citation with a particular place: the HQ of the publisher, for example, or the nationality of the author; the place in which the article/paper/book is set, or the place in which the publishers are located. An added complexity has to do with the fact that websites sometimes host content produced elsewhere. Are we using ‘author’ or ‘publisher’ when we attempt to locate a citation? If we attribute the source to the HQ of the website and not the actual text, are we still accurately portraying the location of the source?

In order to understand which signals to use in our large scale analysis, then, we’ve decided to try to get a better understanding of both the shape of these citations and the context in which those citations occur by looking more closely at a random sample of citations from articles about places and asking the questions: RQ0a To what extent might signals like ‘administrative contact of the domain registrant’ or ‘country domain’ accurately reflect the location of authors of Wikipedia sources about places? RQ0b What alternative signals might more accurately capture the locations of sources? Already in my own initial analysis of the English Wikipedia article on Mombasa, I noticed that citations to some articles written by locals were hosted on domains such as wordpress.com and wikidot.com that are registered in the US and Poland respectively. There was also a citation to the Kenyan 2009 census authored by the Kenya National Bureau of Statistics hosted by Scribd.com, and a link to an article about Mombasa written by a Ugandan on a US-based travel blog. All this means that we are going to under-represent East Africans’ participation in the writing of this place-based article about Mombasa if we use signals like domain registration.

We can, of course, ‘solve’ each of these problems by removing hosting sites like WordPress from our analysis, but the concern is whether this will negatively affect the representation of efforts by those few in developing countries who are doing their best to produce local content on Wikipedia. Right now, though, we’re starting with the micro level instances and developing a deeper understanding that way, rather than the other way around. And that I really appreciate.

Beyond reliability: An ethnographic study of Wikipedia sources

First published on Ethnographymatters.net and Ushahidi.com 

Almost a year ago, I was hired by Ushahidi to work as an ethnographic researcher on a project to understand how Wikipedians managed sources during breaking news events. Ushahidi cares a great deal about this kind of work because of a new project called SwiftRiver that seeks to collect and enable the collaborative curation of streams of data from the real time web about a particular issue or event. If another Haiti earthquake happened, for example, would there be a way for us to filter out the irrelevant, the misinformation and build a stream of relevant, meaningful and accurate content about what was happening for those who needed it? And on Wikipedia’s side, could the same tools be used to help editors curate a stream of relevant sources as a team rather than individuals?

Original designs for voting a source up or down in order to determine “veracity”

When we first started thinking about the problem of filtering the web, we naturally thought of a ranking system which would rank sources according to their reliability or veracity. The algorithm would consider a variety of variables involved in determining accuracy as well as whether sources have been chosen, voted up or down by users in the past, and eventually be able to suggest sources according to the subject at hand. My job would be to determine what those variables are i.e. what were editors looking at when deciding whether to use a source or not? Continue reading “Beyond reliability: An ethnographic study of Wikipedia sources”

Update on the Wikipedia sources project

This post first appeared on the Ushahidi blog.

Last month I presented the first results of the WikiSweeper project, an ethnographic research project to understand how Wikipedia editors track, evaluate and verify sources on rapidly evolving pages of Wikipedia, the results of which will inform ongoing development of the SwiftRiver (then Sweeper) platform. Wikipedians are some of the most sophisticated managers of online sources and we were excited to learn how they collaboratively decide which sources to use and which to dismiss in the first days of the 2011 Egyptian Revolution. In the past few months, I’ve interviewed users from the Middle East, Kenya, Mexico and the United States, studied hundreds of ‘talk pages’ from the article and analysed edits, users and references from the article, and compared these findings to what Wikipedia policy says about sources. In the end, I came up with four key findings that I’m busy refining for the upcoming report:

1.The source <original version of the article and its author> of the page can play a significant role: Wikipedia policy indicates that characteristics of the book, author and publishers of an article’s citations all affect reliability. But the 2011 Egyptian Revolution article showed how influential the Wikipedia editor who edits the first version of the page can be. Making Wikipedia editors’ reputation, edit histories etc more easily readable is a critical component to understanding points of view while editing and reading rapidly evolving Wikipedia articles. Continue reading “Update on the Wikipedia sources project”