Diary of an internet geography project #4

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-08-05 at 1.31.00 PMContinuing with our series of blog posts exposing the workings behind a multidisciplinary big data project, we talk this week about the process of moving between small data and big data analyses. Last week, we did a group deep dive into our data. Extending the metaphor: Shilad caught the fish and dumped them on the boat for us to sort through. We wanted to know whether our method of collecting and determining the origins of the fish was working by looking at a bunch of randomly selected fish up close. Working out how we would do the sorting was the biggest challenge. Some of us liked really strict rules about how we were identifying the fish. ‘Small’ wasn’t a good enough description; better would be that small = 10-15cm diameter after a maximum of 30 minutes out of the water. Through this process we learned a few lessons about how to do this close-looking as a team. 

Step 1: Randomly selecting items from the corpus

We wanted to know two things about the data that we were selecting through this ‘small data’ analysis: Q1) Were we getting every citation in the article or were we missing/duplicating any? Q2) What was the best way to determine the location of the source?

Shilad used the WikiBrain software library he developed with Brent to identify all roughly one million geo-tagged Wikipedia articles. He then collected all external URLs (about 2.9 million unique URLs) appearing within those articles and used this data to create two samples for coding tasks. He sampled about 50 geotagged articles (to answer Q1) and selected a few hundred random URLs cited within particular articles (to answer Q2).

  • Batch 1 for Q1: 50 documents each containing an article title, url, list of citations, empty list of ‘missing citations’
  • Batch 2 for Q2: Spreadsheet of 500 random citations occurring in 500 random geotagged articles.

Continue reading

Wikipedia and breaking news: The promise of a global media platform and the threat of the filter bubble

I gave this talk at Wikimania in London yesterday. 

In the first years of Wikipedia’s existence, many of us said that, as an example of citizen journalism and journalism by the people, Wikipedia would be able to avoid the gatekeeping problems faced by traditional media. The theory was that because we didn’t have the burden of shareholders and the practices that favoured elite viewpoints, we could produce a media that was about ‘all of us’ and not just ‘some of us’.

Dan Gillmor (2004) wrote that Wikipedia was an example of a wave of citizen journalism projects initiated at the turn of the century in which ‘news was being produced by regular people who had something to say and show, and not solely by the “official” news organizations that had traditionally decided how the first draft of history would look’ (Gillmor, 2004: x).

Yochai Benkler (2006) wrote that projects like Wikipedia enables ‘many more individuals to communicate their observations and their viewpoints to many others, and to do so in a way that cannot be controlled by media owners and is not as easily corruptible by money as were the mass media.’ (Benkler, 2006: 11)

I think that at that time we were all really buoyed by the idea that Wikipedia and peer production could produce information products that were much more representative of “everyone’s” experience. But the idea that Wikipedia could avoid bias completely, I now believe, is fundamentally wrong. Wikipedia presents a particular view of the world while rejecting others. Its bias arises both from its dependence on sources which are themselves biased, but Wikipedia itself has policies and practices that favour particular viewpoints. Although Wikipedia is as close to a truly global media product than we have probably ever come*, like every media product it is a representation of the world and is the result of a series of editorial, technical and social decisions made to prioritise certain narratives over others. Continue reading

Big Data and Small: Collaborations between ethnographers and data scientists

This article first appeared in Big Data and Society journal published by Sage and is licensed by the author under a Creative Commons Attribution license. [PDF]

Abstract

In the past three years, Heather Ford—an ethnographer and now a PhD student—has worked on ad hoc collaborative projects around Wikipedia sources with two data scientists from Minnesota, Dave Musicant and Shilad Sen. In this essay, she talks about how the three met, how they worked together, and what they gained from the experience. Three themes became apparent through their collaboration: that data scientists and ethnographers have much in common, that their skills are complementary, and that discovering the data together rather than compartmentalizing research activities was key to their success.

Continue reading

Full disclosure: Diary of an internet geography project #3

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-07-25 at 2.51.29 PMIn this series of blog posts, we are documenting the process by which a group of computer and social scientists are working together on a project to understand the geography of Wikipedia citations. Our aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. In this post, I outline the way we initially thought about locating citations and Dave Musicant tells the story of how he has started to build a foundation for coding citation location at scale. It includes feats of superhuman effort including the posting of letters to a host of companies around the world (and you thought that data scientists sat in front of their computers all day!)   

Many articles about places on Wikipedia include a list of citations and references linked to particular statements in the text of the article. Some of the smaller language Wikipedias have fewer citations than the English, Dutch or German Wikipedias, and some have very, very few but the source of information about places can still act as an important signal of ‘how much information about a place comes from that place‘.

When Dave, Shilad and I did our overview paper (‘Getting to the Source‘) looking at citations on English Wikipedia, we manually looked up the whois data for a set of 500 randomly collected citations for articles across the encyclopedia (not just about places). We coded citations according to their top-level domain so that if the domain was a country code top-level domain (such as ‘.za’), then we coded it according to the country (South Africa), but if it was using a generic top-level domain such as .com or.org, we looked up the whois data and entered the country for the administrative contact (since often the technical contact is the domain registration company often located in a different country). The results were interesting, but perhaps unsurprising. We found that the majority of publishers were from the US (at 56% of the sample), followed by the UK (at 13%) and then a long tail of countries including Australia, Germany, India, New Zealand, the Netherlands and France at either 2 or 3% of the sample.

Screen Shot 2014-07-30 at 12.42.37 PM

Geographic distribution of English Wikipedia sources, grouped by country and continent. Ref: ‘Getting to the Source: Where does Wikipedia get its information from?’ Ford, Musicant, Sen, Miller (2013).

Screen Shot 2014-07-17 at 5.28.50 PMThis was useful to some extent, but we also knew that we needed to extend this to capture more citations and to do this across particular types of article in order for it to be more meaningful. We were beginning to understand that local citations practices (local in the sense of the type of article and the language edition) dictated particular citation norms and that we needed to look at particular types of article in order to better understand what was happening in the dataset. This is a common problem besetting many ‘big data’ projects when the scale is too large to get at meaningful answers. It is this deeper understanding that we’re aiming at with our Wikipedia geography of citations research project. Instead of just a random sample of English Wikipedia citations, we’re going to be looking up citation geography for millions of articles across many different languages, but only for articles about places. We’re also going to be complementing the quantitative analysis with some deep dive qualitative analysis into citation practice within articles about places, and doing the analysis across many language versions, not just English. In the meantime, though, Dave has been working on the technical challenge of how to scale up location data for citations using the whois lookups as a starting point. Continue reading

Full disclosure: Diary of an internet geography project #2

Reblogged from ‘Connectivity, Inclusivity and Inequality

In this series of blog posts, Heather Ford documents the process by which a group of computer and social scientists are working together in a project to understand the geography of Wikipedia citations. Their aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. In this post, Heather discusses how the group are focusing their work on a series of exploratory research questions.  week3 In last week’s call, we had a conversation about articulating the initial research questions that we’re trying to answer. At it’s simplest level, we decided that what we’re interested in is:

‘How much information about a place on Wikipedia comes from that place?’

In the English Wikipedia article about Guinea Bissau, for example, how many of the citations originate from organisations or authors in Guinea Bissau? In the Spanish Wikipedia article about Argentina, for example, what proportion of editors are from Argentina? Cumulatively, can we see any patterns among different language versions that indicate that some language versions contain more ‘local’ voices than others? We think that these are important questions because they point to extent to which Wikipedia projects can be said to be a reflection of how people from a particular place see the world; they also point to the importance of particular countries in shaping information about certain places from outside their borders. We think it makes a difference to the content of Wikipedia that the US’s Central Intelligence Agency (CIA) is responsible for such a large proportion of the citations, for example.

Past research from Brendan Luyt and Tan (2010, PDF) is instructive here. In 2010, Luyt and Tan took a random sample of national history articles on Wikipedia (English) and found that 17% were government sites and of those 17%, four of the top five sites were US government sites including the US Department of State and the CIA World Fact Book. The authors write that this is problematic because ‘the nature of the institutions producing these documents makes it difficult for certain points of view to be included. Being part of the US government, the CIA World Fact Book, for example, is unlikely to include a reading of Guatemalan history that stresses the role of US intervention as an explanation for that country’s long and brutal civil war.’ (p719) Instead of Luyt and Tan’s history articles, we’re looking at country articles and we’re zeroing in on citations and trying to ‘locate’ those citations in different ways. While we were talking on Skype, Shilad drew this really great diagram to show how we seem to be looking at this question of information geography: Screen Shot 2014-07-22 at 10.04.37 AM In this research, we seem to be looking at locating all three elements (the location of the article, the sources/citations and the editors) and then establishing the relationships between them i.e.

RQ1a What proportion of editors from a place edit articles about that place?

RQ1b What proportion of sources in an article about a place come from that place?

RQ1c What proportion of sources from particular places are added by editors from that place?

We started out by taking the address of the administrative contact contained in a source’s domain registration as the signal for the source’s location but we’ve come up against a number of issues as we’ve discussed the initial results. A question that seems to be a precursor to the questions above seems to be how we define ‘location’ in the context of a citation contained within in an article about a particular place. There are numerous signals that we might use to associate a citation with a particular place: the HQ of the publisher, for example, or the nationality of the author; the place in which the article/paper/book is set, or the place in which the publishers are located. An added complexity has to do with the fact that websites sometimes host content produced elsewhere. Are we using ‘author’ or ‘publisher’ when we attempt to locate a citation? If we attribute the source to the HQ of the website and not the actual text, are we still accurately portraying the location of the source?

In order to understand which signals to use in our large scale analysis, then, we’ve decided to try to get a better understanding of both the shape of these citations and the context in which those citations occur by looking more closely at a random sample of citations from articles about places and asking the questions: RQ0a To what extent might signals like ‘administrative contact of the domain registrant’ or ‘country domain’ accurately reflect the location of authors of Wikipedia sources about places? RQ0b What alternative signals might more accurately capture the locations of sources? Already in my own initial analysis of the English Wikipedia article on Mombasa, I noticed that citations to some articles written by locals were hosted on domains such as wordpress.com and wikidot.com that are registered in the US and Poland respectively. There was also a citation to the Kenyan 2009 census authored by the Kenya National Bureau of Statistics hosted by Scribd.com, and a link to an article about Mombasa written by a Ugandan on a US-based travel blog. All this means that we are going to under-represent East Africans’ participation in the writing of this place-based article about Mombasa if we use signals like domain registration.

We can, of course, ‘solve’ each of these problems by removing hosting sites like WordPress from our analysis, but the concern is whether this will negatively affect the representation of efforts by those few in developing countries who are doing their best to produce local content on Wikipedia. Right now, though, we’re starting with the micro level instances and developing a deeper understanding that way, rather than the other way around. And that I really appreciate.

Full disclosure: Diary of an internet geography project #1

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-07-10 at 12.28.58 PMOII research fellow, Mark Graham and DPhil student, Heather Ford (both part of the CII group) are working with a group of computer scientists including Brent HechtDave Musicant and Shilad Sen to understand how far Wikipedia has come to representing ‘the sum of all human knowledge’. As part of the project, they will be making explicit the methods that they use to analyse millions of data records from Wikipedia articles about places in many languages. The hope is that by experimenting with a reflexive method of doing multidisciplinary ‘big data’ project, others might be able to use this as a model for pursuing their own analyses in the future. This is the first post in a series in which Heather outlines the team’s plans and processes.  

It was a beautiful day in Oxford and we wanted to show our Minnesotan friends some Harry Pottery architecture, so Mark and I sat on a bench in the Balliol gardens while we called Brent, Dave and Shilad who are based in Minnesota for our inaugural Skype meeting. I have worked with Dave and Shilad on a paper about Wikipedia sources in the past, and Mark and Brent know each other because they both have produced great work on Wikipedia geography, but we’ve never all worked together as a team. A recent grant from Oxford University’s John Fell Fund provided impetus for the five of us to get together and pool efforts in a short, multidisciplinary project that will hopefully catalyse further collaborative work in the future.

In last week’s meeting, we talked about our goals and timing and how we wanted to work as a team. Since we’re a multidisciplinary group who really value both quantitative and qualitative approaches, we thought that it might make sense to present our goals as consisting of two main strands: 1) to investigate the origins of knowledge about places on Wikipedia in many languages, and 2) to do this in a way that is both transparent and reflexive.

In her eight ‘big tent’ criteria for excellent qualitative research, Sarah Tracy (2010, PDF) includesself-reflexivity and transparency in her conception of researcher ‘sincerity’. Tracy believes that sincerity is a valuable quality that relates to researchers being earnest and vulnerable in their work and ‘considering not only their own needs but also those of their participants, readers, coauthors and potential audiences’. Despite the focus on qualitative research in Tracy’s influential paper, we think that practicing transparency and reflexivity can have enormous benefits for quantitative research as well but one of the challenges is finding ways to pursue transparency and reflexivity as a team rather than as individual researchers.

Transparency

Tracy writes that transparency is about researchers being honest about the research process.

‘Transparent research is marked by disclosure of the study’s challenges and unexpected twists and turns and revelation of the ways research foci transformed over time.’

She writes that, in practice, transparency requires a formal audit trail of all research decisions and activities. For this project, we’ve set up a series of Google docs folders for our meeting agendas, minutes, Skype calls, screenshots of our video call as well as any related spreadsheets and analyses produced during the week. After each session, I clean up the meeting minutes that we’ve co-produced on the Google doc while we’re talking, and write a more narrative account about what we did and what we learned beneath that.

Although we’re co-editing these documents as a team, it’s important to note that, as the documenter of the process, it’s my perspective that is foregrounded and I have to be really mindful of this as reflect what happened. Our team meetings are occasions for discussion of the week’s activities, challenges and revelations which I try to document as accurately as possible, but I will probably also need to conduct interviews with individual members of the team further along in the process in order to capture individual responses to the project and the process that aren’t necessarily accommodated in the weekly meetings.

Reflexivity

According to Tracy, self-reflexivity involves ‘honesty and authenticity with one’s self, one’s research and one’s audience’. Apart from the focus on interrogating our own biases as researchers, reflexivity is about being frank about our strengths and weaknesses, and, importantly, about examining our impact on the scene and asking for feedback from participants.

Soliciting feedback from participants is something quite rare in quantitative research but we believe that gaining input from Wikipedians and other stakeholders can be extremely valuable for improving the rigor of our results and for providing insight into the humans behind the data.

As an example, a few years ago when I was at a Wikimedia Kenya meetup, I asked what editorsthought about Mark Graham’s Swahili Wikipedia maps. One respondent was immediately able to explain the concentration of geolocated articles from Turkey because he knew the editor who was known as a specialist of Turkey geography stubs. Suddenly the map took on a more human form — a reflection of the relationships between real people trying to represent their world. More recently, a Swahili Wikipedians contacted Mark about the same maps and engaged him in a conversation about how they could be made better. Inspired by these engagements, we want to really encourage those conversations and invite people to comment on our process as it evolves. To do this, we’ll be blogging about the progress of the project and inviting particular groups of stakeholders to provide comments and questions. We’ll then discuss those comments and questions in our weekly meetings and try to respond to as many of them as possible in thinking about how we move the analysis forward.

In conclusion, transparency and reflexivity are two really important aspects of researcher sincerity. The challenge with this project is trying to put this into practice in a quantitative rather than qualitative project, a project driven by a team rather than an individual researcher. Potential risks are that I inaccurately report on what we’re doing, or expose something about our process that is considered inappropriate. What I’m hoping is that we can mark these entries clearly as my initial, necessarily incomplete reflections on our process and that this can feed into the team’s reflections going forward. Knowing the researchers in the team and having worked with all of them in the past, my goal is to reflect the ways in which they bring what Tracy values in ‘sincere’ researchers: the empathy, kindness, self-awareness and self deprecation that I know all of these team members display in their daily work.

Review of ‘code/space: software and everyday life’

This review was published in Environment and Planning B last year. I really loved the book and think that it’s a powerful reminder of the importance of context in thinking about how code does work in the world. 

Code/space: Software and Everyday Life By Rob Kitchin and Martin Dodge; MIT Press, Cambridge, London, 2011, 290 pages, ISBN: 978-0262042482

codepsaceKitchen and Dodge’s important new book,  Code/space: Software and Everyday Life,  opens with the crucial phrase – “software matters”. It matters, they argue, because software increasingly mediates our everyday lives – from the digital trail that extends just that little bit further when we order our morning coffee, to the data which is sent to a remote location about our electricity and gas usage from so-called “smart meters”, and the airport security databases that determine whether we are allowed to travel or not. The power of software is its ability to make our lives easier, improve efficiency and productivity; but such efficiencies come at the cost of pervasive surveillance, a feature that is producing a society that “never forgets”.

The key premise of the book is that there are two key gaps in the way that we talk about technology and society. The first critique is aimed at social science and humanities approaches that deal too much with the technologies that software enables rather than explaining the particular code that affects activity and behavior in different contexts. This is akin, the authors argue, to looking only at the effects of ill health on society, rather than also considering ‘the specifics of different diseases, their etiology (causes, origins, evolution, and implications), and how these manifest themselves in shaping social relations’ (p13).

Software studies, on the other hand, is a nascent research field that seeks ‘to open the black box of processors and arcane algorithms to understand how software – its lines and routines of code – does work in the world by instructing various technologies how to act’ (p13). The problem, they write, is that the majority of software studies are aspatial, presuming that space is merely a neutral backdrop against which human activity occurs. Here, Kitchin and Dodge’s critique is directed at scholars such as Lawrence Lessig, whose book Code and Other Laws of Cyberspace (Lessig, 1999) refers to code (in the form of software) as having the ability to automatically regulate activity online. But code, argue Kitchin and Dodge, is not law. It is neither universal nor deterministic, but rather contingent and relational. And space is not simply a container in which things happen but rather ‘subtly evolving layers of context and practices that fold together people and things and actively shape social relations’ (p13).

Kitchin and Dodge formulate two important concepts in arguing for a spatial approach to software studies. The first is ‘code/space’, or the moment when space and code are mutually dependent on one another (a check-in desk at an airport, for example) and ‘coded spaces’, on the other hand, which do not entirely depend on code to function (the use of Powerpoint slides during a presentation, for example). After detailing the types of software employed in the areas of home, travel and consumption, Kitchin and Dodge conclude with a Manifesto for Software Studies that sets out an agenda for software studies to produce ‘detailed case studies of how software does work in the world’ as well as ‘theoretical tools for describing how and explaining why, and the effects, of that work’ (p249). Here they propose studies comparing effects of code in rural Ireland and urban Manchester, for example, where code is analyzed in a manner that is sensitive to place and scale, cultural histories, and modes of activity (p249).

The concept that code is not universal or immutable but contingent and contextual is powerful, but Kitchen and Dodge have a tendency to analyze code in home, travel and consumption without any reference to differences in code’s impact in different places or spaces (the use of international passenger record databases when traveling from Rio de Janeiro, or trying to buy books on Amazon.com from Johannesburg, for example). Although they maintain that their intention is to provide a broad field for future research, the book would have been stronger with some analysis of the different ways in which code codes differing practices and how they are understood, used, and the different moments of its instantiation in places and/or spaces.

Where the book succeeds, and makes its most useful contribution, is in its lucid explanation of how detailed analyses of code are essential to understanding how software is becoming ingrained into our everyday lives, and how it has both an empowering and disciplining effect. Kitchin and Dodge are charting new disciplinary territory here, bringing together the fields of computer science, social science and spatial studies in highly promising ways. Theirs is an inspiring approach to how we might come to unveil the hidden choices behind the code that governs our everyday lives, and how we might come to understand a phenomenon that has an increasingly powerful role in society.

References

Lessig, L, 1999 Code: And Other Laws of Cyberspace (Basic Books, New York)