The Intimate Encyclopedia

The Intimate Encyclopedia is an experiment that makes explicit the subjectivities of encyclopedic knowledge. Using Wikipedia as inspiration, it offers three core principles guiding the writing of articles. It asks authors to present the 1. Subjective Point of View (IE:SPOV), warns readers that content is 2. Unverifiable and encourages 3. All Original Research (AOR). Although the Intimate Encyclopedia is no longer, this record reminds us of the alternative ways of representing knowledge, distinct from the logics that guide our current truthmaking practices.

Revision 245 of the Intimate Encyclopedia as at 11 December 2020

The following is from a talk I gave at the recent Digital Intimacies symposium organised by Paul Byron, Suneel Jethani, Amelia Johns and Natalie Krikowa, from my discipline group (Digital and Social Media) at the University of Technology Sydney.

I spend most of my time these days trying to understand what it means to know, whose knowledge is recognised and how knowledge should be governed. I do this in a world materially constituted by data and epistemologically by a moment in which truth seems to be located either as a result of machinic (as opposed to human) processes, or in the humans and crowds who seem to epitomise the rejection of a kind of politics that seems to muddy the truth. Seems, because even the algorithms that drive our truth machines are, we know, a very human craft and very much political artefacts. Seems, because the politicians who rise on the back of an idea that politics is corrupt, we learn are themselves often politically corrupt. Seems, because crowds are not – as Surowieki claimed – all wise. They do not always produce more truthful representations than individuals or groups, even if accuracy were the only thing we were in need of right now.

I’m interested in the governance of knowledge and my primary site of study is Wikipedia. When I tried to think about how I’d contribute to a conference dedicated to “Digital Intimacies”, I couldn’t imagine how. Wikipedia seems the opposite of intimate knowledge. Its policies are conservative and representative of Western enlightenment traditions. It asks editors to leave their knowledge at the door in favour of what it considers “reliable sources”, not to do original research, to represent the Neutral Point of View (NPOV).

And yet, in the decade of my research about the 2011 Egyptian Revolution article, constructed as protests descended in ever increasing waves on Egypt’s streets, I learned that intimate knowledge was everywhere. It was in the decisions about what facts to exclude, about who to contact on the ground for verification, in the knowledge about how Wikipedia really works and who to engage in order to make it work for them. As Donna Haraway wrote: “All knowledges are situated. There can be no ‘infinite vision’ – it is a ‘god trick’ (Haraway, 1988, p. 581).

And so, I started to imagine what an encyclopedia that opened itself up to this idea would look like and how it would be governed. This experiment makes knowledge’s subjectivity explicit. With the help of my colleagues in the Digital and Social Media discipline at the UTS School of Communication, we wrote seven encyclopedic articles for the inaugural and only version of the Intimate Encyclopedia. My instructions to authors were to write encyclopedia articles from a personal rather than objective point of view. The other rules came later, as they did with Wikipedia.

The Intimate Encyclopedia begins with three core content principles:

1. Subjective Point of View (IE:SPOV)

All Intimate encyclopedia articles and other encyclopedic content must be written from a subjective point of view, representing the authors’ views truthfully, momentarily and with as much bias as possible.

IE:SPOV

In the example below, Tisha Dejmanee defines the suitcase not only as a “form of luggage” but as a companion (accompanying Tisha to “grad school and new jobs, new houses and growing networks”) that is too big to hide in her new home. For Dejmanee, the suitcase (her suitcase) is symbolic of “the ruptures of 2020 while also serving as a reminder of the continued longing that carries people and hope across the world”. This statement is highly subjective (since when are suitcases symbolic?!) and thus perfectly suitable for the Intimate Encyclopedia.

“Suitcase” by Tisha Dejmanee

In another example, Paul Byron defines his chosen object, the “Portable Webcam” as “a video camera that feeds or streams an image or video in real time” but also as an instrument of oppression that represents constant surveillance and that is reflective of “a sad story of somebody who spends a lot of time at a desk.” This perspective on the webcam is reflective of a very particular moment in time and contains opinions rather than knowledge. Its place in the Intimate Encyclopedia is guaranteed!

“Portable Webcam” by Paul Byron

The second core content principle of the Intimate Encyclopedia is that it is:

2. Unverifiable (IE:U)

References provided are an indication but not evidence of the source for authors’ inspiration. Readers of the Intimate Encyclopedia must accept that authors have produced an accurate representation of their thoughts and feelings. The Intimate Encyclopedia was at one time open for challenge but is no longer*.

IE:U

In the example below, I write about the “Teapot”, “a vessel for steeping black tea leaves in boiling water”. “Only BLACK TEA?” you cry! This is an unverifiable statement (along with the method of making Proper Way tea). The citations here are a ruse – they do not support the statements made. Thankfully there is no need for verifiable knowledge on the Intimate Encyclopedia. Teapots, for this author, are “fragile things” whose “fragility reminded Ford of the tenuousness of our existence and the importance of celebrating small joys – even if they consisted only in a sip of a properly made cup of tea in a real tea cup and from a pot of freshly brewed tea made, importantly, in a teapot.”

“Teapot” by Heather Ford

“Kangaroo Paw” by Amelia Johns is equally unverifiable. Little to Johns’ knowledge, the kangaroo paw was sourced from a warehouse in Melbourne, but we must rely on Johns’ account because no original receipt was included. Kangaroo Paw, according to Johns, is the companion and toilet to Ella and a reminder of “the delicate balance of nature-animal-human cohabitations that have thrived during the pandemic.”

“Kangaroo paw” by Amelia Johns

3. All Original Research (IE: AOR)

The Intimate Encyclopedia only publishes original, untarnished thought. Although some facts may be attributed to a reliable source, authors must intersperse these with definitions of their own design so that the rendering is completely original.

IE:AOR

Bhuva Narayan’s article on the X-Ray is a very personal account of the object. Instead of an image of a human hand, she reveals that this image is, indeed, of her own hands, her own feet. These reflections are interspersed with factual statements about the ways in which X-rays were preceded by “pre-historic hunting cultures depicted animals by drawing or painting the skeletal frame and internal organs (Chaloupka, 1993)”.

X-Ray by Bhuva Narayan

In the next article about the “Dummy”, Natalie Krikowa classifies dummies as both “nipple substitute(s)” and objects “located in the cracks between couch cushions”. This original rendering is of a very particular set of dummies belonging to a very particular human.

In the final article, about the “Book”, Alan McKee presents a truly original portrait of this common object, making it very strange in this original rendering. Books, according to McKee, are not only “primitive forms of computers” but also objects that enable anxious people to “avoid staring straight into the face of the terrifying world around them”. The image is not an image of “a book” one might regularly see in an encyclopedic article about books but “a book nibbled by a parrot”. Parrots featuring in articles about books! Original indeed.

“Book” by Alan McKee

Coda

This tiny experiment demonstrates, among other things, that there are multiple ways of representing knowledges and that the rules that govern the dominant representations (from Wikipedia, for example) are not natural or obvious but shaped by particular ways of understanding what it means to know.

Through the experiment, I learned few facts about books, plants, webcams, suitcases, teapots, x-rays and dummies. I also learned about what is possibly more important: about the hopes, longings, anxieties and dreams of the people I spend many of my days with. Intimate knowledges are, indeed, a worthy persuit… alongside the Other (objective) forms we are so obsessed with at this moment in time.

* The Intimate Encyclopedia was technically available to the public for only a few weeks, even though we didn’t let anyone other than the participants of the conference. This is the only record of its existence.

Thanks to Francesco Bailo for installing our Intimate Encyclopedia and helping its authors with their contributions.

Fact Factories: Wikipedia and Writing History as it Happens

I will be speaking at the Digital Histories Research Seminar on Thursday 8 October 2020, 6.00pm (AEST).

On the 24th of January, 2011, an Egyptian born Wikipedia editor, “The Egyptian Liberal” published the first draft of an article titled “2011 Egyptian protests” on English Wikipedia. Working with hundreds of other editors over the next two weeks, “The Egyptian Liberal” documented the events that catalysed the downfall of Hosni Mubarak as hundreds of thousands of people descended on Tahrir Square and in cities through the country to demand change. In this talk, I’ll discuss my forthcoming book, Fact Factories. I’ll introduce the concept of traveling facts and the mirroring (and sometimes refracting) of material realities on Wikipedia and in the streets of Egypt in ways that framed and eventually helped determine the result of the protests. The talk is about the writing of history as it happens, about the role of automated technology in our collaborative narration of events and about how Wikipedia’s narration will always be a partial one.

Join via Zoom: https://utsmeet.zoom.us/j/99750414645 

Data analyst/visualisation expert needed

Tamson Pietsch, Head of the Centre for Public History at UTS and I are leading a small pilot project at UTS to analyse Wikipedia’s scope and progress over the past twenty years in Australia together with collaborators, Wikimedia Australia
<https://wikimedia.org.au/wiki/Wikimedia_Australia> (including Pru Mitchell
and 99of9|Toby Hudson). We are looking for someone to help us to
develop a series of visualisations for a pilot project. This will involve
extracting data about en.wp.org articles (either from Wikipedia or via Wikidata) and comparing it to another dataset (possibly the Australian Honours List),
cleaning and coding data and, importantly, visualising the data using
mapping and other visualisation tools. This is a pilot project with resources for a few days work which we would ideally like to happen over the next month. Experience with Wikimedia data analysis is a plus.

Please contact me for more info!

Wikipedia’s relationship to academia and academics

I was recently quoted in an article for Science News about the relationship between academia and Wikipedia by Bethany Brookshire. I was asked to comment on a recent paper by MIT Sloan‘s Neil Thompson and Douglas Hanley who investigated the relationship between Wikipedia articles and scientific papers using examples from chemistry and econometrics. There are a bunch of studies on a similar topic (if you’re interested, here is a good place to start) and I’ve been working on this topic – but from a very different angle – for a qualitative study to be published soon. I thought I would share my answers to the interview questions here since many of them are questions that friends and colleagues ask regularly about citing Wikipedia articles and about quality issues on Wikipedia.

Have you ever edited Wikipedia articles?  What do you think of the process?

Some, yes. Being a successful editor on English Wikipedia is a complicated process, particularly if you’re writing about topics that are either controversial or outside the purview of the majority of Western editors. Editing is complicated not only because it is technical (even with the excellent new tools that have been developed to support editing without having to learn wiki markup) – most of the complications come with knowing the norms, the rules and the power dynamics at play.

You’ve worked previously with Wikipedia on things like verification practices. What are the verification practices currently?

That’s a big question 🙂 Verification practices involve a complicated set of norms, rules and technologies. Editors may (or may not) verify their statements by checking sources, but the power of Wikipedia’s claim-making practice lies in the norms of questioning  unsourced claims using the “citation needed” tag and by any other editor being able to remove claims that they believe to be incorrect. This, of course, does not guarantee that every claim on Wikipedia is factually correct, but it does enable the dynamic labelling of unverified claims and the ability to set verification tasks in an iterative fashion.

Many people in academia view Wikipedia as an unreliable source and do not encourage students to use it. What do you think of this?

Academic use of sources is a very contextual practice. We refer to sources in our own papers and publications not only when we are supporting the claims they contain, but also when we dispute them. That’s the first point: even if Wikipedia was generally unreliable, that is not a good reason for denying its use. The second point is that Wikipedia can be a very reliable source for particular types of information. Affirming the claims made in a particular article, if that was our goal in using it, would require verifying the information that we are reinforcing through citation and in citing the particular version (the “oldid” in Wikipedia terms) that we are referring to. Wikipedia can be used very soundly by academics and students – we just need to do so carefully and with an understanding of the context of citation – something we should be doing generally, not only on Wikipedia.

You work in a highly social media savvy field, what is the general attitude of your colleagues toward Wikipedia as a research resource? Do you think it differs from the attitudes of other academics?

I would say that Wikipedia is widely recognized by academics, including those of my colleagues who don’t specifically conduct Wikipedia research, as a source that is fine to visit but not to cite.

What did you think of this particular paper overall?

I thought that it was a really good paper. Excellent research design and very solid analysis. The only weakness, I would argue, would be that there are quite different results for chemistry and econometrics and that those differences aren’t adequately accounted for. More on that below.

The authors were attempting a causational study by adding Wikipedia articles (while leaving some written but unadded) and looking at how the phrases translated to the scientific literature six months later. Is this a long enough period of time?

This seems to be an appropriate amount of time to study, but there are probably quite important differences between fields of study that might influence results. The volume of publication (social scientists and humanities scholars tend to produce much lower volumes of publications and publications thus tend to be extended over time than natural science and engineering subjects, for example), the volume of explanatory or definitional material in publications (requiring greater use of the literature), the extent to which academics in the particular field consult and contribute to Wikipedia – all might affect how different fields of study influence and are influenced by Wikipedia articles.

Do you think the authors achieved evidence of causation here?

Yes. But again, causation in a single field i.e. chemistry.

It is important to know whether Wikipedia is influencing the scientific literature? Why or why not?

Yes. It is important to know whether Wikipedia is influencing scientific literature – particularly because we need to know where power to influence knowledge is located (in order to ensure that it is being fairly governed and maintained for the development of accurate and unbiased public knowledge).

Do you think papers like this will impact how scientists view and use Wikipedia?

As far as I know, this is the first paper that attributes a strong link between what is on Wikipedia and the development of science. I am sure that it will influence how scientists and other academic view and use Wikipedia – particularly in driving initiatives where scientists contribute to Wikipedia either directly or via initiatives such as PLoS’s Topic Pages.

Is there anything especially important to emphasize?

The most important thing is to emphasize the differences between fields that I think needs to be better explained. I definitely think that certain types of academic research are more in line with Wikipedia’s way of working, forms and styles of publication and epistemology and that it will not have the same influence on other fields.

How Wikipedia’s silent coup ousted our traditional sources of knowledge

[Reposted from The Conversation, 15 January 2016]

As Wikipedia turns 15, volunteer editors worldwide will be celebrating with themed cakes and edit-a-thons aimed at filling holes in poorly covered topics. It’s remarkable that a user-editable encyclopedia project that allows anyone to edit has got this far, especially as the website is kept afloat through donations and the efforts of thousands of volunteers. But Wikipedia hasn’t just become an important and heavily relied-upon source of facts: it has become an authority on those facts.

Through six years of studying Wikipedia I’ve learned that we are witnessing a largely silent coup, in which traditional sources of authority have been usurped. Rather than discovering what the capital of Israel is by consulting paper copies of Encyclopedia Britannica or geographical reference books, we source our information online. Instead of learning about thermonuclear warfare from university professors, we can now watch a YouTube video about it.

The ability to publish online cheaply has led to an explosion in the number and range of people putting across facts and opinion than was traditionally delivered through largely academic publishers. But rather than this leading to an increase in the diversity of knowledge and the democratisation of expertise, the result has actually been greater consolidation in the number of knowledge sources considered authoritative. Wikipedia, particularly in terms of its alliance with Google and other search engines, now plays a central role. Continue reading “How Wikipedia’s silent coup ousted our traditional sources of knowledge”

Infoboxes and cleanup tags: Artifacts of Wikipedia newsmaking

Screen Shot 2014-09-02 at 2.06.05 PM
Infobox from the first version of the 2011 Egyptian Revolution (then ‘protests’) article on English Wikipedia, 25 January, 2011

My article about Wikipedia infoboxes and cleanup tags and their role in the development of the 2011 Egyptian Revolution article has just been published in the journal, ‘Journalism: Theory, Practice and Criticism‘ (a pre-print is available on Academia.edu). The article forms part of a special issue of the journal edited by C W Anderson and Juliette de Meyer who organised the ‘Objects of Journalism’ pre-conference at the International Communications Association conference in London that I attended last year. The issue includes a number of really interesting articles from a variety of periods in journalism’s history – from pica sticks to interfaces, timezones to software, some of which we covered in the August 2013 edition of ethnographymatters.net. 

My article is about infoboxes and cleanup tags as objects of Wikipedia journalism, objects that have important functions in the coordination of editing and writing by distributed groups of editors. Infoboxes are summary tables on the right hand side of an article that enable readability and quick reference, while cleanup tags are notices at the head of an article warning readers and editors of specific problems with articles. When added to an article, both tools simultaneously notify editors about missing or weak elements of the article and add articles to particular categories of work.

The article contains an account of the first 18 days of the protests that resulted in the resignation of then-president Hosni Mubarak based on interviews with a number of the article’s key editors as well as traces in related articles, talk pages and edit histories. Below is a selection from what happened on day 1:

Day 1: 25 January, 2011 (first day of the protests)

The_Egyptian_Liberal published the article on English Wikipedia on the afternoon of what would become a wave of protests that would lead to the unseating of President Hosni Mubarak. A template was used to insert the ‘uprising’ infobox to house summarised information about the event including fields for its ‘characteristics’, the number of injuries and fatalities. This template was chosen from a range of other infoboxes relating to history and events on Wikipedia, but has since been deleted in favor of the more recently developed ‘civil conflict’ infobox with fields for ‘causes’, ‘methods’ and ‘results’.

The first draft included the terms ‘demonstration’, ‘riot’ and ‘self-immolation’ in the ‘characteristics’ field and was illustrated by the Latuff cartoon of Khaled Mohamed Saeed and Hosni Mubarak with the caption ‘Khaled Mohamed Saeed holding up a tiny, flailing, stone-faced Hosni Mubarak’. Khaled Mohamed Saeed was a young Egyptian man who was beaten to death reportedly by Egyptian security forces and the subject of the Facebook group ‘We are all Khaled Said’ moderated by Wael Ghonim that contributed to the growing discontent in the weeks leading up to 25 January, 2011. This would ideally have been a filled by a photograph of the protests, but the cartoon was used because the article was uploaded so soon after the first protests began. It also has significant emotive power and clearly represented the perspective of the crowd of anti-Mubarak demonstrators in the first protests.

Upon publishing, three prominent cleanup tags were automatically appended to the head of the article. These included the ‘new unreviewed article’ tag, the ‘expert in politics needed’ tag and the ‘current event’ tag, warning readers that information on the page may change rapidly as events progress. These three lines of code that constituted the cleanup tags initiated a complex distribution of tasks to different groups of users located in work groups throughout the site: page patrollers, subject experts and those interested in current events.

The three cleanup tags automatically appended to the article when it was published at UTC 13:27 on 25 January, 2011
The three cleanup tags automatically appended to the article when it was published at UTC 13:27 on 25 January, 2011

Looking at the diffs in the first day of the article’s growth, it becomes clear that the article is by no means a ‘blank slate’ that editors fill progressively with prose. Much of the activity in the first stage of the article’s development consisted of editors inserting markers or frames in the article that acted to prioritize and distribute work. Cleanup tags alerted others about what they believed to be priorities (to improve weak sections or provide political expertise, for example) while infoboxes and tables provided frames for editors to fill in details iteratively as new information became available.

By discussing the use of these tools in the context of Bowker and Star’s theories of classification (2000), I argue that these tools are not only material but also conceptual and symbolic. They facilitate collaboration by enabling users to fill in details according to a pre-defined set of categories and by catalyzing notices that alert others to the work that they believe needs to be done on the article. Their power, however, cannot only be seen in terms of their functional value. These artifacts are deployed and removed as acts of social and strategic power play among Wikipedia editors who each want to influence the narrative about what happened and why it happened. Infoboxes and tabular elements arise as clean, simple, well-referenced numbers out of the messiness and conflict that gave rise to them. When cleanup tags are removed, the article develops an implicit authority, appearing to rise above uncertainty, power struggles and the impermanence of the compromise that it originated from.

This categorization practice enables editors to collaborate iteratively with one another because each object signals work that needs to be done by others in order to fill in the gaps of the current content. In addition to this functional value, however, categorization also has a number of symbolic and political consequences. Editors are engaged in a continual practice of iterative summation that contributes to an active construction of the event as it happens rather than a mere assembling of ‘reliable sources’. The deployment and removal of cleanup tags can be seen as an act of power play between editors that affects readers’ evaluation of the article’s content. Infoboxes are similar sites of struggle whose deployment and development result in an erasure of the contradictions and debates that gave rise to them. These objects illuminate how this novel journalistic practice has important implications for the way that political events are represented.

Wikipedia and breaking news: The promise of a global media platform and the threat of the filter bubble

I gave this talk at Wikimania in London yesterday. 

In the first years of Wikipedia’s existence, many of us said that, as an example of citizen journalism and journalism by the people, Wikipedia would be able to avoid the gatekeeping problems faced by traditional media. The theory was that because we didn’t have the burden of shareholders and the practices that favoured elite viewpoints, we could produce a media that was about ‘all of us’ and not just ‘some of us’.

Dan Gillmor (2004) wrote that Wikipedia was an example of a wave of citizen journalism projects initiated at the turn of the century in which ‘news was being produced by regular people who had something to say and show, and not solely by the “official” news organizations that had traditionally decided how the first draft of history would look’ (Gillmor, 2004: x).

Yochai Benkler (2006) wrote that projects like Wikipedia enables ‘many more individuals to communicate their observations and their viewpoints to many others, and to do so in a way that cannot be controlled by media owners and is not as easily corruptible by money as were the mass media.’ (Benkler, 2006: 11)

I think that at that time we were all really buoyed by the idea that Wikipedia and peer production could produce information products that were much more representative of “everyone’s” experience. But the idea that Wikipedia could avoid bias completely, I now believe, is fundamentally wrong. Wikipedia presents a particular view of the world while rejecting others. Its bias arises both from its dependence on sources which are themselves biased, but Wikipedia itself has policies and practices that favour particular viewpoints. Although Wikipedia is as close to a truly global media product than we have probably ever come*, like every media product it is a representation of the world and is the result of a series of editorial, technical and social decisions made to prioritise certain narratives over others. Continue reading “Wikipedia and breaking news: The promise of a global media platform and the threat of the filter bubble”

Full disclosure: Diary of an internet geography project #3

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-07-25 at 2.51.29 PMIn this series of blog posts, we are documenting the process by which a group of computer and social scientists are working together on a project to understand the geography of Wikipedia citations. Our aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. In this post, I outline the way we initially thought about locating citations and Dave Musicant tells the story of how he has started to build a foundation for coding citation location at scale. It includes feats of superhuman effort including the posting of letters to a host of companies around the world (and you thought that data scientists sat in front of their computers all day!)   

Many articles about places on Wikipedia include a list of citations and references linked to particular statements in the text of the article. Some of the smaller language Wikipedias have fewer citations than the English, Dutch or German Wikipedias, and some have very, very few but the source of information about places can still act as an important signal of ‘how much information about a place comes from that place‘.

When Dave, Shilad and I did our overview paper (‘Getting to the Source‘) looking at citations on English Wikipedia, we manually looked up the whois data for a set of 500 randomly collected citations for articles across the encyclopedia (not just about places). We coded citations according to their top-level domain so that if the domain was a country code top-level domain (such as ‘.za’), then we coded it according to the country (South Africa), but if it was using a generic top-level domain such as .com or.org, we looked up the whois data and entered the country for the administrative contact (since often the technical contact is the domain registration company often located in a different country). The results were interesting, but perhaps unsurprising. We found that the majority of publishers were from the US (at 56% of the sample), followed by the UK (at 13%) and then a long tail of countries including Australia, Germany, India, New Zealand, the Netherlands and France at either 2 or 3% of the sample.

Screen Shot 2014-07-30 at 12.42.37 PM
Geographic distribution of English Wikipedia sources, grouped by country and continent. Ref: ‘Getting to the Source: Where does Wikipedia get its information from?’ Ford, Musicant, Sen, Miller (2013).

Screen Shot 2014-07-17 at 5.28.50 PMThis was useful to some extent, but we also knew that we needed to extend this to capture more citations and to do this across particular types of article in order for it to be more meaningful. We were beginning to understand that local citations practices (local in the sense of the type of article and the language edition) dictated particular citation norms and that we needed to look at particular types of article in order to better understand what was happening in the dataset. This is a common problem besetting many ‘big data’ projects when the scale is too large to get at meaningful answers. It is this deeper understanding that we’re aiming at with our Wikipedia geography of citations research project. Instead of just a random sample of English Wikipedia citations, we’re going to be looking up citation geography for millions of articles across many different languages, but only for articles about places. We’re also going to be complementing the quantitative analysis with some deep dive qualitative analysis into citation practice within articles about places, and doing the analysis across many language versions, not just English. In the meantime, though, Dave has been working on the technical challenge of how to scale up location data for citations using the whois lookups as a starting point. Continue reading “Full disclosure: Diary of an internet geography project #3”

Full disclosure: Diary of an internet geography project #2

Reblogged from ‘Connectivity, Inclusivity and Inequality

In this series of blog posts, Heather Ford documents the process by which a group of computer and social scientists are working together in a project to understand the geography of Wikipedia citations. Their aim is not only to better understand how far Wikipedia has come to representing ‘the sum of all human knowledge’ but to do so in a way that lays bare the processes by which ‘big data’ is selected and visualized. In this post, Heather discusses how the group are focusing their work on a series of exploratory research questions.  week3 In last week’s call, we had a conversation about articulating the initial research questions that we’re trying to answer. At it’s simplest level, we decided that what we’re interested in is:

‘How much information about a place on Wikipedia comes from that place?’

In the English Wikipedia article about Guinea Bissau, for example, how many of the citations originate from organisations or authors in Guinea Bissau? In the Spanish Wikipedia article about Argentina, for example, what proportion of editors are from Argentina? Cumulatively, can we see any patterns among different language versions that indicate that some language versions contain more ‘local’ voices than others? We think that these are important questions because they point to extent to which Wikipedia projects can be said to be a reflection of how people from a particular place see the world; they also point to the importance of particular countries in shaping information about certain places from outside their borders. We think it makes a difference to the content of Wikipedia that the US’s Central Intelligence Agency (CIA) is responsible for such a large proportion of the citations, for example.

Past research from Brendan Luyt and Tan (2010, PDF) is instructive here. In 2010, Luyt and Tan took a random sample of national history articles on Wikipedia (English) and found that 17% were government sites and of those 17%, four of the top five sites were US government sites including the US Department of State and the CIA World Fact Book. The authors write that this is problematic because ‘the nature of the institutions producing these documents makes it difficult for certain points of view to be included. Being part of the US government, the CIA World Fact Book, for example, is unlikely to include a reading of Guatemalan history that stresses the role of US intervention as an explanation for that country’s long and brutal civil war.’ (p719) Instead of Luyt and Tan’s history articles, we’re looking at country articles and we’re zeroing in on citations and trying to ‘locate’ those citations in different ways. While we were talking on Skype, Shilad drew this really great diagram to show how we seem to be looking at this question of information geography: Screen Shot 2014-07-22 at 10.04.37 AM In this research, we seem to be looking at locating all three elements (the location of the article, the sources/citations and the editors) and then establishing the relationships between them i.e.

RQ1a What proportion of editors from a place edit articles about that place?

RQ1b What proportion of sources in an article about a place come from that place?

RQ1c What proportion of sources from particular places are added by editors from that place?

We started out by taking the address of the administrative contact contained in a source’s domain registration as the signal for the source’s location but we’ve come up against a number of issues as we’ve discussed the initial results. A question that seems to be a precursor to the questions above seems to be how we define ‘location’ in the context of a citation contained within in an article about a particular place. There are numerous signals that we might use to associate a citation with a particular place: the HQ of the publisher, for example, or the nationality of the author; the place in which the article/paper/book is set, or the place in which the publishers are located. An added complexity has to do with the fact that websites sometimes host content produced elsewhere. Are we using ‘author’ or ‘publisher’ when we attempt to locate a citation? If we attribute the source to the HQ of the website and not the actual text, are we still accurately portraying the location of the source?

In order to understand which signals to use in our large scale analysis, then, we’ve decided to try to get a better understanding of both the shape of these citations and the context in which those citations occur by looking more closely at a random sample of citations from articles about places and asking the questions: RQ0a To what extent might signals like ‘administrative contact of the domain registrant’ or ‘country domain’ accurately reflect the location of authors of Wikipedia sources about places? RQ0b What alternative signals might more accurately capture the locations of sources? Already in my own initial analysis of the English Wikipedia article on Mombasa, I noticed that citations to some articles written by locals were hosted on domains such as wordpress.com and wikidot.com that are registered in the US and Poland respectively. There was also a citation to the Kenyan 2009 census authored by the Kenya National Bureau of Statistics hosted by Scribd.com, and a link to an article about Mombasa written by a Ugandan on a US-based travel blog. All this means that we are going to under-represent East Africans’ participation in the writing of this place-based article about Mombasa if we use signals like domain registration.

We can, of course, ‘solve’ each of these problems by removing hosting sites like WordPress from our analysis, but the concern is whether this will negatively affect the representation of efforts by those few in developing countries who are doing their best to produce local content on Wikipedia. Right now, though, we’re starting with the micro level instances and developing a deeper understanding that way, rather than the other way around. And that I really appreciate.

Full disclosure: Diary of an internet geography project #1

Reblogged from ‘Connectivity, Inclusivity and Inequality

Screen Shot 2014-07-10 at 12.28.58 PMOII research fellow, Mark Graham and DPhil student, Heather Ford (both part of the CII group) are working with a group of computer scientists including Brent Hecht, Dave Musicant and Shilad Sen to understand how far Wikipedia has come to representing ‘the sum of all human knowledge’. As part of the project, they will be making explicit the methods that they use to analyse millions of data records from Wikipedia articles about places in many languages. The hope is that by experimenting with a reflexive method of doing multidisciplinary ‘big data’ project, others might be able to use this as a model for pursuing their own analyses in the future. This is the first post in a series in which Heather outlines the team’s plans and processes.  

It was a beautiful day in Oxford and we wanted to show our Minnesotan friends some Harry Pottery architecture, so Mark and I sat on a bench in the Balliol gardens while we called Brent, Dave and Shilad who are based in Minnesota for our inaugural Skype meeting. I have worked with Dave and Shilad on a paper about Wikipedia sources in the past, and Mark and Brent know each other because they both have produced great work on Wikipedia geography, but we’ve never all worked together as a team. A recent grant from Oxford University’s John Fell Fund provided impetus for the five of us to get together and pool efforts in a short, multidisciplinary project that will hopefully catalyse further collaborative work in the future.

In last week’s meeting, we talked about our goals and timing and how we wanted to work as a team. Since we’re a multidisciplinary group who really value both quantitative and qualitative approaches, we thought that it might make sense to present our goals as consisting of two main strands: 1) to investigate the origins of knowledge about places on Wikipedia in many languages, and 2) to do this in a way that is both transparent and reflexive.

In her eight ‘big tent’ criteria for excellent qualitative research, Sarah Tracy (2010, PDF) includesself-reflexivity and transparency in her conception of researcher ‘sincerity’. Tracy believes that sincerity is a valuable quality that relates to researchers being earnest and vulnerable in their work and ‘considering not only their own needs but also those of their participants, readers, coauthors and potential audiences’. Despite the focus on qualitative research in Tracy’s influential paper, we think that practicing transparency and reflexivity can have enormous benefits for quantitative research as well but one of the challenges is finding ways to pursue transparency and reflexivity as a team rather than as individual researchers.

Transparency

Tracy writes that transparency is about researchers being honest about the research process.

‘Transparent research is marked by disclosure of the study’s challenges and unexpected twists and turns and revelation of the ways research foci transformed over time.’

She writes that, in practice, transparency requires a formal audit trail of all research decisions and activities. For this project, we’ve set up a series of Google docs folders for our meeting agendas, minutes, Skype calls, screenshots of our video call as well as any related spreadsheets and analyses produced during the week. After each session, I clean up the meeting minutes that we’ve co-produced on the Google doc while we’re talking, and write a more narrative account about what we did and what we learned beneath that.

Although we’re co-editing these documents as a team, it’s important to note that, as the documenter of the process, it’s my perspective that is foregrounded and I have to be really mindful of this as reflect what happened. Our team meetings are occasions for discussion of the week’s activities, challenges and revelations which I try to document as accurately as possible, but I will probably also need to conduct interviews with individual members of the team further along in the process in order to capture individual responses to the project and the process that aren’t necessarily accommodated in the weekly meetings.

Reflexivity

According to Tracy, self-reflexivity involves ‘honesty and authenticity with one’s self, one’s research and one’s audience’. Apart from the focus on interrogating our own biases as researchers, reflexivity is about being frank about our strengths and weaknesses, and, importantly, about examining our impact on the scene and asking for feedback from participants.

Soliciting feedback from participants is something quite rare in quantitative research but we believe that gaining input from Wikipedians and other stakeholders can be extremely valuable for improving the rigor of our results and for providing insight into the humans behind the data.

As an example, a few years ago when I was at a Wikimedia Kenya meetup, I asked what editorsthought about Mark Graham’s Swahili Wikipedia maps. One respondent was immediately able to explain the concentration of geolocated articles from Turkey because he knew the editor who was known as a specialist of Turkey geography stubs. Suddenly the map took on a more human form — a reflection of the relationships between real people trying to represent their world. More recently, a Swahili Wikipedians contacted Mark about the same maps and engaged him in a conversation about how they could be made better. Inspired by these engagements, we want to really encourage those conversations and invite people to comment on our process as it evolves. To do this, we’ll be blogging about the progress of the project and inviting particular groups of stakeholders to provide comments and questions. We’ll then discuss those comments and questions in our weekly meetings and try to respond to as many of them as possible in thinking about how we move the analysis forward.

In conclusion, transparency and reflexivity are two really important aspects of researcher sincerity. The challenge with this project is trying to put this into practice in a quantitative rather than qualitative project, a project driven by a team rather than an individual researcher. Potential risks are that I inaccurately report on what we’re doing, or expose something about our process that is considered inappropriate. What I’m hoping is that we can mark these entries clearly as my initial, necessarily incomplete reflections on our process and that this can feed into the team’s reflections going forward. Knowing the researchers in the team and having worked with all of them in the past, my goal is to reflect the ways in which they bring what Tracy values in ‘sincere’ researchers: the empathy, kindness, self-awareness and self deprecation that I know all of these team members display in their daily work.