A Small Text Analysis Mission to a Comet

Last week I watched fascinated as scientists at the European Space Agency entered a critical phase of their Rosetta Mission. They separated the Philae Lander from the main Rosetta craft and waited as it fell towards Comet 67/P Churyumov-Gerasimenko and attempted to land and affix itself to the surface in order to take scientific measurements. This was no Hollywood blockbuster. It had taken over 20 years of vision, ambition and quiet, unheralded dedication to reach this point. The crafts themselves had spent 10 years travelling the 510 million km to their destination. Communication between Rosetta and the control centre in Germany took 28 minutes to travel. There was no live visualisation of the events unfolding in space but over 7 tense hours and 3 intriguing days there was data (firstly telemetry and subsequently scientific measurements), there were many words across news outlets and social media, there were comics and there was live video streaming of anxious scientists peering intently at screens waiting to decipher the telemetry and work out if their work had been successful. Then there was jubilation that not even the scientific instinct towards caution and circumspection about their data could contain.

Source: xkcd.com Web Comic #1446 ‘Landing‘ by Randall Munroe (CC BY-NC 2.5) via http://xkcd1446.org

This was a great moment for science, far beyond what I am able to comprehend, but also an interesting moment for a social historian and information science student. For it is the human and social reception of such moments and how they fit into broader narratives that are the life blood of a historian and the evolving nature of documents, data and information communication that is my current object of study.

Our DITA Mission

Out DITA mission this week was to explore text analysis. Nothing too formal or deep but a fun introduction based on the idea of surfing, stumbling and ‘screwing around‘ with a text corpus. Our aim was to begin to understand how textual analysis can illuminate by playing around with some tools (Wordle, Voyant Tools and Many Eyes) and small data sets (using data captured from our work on Altmetrics and Twitter Analysis). So not a huge corpus and only one research question provided by Stephen Ramsay in the Hermeneutics of Screwing Around; or What You Do With a Million Books:

“could we imagine a world in which ‘Here is an ordered list of the books you should read,’ gives way to, ‘Here is what I found. What did you find?’”

My Comet Corpus

I’d used TAGS to create several Twitter archives starting just after Philae had separated from Rosetta. I was interested in the amount of interest #CometLanding was generating. I was also interested in the use of social media by ESA to communicate their mission’s progress. Given the prospect of failure this was exposing the mission very publicly at such a crucial moment. Even individual instruments had their own Twitter feeds, and personalities, to contribute to the narration of an event that was mostly told through words because there were few visuals and our access to the raw data was mediated and interpreted by the mission team (see my twitter list for a collection of ESA Rosetta accounts). I couldn’t help wondering why a scientific mission had chosen to give this moment such a human voice?

Maybe to encourage conversation and to help us explore what science means to humanity?

An archive of the hashtag #CometLanding would obviously quickly become unwieldy for TAGS and so it seemed did a query looking for mentions of @Philae2014 which broke down very quickly though I did manage to collect 75,280 Tweets between 12 November and 13 November (an 87Mb cvs file). I would have liked to create a TAGS archive for my twitter list but this API method isn’t supported in TAGS at this time. So I also created a TAGS archive for the tweets from Philae itself (my second favourite anthropomorphic robotic device after Wall-E) a set of 736 tweets between 27 October 2014 and 15 November 2014. From these two sources I created a merged corpus combining voice of Philae throughout its mission with about a day of Twitter interaction with the @Philae2014 Twitter account (about 76,000 tweets in total).

Visualising the Comet Corpus

The best tool for analysing this corpus was Voyant Tools. I tried Wordle and whilst getting some interesting data and visualisation on word frequent the inability to use stop words to filter out some of the ‘Twitter grammar’ e.g. hashtags, ‘RT’, ‘http’, Twitter Handles and the Twitter ‘t.co’ link shortener URL meant it was hard to look for other interesting patterns in meaningful words as these terms dominated. Many Eyes from IBM is slow and buggy. It failed to cope with more than the smallest amount of free text data and I got the impression it would have been happier working with nicely wrangled structured data from a spreadsheet rather than a text corpus. Voyant Tools was most focused on picking up patterns in text. I also realised Voyant Tools could handle the upload of multiple documents to create a corpus consisting of multiple documents for a different analysis.

Initially Voyant Tools suffered from the same Twitter grammar dominance as Wordle but provides a stop list feature that allows common and specified words to be excluded. I used a multilingual set of common words as my starting list (this is a European project) and then added in the top Twitter specific terms that were skewing the data. I still left in rosetta, #67p and 67p though I could have excluded these too given that the name of the mission and the handle for the comet are expected in the corpus and not the most interesting terms. This would have made touchdown the most used term with a frequency of 11,292.

Cleaned up version of Voyant Tools with Twitter ‘grammar’ removed allows more of a focus on specific words.

As expected the predominant language used was quite technical and related to key terms in the mission for example Rosetta, the mission name, comet and touchdown. Status and progress type words predominate with some unexpected peaks like for ‘Harpoons’ which were much speculated about before and after they failed to fire on landed as expected. This mission language was however mixed with words expressing the emotions generated by the successful attempt such as great, wow, awesome, amazing and wonderful.

The present tense words evoked exploration: landing, trying, stretching, receiving, feeling and floating. Many of these words in the ‘as it happened’ tense also again emphasise the anthropomorphic nature of Philae’s journey as represented via Twitter. The past tense words were slightly more status focused: imaged, deployed, confirmed, established, landed. Landed was however not as favoured as the more dramatic term touchdown for the dramatic point of contact. Landing was the most confusing word as it was used as a gerund throughout the corpus and picked up after the landing had happened as there was much analysis of where and how Philae had landed resulting in a conclusion there were three separate landings. For example in one of its peaks it was used in a much retweeted post about landing gear rather than the act of landing.

This slideshow requires JavaScript.

Macroscopic Reading

Some of this information and interpretation I knew from following the Philae landing. I monitored Twitter and other channels in real time and clipped and saved tweets and articles so I already knew the trajectory of this journey before undertaking this distant reading approach. The use of textual analysis provided me with different reading tools, techniques and methods by approaching the same corpus as dataset rather than a stream that certainly helped me locate different patterns. That said, even though I was able to find traces of emotion in the corpus it still didn’t quite convey they wonder of the moment in the same was as close reading did. Like last week with altmetrics, new methods such as distant reading and textual analysis should extend our reading palette rather than replace well honed methods. Professor Tim Hitchcock made this point very eloquently at the recent British Library Labs keynote Big Data, Small Data and Meaning. Like its place in space, the story of Philae is best understood not just in the large but also in the small.

Comets and the Humanities

I can only imagine and look forward to seeing how the scientific results from the Rosetta Mission are analysed and disseminated through scholarly publishing (and hopefully social impact channels that can be traced by Altmetrics). However, I also hope that those of us in information science, data science or the digital humanities also use our methods to trace social, political and cultural patterns using the corpus of comment and reaction this moment generated. For example a macroscopic study using distant and close reading methods could compare and contrast the Rosetta mission with the space programmes of the 1960s across a diverse corpus of social and cultural texts to trace attitudes towards science, space and the unknown.  The adventures out there with this material are not just scientific.


View the Corpus

Merged Corpus

  • All Data (provides access to Cirrus, Corpus Summary, Corpus and Document Term Frequency, Corpus Reader, Corpus Grid, Term Frequencies Chart, Document Keyword in Context and Reader tools)
  • Bubblelines (filtered for landing terms)
  • Bubblelines (filtered for emotion terms)

Two Document Corpus

  • All Data (provides access to Cirrus, Corpus Summary, Corpus and Document Term Frequency, Corpus Reader, Corpus Grid, Term Frequencies Chart, Document Keyword in Context and Reader tools)

Tools


Sadly for all their engagement the few Philae photos released by ESA to Flickr are all copyrighted so thanks NASA for releasing this image of the ESA Johannes Kepler under a CC licence.

Featured Image: Kepler on the Horizon (NASA, International Space Station, 06/20/11) by NASA’s Marshall Space Flight Centre.  Source: Flickr.  (CC BY-NC 2.0)
Advertisements

3 Comments

Add yours →

  1. This is amazing! Forget the DITA assignment; I’m sure there’s a dissertation in this as well!

    Like

  2. Fantastic use of text analysis Alison! Really interesting exploration of the changes in language used on Twitter with the different stages of the mission. I agree with Dom – definitely could be a dissertation in here.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: