This lab follows on from our experiments in text analysis and involves exploring two examples of using data mining in the digital humanities. Case 1 uses the Old Bailey Online as the source for a practical experiment. This experiment pulls together several topics we’ve already covered as we used the Old Bailey Online API to extract data and Voyant Tools to analyse it. Cases 2 involves analysing a project from the Utrecht University Digital Humanities Lab.
Being late to complete a lab means that classmates had already uncovered some niggles using Old Bailey Online with Voyant Tools. Given that both of these are free services run on a best efforts basis both teams were incredibly helpful and responsive in answering queries and troubleshooting the problems caused by a class of 40 enthusiastic students using their tools at the same time. It also raises questions about the ongoing viability of tools and resources created by funded research projects and maintained by enthusiasts out of love. How these can be provided sustainably? Is it fair to expect free services to be able to cope with huge workloads or should these be commercial services to guarantee reliability and development in which case who pays? Should open access cover data and tools not just scholarly publications? Big questions at every turn.
Case 1: The Old Bailey Online
I have previous with studying crime. My Medieval Latin undergraduate project was on Violent Crime in England 1153–1309. It involved heading to the Institute of Historical Research, finding suitable rape, assault and murder cases in old court proceedings, photocopying them, typing crime, verdict and punishment data into Excel for analysis and cutting and pasting extracts into my final essay using glue.
In the modern data rich world there are bulletins with data tables and resulting analysis to pour over. How long will these sources last? In hundreds of years time will these have been preserved so that students can pour over them as we do? In between medieval latin tomes and databases there are resources like the Old Bailey Online that have successfully transitioned earlier sources into digital resources.
The Old Bailey Online is an excellent resource that provides digital access to over 197,000 trials held at the Old Bailey between 1674 and 1913. The resource was created via a project directed by Professor Clive Elmsley, Professor Tim Hitchcock and Professor Robert Shoemaker and funded by several bodies between 2000 and 2011. The trial information have been digitised from the Old Bailey Proceedings and the Ordinary of Newgate’s Account. Also included are biographical details of many of those executed. Whilst an unfortunate end for them, many years in the future it gives us a useful insight into the lives of people not normally included in recorded information. The web site includes both digitised images and XML encoded texts and is searchable.
You can get an idea of how influential the project has been by looking at some of the digital projects that build upon the data, the long list of publications that cite the resource and the awards it has received.
I was particularly interested in the effort put into the original digitisation process. Digital images were scanned from microfilms of the original sources. The text was transcribed using various forms of rekeying rather than done automatically to provide an accuracy of over 99%. The transcriptions were then marked up using Text Encoding Initiative (TEI) XML schemas using both automation and manual markup.
In our RECS class Ernesto had also taught us about coding texts as a qualitative research method. I was intrigued to see the coding structures that had been used to classify parts of the text. Some coding was to pick out entities such as names and locations. Of most interest to me was the coding of crimes, verdicts and punishments where a hierarchical classification of general categories and specific types was used.
Querying Old Bailey Online
Whilst the Old Bailey Online represents a much smaller and very different sample I wanted to see if I could get data to compare sexual offences across the centuries during the years 09–11 to compare with the Ministry of Justice analysis.
There are two main routes to querying the Old Bailey Online corpus: using a web based search build on a MySQL database or using a provided API to extract data. The main search form allows me to search by offence and by date. The data selection is via a drop down list of possible years though with no opportunity to type in a year which does make it laborious to scroll through all years to select a particular one.
The search results returned is a list of text sections that contain any sexual assault between the selected years. By default the search returns 10 results. If there are more pressing a calculate total link returns the full number of cases.
The data isn’t tabulated and can’t be exported but because of the coding used the defendant, offence and date is clearly visible as the title of each entry. Even counting the number of offences of each type comes laborious once there is more than 10 results returned. The information might be on a screen rather than on paper but I wouldn’t see my methodology being that different to my undergraduate project. The search interface speeds up the finding of suitable cases but compiling data to perform analysis and then close reading of cases and copying extracts would still be a very manual process and require several external tools.
I next tried using the statistics search form. This has a similar search interface to the main form but adds more options for generating a tabulated rather than list output (with the option to include a chart). Click on a number in the table takes you through to the list of results again. This option provides a better synthesis of quantitative and qualitative analysis options.
This enabled me to do simple statistical analysis online to look at a breakdown of offence sub-categories within the totals and then cross reference with verdict. Using the search to first retrieve tabulated data and the clicking through to the digitised texts for closer reading. This interface would have made my undergraduate analysis quicker by providing a quick way to tabulate data to look for patterns using the coded topics. This simple statistical analysis helps provide suggestions for further analysis.
For example even headline offending counts so an increasing number of cases and a shift:
1709–1711 6 Sexual Offences (Guilty 1 Not Guilty 5)
1809–1811 36 Sexual Offences (Guilty 17 Not Guilty 21)
1909–1911 389 Sexual Offences (Guilty 251 Not Guilty 84)
Deeper analysis could look at possible reasons for this. It could simply be there are more people or it could be due to changes in the legal system or evidence of increasing criminality etc. It would be interesting to look at whether offences change and whether verdicts change over time. For example bigamy and sodomy are the two highest sexual offences in these three centuries but no longer the focus of the analysis in the 21st century; keeping a brothel appears as an offence only in 1909–1911. There is also a noticeable headline shift in the proportion of cases found guilty. Investigating the story behind headline trends requires deeper mining of the data and closer reading of the cases and contexts.
There is also an API was developed as part of a Digging into Data project Criminal Intent that demonstrates how the Old Bailey Online can be used with the Zotero reference manager (for collecting a custom corpus) and Voyant Tools for analysing a corpus.
The API search demonstrator is slightly different to the main search form. It assumes you are searching within trials so doesn’t give the option to select a source. It breaks each categories and subcategories into separate facets. It also allows you to search by the gender of both defendant and victim. It does not allow you to search by name or for a specific reference and the dates are a more specific term list. This made searching within a precise data range even more difficult. The API documentation provides a full list of options for querying the API directly rather than via the demonstrator.
Initially the search returns a list of hits (text sections) and a count. Unlike with the main search there is the option to then breakdown the results to add a frequency count. So I was able to select offence subcategory to see the breakdown of offences or I could select verdict category to see the breakdown of guilty and not-guilty but unlike with the statistics search I couldn’t tabulate these together using the API.
These search results displays the list of text results on the same screen though only lists the text reference not the helpful, defendant, offence, data title of the full search results. Clicking on a reference displays the text in the same screen allowing a reading of the case on the same screen as exploring the data quantitatively. The API results also gives the option to download the xml for each cases though this has to be done in chunks. It also provides a link to click on the Query URL. This allows access to the raw API and so displays the data returned as JSON. The JSON data could then be saved or passed to another tool for further manipulation.
The API search returns different results than the main search interface. In their white paper they explain this is because the search is based on trial rather than defendant, offence of victim. So the API is returning trials whilst the main search form is returning offences I think (there may be multiple offences in a trial)?
As an online search interface it is somewhere between the main search form and the statistics search form which would be my preferred search option of the three. The advantage of an API is better access to the data to get it out in other forms.
The Criminal Intent project investigated two integrations that may appeal to historians: getting data into Zotero for reference management and getting data into Voyant Tools for textual analysis. There is also a data warehousing interface for visualising the data but I couldn’t get this to work. I also struggled to extract data into Zotero. The translator didn’t activate on the API demonstrator page in either Safari nor when I tried in Firefox. Strangely the translator did allow me to save search result URLs and some XML text extracts from standard search pages. I also couldn’t get the Zotero plugin that sends data from Zotero to Voyant Tools (or any other URL that might point to an analytical tool) to work. It’s possible these translators haven’t been maintained or that my configuration and usage of them in the time available is incorrect.
Fortunately the link up with Voyant Tools from the API demonstrator was, me (I know some classmates had problems), much smoother. As we know Voyant Tools provides text analytic tools for a corpus of documents. Clicking a URL in the API demonstrator sends either 10, 50 or 100 results to Voyant to analysis as a corpus. What I’d like to do and couldn’t figure out how to do is send my results from all my three centuries (1709–1711, 1809–1811 and 1909–1911 for analysis). Had I been able to get Zotero working as an intermediary I could have assembled my three result sets as a single collection and then used the ToURL plugin to send them to a locally running copy of Voyant to work with a bigger corpus or combine documents into a smaller number of large documents.
Voyant Tools struggled with the large volume of documents from 1909–1911 which was in any case incomplete. With the 1709–1711 corpus being quite smaller I found the ideal sample for exploring with Voyant to be the 1809–1811 data. It reveals interesting subtleties in languages. For example even though there are 13 Rape offences listed statistically the word only appears in one document in the corpus. In 12 other documents the phrase “ravish and carnally know” is used. This would be particularly important to know if the cases had not already been coded. It’s also not great at understanding verdicts as the bigram “not guilty” is used more often than the unigram acquitted. Searching for the unigram guilty will also find not guilty etc.
If I were to use these tools in an active research project I would probably first use statistical search to use the topic modelling to look for some headline trends and break these down. I would then use the API to send smaller subsets of data to Voyant Tools for distant reading and collect data via JSON for deeper statistical analysis in something like R. I would then use this analysis to help select texts for close reading.
Case 2: Circulation of Knowledge and Learned Practices in 17th-Century Dutch Republic
My second case study is a University of Utrecht Digital Humanities Lab project looking at how knowledge circulated in the 17th-century Dutch republic by analysing digitised correspondence to and from selected scholars (know as CKCC for short).
I was interested in this project because of the overlap with the time period of the Old Bailey online and because as an information science student I’m also interested in how information circulates and gives rise to new knowledge. The project aims to interrogate a key period in information evolution as a so-called scientific revolution increased the amount of information available and the formation of intellectual communities and scholarly practices.
The project aims to analyse a corpus of correspondence, about 20,000 letters centered on 9 scholars, both quantitatively and qualitatively and provide a digital repository of the letters for further scholarship. The methodology includes topic modelling, keyword analysis, named entity recognition and visualisations. This is a somewhat similar approach to the Old Bailey Online where the topic modelling and coding allows statistical analysis whilst making digital texts available allows more qualitative techniques. Whilst the Old Bailey Online allows searching by name, CKCC goes further in identifying not just entities but network structures between them. It’s worth noting though that the Old Bailey approach has allowed other projects to build upon it such as Locating London’s Past which allows locations mentioned in the proceedings to be visualised on contemporary maps.
The CKCC correspondance is available as a digital repository known as the ePistolarium. Like the Old Bailey Online proceedings the letters are marked up using TEI and also include metadata mainly about senders and recipients. As the corpus is made up of several collections of the letters the project had to deal with a variety of digitisation formats, metadata and coding and had to normalise these to create a consolidated resource.
The main ePistolarium search interface used faceted keyword searching. This is sightly more flexible than the Old Bailey Online. Results are presented as a list with a temporal graph. The result list an be exported as CSV. It’s also possible to select one of the visualisations for the results set and download this data as JSON. There isn’t an obvious API for the data.
Using the similarity search function requires knowledge of the languages used in the letters and topic models (Dutch, French or Latin). The resource is perhaps a bit less easier to get into without some knowledge of what you are looking for or deeper knowledge of the subject matter as acknowledged by a consideration of the possible use cases and limitations of the tool.
This is a newer project than Old Bailey Online so hopefully the publications coming out of the project will lead to future practical applications and use in research. The presentation on ￼_The Digital Republic of Letters and explorations of Future E-Humanities Research) by Charles van den Heuvel, for example, is an interesting exploration of the project, visualisations and methodological possibilities.
It’s worth noting that this kind of excavation would only be possible because of the huge amount of work that has gone into preserving, digitisting, encoding and then publishing the texts and then developing user interfaces, search interfaces, APIs to access raw data and analytical tools to sift data. Only then can I begin to dig.
I’d also be interesting in understanding more about how archaeologists approach excavation to see if there is anything that could learnt about approaching digital excavations into data. Let me know if anyone has any references!