Reading 35,000 Books
Blog
- Can election results be predicted by correcting biases in social polls from X?
- International Women’s Day: Celebrating Ad Astra Scholar Ava Canning
- Chinese New Year 2024 Celebration
- MSc Advanced Software Engineering Alumni
- Generative AI in Computing Education: Wrecking Ball or Holy Grail?
- My internship in digital strategy at Limerick City & County Council
- Clown Computing 101
- International Men's Day 2023
- Top marks for ChatGPT in the Leaving Certificate Computer Science Examination
- Alumnus Interview
- Student profile: Pasika Ranaweera, PhD student
- Staff profile: Associate Professor Neil Hurley, Head of School
- UCD CS PhD candidate award from IEEE Consumer Electronics Magazine
- UCD-Insight Collaboration Wins Prestigious Publication Award
- W@CS Alumni Roundtable
- Staff profile: Dr. Fatemeh Golpayegani
- Interning at a Smaller Tech Company
- Powering through the pandemic: My Remote Research Internship Experience
- Exploring Sense of Belonging in Computer Science Students
- Student Inter-Society Tech & Enterprise Meetup (SISTEM) held in UCD
- Computer Science Research and the COVID -19 Pandemic
- Zoom fatigue: how to make video calls less tiring
- SIGCSEire Launched at UCD CS
- Best Paper at the International Conference on Case-Based Reasoning (2019)
- ‘Spare tire genes’ explain why some genes can be lost by cancer cells
- Bi-annual CS Graduate Research Symposium
- UCD CS Postdoctoral fellow Claudia Mazo selected as a member of the ACM Future of Computing Academy
- Security, Privacy and Digital Forensics in the Cloud
- Chidubem Iddianozie: PhD student and GitHub Ambassador in UCD
- UCD CS PhD student selected to attend the Heidelberg Laureate Forum
- New Project: Evidence-Based Decision Support for Real-Estate Investment
- UCD projects celebrate Europe Day
- Research Award at the 2019 ACM SIGCSE Technical Symposium on Computer Science Education
- Top Tips for Student Scholarships
- I am a Computer Scientist and a Cancer Biologist
- Critical thinking and data ethics in UCD CS
- Teaching at BDIC Beijing
- Reading 35,000 Books
- Secret to a Great Internship
- 12 Tips for PhD Researchers
- Buddy Coders - a new initiative to support women in Computer Science
Reading 35,000 Books: UCD CS faculty collaborates with UCD School of English to allow humanities scholars better explore our cultural past
Dr. Derek Greene is an Assistant Professor in the UCD School of Computer Science and a Funded Investigator at the SFI Insight and VistaMilk research centres. Prof. Gerardine Meaney is Professor of Cultural Theory in the UCD School of English, Drama and Film.
In recent years, the potential for collaboration between data science and other disciplines to develop new research methods is being increasingly recognised. This is particularly evident in the development of cultural analytics in the field of Digital Humanities, where available datasets and other digital resources for humanities research have expanded rapidly in the last decade. Since 2011 there has been an active collaboration between the School of Computer Science and the School of English. Using advanced Data Science techniques, we have approached literary sources with new questions and come up with some surprising findings.
Initially, our work focused on developing network analysis models to represent the associations between characters in 19th and early 20th century Irish and British fiction ((opens in a new window)http://www.nggprojectucd.ie). In this case, the data analysed was a corpus of annotated full-texts for 46 novels, with over 9000 named entities, a very large data set for literary studies. For a novel like Pride and Prejudice, we were able to identify ways in which seemingly minor characters were in fact crucial in bringing about major plot devices. Understanding this, we can now revisit these characters and increase our knowledge of the society in which Jane Austen wrote.
More recently, as part of the IRC-funded Contagion project ((opens in a new window)http://www.contagion.ie/), our focus has shifted to analysing historical trends at a larger scale. Through a collaboration with the British Library Labs ((opens in a new window)https://www.bl.uk/projects/british-library-labs), we have access to a much larger corpus from the British Library, covering 35,918 English language fiction and non-fiction books dating from 1700 to 1899. This is equivalent to over 12 million individual pages of printed text. For this project, we wanted to explore historical understandings of disease, contagion and migration, in order to better understand current public health challenges.
A project like this, where a huge corpus is available to researchers, presents significant challenges to humanities researchers who are studying a very specific theme. In order to understand society’s understanding of these themes, the more discussions of them that we can analyse, the better. This means looking at texts where they are only mentioned briefly, fiction books where they are used as plot devices, and non-fiction texts where they are the central theme. Analytics can also be used to establish how these themes are discussed.
The Data
The books were originally digitised to image format and then converted to plain text via optical character recognition (OCR). As a result, the quality and formatting of the text varies considerably, particularly in the case of older books. The British Library also provided metadata for the corpus, including information such as author, edition and place of publication for each book.
Our Approach
In February 2019, we organised a workshop at the British Library in London to showcase the Curatr platform, a web-based interface which we developed to make the British Library corpus more accessible and useful to a wider group of researchers. The platform indexes all of this text and the associated metadata, allowing the corpus to be browsed, searched, and filtered by author, title, and year. The interface also incorporates a digitised version of the topical classification index of volumes used by the British Library from 1823-1985, which allows the texts to be further filtered by categories such as “fiction”, “drama”, and “geography”.
Curatr incorporates functionality to build word lexicons. These are lists of thematically-related keywords, which are used to locate niche research topics within little known or long unwieldy texts. To reduce the manual effort required to build new word lexicons, we provide users with automatic keyword recommendations, as generated by word embeddings. Word embeddings refer to a set of machine learning techniques, based on neural networks, which "map" the words in a corpus vocabulary to a numeric representation. In this new representation, words which frequently appear together in the original corpus will appear to be similar to one another, while words which do not frequently appear together will be dissimilar. So for example, for the input word “influenza”, we could automatically recommend similar words such as “pneumonia” and “bronchitis”. In this way, researchers can quickly build lexicons of related word