Thanks to the digitization of millions of documents including newspapers, books, and handwritten sources, Melissa Terras and the team at the Centre for Data, Culture and Society (CDCS) at the University of Edinburgh are bringing context to large-scale data processing and HPC.
“Humanities looks at history to better understand the current issues facing our society,” explains Melissa Terras, Director for the CDCS at the University of Edinburgh “so why not take this humanities approach and look deeper into the history of the data we collect?”
Melissa and the CDCS focus on the digitalization of cultural heritage. Working with researchers they help to build experiments and drive innovation around the potential of data-rich and applied digital research. From images, to texts, recordings, and video – their work encompasses large-scale processing of the ‘data’ of society – allowing researchers to ask questions and explore topics that have a global impact.
“Newspapers, for example, are a readily available source of ‘data,’” Melissa notes, “You can gather together newspapers from around the world, focus them on a particular day or event, and start to interrogate what factors linked the stories together. For example, when the eruption of Krakatoa happened in 1883 there were few copyright laws, so newspapers would copy and paste one story into their format. We can take that original story and see how it changed as it moved from paper to paper. From there we can ask why those changes happened and go deeper to find the answers.”
Seeing how news spreads around the world is just one of the many projects Melissa and the CDCS is involved in. “You have Oceanic Exchanges, which looks at the way news is reported around the world. But we also know that while these events are unfolding at the macro, or public level, there are private notes, letters, and documents that can give you valuable micro level information.”
Transkribus, a comprehensive platform for the digitization, AI-powered text recognition, transcription and searching of historical documents, came together to tackle millions of handwritten documents and bring them into the fold of large-scale research. “Historically a handwritten document was captured as an image. It took a lot of time as a researcher to work through these images – it really limited any significant research scope. Ten years ago, we started working on AI/ML algorithms to tackle this issue, two years ago we launched as a non-profit co-op subscription service, and today we have a 95% accuracy rating and over 50,000 users.”
How we study the data of the past impacts how we gather data in the future.”
Bringing supercomputing methodology to historical texts represents only a portion of what Melissa’s team does. “Because the ability to look at large-scale data sets like these is still relatively new we are working hand-in-hand with researchers to take their domain knowledge and scale it to the levels of HPC. We’re a part of the ARCHER UK National Supercomputing Service and offer bursary schemes to get important HPC research in the humanities underway. We have collaborations with groups such as The Alan Turing Institute the National Library of Scotland, the British Library, as well as work with institutions all over the world who contribute to our shared goals.”
Melissa and her team see broader impact in the work they do around data. “How we study the data of the past impacts how we gather data in the future. When you think about it, data is never really live. From the moment a piece of data is captured it becomes a historical record. It is up to us to make sure we leave as complete and inclusive records as possible for future research, and future generations, to benefit from.”
- Transcribing Handwritten Documents with AI
- Oceanic Exchanges
- Data-centric Review of the Industrial Revolution
- Department for Data, Culture & Society
Cristin Merritt, SC21 Inclusivity Liaison for Communications