Encyclopedia Britannica data visualisation

2020 National Library of Scotland
overview
This project was developed as part of the University of Edinburgh’s Data Science for Design course, in collaboration with the National Library of Scotland. Our dataset comprised eight OCR-scanned editions of the Encyclopaedia Britannica, published between 1768 and 1860. As a historical body of general knowledge, the Encyclopaedia provides a compelling lens through which to examine how information and the way we organise it has evolved over time.
As a UI designer, I worked alongside a data scientist and game designer into translate our research into an interactive, accessible platform. The goal was to create an engaging entry point for exploring the Encyclopaedia’s content and structure across its various editions.

Challenge

01

The structure of the data made it nearly impossible to craft RegEx queries that didn’t also capture irrelevant noise.

02

Character recognition was inconsistent, letters were often misread as other letters or punctuation marks.

03

Headers were poorly detected across the text files, making it difficult to segment the content accurately.

04

Images didn’t consistently align with the corresponding content in the same file.

Solution
Because the data was composed of unstructured OCR text, we needed to create our own structured framework. We focused on identifying simple definitions that followed a “TERM, definition” format, as well as cross-references such as “See x.” To gauge the relative popularity of topics across editions, we used the number of cross-references as a proxy.Working with unclean OCR data presented challenges. Regular expressions couldn’t consistently extract entries without noise or omissions. However, because these inconsistencies appeared uniformly across editions, we were still able to draw proportional insights from the processed data. Extracting the most referenced or lengthy terms required manual review of the top entries, since the data was too noisy to be parsed accurately through automation alone.Due to the data being unstructured text, we had to generate our own structured data from it.
We decided to focus on very specific aspects of the data: simple entries of the form “TERM, definition”, as well as references to topics of the form “See x”, and used reference counts as a proxy for the popularity of a topic.

Target Users
Our audience is the general public. These are curious individuals who are interested in the Encyclopaedia Britannica but may feel overwhelmed by the sheer volume and density of its content.

Final Output

We created a website to visualise our investigation. The platform highlights five frequently referenced subjects in the Encyclopaedia: Anatomy, Architecture, Agriculture, Botany, and Chemistry.Each field is represented by an image. Hovering over an image reveals the number of references to that subject across different editions, offering a glimpse into how interest in certain topics changed over time. In the lower-left corner, we included a comparative visualisation of topic popularity. This element was refined based on feedback from our data holder, who noted that the concepts of “popularity” and “reference count” were originally separated across pages and could lead to misinterpretation if not visually integrated.

Open Resource

Let's dig into the code a bit  💻
This project was also published on the
All the code used for processing the data is openly available on