Narrative and Structured Historical Data

Our hypothesis is that new Large Language Model (LLM) AI services can extract structured data from historical narrative with an accuracy comparable to a human and at a lower cost.

This article is based on a Poster Session presentation at the November 2023, 1st international conference on artificIAl Intelligence and applied MAthematics for History and Archaeology (IAMAHA) in Nice, France.

Note that this analysis was run in 2023 using OpenAI GPT-3 and GPT-4. More recent models from OpenAI, such as GPT-4o, are showing higher accuracy and lower cost than these 2023 results. Our original hypothesis looks promising for the future if there are further model improvements each year.

Video

thumbnail of video

Poster

thumbnail of poster

Overview

Digital history tools use structured data to create models of historical environments, but a very large fraction of historical data is in narrative format. Building a large set of structured data requires identifying individual factoids from within historical narratives. Recent advances in Artificial Intelligence and Machine Learning (AI/ML) have led to innovative neural networks known as the Large Language Models (LLMs) that can follow a train of thought in written work and then answer questions about that work. The Running Reality digital history desktop application has been upgraded with an experimental feature to interface with LLMs to import data from narrative text. Running Reality breaks up the text into single-topic sections, provides the section to the LLM, then asks the LLM a predefined set of questions. Running Reality has predefined sets of questions for text whose subject may be a city or a person, to determine if the text contains basic data such as founding or birth dates, alternative names, as well as locations over time. The OpenAI ChatGPT version 3.5 LLM is able to work with text within a 4096 token (or approximately 3000 word) look-back attention buffer, so Running Reality tries to keep section text to within this limit. The results of the experimental feature show that a combination of Running Reality and an LLM promises to be able to build large structured historical datasets.

Methodology

For a human to extract structured data in a uniform format takes time and tooling beyond just reading the text. The accuracy of humans depends on skill level. Crowdsourcing approaches in other fields have relied on large numbers of volunteers to cross-check one another, extensive support tooling, and expert review of results. Even higher-skilled paid humans would require tooling to produce uniform results, i.e. validation and linking of dates, names, event wording, and locations.

Most historical data is in narrative form and existing structured data sets have been built at great cost and, as a consequence, can carry usage or license restrictions.

Running Reality adapted its existing data source processor that can ingest, transform, and reformat structured historical data. The Running Reality app calls the LLM-as-a-service known as OpenAI GPT via its Application Programming Interface (API). The app sends blocks of narrative wrapped with guidance instructions 1) to only use the text provided and 2) to produce JavaScript Object Notation (JSON) output and a series of questions about whether the text references historical events, such as whether a city experienced an earthquake. This capability is experimental, but is currently available to all users of the app. Running Reality is characterizing the performance of this experimental approach by assessing against narrative data sources of value to Running Reality.

Results

Data SourcePagesQuestionsTokens ModelFactoidsFully Correct %Price
An Inventory of Archaic and Classical Poleis 20 8 per section 377755 GPT3.5 46 52% $0.38USD
GPT4 19 100% $3.82USD
Wikipedia 2 8 per section 143099 GPT3.5 46 46% $0.14USD
GPT4 16 88% $1.45USD

Fully correct factoids had all data correct and could be used in the Running Reality world history model. Mostly correct factoids had a minor error, such as a formatting error. Partially correct factoids had some data correct, such as a date or event subject or event object but had some data incorrect, such as mistaking the outcome of an event. Incorrect factoids were unusable, with no traceability to the source text.

Assessment

The results show promise, yet human supervision remains critical.

Considerations

Next Steps

The next steps will test additional kinds data sources and improve the RR interface with OpenAI's GPT.

Acknowledgements

Running Reality collaborative assessment of “An Inventory of Archaic and Classical Poleis” is in partnership with Valentina Mignosa of the Greek History project Mapping Ancient Sicily funded by the University Ca' Foscari of Venice (PI Stefania De Vido).