Data Transformation

Work with your data in spreadsheets, RDF linked open data, PDF, databases. Extract data in a semantically-aware way and save it in new files in a different format.

Overview

Running Reality has a data transformer built with feedback from researchers. It is enhanced to work within digital history tool workflows, with structured and unstructured data, and with geospatial, geo-temporal and narrative data. It is used by Running Reality itself to build its world history model.

A data layout tells Running Reality what type of data is in the data source. Without a layout, it would not know which data columns or properties represent dates, names, locations, events, or relationships.

Running Reality can sometimes suggest a data layout if it finds a date column, but you will have to confirm the historical context. If Running Reality can auto-detect that a column is a date, is that the date of the founding of a city, a birth date, the citation publication date, or another date? Many data sources have location data that is a named location (such as a ship's port of origin), or linked data (such as a geocoder reference ID).

To have this data source appear as a map layer, a minimal layout that identifies the location data is all that is needed. If start and end dates and a name are identified by the data layout, then the map layer can have additional nuance.

Data Layout

A data layout is a list of data fields that correlate a data field within the data source to a type of historical data. Most data sources consist of sets of "records," each for a specific historical subject. Examples of data fields include names, dates, location, and events. Examples of subjects include cities, people, buildings, coins, or artifacts. In some data sources, the records are easy to identify, such as a single row of a spreadsheet. In other sources, such as a large PDF, you may need to describe how to identify a record, such a paragraph or page break or section header.

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

A layout is used by Running Reality to know the historical meaning of the data. First, the data importer uses the layout to extract factoids. Each data field in the layout might result in one (name) or more (movements, altnames) factoids. Second, the layout can enable an in-map layer where each record that has a location can be a point in the map. If the layout also identifies names and dates, those can also be used to render the point only on certain dates and with a name label.

Here is an example of a CSV file from GreekCoinage.org that lists all coin mints in the classical Mediterranean world. This example will be used throughout this tutorial. The file is here if you would like to download a copy yourself:

  GreekCoinage.csv

Here is an example of an RRLayout file with the example data layout used throughout this tutorial.

  GreekCoinage.rrlayout

You can import this data layout when you are in the transformation window for the CSV file above.

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

Suggestions

Running Reality can suggest data fields to help you create a layout from scratch. It can be intimidating to start from a completely empty list. Automatic suggestions are shown when fields are needed for name and type or could be beneficial for dates and locations. There is also a "Suggest" button that will attempt to analyze the data source to find data fields in an easily identifiable format.

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

For data tables, you can also start a data field by clicking the header row. This makes it easier to define the record position, i.e. pre-filling out that the data is in column C if you click the Column C header.

Record Fields

The transformation of data is done by small units called data fields. A field defines a type of data and provides options for transforming it. It might take data from the data source as its input, such as from a spreadsheet column, or it might take another field as its input, such as a "built" event drawing from a date column. Fields might generate an intermediate output, such as that date field, or it might generate an output for your transformed file, such as a geographic coordinate. A field's input can be narrowed to a specific JSON data "key" from a field that generates JSON output.

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

To add a new field to a layout, click the "add" button. The field starts out as a draft field, then you can select a type. The list of available data fields is shown below. Each field has a field type, which is also its default label when you add it to your layout, but you can configure the label to make it easier to understand its meaning. I.e. you might have three date columns and it helps to give them labels so they don't all say just "Date."

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

Some fields call out to other services to perform their transformation. For example, this field calls out to OpenAI's GPT large language model to ask a question about a block of text and return an piece of structured data in JSON format that can be used as an input by another field. There is more about using AI fields in the advanced section below.

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

The fields are described in the field's block. (Note there is a checkbox at the bottom to show or hide descriptions so that you can better see the fields' parameters to edit them.) For instance, a "Name" field expects its input data to be a plain text name, like a table cell with "Julius Caesar." There are multiple location field types for the cases when the latitude and longitude are in different data fields, or when they are in the same field separated by commas. There are also location field types for named locations, like a table cell with "Athens," where the field will use a geocoder to translate the data from text to latitude and longitude.

Each field has detailed configuration parameters. These let you operate on data that may not be already clean or which might need formatting adjustments. Most text fields have parameters to exclude certain text, such as parenthetical remarks. Name fields have options for how to handle blank names or "NA" or to create anonymous names. Date fields can have special formatting to enforce strict date formats or to be more permissive, or filters to exclude dates outside of a range.

The number of data fields is always growing as is the list of parameters available. Running Reality uses these fields for transforming contributed data into its own format, so we are actively adding fields as needed. If you need an additional field type not yet included, please reach out to ask:

Inputs

The data field input abstracts the data source so that the same transformation algorithms can operate on a wide range of types, whether narrative text or a spreadsheet or an RDF node graph. Inputs can either be another field's output, or a record position within the data source. This section will talk about inputs that are record positions.

Each type of data source may have records in a different format and have a different way to identify data fields within records. Here are examples:

A simple table (such as a CSV or XLS file or a SQL database) has records in rows and data fields in columns. So, the record position is a simple column letter:

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

An RDF file consists of statements in "tuples", where a tuple is a subject → predicate → object. All statements about the same subject become a record. Then the record position is the name of the predicate. For example, Mint of Athens → temporal → 200AD. In this case, the record position is the predicate "temporal."

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

For text documents (such as PDF, TXT, or HTML), the narrative has no defined records or data fields. An AI data field can pass a block of text to a Machine Learning model (such as OpenAI's Large Language Model (LLM) ChatGPT) with a question. You can delineate the breaks in the text that should be the "records" such as PDF section headers or an HTML <h1> header tag. Each AI field's question will be asked about each "record" section.

One type of position is "Global" that can be used in all layouts. A global value is a value that is set for all records, regardless of the record data. If every row in a table is a city, and there is no data column that says "city," then you can set a global data field that sets the type for all records as "City."

The record position is the central concept that allows the transformer to operate on such a wide range of data sources.

Outputs

The final output of all the data fields is your transformed data. You can select the output format to be either the native Running Reality factoid format or various flavors of GeoJSON. The transformer will auto-transform the first three records when you modify the data layout to make it quicker and easier to see the effects of edits like adding a field or modifying a field's parameter.

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids. A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

If a layout has fields that call out to external services, such as an AI field, or especially if such a call to an external service incurs a cost, the transformer will not auto-transform the first three records. You will have to explicitly click the "transform" button to call out to these services.

Advanced

Very complex layouts can be created to handle very complex data. It can be an advantage to not change the complex data source because it might be needed for interoperability with other tools. So, rather than change the data source, you might need to use more complex layouts to describe the historical context of the data for Running Reality to use it.

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

Fields can depend on other fields. The most common case is date fields. A date field might be a standalone field, but it implies a date for other data fields that might be in different columns. To mark a location, you might have the location data (consisting of just the latitude and longitude) depend on a labeled date field. To label one field for use by another, you set the label parameter. A labeled field uses its label in the layout list instead of the default label which is the field type.

Another interdependency is for movements. A movement is a compound field, linked to two or more other labeled fields. A movement required a labeled origin location, a labeled destination location, and a labeled start date. An example is a ship movement where the origin and destination are the names of the two port cities.

For text documents, you may need to specify the page range and the section break that divides records. A long PDF might have many pages without data, such as tables of contents or a bibliography, so a PageRange global field can narrow the pages being processed. Similarly, a global SectionBreak field can specify the REGEX pattern to use to identify each document section that corresponds to a single record for a single subject. For example, a section might start with a number then a city name "75. Athens" to denote the next few paragraphs are data about Athens.

For text documents, some assistance is provided to define useful questions to ask an AI LLM about the text. To reduce "hallucinations," the AI field will send an instruction to the LLM to use only the provided text to respond to the question. To enable downstream processing of the answer, the AI field will send an instruction to respond in JSON format using the JSON template you provide with explicitly defined JSON "keys." LLMs have been trained on extensive JSON and will respond only in JSON when so instructed. They do this specifically and explicitly to be able to generate structured data from an unstructured or narrative input.

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

Running Reality provides a list of sample questions and associated JSON response templates from which you can select and modify.

A Running Reality world is comprised of user layers, user factoids, and the baseline factoids.

We have run experiments on the accuracy of LLMs in understanding text and being able to extract structured data.

Using this capability in this way is also makes it possible to generate RDF "tuples" or triples from narrative text. Currently, you would need to transform the text into the native Running Reality factoid format, and then transform the factoids into RDF, but we are developing a one-step RDF output. A Running Reality factoid is similar to an RDF tuple, but contains five elements. A tuple contains a subject → predicate → object. A factoid contains the subject, relationship (i.e. predicate), object, date, and fidelity (i.e. citation).

For very complex layouts, you may want to save them to reuse later or to share with other team members who are also operating on the same data source. You can use the "import" and "export" buttons to transfer data fields between layouts.