GENOME is a dataset of geopolitical events automatically extracted from English-language newswire articles using large language models. It implements PLOVER's 16-type event ontology and extends its Actor–Recipient schema with a Third Party role to capture multilateral interactions.
This repository accompanies the paper:
Alessandro Dell'Orto and Jesse Kommandeur. GENOME: A New Geopolitical Event Methodology and Dataset using Large Language Models.
Status: companion repository for an in-press paper. The citation entry below will be updated with the full bibliographic record (venue, year, DOI) once the paper is officially published.
This repository distributes the GENOME event data used in the comparison with POLECAT reported in the paper. It contains the extracted event records only — the extraction pipeline, prompts, and model artifacts are not included.
| Time range | 2024-02-01 – 2024-06-30 |
| Format | CSV (UTF-8, comma-separated, quoted strings) |
| License | CC BY 4.0 |
Two files are provided, corresponding to the GENOME variants evaluated in the paper:
| File | Description |
|---|---|
data/GENOME_non_dedup.csv |
Output of the pipeline after entity normalization, before deduplication. Corresponds to the GENOME (article_date) and GENOME (event_date) variants in the paper — the only difference between the two is whether article_date or event_date is used as the event timestamp. |
data/GENOME_dedup.csv |
Output of the pipeline after the deduplication step. Corresponds to the GENOME (dedup) variant in the paper, with event_date as the timestamp. |
POLECAT is not redistributed here — see data/POLECAT_PLACEHOLDER.md for download instructions.
import pandas as pd
events = pd.read_csv("data/GENOME_dedup.csv")List-valued fields are serialized as semicolon-separated strings (e.g. "Russia; Cyprus; Cyprus"). The article_ids column is serialized as a brace-enclosed, comma-separated set of MD5 hashes (e.g. "{hash1,hash2}"). Parse them back to Python lists with:
list_cols = [
"location_names", "location_countries", "location_countries_norm", "location_countries_ids",
"actor_names", "actor_names_norm",
"actor_countries", "actor_countries_norm", "actor_countries_ids",
"recipient_names", "recipient_names_norm",
"recipient_countries", "recipient_countries_norm", "recipient_countries_ids",
"third_party_names", "third_party_names_norm",
"third_party_countries", "third_party_countries_norm", "third_party_countries_ids",
]
for c in list_cols:
events[c] = events[c].fillna("").astype(str).apply(
lambda s: [x.strip() for x in s.split(";") if x.strip()]
)
events["article_ids"] = events["article_ids"].str.strip("{}").str.split(",")Both files share the following columns:
| Field | Type | Description |
|---|---|---|
id |
string | Unique event identifier. In GENOME_non_dedup.csv it has the form <article_md5>_<event_num>; in GENOME_dedup.csv it is a short hex identifier assigned to the merged cluster. |
article_ids |
set[string] | MD5 hash(es) of the source article text(s), serialized as "{hash1,hash2,...}". Non-deduplicated rows always contain a single hash; deduplicated rows may contain multiple hashes when several articles were merged into one event. |
event_num |
int | 1-based index of the event within its source article. GENOME's extraction is prompted to return a single "core" event per article, so this is almost always 1. |
event_date |
date (YYYY-MM-DD) |
Date of the event as inferred by the extraction model from the article text. |
article_date |
date (YYYY-MM-DD) |
Publication date of the source article. |
event_summary |
string | Self-contained natural-language summary of the event, produced by the extraction model. |
source_quote |
string | Short verbatim snippet from the source article used to construct the event, reproduced under fair use / short-quotation exceptions for research purposes. Truncated to a maximum of 20 words; truncated snippets end with an ellipsis (…). |
core_sentence |
string | "Actor VERB Recipient" canonical form of the event, produced by the classification model. |
event_type |
enum | One of the 16 PLOVER event types (see below). Empty when the model could not assign a type. |
category |
enum | CONFLICT or COOPERATION, derived from event_type. Empty when event_type is empty. |
location_names |
list[string] | Raw geographic locations mentioned in the event. |
location_countries |
list[string] | Country / international-organization names inferred for each location. |
location_countries_norm |
list[string] | Lowercase normalized form of location_countries, matched against the curated reference list. |
location_countries_ids |
list[string] | Numeric IDs from the curated reference list (countries.csv), or io_<n> for international organizations. |
actor_names |
list[string] | Raw actor names extracted from the article. |
actor_names_norm |
list[string] | Canonical actor names after embedding-based clustering (see the methodology section of the paper). |
actor_countries |
list[string] | Country / international-organization names inferred for each actor. |
actor_countries_norm |
list[string] | Lowercase normalized form of actor_countries. |
actor_countries_ids |
list[string] | Numeric IDs / io_<n> codes from the curated reference list. |
recipient_names, recipient_names_norm, recipient_countries, recipient_countries_norm, recipient_countries_ids |
list[string] | Same structure as the actor fields, for the Recipient role. |
third_party_names, third_party_names_norm, third_party_countries, third_party_countries_norm, third_party_countries_ids |
list[string] | Same structure as the actor fields, for the Third Party role introduced by GENOME. |
countries.csv provides the mapping between the *_countries_ids columns and human-readable names. It contains the countries and international organizations that appear in this release (a subset of the full curated reference list — entities with no events in the release window are omitted).
| Field | Type | Description |
|---|---|---|
id |
string | Numeric ID for countries, io_<n> for international organizations |
name |
string | Canonical English name |
type |
enum | country or organization |
GENOME_dedup.csv additionally contains the deduplication similarity scores:
| Field | Type | Description |
|---|---|---|
score_total |
float | Weighted composite similarity score of the merged pair (0–1). 0 for singleton events that were not merged. |
score_semantic |
float | Cosine similarity of event_summary embeddings (all-MiniLM-L6-v2). |
score_actors |
float | Jaccard similarity over actor and recipient name sets. |
score_type |
float | 1 if event_type matches, else 0. |
score_meta |
float | Location overlap. |
Pairs above the deduplication threshold were linked as duplicates and merged via connected components on the similarity graph; see the methodology section of the paper for full details.
Event types and Quad mapping
The event_type column uses PLOVER's 16 root types. They map to the four PLOVER Quad categories as follows (the category column collapses Quad into a binary CONFLICT / COOPERATION axis):
- Verbal cooperation:
CONSULT,AGREE,SUPPORT,CONCEDE - Material cooperation:
AID,COOPERATE,RETREAT - Verbal conflict:
ACCUSE,REQUEST,THREATEN,REJECT - Material conflict:
ASSAULT,SANCTION,COERCE,MOBILIZE,PROTEST
GENOME is built from articles obtained from a single commercial English-language newswire provider whose terms of use prohibit redistribution of the raw text. We therefore release only the extracted structured records, plus the MD5 hash of each source article in article_ids so that researchers with credentialed access to the same provider can verify which articles a given event was extracted from. Short verbatim snippets in source_quote are capped at a small number of words (truncated quotes end with …) and reproduced under fair use / short-quotation exceptions for the purposes of research and criticism.
See the Limitations section of the paper for known limitations of the dataset, including English-language and Western-newswire bias, the absence of a manually annotated gold standard, the single-core-event extraction strategy, deduplication threshold sensitivity, and reliance on closed-source LLMs.
If you use GENOME in your research, please cite:
@misc{dellorto-kommandeur-genome,
title = "{GENOME}: A New Geopolitical Event Methodology and Dataset using Large Language Models",
author = "Dell'Orto, Alessandro and Kommandeur, Jesse",
}A full citation entry will be added once the paper is officially published.
The GENOME dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). See LICENSE for the full notice. You are free to share and adapt the data for any purpose, including commercial use, provided you give appropriate credit by citing the paper above.