GENOME: Geopolitical Event News Observatory, Mapping, and Extraction

GENOME is a dataset of geopolitical events automatically extracted from English-language newswire articles using large language models. It implements PLOVER's 16-type event ontology and extends its Actor–Recipient schema with a Third Party role to capture multilateral interactions.

This repository accompanies the paper:

Alessandro Dell'Orto and Jesse Kommandeur. GENOME: A New Geopolitical Event Methodology and Dataset using Large Language Models.

Status: companion repository for an in-press paper. The citation entry below will be updated with the full bibliographic record (venue, year, DOI) once the paper is officially published.

Scope of this release

This repository distributes the GENOME event data used in the comparison with POLECAT reported in the paper. It contains the extracted event records only — the extraction pipeline, prompts, and model artifacts are not included.


Time range	2024-02-01 – 2024-06-30
Format	CSV (UTF-8, comma-separated, quoted strings)
License	CC BY 4.0

Two files are provided, corresponding to the GENOME variants evaluated in the paper:

File	Description
`data/GENOME_non_dedup.csv`	Output of the pipeline after entity normalization, before deduplication. Corresponds to the GENOME (article_date) and GENOME (event_date) variants in the paper — the only difference between the two is whether `article_date` or `event_date` is used as the event timestamp.
`data/GENOME_dedup.csv`	Output of the pipeline after the deduplication step. Corresponds to the GENOME (dedup) variant in the paper, with `event_date` as the timestamp.

POLECAT is not redistributed here — see data/POLECAT_PLACEHOLDER.md for download instructions.

Loading

import pandas as pd

events = pd.read_csv("data/GENOME_dedup.csv")

List-valued fields are serialized as semicolon-separated strings (e.g. "Russia; Cyprus; Cyprus"). The article_ids column is serialized as a brace-enclosed, comma-separated set of MD5 hashes (e.g. "{hash1,hash2}"). Parse them back to Python lists with:

list_cols = [
    "location_names", "location_countries", "location_countries_norm", "location_countries_ids",
    "actor_names", "actor_names_norm",
    "actor_countries", "actor_countries_norm", "actor_countries_ids",
    "recipient_names", "recipient_names_norm",
    "recipient_countries", "recipient_countries_norm", "recipient_countries_ids",
    "third_party_names", "third_party_names_norm",
    "third_party_countries", "third_party_countries_norm", "third_party_countries_ids",
]
for c in list_cols:
    events[c] = events[c].fillna("").astype(str).apply(
        lambda s: [x.strip() for x in s.split(";") if x.strip()]
    )

events["article_ids"] = events["article_ids"].str.strip("{}").str.split(",")

Schema

Both files share the following columns:

Field	Type	Description
`id`	string	Unique event identifier. In `GENOME_non_dedup.csv` it has the form `<article_md5>_<event_num>`; in `GENOME_dedup.csv` it is a short hex identifier assigned to the merged cluster.
`article_ids`	set[string]	MD5 hash(es) of the source article text(s), serialized as `"{hash1,hash2,...}"`. Non-deduplicated rows always contain a single hash; deduplicated rows may contain multiple hashes when several articles were merged into one event.
`event_num`	int	1-based index of the event within its source article. GENOME's extraction is prompted to return a single "core" event per article, so this is almost always `1`.
`event_date`	date (`YYYY-MM-DD`)	Date of the event as inferred by the extraction model from the article text.
`article_date`	date (`YYYY-MM-DD`)	Publication date of the source article.
`event_summary`	string	Self-contained natural-language summary of the event, produced by the extraction model.
`source_quote`	string	Short verbatim snippet from the source article used to construct the event, reproduced under fair use / short-quotation exceptions for research purposes. Truncated to a maximum of 20 words; truncated snippets end with an ellipsis (`…`).
`core_sentence`	string	"Actor VERB Recipient" canonical form of the event, produced by the classification model.
`event_type`	enum	One of the 16 PLOVER event types (see below). Empty when the model could not assign a type.
`category`	enum	`CONFLICT` or `COOPERATION`, derived from `event_type`. Empty when `event_type` is empty.
`location_names`	list[string]	Raw geographic locations mentioned in the event.
`location_countries`	list[string]	Country / international-organization names inferred for each location.
`location_countries_norm`	list[string]	Lowercase normalized form of `location_countries`, matched against the curated reference list.
`location_countries_ids`	list[string]	Numeric IDs from the curated reference list (`countries.csv`), or `io_<n>` for international organizations.
`actor_names`	list[string]	Raw actor names extracted from the article.
`actor_names_norm`	list[string]	Canonical actor names after embedding-based clustering (see the methodology section of the paper).
`actor_countries`	list[string]	Country / international-organization names inferred for each actor.
`actor_countries_norm`	list[string]	Lowercase normalized form of `actor_countries`.
`actor_countries_ids`	list[string]	Numeric IDs / `io_<n>` codes from the curated reference list.
`recipient_names`, `recipient_names_norm`, `recipient_countries`, `recipient_countries_norm`, `recipient_countries_ids`	list[string]	Same structure as the actor fields, for the Recipient role.
`third_party_names`, `third_party_names_norm`, `third_party_countries`, `third_party_countries_norm`, `third_party_countries_ids`	list[string]	Same structure as the actor fields, for the Third Party role introduced by GENOME.

Country / organization reference table

countries.csv provides the mapping between the *_countries_ids columns and human-readable names. It contains the countries and international organizations that appear in this release (a subset of the full curated reference list — entities with no events in the release window are omitted).

Field	Type	Description
`id`	string	Numeric ID for countries, `io_<n>` for international organizations
`name`	string	Canonical English name
`type`	enum	`country` or `organization`

Deduplication scores (dedup file only)

GENOME_dedup.csv additionally contains the deduplication similarity scores:

Field	Type	Description
`score_total`	float	Weighted composite similarity score of the merged pair (0–1). `0` for singleton events that were not merged.
`score_semantic`	float	Cosine similarity of `event_summary` embeddings (`all-MiniLM-L6-v2`).
`score_actors`	float	Jaccard similarity over actor and recipient name sets.
`score_type`	float	1 if `event_type` matches, else 0.
`score_meta`	float	Location overlap.

Pairs above the deduplication threshold were linked as duplicates and merged via connected components on the similarity graph; see the methodology section of the paper for full details.

Event types and Quad mapping

The event_type column uses PLOVER's 16 root types. They map to the four PLOVER Quad categories as follows (the category column collapses Quad into a binary CONFLICT / COOPERATION axis):

Verbal cooperation: CONSULT, AGREE, SUPPORT, CONCEDE
Material cooperation: AID, COOPERATE, RETREAT
Verbal conflict: ACCUSE, REQUEST, THREATEN, REJECT
Material conflict: ASSAULT, SANCTION, COERCE, MOBILIZE, PROTEST

Source articles

GENOME is built from articles obtained from a single commercial English-language newswire provider whose terms of use prohibit redistribution of the raw text. We therefore release only the extracted structured records, plus the MD5 hash of each source article in article_ids so that researchers with credentialed access to the same provider can verify which articles a given event was extracted from. Short verbatim snippets in source_quote are capped at a small number of words (truncated quotes end with …) and reproduced under fair use / short-quotation exceptions for the purposes of research and criticism.

Limitations

See the Limitations section of the paper for known limitations of the dataset, including English-language and Western-newswire bias, the absence of a manually annotated gold standard, the single-core-event extraction strategy, deduplication threshold sensitivity, and reliance on closed-source LLMs.

Citation

If you use GENOME in your research, please cite:

@misc{dellorto-kommandeur-genome,
    title  = "{GENOME}: A New Geopolitical Event Methodology and Dataset using Large Language Models",
    author = "Dell'Orto, Alessandro and Kommandeur, Jesse",
}

A full citation entry will be added once the paper is officially published.

License

The GENOME dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). See LICENSE for the full notice. You are free to share and adapt the data for any purpose, including commercial use, provided you give appropriate credit by citing the paper above.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GENOME: Geopolitical Event News Observatory, Mapping, and Extraction

Scope of this release

Loading

Schema

Country / organization reference table

Deduplication scores (dedup file only)

Source articles

Limitations

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GENOME: Geopolitical Event News Observatory, Mapping, and Extraction

Scope of this release

Loading

Schema

Country / organization reference table

Deduplication scores (dedup file only)

Source articles

Limitations

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages