Skip to content

HCSS-Data-Lab/Submission-GENOME

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GENOME: Geopolitical Event News Observatory, Mapping, and Extraction

GENOME is a dataset of geopolitical events automatically extracted from English-language newswire articles using large language models. It implements PLOVER's 16-type event ontology and extends its Actor–Recipient schema with a Third Party role to capture multilateral interactions.

This repository accompanies the paper:

Alessandro Dell'Orto and Jesse Kommandeur. GENOME: A New Geopolitical Event Methodology and Dataset using Large Language Models.

Status: companion repository for an in-press paper. The citation entry below will be updated with the full bibliographic record (venue, year, DOI) once the paper is officially published.

Scope of this release

This repository distributes the GENOME event data used in the comparison with POLECAT reported in the paper. It contains the extracted event records only — the extraction pipeline, prompts, and model artifacts are not included.

Time range 2024-02-01 – 2024-06-30
Format CSV (UTF-8, comma-separated, quoted strings)
License CC BY 4.0

Two files are provided, corresponding to the GENOME variants evaluated in the paper:

File Description
data/GENOME_non_dedup.csv Output of the pipeline after entity normalization, before deduplication. Corresponds to the GENOME (article_date) and GENOME (event_date) variants in the paper — the only difference between the two is whether article_date or event_date is used as the event timestamp.
data/GENOME_dedup.csv Output of the pipeline after the deduplication step. Corresponds to the GENOME (dedup) variant in the paper, with event_date as the timestamp.

POLECAT is not redistributed here — see data/POLECAT_PLACEHOLDER.md for download instructions.

Loading

import pandas as pd

events = pd.read_csv("data/GENOME_dedup.csv")

List-valued fields are serialized as semicolon-separated strings (e.g. "Russia; Cyprus; Cyprus"). The article_ids column is serialized as a brace-enclosed, comma-separated set of MD5 hashes (e.g. "{hash1,hash2}"). Parse them back to Python lists with:

list_cols = [
    "location_names", "location_countries", "location_countries_norm", "location_countries_ids",
    "actor_names", "actor_names_norm",
    "actor_countries", "actor_countries_norm", "actor_countries_ids",
    "recipient_names", "recipient_names_norm",
    "recipient_countries", "recipient_countries_norm", "recipient_countries_ids",
    "third_party_names", "third_party_names_norm",
    "third_party_countries", "third_party_countries_norm", "third_party_countries_ids",
]
for c in list_cols:
    events[c] = events[c].fillna("").astype(str).apply(
        lambda s: [x.strip() for x in s.split(";") if x.strip()]
    )

events["article_ids"] = events["article_ids"].str.strip("{}").str.split(",")

Schema

Both files share the following columns:

Field Type Description
id string Unique event identifier. In GENOME_non_dedup.csv it has the form <article_md5>_<event_num>; in GENOME_dedup.csv it is a short hex identifier assigned to the merged cluster.
article_ids set[string] MD5 hash(es) of the source article text(s), serialized as "{hash1,hash2,...}". Non-deduplicated rows always contain a single hash; deduplicated rows may contain multiple hashes when several articles were merged into one event.
event_num int 1-based index of the event within its source article. GENOME's extraction is prompted to return a single "core" event per article, so this is almost always 1.
event_date date (YYYY-MM-DD) Date of the event as inferred by the extraction model from the article text.
article_date date (YYYY-MM-DD) Publication date of the source article.
event_summary string Self-contained natural-language summary of the event, produced by the extraction model.
source_quote string Short verbatim snippet from the source article used to construct the event, reproduced under fair use / short-quotation exceptions for research purposes. Truncated to a maximum of 20 words; truncated snippets end with an ellipsis ().
core_sentence string "Actor VERB Recipient" canonical form of the event, produced by the classification model.
event_type enum One of the 16 PLOVER event types (see below). Empty when the model could not assign a type.
category enum CONFLICT or COOPERATION, derived from event_type. Empty when event_type is empty.
location_names list[string] Raw geographic locations mentioned in the event.
location_countries list[string] Country / international-organization names inferred for each location.
location_countries_norm list[string] Lowercase normalized form of location_countries, matched against the curated reference list.
location_countries_ids list[string] Numeric IDs from the curated reference list (countries.csv), or io_<n> for international organizations.
actor_names list[string] Raw actor names extracted from the article.
actor_names_norm list[string] Canonical actor names after embedding-based clustering (see the methodology section of the paper).
actor_countries list[string] Country / international-organization names inferred for each actor.
actor_countries_norm list[string] Lowercase normalized form of actor_countries.
actor_countries_ids list[string] Numeric IDs / io_<n> codes from the curated reference list.
recipient_names, recipient_names_norm, recipient_countries, recipient_countries_norm, recipient_countries_ids list[string] Same structure as the actor fields, for the Recipient role.
third_party_names, third_party_names_norm, third_party_countries, third_party_countries_norm, third_party_countries_ids list[string] Same structure as the actor fields, for the Third Party role introduced by GENOME.

Country / organization reference table

countries.csv provides the mapping between the *_countries_ids columns and human-readable names. It contains the countries and international organizations that appear in this release (a subset of the full curated reference list — entities with no events in the release window are omitted).

Field Type Description
id string Numeric ID for countries, io_<n> for international organizations
name string Canonical English name
type enum country or organization

Deduplication scores (dedup file only)

GENOME_dedup.csv additionally contains the deduplication similarity scores:

Field Type Description
score_total float Weighted composite similarity score of the merged pair (0–1). 0 for singleton events that were not merged.
score_semantic float Cosine similarity of event_summary embeddings (all-MiniLM-L6-v2).
score_actors float Jaccard similarity over actor and recipient name sets.
score_type float 1 if event_type matches, else 0.
score_meta float Location overlap.

Pairs above the deduplication threshold were linked as duplicates and merged via connected components on the similarity graph; see the methodology section of the paper for full details.

Event types and Quad mapping

The event_type column uses PLOVER's 16 root types. They map to the four PLOVER Quad categories as follows (the category column collapses Quad into a binary CONFLICT / COOPERATION axis):

  • Verbal cooperation: CONSULT, AGREE, SUPPORT, CONCEDE
  • Material cooperation: AID, COOPERATE, RETREAT
  • Verbal conflict: ACCUSE, REQUEST, THREATEN, REJECT
  • Material conflict: ASSAULT, SANCTION, COERCE, MOBILIZE, PROTEST

Source articles

GENOME is built from articles obtained from a single commercial English-language newswire provider whose terms of use prohibit redistribution of the raw text. We therefore release only the extracted structured records, plus the MD5 hash of each source article in article_ids so that researchers with credentialed access to the same provider can verify which articles a given event was extracted from. Short verbatim snippets in source_quote are capped at a small number of words (truncated quotes end with ) and reproduced under fair use / short-quotation exceptions for the purposes of research and criticism.

Limitations

See the Limitations section of the paper for known limitations of the dataset, including English-language and Western-newswire bias, the absence of a manually annotated gold standard, the single-core-event extraction strategy, deduplication threshold sensitivity, and reliance on closed-source LLMs.

Citation

If you use GENOME in your research, please cite:

@misc{dellorto-kommandeur-genome,
    title  = "{GENOME}: A New Geopolitical Event Methodology and Dataset using Large Language Models",
    author = "Dell'Orto, Alessandro and Kommandeur, Jesse",
}

A full citation entry will be added once the paper is officially published.

License

The GENOME dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). See LICENSE for the full notice. You are free to share and adapt the data for any purpose, including commercial use, provided you give appropriate credit by citing the paper above.

About

Companion repository to GENOME: A New Geopolitical Event Methodology and Dataset using Large Language Models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors