Skip to content

Metadata auto-detection does not detect semantic sdtype as foreign key (column_name_match algorithm) #2799

@gsheni

Description

@gsheni

Environment Details

  • SDV version: 1.33.0
  • Python version: 3.13
  • Operating System: macOS

Error Description

For multi-table datasets, the metadata auto-detection is done using the detect_from_dataframes method.

This method infers the sdtypes, primary keys, and foreign keys. The foreign key columns are determined by matching up the foreign and primary key columns that have the same names. This is the only algorithm available in SDV Community.

Currently, the foreign key detection only works if the foreign key is an id sdtype (the PK needs to be id sdtype as well). This is not the expected behavior. The foreign key detection should also work for semantic sdtypes (email, address, phone_number, etc).

Steps to reproduce

from sdv.metadata import Metadata
import pandas as pd

data = {
    'parent': pd.DataFrame({
        'email': ['sdv@sdv.dev', 'info@datacebo.com', 'info@gmail.com']
     }),
    'child': pd.DataFrame({
        'child_id': [1, 2],
        'email': ['sdv@sdv.dev', 'sdv@sdv.dev']
     }),
}
assert set(data['child']['email']).issubset(set(data['parent']['email']))
metadata = Metadata().detect_from_dataframes(data, 
                                             foreign_key_inference_algorithm='column_name_match')
metadata
{
    "tables": {
        "parent": {
            "columns": {
                "email": {
                    "pii": true,
                    "sdtype": "email"
                }
            },
            "primary_key": "email"
        },
        "child": {
            "columns": {
                "child_id": {
                    "sdtype": "id"
                },
                "email": {
                    "sdtype": "email"
                }
            },
            "primary_key": "child_id"
        }
    },
    "relationships": [],
    "METADATA_SPEC_VERSION": "V1"
}

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions