After generating a synthetic dataset with DataDesigner, sharing it on Hugging Face Hub requires manual conversion steps:
from datasets import Dataset
results = designer.create(...)
df = results.load_dataset()
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/dataset-name")
It might be nice to have a push_to_hub() method on DatasetCreationResults i.e. something like
results = designer.create(...)
results.push_to_hub("username/my-synthetic-dataset")
This would load directly from the parquet files (memory efficient for large datasets) and handle the upload.
Additional context
rough suggestion for an implementation
# In src/data_designer/interface/results.py
def push_to_hub(
self,
repo_id: str,
*,
private: bool = False,
token: str | None = None,
commit_message: str | None = None,
) -> str:
"""Push the generated dataset to Hugging Face Hub.
Args:
repo_id: Repository ID (e.g., "username/dataset-name")
private: Whether the dataset should be private
token: Hugging Face token (uses cached token if not provided)
commit_message: Custom commit message
Returns:
URL of the pushed dataset
"""
from datasets import Dataset
# Load directly from parquet - memory efficient for large datasets
parquet_path = str(self.artifact_storage.final_dataset_path / "*.parquet")
dataset = Dataset.from_parquet(parquet_path)
return dataset.push_to_hub(
repo_id,
private=private,
token=token,
commit_message=commit_message,
)
Happy to open a PR for this if it seems interesting!
After generating a synthetic dataset with DataDesigner, sharing it on Hugging Face Hub requires manual conversion steps:
It might be nice to have a
push_to_hub()method on DatasetCreationResults i.e. something likeThis would load directly from the parquet files (memory efficient for large datasets) and handle the upload.
Additional context
rough suggestion for an implementation
Happy to open a PR for this if it seems interesting!