Skip to content

OpenPhase-OPI: An instruction-tuning dataset for phase-field simulation configuration (3,398 examples, 7 formats)

Notifications You must be signed in to change notification settings

HeshamFS/OpenPhase-OPI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

license language tags size_categories task_categories pretty_name dataset_info
mit
en
openphase
phase-field
materials-science
simulation
scientific-computing
instruction-tuning
alpaca
sharegpt
1K<n<10K
text-generation
question-answering
OpenPhase-OPI
features splits
name dtype
instruction
string
name dtype
input
string
name dtype
output
string
name dtype
system
string
name dtype
category
string
name num_examples
train
3398

OpenPhase-OPI Dataset

A high-quality instruction-tuning dataset for training LLMs to be experts in OpenPhase .opi configuration files. OpenPhase is a C++17 phase-field simulation library for modeling microstructure evolution in materials science.

Dataset Description

This dataset was created to train a specialized LLM (e.g., Qwen 2.5 3B) that can:

  • Explain OpenPhase configuration parameters with physical meaning, units, and typical ranges
  • Guide users in setting up phase-field simulations
  • Troubleshoot common simulation issues
  • Explain parameter relationships and stability conditions

Generation Process

  1. Extraction: Parameters extracted from both .opi example files AND C++ source code
  2. Labeling: Rich explanations generated using Claude Opus 4.5
  3. Quality Filtering: Removed fallback/failed responses, validated output quality
  4. Multi-format Export: Converted to 7 popular fine-tuning formats

Dataset Statistics

Metric Value
Total Examples 3,398
Unique Parameters 1,024
Unique Sections 61
Average Output Length 4,367 chars
Total Content ~15 MB text

Examples by Category

Category Count Description
parameter 3,072 Individual parameter explanations
section 183 Section overviews and parameter listings
troubleshooting 74 Common simulation issues and solutions
relationship 52 Parameter interdependencies (CFL, stability)
construction 9 Complete .opi file examples
best_practice 8 Configuration best practices

Data Formats

Multiple formats are provided for compatibility with all major fine-tuning frameworks:

File Format Compatible Frameworks
data/train.json Alpaca Qwen, LLaMA-Factory, Axolotl, Unsloth
data/train_sharegpt.json ShareGPT LLaMA-Factory, FastChat, Vicuna
data/train_openai.jsonl OpenAI/ChatML OpenAI API, Axolotl, Unsloth
data/train_axolotl.json Axolotl Chat Axolotl
data/train_completion.jsonl Completion Base models (GPT-2 style)
data/train_hf_chatml.jsonl HF ChatML Hugging Face TRL
data/train_universal.jsonl Universal JSONL Any framework (raw format)

Data Fields

Alpaca Format (Primary)

{
  "instruction": "Explain this OpenPhase configuration parameter.",
  "input": "$dx in the @GridParameters section",
  "output": "# `$dx` - Grid Spacing\n\n## Physical Meaning\n\nThe `$dx` parameter defines the spatial discretization...",
  "system": "You are an expert assistant for OpenPhase...",
  "category": "parameter"
}

Field Descriptions

Field Type Description
instruction string The task or question being asked
input string Additional context (parameter name, section, etc.)
output string Rich explanation with markdown formatting
system string System prompt defining the assistant's expertise
category string Example type: parameter, section, troubleshooting, etc.

Usage

With Hugging Face Datasets

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("heshamfsalama/OpenPhase-OPI")

# Access training examples
for example in dataset["train"]:
    print(example["instruction"])
    print(example["output"])

With LLaMA-Factory

# Add to LLaMA-Factory/data/dataset_info.json:
{
  "openphase_opi": {
    "file_name": "train.json",
    "formatting": "alpaca",
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output",
      "system": "system"
    }
  }
}

# Fine-tune
llamafactory-cli train \
  --model_name_or_path Qwen/Qwen2.5-3B-Instruct \
  --dataset openphase_opi \
  --finetuning_type lora \
  --output_dir ./output/qwen_opi_expert

With Axolotl

datasets:
  - path: data/train_axolotl.json
    type: sharegpt
    conversation: chatml

With Unsloth

from unsloth import FastLanguageModel
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
dataset = load_dataset("heshamfsalama/OpenPhase-OPI")
# Use train_openai.jsonl format

Example Outputs

Parameter Explanation

Input: $IWidth in the @GridParameters section

Output:

$IWidth - Interface Width

Physical Meaning

The interface width parameter controls the diffuse interface thickness in grid points. This is a critical numerical parameter that balances accuracy and computational cost.

Typical Range

  • Minimum: 3-4 grid points (too thin → numerical artifacts)
  • Recommended: 5-7 grid points (good balance)
  • Maximum: 10+ grid points (very smooth but computationally expensive)

Stability Relationship

Must satisfy: IWidth * dx > physical_interface_width ...

Troubleshooting

Input: My simulation produces NaN values

Output:

Troubleshooting: NaN Values in Simulation

Common Causes

  1. Time step too large: Violates CFL condition (dt < dx²/D)
  2. Interface too thin: IWidth < 3 grid points
  3. Division by zero: Phase fraction approaching zero ...

Dataset Creation

Source Data

  • OpenPhase v4.x: C++ headers and source files
  • Example .opi files: 25+ simulation examples
  • Parameters: 797 from examples + 227 from C++ source

Annotation Process

  • Model: Claude Opus 4.5 (claude-opus-4-5-20250514)
  • Prompt Engineering: Domain-specific prompts for materials science
  • Quality Control: Automated filtering of failed/short responses
  • Validation: Manual review of sample outputs

Personal and Sensitive Information

This dataset contains only technical documentation about simulation parameters. No personal or sensitive information is included.

Considerations for Using the Data

Intended Use

  • Fine-tuning LLMs to assist with OpenPhase simulations
  • Educational purposes for learning phase-field methods
  • Research in scientific computing and materials science

Limitations

  • Focused specifically on OpenPhase library (may not generalize to other simulation tools)
  • Generated content may contain occasional inaccuracies
  • Best used with domain knowledge for validation

Biases

  • English language only
  • Biased toward common simulation types in OpenPhase examples
  • Parameter explanations reflect OpenPhase v4.x behavior

License

This dataset is released under the MIT License, consistent with the OpenPhase library license.

Citation

If you use this dataset, please cite:

@dataset{openphase_opi,
  title = {OpenPhase-OPI: An Instruction-Tuning Dataset for Phase-Field Simulation Configuration},
  author = {Hesham Salama},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/heshamfsalama/OpenPhase-OPI},
  note = {Generated using Claude Opus 4.5 from OpenPhase source code and examples}
}

Acknowledgments

Contact

For questions or issues, please open an issue on the repository or contact:

About

OpenPhase-OPI: An instruction-tuning dataset for phase-field simulation configuration (3,398 examples, 7 formats)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published