| license | language | tags | size_categories | task_categories | pretty_name | dataset_info | ||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mit |
|
|
|
|
OpenPhase-OPI |
|
A high-quality instruction-tuning dataset for training LLMs to be experts in OpenPhase .opi configuration files. OpenPhase is a C++17 phase-field simulation library for modeling microstructure evolution in materials science.
This dataset was created to train a specialized LLM (e.g., Qwen 2.5 3B) that can:
- Explain OpenPhase configuration parameters with physical meaning, units, and typical ranges
- Guide users in setting up phase-field simulations
- Troubleshoot common simulation issues
- Explain parameter relationships and stability conditions
- Extraction: Parameters extracted from both
.opiexample files AND C++ source code - Labeling: Rich explanations generated using Claude Opus 4.5
- Quality Filtering: Removed fallback/failed responses, validated output quality
- Multi-format Export: Converted to 7 popular fine-tuning formats
| Metric | Value |
|---|---|
| Total Examples | 3,398 |
| Unique Parameters | 1,024 |
| Unique Sections | 61 |
| Average Output Length | 4,367 chars |
| Total Content | ~15 MB text |
| Category | Count | Description |
|---|---|---|
parameter |
3,072 | Individual parameter explanations |
section |
183 | Section overviews and parameter listings |
troubleshooting |
74 | Common simulation issues and solutions |
relationship |
52 | Parameter interdependencies (CFL, stability) |
construction |
9 | Complete .opi file examples |
best_practice |
8 | Configuration best practices |
Multiple formats are provided for compatibility with all major fine-tuning frameworks:
| File | Format | Compatible Frameworks |
|---|---|---|
data/train.json |
Alpaca | Qwen, LLaMA-Factory, Axolotl, Unsloth |
data/train_sharegpt.json |
ShareGPT | LLaMA-Factory, FastChat, Vicuna |
data/train_openai.jsonl |
OpenAI/ChatML | OpenAI API, Axolotl, Unsloth |
data/train_axolotl.json |
Axolotl Chat | Axolotl |
data/train_completion.jsonl |
Completion | Base models (GPT-2 style) |
data/train_hf_chatml.jsonl |
HF ChatML | Hugging Face TRL |
data/train_universal.jsonl |
Universal JSONL | Any framework (raw format) |
{
"instruction": "Explain this OpenPhase configuration parameter.",
"input": "$dx in the @GridParameters section",
"output": "# `$dx` - Grid Spacing\n\n## Physical Meaning\n\nThe `$dx` parameter defines the spatial discretization...",
"system": "You are an expert assistant for OpenPhase...",
"category": "parameter"
}| Field | Type | Description |
|---|---|---|
instruction |
string | The task or question being asked |
input |
string | Additional context (parameter name, section, etc.) |
output |
string | Rich explanation with markdown formatting |
system |
string | System prompt defining the assistant's expertise |
category |
string | Example type: parameter, section, troubleshooting, etc. |
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("heshamfsalama/OpenPhase-OPI")
# Access training examples
for example in dataset["train"]:
print(example["instruction"])
print(example["output"])# Add to LLaMA-Factory/data/dataset_info.json:
{
"openphase_opi": {
"file_name": "train.json",
"formatting": "alpaca",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"system": "system"
}
}
}
# Fine-tune
llamafactory-cli train \
--model_name_or_path Qwen/Qwen2.5-3B-Instruct \
--dataset openphase_opi \
--finetuning_type lora \
--output_dir ./output/qwen_opi_expertdatasets:
- path: data/train_axolotl.json
type: sharegpt
conversation: chatmlfrom unsloth import FastLanguageModel
from datasets import load_dataset
model, tokenizer = FastLanguageModel.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
dataset = load_dataset("heshamfsalama/OpenPhase-OPI")
# Use train_openai.jsonl formatInput: $IWidth in the @GridParameters section
Output:
The interface width parameter controls the diffuse interface thickness in grid points. This is a critical numerical parameter that balances accuracy and computational cost.
- Minimum: 3-4 grid points (too thin → numerical artifacts)
- Recommended: 5-7 grid points (good balance)
- Maximum: 10+ grid points (very smooth but computationally expensive)
Must satisfy:
IWidth * dx > physical_interface_width...
Input: My simulation produces NaN values
Output:
- Time step too large: Violates CFL condition (
dt < dx²/D)- Interface too thin:
IWidth< 3 grid points- Division by zero: Phase fraction approaching zero ...
- OpenPhase v4.x: C++ headers and source files
- Example
.opifiles: 25+ simulation examples - Parameters: 797 from examples + 227 from C++ source
- Model: Claude Opus 4.5 (claude-opus-4-5-20250514)
- Prompt Engineering: Domain-specific prompts for materials science
- Quality Control: Automated filtering of failed/short responses
- Validation: Manual review of sample outputs
This dataset contains only technical documentation about simulation parameters. No personal or sensitive information is included.
- Fine-tuning LLMs to assist with OpenPhase simulations
- Educational purposes for learning phase-field methods
- Research in scientific computing and materials science
- Focused specifically on OpenPhase library (may not generalize to other simulation tools)
- Generated content may contain occasional inaccuracies
- Best used with domain knowledge for validation
- English language only
- Biased toward common simulation types in OpenPhase examples
- Parameter explanations reflect OpenPhase v4.x behavior
This dataset is released under the MIT License, consistent with the OpenPhase library license.
If you use this dataset, please cite:
@dataset{openphase_opi,
title = {OpenPhase-OPI: An Instruction-Tuning Dataset for Phase-Field Simulation Configuration},
author = {Hesham Salama},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/heshamfsalama/OpenPhase-OPI},
note = {Generated using Claude Opus 4.5 from OpenPhase source code and examples}
}- OpenPhase - The open-source phase-field simulation library
- Anthropic Claude - For generating high-quality explanations
- Hugging Face - For dataset hosting infrastructure
For questions or issues, please open an issue on the repository or contact:
- Email: info@heshamsalama.dev | hesham@autonomouslab.io
- Website: autonomouslab.io