OpenPhase-OPI Dataset

license

language

OpenPhase-OPI Dataset

A high-quality instruction-tuning dataset for training LLMs to be experts in OpenPhase .opi configuration files. OpenPhase is a C++17 phase-field simulation library for modeling microstructure evolution in materials science.

Dataset Description

This dataset was created to train a specialized LLM (e.g., Qwen 2.5 3B) that can:

Explain OpenPhase configuration parameters with physical meaning, units, and typical ranges
Guide users in setting up phase-field simulations
Troubleshoot common simulation issues
Explain parameter relationships and stability conditions

Generation Process

Extraction: Parameters extracted from both .opi example files AND C++ source code
Labeling: Rich explanations generated using Claude Opus 4.5
Quality Filtering: Removed fallback/failed responses, validated output quality
Multi-format Export: Converted to 7 popular fine-tuning formats

Dataset Statistics

Metric	Value
Total Examples	3,398
Unique Parameters	1,024
Unique Sections	61
Average Output Length	4,367 chars
Total Content	~15 MB text

Examples by Category

Category	Count	Description
`parameter`	3,072	Individual parameter explanations
`section`	183	Section overviews and parameter listings
`troubleshooting`	74	Common simulation issues and solutions
`relationship`	52	Parameter interdependencies (CFL, stability)
`construction`	9	Complete `.opi` file examples
`best_practice`	8	Configuration best practices

Data Formats

Multiple formats are provided for compatibility with all major fine-tuning frameworks:

File	Format	Compatible Frameworks
`data/train.json`	Alpaca	Qwen, LLaMA-Factory, Axolotl, Unsloth
`data/train_sharegpt.json`	ShareGPT	LLaMA-Factory, FastChat, Vicuna
`data/train_openai.jsonl`	OpenAI/ChatML	OpenAI API, Axolotl, Unsloth
`data/train_axolotl.json`	Axolotl Chat	Axolotl
`data/train_completion.jsonl`	Completion	Base models (GPT-2 style)
`data/train_hf_chatml.jsonl`	HF ChatML	Hugging Face TRL
`data/train_universal.jsonl`	Universal JSONL	Any framework (raw format)

Data Fields

Alpaca Format (Primary)

{
  "instruction": "Explain this OpenPhase configuration parameter.",
  "input": "$dx in the @GridParameters section",
  "output": "# `$dx` - Grid Spacing\n\n## Physical Meaning\n\nThe `$dx` parameter defines the spatial discretization...",
  "system": "You are an expert assistant for OpenPhase...",
  "category": "parameter"
}

Field Descriptions

Field	Type	Description
`instruction`	string	The task or question being asked
`input`	string	Additional context (parameter name, section, etc.)
`output`	string	Rich explanation with markdown formatting
`system`	string	System prompt defining the assistant's expertise
`category`	string	Example type: parameter, section, troubleshooting, etc.

Usage

With Hugging Face Datasets

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("heshamfsalama/OpenPhase-OPI")

# Access training examples
for example in dataset["train"]:
    print(example["instruction"])
    print(example["output"])

With LLaMA-Factory

# Add to LLaMA-Factory/data/dataset_info.json:
{
  "openphase_opi": {
    "file_name": "train.json",
    "formatting": "alpaca",
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output",
      "system": "system"
    }
  }
}

# Fine-tune
llamafactory-cli train \
  --model_name_or_path Qwen/Qwen2.5-3B-Instruct \
  --dataset openphase_opi \
  --finetuning_type lora \
  --output_dir ./output/qwen_opi_expert

With Axolotl

datasets:
  - path: data/train_axolotl.json
    type: sharegpt
    conversation: chatml

With Unsloth

from unsloth import FastLanguageModel
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
dataset = load_dataset("heshamfsalama/OpenPhase-OPI")
# Use train_openai.jsonl format

Example Outputs

Parameter Explanation

Input: $IWidth in the @GridParameters section

Output:

$IWidth - Interface Width

Physical Meaning

The interface width parameter controls the diffuse interface thickness in grid points. This is a critical numerical parameter that balances accuracy and computational cost.

Typical Range

Minimum: 3-4 grid points (too thin → numerical artifacts)

Recommended: 5-7 grid points (good balance)

Maximum: 10+ grid points (very smooth but computationally expensive)

Stability Relationship

Must satisfy: IWidth * dx > physical_interface_width ...

Troubleshooting

Input: My simulation produces NaN values

Output:

Troubleshooting: NaN Values in Simulation

Common Causes

Time step too large: Violates CFL condition (dt < dx²/D)

Interface too thin: IWidth < 3 grid points

Division by zero: Phase fraction approaching zero ...

Dataset Creation

Source Data

OpenPhase v4.x: C++ headers and source files
Example .opi files: 25+ simulation examples
Parameters: 797 from examples + 227 from C++ source

Annotation Process

Model: Claude Opus 4.5 (claude-opus-4-5-20250514)
Prompt Engineering: Domain-specific prompts for materials science
Quality Control: Automated filtering of failed/short responses
Validation: Manual review of sample outputs

Personal and Sensitive Information

This dataset contains only technical documentation about simulation parameters. No personal or sensitive information is included.

Considerations for Using the Data

Intended Use

Fine-tuning LLMs to assist with OpenPhase simulations
Educational purposes for learning phase-field methods
Research in scientific computing and materials science

Limitations

Focused specifically on OpenPhase library (may not generalize to other simulation tools)
Generated content may contain occasional inaccuracies
Best used with domain knowledge for validation

Biases

English language only
Biased toward common simulation types in OpenPhase examples
Parameter explanations reflect OpenPhase v4.x behavior

License

This dataset is released under the MIT License, consistent with the OpenPhase library license.

Citation

If you use this dataset, please cite:

@dataset{openphase_opi,
  title = {OpenPhase-OPI: An Instruction-Tuning Dataset for Phase-Field Simulation Configuration},
  author = {Hesham Salama},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/heshamfsalama/OpenPhase-OPI},
  note = {Generated using Claude Opus 4.5 from OpenPhase source code and examples}
}

Acknowledgments

OpenPhase - The open-source phase-field simulation library
Anthropic Claude - For generating high-quality explanations
Hugging Face - For dataset hosting infrastructure

Contact

For questions or issues, please open an issue on the repository or contact:

Email: info@heshamsalama.dev | hesham@autonomouslab.io
Website: autonomouslab.io

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
.gitattributes		.gitattributes
README.md		README.md
dataset_infos.json		dataset_infos.json

HeshamFS/OpenPhase-OPI

Folders and files

Latest commit

History

Repository files navigation

OpenPhase-OPI Dataset

Dataset Description

Generation Process

Dataset Statistics

Examples by Category

Data Formats

Data Fields

Alpaca Format (Primary)

Field Descriptions

Usage

With Hugging Face Datasets

With LLaMA-Factory

With Axolotl

With Unsloth

Example Outputs

Parameter Explanation

$IWidth - Interface Width

Physical Meaning

Typical Range

Stability Relationship

Troubleshooting

Troubleshooting: NaN Values in Simulation

Common Causes

Dataset Creation

Source Data

Annotation Process

Personal and Sensitive Information

Considerations for Using the Data

Intended Use

Limitations

Biases

License

Citation

Acknowledgments

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

`$IWidth` - Interface Width

Packages