Improve inference speed on the edge side

Hi, I exported the PTE model for 8255 edge side inference using the following command:
`
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android  -m  SA8255 --compile_only --decoder_model qwen3-0_6b --system_prompt "你是指令优化专家，负责将用户的口语化、模糊或多意图指令转化为清晰、可独立执行的指令。" --prompt "/no_think开一下哔哩哔哩把左边车窗关上" --model_mode hybrid --max_seq_len 512 --prefill_ar_len 64 --temperature 0.8 --artifact ./qwen3_0_6b_sa8255_full_sft
`
Now I want to improve the inference speed on the car device. Besides modifying the model_mode(kv/hybrid/lookahead) and max_seq_len(1024/512/256) variables, and using lower quantization precision(16a4w/16a8w/8a8w), are there any other methods to improve inference speed? Moreover, changing model_mode and using lower quantization precision will result in worse model inference performance.

Are there any other ways to improve inference speed?

cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve inference speed on the edge side #17948

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve inference speed on the edge side #17948

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions