Skip to content

✨File preview: Add file preview backend service#2470

Open
Stockton11 wants to merge 5 commits intoModelEngine-Group:developfrom
Stockton11:zwb/file_preview
Open

✨File preview: Add file preview backend service#2470
Stockton11 wants to merge 5 commits intoModelEngine-Group:developfrom
Stockton11:zwb/file_preview

Conversation

@Stockton11
Copy link

1.添加文件预览接口/preview/{object_name:path}
2.添加文件预览逻辑,使用LibreOffice将Office文件转换为PDF并将转换后的PDF缓存回MinIO,设置为7天有效期;其他文件直接返回文件流。

Copilot AI review requested due to automatic review settings February 9, 2026 08:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a backend “file preview” capability: direct streaming for previewable types (PDF/images/text) and Office→PDF conversion via LibreOffice with MinIO-backed caching (intended TTL: 7 days).

Changes:

  • Add GET /file/preview/{object_name:path} endpoint that returns inline StreamingResponse.
  • Implement Office-to-PDF conversion with concurrency limits + per-file locking + MinIO cache write-back.
  • Add MinIO copy/existence helpers and extensive unit tests for preview, conversion, and headers.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
backend/apps/file_management_app.py Adds preview endpoint and inline Content-Disposition support
backend/services/file_management_service.py Implements preview_file_impl() + conversion/caching workflow + PDF validation
backend/utils/file_management_utils.py Adds convert_office_to_pdf() using LibreOffice
backend/database/attachment_db.py Adds file_exists() and copy_object() helpers; extends MIME map (.md)
backend/database/client.py Adds MinioClient.copy_object() wrapper
backend/consts/const.py Adds office MIME list and conversion concurrency limit
docker/docker-compose.yml Adds MinIO ILM expiry rule for converted/ prefix
test/backend/app/test_file_management_app.py Adds tests for preview endpoint + inline headers
test/backend/services/test_file_management_service.py Adds tests for preview routing, cache hit/miss, and PDF validation
test/backend/utils/test_file_management_utils.py Adds tests for LibreOffice conversion helper
test/backend/database/test_client.py Adds tests for MinioClient.copy_object()
test/backend/database/test_attachment_db.py Adds tests for file_exists() / copy_object()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +38 to +49
# Limit concurrent Office-to-PDF conversions
MAX_CONCURRENT_CONVERSIONS = 5

# Supported Office file MIME types for preview conversion
OFFICE_MIME_TYPES = [
'application/msword', # .doc
'application/vnd.openxmlformats-officedocument.wordprocessingml.document', # .docx
'application/vnd.ms-excel', # .xls
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', # .xlsx
'application/vnd.ms-powerpoint', # .ppt
'application/vnd.openxmlformats-officedocument.presentationml.presentation' # .pptx
]
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_CONCURRENT_CONVERSIONS is hard-coded to 5 here. Since this directly controls how many LibreOffice processes can run concurrently (CPU/memory heavy), it likely needs to be configurable via env var (similar to other settings) so deployments can tune it per machine size.

Copilot uses AI. Check for mistakes.
Comment on lines +602 to +610
return StreamingResponse(
file_stream,
media_type=content_type,
headers={
"Content-Disposition": content_disposition,
"Cache-Control": "public, max-age=3600",
"ETag": f'"{object_name}"',
}
)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ETag response header is being set to the literal object_name (e.g. "documents/test.pdf"), which is not a representation of the content. This can cause clients/proxies to incorrectly treat different versions of the same object as identical and serve stale previews. Either omit the ETag header, or populate it with the real object ETag/version from storage metadata (e.g., head_object / stat) and consider also handling If-None-Match for 304 responses.

Copilot uses AI. Check for mistakes.

# Unsupported file type
else:
raise Exception(f"Unsupported file type for preview: {content_type}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该函数内容较多,逻辑较复杂,建议拆分出多个指责单一的子函数,便于维护

cd ..

- name: Install LibreOffice
run: sudo apt-get update && sudo apt-get install -y libreoffice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你在单元测试中安装了LibreOffice依赖,但我并没有在项目部署文件中找到相关的安装逻辑,是否验证过重新部署后的功能是否正常

uv pip install -e "../sdk[dev]"
cd ..

- name: Install LibreOffice
Copy link
Contributor

@xuyaqist xuyaqist Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. libre office在data-process里已经存在,因此可以直接复用。data_process 是一个独立的服务(在 Docker 容器中运行),它与主后端进行双向通信。
    (1)在data_process_app.py中定义convert_office_to_pdf(),使用 subprocess 调用 libreoffice convert-to pdf
    (2)POST /convert-to-pdf - API,接收 Office 文件并返回 PDF
    (3)在file_management_utils.py中,调用 data_process 服务的API
    以达到:LibreOffice 在 data_process 容器中运行,backend 通过 HTTP 调用它。

  2. 为了防止对大文件的预览,进行文件转换时导致占用内容过多,因此对超过一定大小(100M)的文件不支持预览

Comment on lines +343 to +348
async def convert_office_to_pdf(input_path: str, output_dir: str, timeout: int = 30) -> str:
"""
Convert Office document to PDF using LibreOffice.

Args:
input_path: Path to input Office file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前的方案是直接在后端里执行libre office的命令,理论上要把libre office打到后端镜像里。最好是采取libre office运行在data process里,然后通过api进行通信,不需要额外地把libre office打包到后端镜像里

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants