diff --git a/site/security/index.md b/site/security/index.md index b28e50ce60..b6b006cc30 100644 --- a/site/security/index.md +++ b/site/security/index.md @@ -7,6 +7,33 @@ Apache ORC is a library rather than an execution framework and thus is less likely to have security vulnerabilities. However, if you have discovered one, please follow the process below. +## Threat Model + +We align our threat model with other foundational data format libraries like [Apache Arrow](https://arrow.apache.org/docs/dev/format/Security.html). When evaluating potential security vulnerabilities, especially those discovered via fuzzing tools, we apply the following principles to distinguish between normal robustness bugs and actual security vulnerabilities. + +### 1. Trusted vs. Untrusted Data Boundaries + +Apache ORC is a low-level format library designed to read and write data from trusted storage systems (e.g., internal data lakes, HDFS, S3). The parsing APIs assume that the underlying data originates from a trusted internal source. + +If an attacker is able to write arbitrary, maliciously crafted ORC files into your internal storage, the system is already compromised at the infrastructure or access control level (e.g., IAM permissions, API gateways). The `orc` library itself is not the appropriate security boundary to defend against compromised storage infrastructure. + +### 2. Robustness Issues vs. Security Vulnerabilities + +A crash, Out-Of-Memory (OOM), Out-Of-Bounds (OOB) read, or assertion failure caused by feeding a maliciously fuzzed file directly into the low-level parser is considered a **robustness issue** (a regular software bug), not a security vulnerability. + +Such issues are only treated as security vulnerabilities (CVEs) if they: +* Lead to Remote Code Execution (RCE). +* Bypass a defined security boundary or cause cross-tenant data leakage. +* Occur in a server component explicitly designed to process unverified, external data. + +Missing bounds checks or crashes during the parsing of malformed files do not typically lead to code execution or break isolation sandboxes. They are treated as regular software defects. + +### 3. Responsibilities of the Low-Level Library + +The primary goal of a low-level serialization/deserialization library like ORC is high performance. Mandating strict defensive programming and validation on every internal memory operation would severely degrade performance. + +Security validation and sanitization of untrusted inputs should occur at the data ingestion layer (e.g., before data is admitted into the data lake). It is outside the scope of the format parser to proactively defend against all possible artificially corrupted bits. + ## Reporting a Vulnerability We strongly encourage folks to report security vulnerabilities to our