Add draft threat model + SECURITY.md/AGENTS.md discoverability#3966
Add draft threat model + SECURITY.md/AGENTS.md discoverability#3966potiuk wants to merge 1 commit into
Conversation
Generated-by: Claude Code
rvesse
left a comment
There was a problem hiding this comment.
Thanks @potiuk for the first pass at this, I think this looks like a pretty solid starting point.
I have gone through the model and made various suggested edits throughout (mostly confirming/clarifying things you'd marked as needed that). I won't commit the edits yet as I want to give the rest of the PMC chance to review the initial draft as-is
Have also provided initial answers for most of the questions. Some of those answers are just me pinging other PMC members with the relevant expertise in a particular area of the codebase to provide their input
| ## §4 Trust boundaries and data flow | ||
|
|
||
| - **Primary boundary: the Fuseki SPARQL endpoint.** Queries arrive over HTTP from (by default) **anonymous** clients. The boundary question is what an anonymous/low-privilege SPARQL query can reach: read data it shouldn't, **write** (SPARQL Update / GSP) without authorisation, make Fuseki issue outbound requests (`SERVICE` → SSRF), read local files (`file:` URLs / FROM), execute code (ARQ custom/JavaScript functions if enabled), or exhaust resources. *(inferred; public-query default documented)* | ||
| - **Admin boundary:** the `/$/*` admin surface is localhost-only by default *(documented)*; exposing it to the network is an operator misconfiguration. |
There was a problem hiding this comment.
| - **Admin boundary:** the `/$/*` admin surface is localhost-only by default *(documented)*; exposing it to the network is an operator misconfiguration. | |
| - **Admin boundary:** the `/$/*` admin surface is localhost-only by default *(documented)*; exposing it to the network (without configuring authentication/authorisation) is an operator misconfiguration. |
|
|
||
| - **Runtime:** JVM (Java; "old in places" per andy@). *(maintainer)* | ||
| - **Fuseki auth:** Apache Shiro via `$FUSEKI_BASE/shiro.ini`; changing it needs a restart *(documented — Fuseki security docs)*. | ||
| - **Store:** TDB1/TDB2 on the local filesystem, assumed private to the Fuseki/JVM process. *(inferred)* |
There was a problem hiding this comment.
| - **Store:** TDB1/TDB2 on the local filesystem, assumed private to the Fuseki/JVM process. *(inferred)* | |
| - **Store:** TDB1/TDB2 on the local filesystem, private to the owning Fuseki/JVM process, multiple processes accessing a single store location prevented by code. *(maintainer)* |
| - **Fuseki auth:** Apache Shiro via `$FUSEKI_BASE/shiro.ini`; changing it needs a restart *(documented — Fuseki security docs)*. | ||
| - **Store:** TDB1/TDB2 on the local filesystem, assumed private to the Fuseki/JVM process. *(inferred)* | ||
| - **Network:** TLS is the deployer's (reverse proxy); Fuseki's bundled example setup is plaintext *(documented — "no TLS, passwords in plain text")*. | ||
| - **Negative side-effects inventory** (inferred — wave-1/2 target): Fuseki listens on HTTP; ARQ can make **outbound** network requests via `SERVICE` (federation) and can read **`file:`/http: URLs** named in queries (FROM/FROM NAMED/SERVICE); RIOT parses untrusted RDF; ARQ may execute **custom/JavaScript functions** if the operator enabled them; TDB reads/writes the data directory. *(inferred — these are the load-bearing confirmations)* |
There was a problem hiding this comment.
| - **Negative side-effects inventory** (inferred — wave-1/2 target): Fuseki listens on HTTP; ARQ can make **outbound** network requests via `SERVICE` (federation) and can read **`file:`/http: URLs** named in queries (FROM/FROM NAMED/SERVICE); RIOT parses untrusted RDF; ARQ may execute **custom/JavaScript functions** if the operator enabled them; TDB reads/writes the data directory. *(inferred — these are the load-bearing confirmations)* | |
| - **Negative side-effects inventory** (inferred — wave-1/2 target): Fuseki listens on HTTP; ARQ can make **outbound** network requests via `SERVICE` (federation), `SERVICE` can be disabled by operator in configuration; ARQ can read **`file:`/http: URLs** named in queries (FROM/FROM NAMED/SERVICE); RIOT parses untrusted RDF; ARQ may execute **custom/JavaScript functions** if the operator enabled them; TDB reads/writes the data directory. *(inferred — these are the load-bearing confirmations)* |
There was a problem hiding this comment.
@afs On the point of reading file:/http URLs in queries/updates isn't that partly a feature of what dataset implementation is used i.e. is this something the operator can control/disable by choice of configuration?
| | Fuseki Shiro auth (`shiro.ini`) | SPARQL **query** public; admin `/$/*` **localhost-only** | *(documented)* Restricting query access requires Shiro `[urls]` ACLs. | | ||
| | Fuseki example user setup | `admin`/`pw`, plaintext, no TLS | *(documented)* explicitly "not recommended for production". Any "default admin/pw in prod" report → `OUT-OF-MODEL: non-default-build`. | | ||
| | SPARQL **Update** / Graph Store write | per-dataset (read-only vs read-write service) — **default to confirm** | *(inferred)* If a dataset ships update-enabled + unauthenticated, anonymous write is in-model; if read-only by default, anonymous write is not reachable. **Wave-1 question.** | | ||
| | `SERVICE` (federated query) | **to confirm** (enabled? restrictable allow-list?) | *(inferred)* SSRF surface; whether it can be disabled / allow-listed is the key §10 lever. | |
There was a problem hiding this comment.
| | `SERVICE` (federated query) | **to confirm** (enabled? restrictable allow-list?) | *(inferred)* SSRF surface; whether it can be disabled / allow-listed is the key §10 lever. | | |
| | `SERVICE` (federated query) | may be disabled by operator config **(documented)** | *(inferred)* SSRF surface |
Documentation references:
| | Fuseki example user setup | `admin`/`pw`, plaintext, no TLS | *(documented)* explicitly "not recommended for production". Any "default admin/pw in prod" report → `OUT-OF-MODEL: non-default-build`. | | ||
| | SPARQL **Update** / Graph Store write | per-dataset (read-only vs read-write service) — **default to confirm** | *(inferred)* If a dataset ships update-enabled + unauthenticated, anonymous write is in-model; if read-only by default, anonymous write is not reachable. **Wave-1 question.** | | ||
| | `SERVICE` (federated query) | **to confirm** (enabled? restrictable allow-list?) | *(inferred)* SSRF surface; whether it can be disabled / allow-listed is the key §10 lever. | | ||
| | ARQ **JavaScript / custom functions** | **to confirm** (opt-in?) | *(inferred)* If enabled, SPARQL can execute code → by-design-if-operator-enabled, like a trusted extension. | |
There was a problem hiding this comment.
| | ARQ **JavaScript / custom functions** | **to confirm** (opt-in?) | *(inferred)* If enabled, SPARQL can execute code → by-design-if-operator-enabled, like a trusted extension. | | |
| | ARQ **JavaScript / custom functions** | opt-in feature, requires explicit operator config of both Fuseki and JVM | *(inferred)* If enabled, SPARQL can execute code, executable JS functions controlled by explicit white list *(documented)*, some JS functions, e.g. `eval()`, are explicitly blacklisted regardless of whitelist → by-design-if-operator-enabled, like a trusted extension. Java custom functions require explicit operator configuration of class path, if added to class path operator responsibility to verify function code is safe | |
There was a problem hiding this comment.
| **Wave 2 — the high-value query surfaces (the Jena CVE classes):** | ||
| 4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. | ||
| 5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. | ||
| 6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a. |
There was a problem hiding this comment.
Yes, opt-in and explicit white list for permitted JS functions.
For custom Java functions operator has to explicitly add code to their class path so operator responsibility to verify they trust the custom function code
Code execution is by-design-operator-enabled
| 4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. | ||
| 5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. | ||
| 6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a. | ||
| 7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. |
There was a problem hiding this comment.
I believe so, this is @afs's area of expertise having rewritten those parsers relatively recently
There was a problem hiding this comment.
RDF/XML - XXE is disabled (JenaXMLInput).
JSON-LD has it's own version the XXE concern.
| 7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. | ||
|
|
||
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. |
There was a problem hiding this comment.
I don't think we can treat these are bugs, these are known issues in the wider RDF/SPARQL community and its operator responsibility to apply configuration (e.g. query timeout), request size limits via reverse proxy etc.
There was a problem hiding this comment.
We ought to couple with needing to enable SERVICE in Fuseki.
(the volume concern I mentioned is number of requests - it can flood a server)
|
|
||
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. | ||
| 9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. |
There was a problem hiding this comment.
Yes, for in-process its app responsibility to verify untrusted inputs and apply any appropriate hardening
Parameterised queries are recommended pattern
There was a problem hiding this comment.
Fuseki itself does not have parameterised queries.
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. | ||
| 9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. | ||
| 10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a. |
There was a problem hiding this comment.
Probably worth reviewing the TDB FAQs page, the following two in particular come to mind that are recurring topics on the mailing lists:
- Does Fuseki/TDB have a memory leak? - Unbounded memory growth under continuous read/write load is a known issue and our use of a WAL ensures no data is lost should the process crash/restart due to this
- Why is the database much large on disk than my input? - This is two-fold, firstly we use sparse files so depending on how disk usage is inspected (and the filesystem in use) different usage metrics can be reported. Secondly TDB2 uses MVCC trees so each write transaction potentially creates new blocks in the trees orphaning the old blocks (once any active read transactions on the old tree state have completed), this is expected behaviour and we provide a compaction operation that we recommend operators run periodically to reclaim disk space.
|
Thanks @rvesse — genuinely useful, detailed review. Understood you're holding your own commits so the rest of the PMC can review the as-is draft first, so I won't push anything over that. For when you're ready: I've staged a revision incorporating all your suggestions — SSRF via |
|
I flew over the files and have nothing to add so far. (Yesterday, I realized I am overworked since a few days and need some time off, to be able to think straight again) |
|
Thanks @arne-bdt — appreciate the look. And genuinely, take the time off you need; there's zero rush here, the draft will keep. Be well. 🙂 |
| - **Reporting cross-reference:** §8-property violations → report privately per ASF process (`security@apache.org` → `private@jena.apache.org`); §3/§9 findings are closed citing this document. | ||
| - **Provenance legend:** *(documented)* = Jena's own docs/repo; *(maintainer)* = confirmed by a Jena PMC member through this process (andy@ has ratified destination + the help-with-model request); *(inferred)* = reasoned from architecture, not yet confirmed — each has a matching §14 open question. | ||
| - **Draft confidence:** ~12 documented / ~2 maintainer / ~34 inferred. | ||
| - **What Jena is:** Apache Jena is a Java framework for building Semantic-Web / linked-data applications over RDF. It provides an in-process API to RDF data held in memory or in a native store (TDB), the ARQ SPARQL query/update engine, RIOT parsers/serialisers for RDF syntaxes (Turtle, RDF/XML, JSON-LD, N-Triples, …), and **Fuseki** — a standalone HTTP server exposing SPARQL query, SPARQL Update, and the Graph Store Protocol over the network. *(documented — README, jena.apache.org; maintainer — andy@ 2026-06-01: "an HTTP-based data server (Fuseki) and a Java API to RDF data stored in memory and in a custom database")* |
There was a problem hiding this comment.
JSON-LD is provided by a dependency.
There is work in the current JSON-LD W3C Working Group to document and provide mitigation for the issue that JSON-LD reads remote file.
It is safer than XML External Entities but nevertheless, it's an issue.
| - **authenticated user / admin** — gated by Apache Shiro (`shiro.ini`); admin functions (`/$/*`) restricted to localhost by default *(documented)*. | ||
| - **operator/deployer** — configures Shiro, datasets, TDB location, and which endpoints are read-only vs updatable. **Trusted.** *(inferred)* | ||
| - **embedding application** (Java API) — trusted; supplies queries/RDF to the library. *(inferred)* | ||
|
|
There was a problem hiding this comment.
Jena also provides a Lucene-based text index component including in Fuseki.
Should that be included here?
| | Fuseki HTTP server | `jena-fuseki2` — SPARQL query / Update / Graph Store Protocol, admin `/$/*` | network (listens) | **In — primary boundary** *(documented)* | | ||
| | SPARQL engine (ARQ) | `jena-arq` — query/update eval, `SERVICE` federation, custom functions | network out (SERVICE), file (file: URLs) | **In — high value** *(inferred)* | | ||
| | RDF I/O (RIOT) | `jena-arq`/`jena-core` parsers (RDF/XML, Turtle, JSON-LD, …) | parses untrusted RDF | **In — XXE / parser-DoS surface** *(inferred)* | | ||
| | Stores | `jena-tdb1`, `jena-tdb2`, `jena-db` | filesystem | **In (engine's use); on-disk store is operator-trusted** *(inferred)* | |
There was a problem hiding this comment.
| | Stores | `jena-tdb1`, `jena-tdb2`, `jena-db` | filesystem | **In (engine's use); on-disk store is operator-trusted** *(inferred)* | | |
| | Stores | `jena-tdb1`, `jena-tdb2`, `jena-text` | filesystem | **In (engine's use); on-disk store is operator-trusted** *(inferred)* | |
jena-db is the module that provides a database framework. TDB2 uses it. TDB2 is the actual database.
TDB1 predates jena-db.
jena-text (Lucene-based) is also a persistent, read-write storage component.
Should it be listed here rather than later?
| ## §2 Scope and intended use | ||
|
|
||
| - **Two deployment shapes** *(maintainer — andy@)*: | ||
| - **Fuseki** — a long-running **HTTP server** that answers SPARQL over the network. The primary network trust surface. |
There was a problem hiding this comment.
For cokpletness, list the different parts of the SPARQL family.
| - **Fuseki** — a long-running **HTTP server** that answers SPARQL over the network. The primary network trust surface. | |
| - **Fuseki** — a long-running **HTTP server** that answers SPARQL Query and SPARQL Update as well as the SPARQL Graph Store Protocol (read and read-write forms) over the network. The primary network trust surface. |
| - **Fuseki** — a long-running **HTTP server** that answers SPARQL over the network. The primary network trust surface. | ||
| - **The Jena Java API** — `jena-core`/`jena-arq`/TDB embedded **in-process** in another application. Trusted caller; the bytes/queries it feeds Jena are that application's responsibility. | ||
| - **Caller roles** (Fuseki is a network service — the role splits): | ||
| - **anonymous SPARQL client** — issues SPARQL queries over HTTP. **Default-public for query** *(documented — Fuseki security docs: "SPARQL endpoints are open to the public but administrative functions are limited to localhost")*. |
There was a problem hiding this comment.
| - **anonymous SPARQL client** — issues SPARQL queries over HTTP. **Default-public for query** *(documented — Fuseki security docs: "SPARQL endpoints are open to the public but administrative functions are limited to localhost")*. | |
| - **anonymous SPARQL client** — issues SPARQL queries over HTTP. **Default-public for SPARQL query** *(documented — Fuseki security docs: "SPARQL endpoints are open to the public but administrative functions are limited to localhost")*. |
SPARQL Update and Graph Store Protocol update are not the default for TDB2 stored data.
They are default for the in-memory configuration if the server is started with no initial data.
|
|
||
| **Wave 2 — the high-value query surfaces (the Jena CVE classes):** | ||
| 4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. | ||
| 5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. |
There was a problem hiding this comment.
In default configurations, FROM / FROM NAMED are URIs used as names of graph in the dataset - already accessible data via GRAPH - even for file:
file: does not read local storage.
| 4. **`SERVICE` federation (SSRF):** is it enabled by default, and can it be disabled / allow-listed? Is an SSRF via `SERVICE` from an anonymous query `VALID`? → §8/§9/§10. | ||
| 5. **`file:` / arbitrary-URI access** via FROM / FROM NAMED / SERVICE: is local-file read from an untrusted query prevented by default? → §8/§9. | ||
| 6. **ARQ JavaScript / custom functions:** opt-in? If enabled and reachable anonymously, is code execution `VALID` or by-design-operator-enabled? → §5a/§9/§11a. | ||
| 7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. |
There was a problem hiding this comment.
RDF/XML - XXE is disabled (JenaXMLInput).
JSON-LD has it's own version the XXE concern.
| 7. **RIOT / RDF-XML XXE:** are external entities (and `file:` fetches) disabled by default in the parsers? → §8. | ||
|
|
||
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. |
There was a problem hiding this comment.
We ought to couple with needing to enable SERVICE in Fuseki.
(the volume concern I mentioned is number of requests - it can flood a server)
|
|
||
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. | ||
| 9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. |
There was a problem hiding this comment.
Fuseki itself does not have parameterised queries.
| **Wave 3 — resources, API, meta:** | ||
| 8. **Resource/DoS line** (your volume concern): is an expensive SPARQL query or huge RDF body a bug, or operator-tuned via query-timeout/result-limits? Where's the line? → §8/§11a. | ||
| 9. Confirm the **in-process Java API** is modeled as trusted-caller (embedding-app SPARQL injection is the app's bug), and that parameterised queries are the recommended pattern. → §3/§9. | ||
| 10. Any other recurring scanner/fuzzer false positives to seed §11a? → §11a. |
What this is
A draft threat model for Apache Jena, proposed by the ASF Security team for the Jena PMC to review, correct, or reject — drafted by the Security team's threat-model tooling from Jena's public docs and repository, following the ASF Security threat-model rubric. It was requested by the PMC (andy@) as a starting point.
This PR:
THREAT_MODEL.md— the draft model;SECURITY.md— a short security policy linking the threat model;AGENTS.mdwith a## Securitysection, so the chainAGENTS.md → SECURITY.md → THREAT_MODEL.mdis mechanically discoverable by automated security scanners.How to read it
Every claim is provenance-tagged: (documented) (from Jena's docs/repo), (inferred) (reasoned from architecture, not yet confirmed), (maintainer) (confirmed by the PMC). This v0 is ~12 documented / ~34 inferred. The §14 Open questions section collects every inferred claim into waves — that is where review time is best spent. The model treats Fuseki's SPARQL endpoint as the primary boundary (public query, localhost-only admin by default are documented) and flags the high-value query surfaces for confirmation:
SERVICEfederation (SSRF),file:/FROM local-file read, ARQ JavaScript/custom functions (code exec), and RDF/XML XXE in RIOT — is each prevented/restrictable by default? (wave 2);Nothing here is a requirement — the model is for the PMC to own. Comment inline, edit the branch, or reply on the email thread.