Skip to content

Commit c3001a6

Browse files
shixishclaude
andcommitted
Add ontology navigation to the SDK core for DQL authoring
Promote ontology introspection out of the CLI layer into a storage-agnostic diffbot.Ontology class so library consumers (e.g. langchain) can build valid DQL on the fly, without depending on the disk-backed CLI cache. - New diffbot/ontology.py: Ontology with types/composites/enums/taxonomies, fields_for, filter_fields, taxonomy_values, enum_values, find_named, format_field, plus from_json/from_path. Pure; imposes no caching policy. - Client: dql_fetch_ontology() (sync + async) downloads and returns an Ontology; add async DiffbotAsync.dql_parallel() to mirror the sync one. - cli/ontology.py now delegates to Ontology, keeping its ~/.diffbot disk cache and existing module API unchanged. - tests/test_ontology.py covers the core, the HTTP fetch, and async parallel. Also includes a pre-existing, in-progress credential-resolution refactor that was already present in the working tree (diffbot/_auth.py and its consumers in cli/_common.py, cli/dql.py, conftest, tests, README); __init__.py carries both sets of changes, so they commit together. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 1dc266f commit c3001a6

14 files changed

Lines changed: 524 additions & 125 deletions

README.md

Lines changed: 40 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,38 @@ pip install -e ".[dev]"
1818
## Usage
1919

2020
### Authentication
21-
Set your Diffbot API token in your environment or .env.
21+
22+
The CLI and the library can share a single credential. The token always has to be
23+
passed to the client explicitly, but `resolve_token()` gives you the same lookup the
24+
CLI uses, in this order:
25+
26+
1. An explicit token passed to `resolve_token(token)`.
27+
2. The `DIFFBOT_API_TOKEN` environment variable.
28+
3. A `DIFFBOT_API_TOKEN=...` line in `~/.diffbot/credentials`.
29+
30+
Set it once and it works for both the CLI and your scripts. Either export it:
2231

2332
```bash
2433
export DIFFBOT_API_TOKEN=<TOKEN>
2534
```
2635

36+
…or write it to the shared credentials file (handy for keeping it out of your shell environment):
37+
38+
```bash
39+
mkdir -p ~/.diffbot
40+
printf 'DIFFBOT_API_TOKEN=%s\n' '<TOKEN>' > ~/.diffbot/credentials
41+
chmod 600 ~/.diffbot/credentials
42+
```
43+
44+
With either in place, resolve the token and pass it to the client:
45+
46+
```python
47+
from diffbot import Diffbot, resolve_token
48+
49+
db = Diffbot(token=resolve_token()) # from env var or ~/.diffbot/credentials
50+
data = db.extract("https://www.example.com")
51+
```
52+
2753
### Extract structured content
2854
```python
2955
from diffbot import Diffbot
@@ -166,7 +192,15 @@ asyncio.run(main())
166192

167193
## CLI
168194

169-
This library also includes a CLI.
195+
This library also includes a CLI exposed as the `db` command.
196+
197+
To make `db` available from anywhere, install it as an isolated tool with [uv](https://docs.astral.sh/uv/):
198+
199+
```bash
200+
uv tool install .
201+
```
202+
203+
This drops a `db` executable into `~/.local/bin` (ensure it is on your `PATH`). Use `--force` to reinstall or upgrade after changes, or `--editable` to have source edits take effect immediately. Alternatively, a plain `pip install .` (or `pip install -e .`) also installs the `db` entry point into the active environment.
170204

171205
```bash
172206
export DIFFBOT_API_TOKEN=your-token-here
@@ -189,7 +223,9 @@ Run the mock test suite:
189223
python -m pytest
190224
```
191225

192-
Run live integration tests against the real API (requires a valid token):
226+
Run live integration tests against the real API (requires a valid token).
227+
The token is resolved the same way as everywhere else — the `DIFFBOT_API_TOKEN`
228+
environment variable or `~/.diffbot/credentials`:
193229
```bash
194-
DIFFBOT_TOKEN=your_token python -m pytest -m live
230+
DIFFBOT_API_TOKEN=your_token python -m pytest -m live
195231
```

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,5 +71,5 @@ include = [
7171
]
7272

7373
[tool.pytest.ini_options]
74-
markers = ["live: marks tests as live integration tests requiring a real DIFFBOT_TOKEN"]
74+
markers = ["live: marks tests as live integration tests requiring a real DIFFBOT_API_TOKEN"]
7575
addopts = "-m 'not live'"

src/diffbot/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
__version__ = "0.1.0"
66

7+
from ._auth import resolve_token
78
from .client import Diffbot, DiffbotAsync
89
from .crawl import CrawlEvent, CrawlEventType
910
from .errors import (
@@ -14,12 +15,15 @@
1415
RateLimitError,
1516
ValidationError,
1617
)
18+
from .ontology import Ontology
1719

1820
__all__ = [
1921
"Diffbot",
2022
"DiffbotAsync",
23+
"resolve_token",
2124
"CrawlEvent",
2225
"CrawlEventType",
26+
"Ontology",
2327
"DiffbotError",
2428
"AuthError",
2529
"ExtractionError",

src/diffbot/_auth.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
"""Shared Diffbot credential resolution for both the library and the CLI.
2+
3+
The same lookup chain is used everywhere so a single credential works for the
4+
``db`` CLI and any Python script that constructs a client:
5+
6+
1. An explicit token passed to the client / function.
7+
2. The ``DIFFBOT_API_TOKEN`` environment variable.
8+
3. A ``DIFFBOT_API_TOKEN=...`` line in ``~/.diffbot/credentials``.
9+
"""
10+
11+
import os
12+
import pathlib
13+
from typing import Optional
14+
15+
TOKEN_ENV_VAR = "DIFFBOT_API_TOKEN"
16+
CREDENTIALS_PATH = pathlib.Path.home() / ".diffbot" / "credentials"
17+
18+
19+
def _read_credentials_file() -> str:
20+
if not CREDENTIALS_PATH.exists():
21+
return ""
22+
for line in CREDENTIALS_PATH.read_text().splitlines():
23+
line = line.strip()
24+
if line.startswith(f"{TOKEN_ENV_VAR}="):
25+
return line[len(TOKEN_ENV_VAR) + 1:].strip()
26+
return ""
27+
28+
29+
def resolve_token(token: Optional[str] = None) -> str:
30+
"""Resolve a Diffbot API token from the explicit argument, env var, or file.
31+
32+
Returns an empty string if no token can be found.
33+
"""
34+
if token and token.strip():
35+
return token.strip()
36+
37+
env_token = os.environ.get(TOKEN_ENV_VAR, "").strip()
38+
if env_token:
39+
return env_token
40+
41+
return _read_credentials_file()

src/diffbot/cli/_common.py

Lines changed: 8 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,20 @@
1-
import os
2-
import pathlib
3-
41
import click
52

6-
from diffbot import Diffbot
7-
8-
CREDENTIALS_PATH = pathlib.Path.home() / ".diffbot" / "credentials"
9-
10-
11-
def resolve_token() -> str:
12-
"""Return the Diffbot API token from the env var, falling back to ~/.diffbot/credentials."""
13-
token = os.environ.get("DIFFBOT_API_TOKEN", "").strip()
14-
if token:
15-
return token
16-
17-
if CREDENTIALS_PATH.exists():
18-
for line in CREDENTIALS_PATH.read_text().splitlines():
19-
line = line.strip()
20-
if line.startswith("DIFFBOT_API_TOKEN="):
21-
return line[len("DIFFBOT_API_TOKEN="):].strip()
22-
23-
return ""
3+
from diffbot import Diffbot, resolve_token
4+
from diffbot._auth import CREDENTIALS_PATH, TOKEN_ENV_VAR
245

256

267
def get_client() -> Diffbot:
8+
"""Build a Diffbot client using the shared credential resolution chain.
9+
10+
Looks at the DIFFBOT_API_TOKEN env var, then ~/.diffbot/credentials.
11+
"""
2712
token = resolve_token()
2813
if not token:
2914
click.echo(
3015
"Error: no Diffbot API token found.\n"
31-
" Set a DIFFBOT_API_TOKEN environment variable, or\n"
32-
f" write 'DIFFBOT_API_TOKEN=YOUR_TOKEN' to {CREDENTIALS_PATH}",
16+
f" Set a {TOKEN_ENV_VAR} environment variable, or\n"
17+
f" write '{TOKEN_ENV_VAR}=YOUR_TOKEN' to {CREDENTIALS_PATH}",
3318
err=True,
3419
)
3520
raise click.Abort()

src/diffbot/cli/dql.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,9 @@
1515
from diffbot import DiffbotError
1616

1717
from . import ontology
18-
from ._common import get_client, resolve_token
18+
from diffbot import resolve_token
19+
20+
from ._common import get_client
1921

2022

2123
class _DqlGroup(click.Group):

src/diffbot/cli/ontology.py

Lines changed: 31 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,24 @@
1+
"""CLI-side ontology access: a disk cache over the storage-agnostic core.
2+
3+
The navigation logic lives in :mod:`diffbot.ontology` (the `Ontology` class).
4+
This module adds the CLI's caching policy on top: the ontology is read once from
5+
``~/.diffbot/ontology.json`` (populated by `db dql init`) and held in
6+
``_CACHE``. The module-level functions preserve the historical CLI surface and
7+
simply delegate to an `Ontology` built from the cached document.
8+
"""
9+
110
import json
211
import pathlib
3-
import re
4-
from typing import Any, Dict, List, Optional
12+
from typing import Any, Dict, List, Optional, Tuple
13+
14+
from diffbot.ontology import Ontology
515

616
ONTOLOGY_PATH = pathlib.Path.home() / ".diffbot" / "ontology.json"
717

818
_CACHE: Dict[str, Any] = {}
919

1020

11-
def load() -> Dict[str, Any]:
21+
def _data() -> Dict[str, Any]:
1222
if "data" not in _CACHE:
1323
if not ONTOLOGY_PATH.exists():
1424
raise FileNotFoundError(
@@ -18,113 +28,47 @@ def load() -> Dict[str, Any]:
1828
return _CACHE["data"]
1929

2030

31+
def _ontology() -> Ontology:
32+
return Ontology(_data())
33+
34+
2135
def list_types() -> List[str]:
22-
return sorted(load().get("types", {}).keys())
36+
return _ontology().types()
2337

2438

2539
def list_composites() -> List[str]:
26-
return sorted(load().get("composites", {}).keys())
40+
return _ontology().composites()
2741

2842

2943
def list_enums() -> List[str]:
30-
return sorted(load().get("enums", {}).keys())
44+
return _ontology().enums()
3145

3246

3347
def list_taxonomies() -> List[str]:
34-
return sorted(load().get("taxonomies", {}).keys())
35-
36-
37-
def _fields_of(container: Dict[str, Any], type_name: str) -> Dict[str, Any]:
38-
entry = container.get(type_name)
39-
if entry is None:
40-
raise KeyError(f"Unknown name: {type_name}")
41-
return entry.get("fields", {})
48+
return _ontology().taxonomies()
4249

4350

4451
def fields_for(type_name: str) -> Dict[str, Any]:
45-
data = load()
46-
types = data.get("types", {})
47-
composites = data.get("composites", {})
48-
if type_name in types:
49-
return _fields_of(types, type_name)
50-
if type_name in composites:
51-
return _fields_of(composites, type_name)
52-
raise KeyError(f"{type_name} is not a known entity type or composite")
52+
return _ontology().fields_for(type_name)
5353

5454

5555
def format_field(name: str, meta: Dict[str, Any]) -> str:
56-
t = meta.get("type", "?")
57-
if t == "LinkedEntity":
58-
le = meta.get("leType") or []
59-
if le:
60-
t = f"LinkedEntity ({le[0]})"
61-
flags = []
62-
if meta.get("isList"):
63-
flags.append("isList")
64-
if meta.get("isComposite"):
65-
flags.append("isComposite")
66-
if meta.get("isEnum"):
67-
flags.append("isEnum")
68-
if meta.get("isDeprecated"):
69-
flags.append("DEPRECATED")
70-
suffix = "".join(f" [{f}]" for f in flags)
71-
return f"{name}: [{t}]{suffix}"
72-
73-
74-
def filter_fields(fields: Dict[str, Any], search: Optional[str], include_deprecated: bool = False) -> List[tuple]:
75-
pattern = re.compile(search, re.IGNORECASE) if search else None
76-
out = []
77-
for name, meta in fields.items():
78-
if not include_deprecated and meta.get("isDeprecated"):
79-
continue
80-
if pattern and not pattern.search(name):
81-
continue
82-
out.append((name, meta))
83-
return out
56+
return Ontology.format_field(name, meta)
8457

8558

86-
def taxonomy_values(name: str, search: Optional[str] = None) -> List[str]:
87-
data = load()
88-
tax = data.get("taxonomies", {}).get(name)
89-
if tax is None:
90-
raise KeyError(f"Unknown taxonomy: {name}")
91-
pattern = re.compile(search, re.IGNORECASE) if search else None
92-
out: List[str] = []
59+
def filter_fields(
60+
fields: Dict[str, Any], search: Optional[str], include_deprecated: bool = False
61+
) -> List[Tuple[str, Dict[str, Any]]]:
62+
return Ontology.filter_fields(fields, search, include_deprecated=include_deprecated)
9363

94-
def walk(node: Dict[str, Any]) -> None:
95-
n = node.get("name")
96-
if n and (pattern is None or pattern.search(n)):
97-
out.append(n)
98-
for child in node.get("children", []) or []:
99-
walk(child)
10064

101-
for cat in tax.get("categories", []) or []:
102-
walk(cat)
103-
return out
65+
def taxonomy_values(name: str, search: Optional[str] = None) -> List[str]:
66+
return _ontology().taxonomy_values(name, search)
10467

10568

10669
def enum_values(name: str) -> List[str]:
107-
data = load()
108-
enum = data.get("enums", {}).get(name)
109-
if enum is None:
110-
raise KeyError(f"Unknown enum: {name}")
111-
return list(enum.get("values", []))
70+
return _ontology().enum_values(name)
11271

11372

11473
def find_named(search: str) -> List[str]:
115-
pattern = re.compile(search, re.IGNORECASE)
116-
found = set()
117-
118-
def walk(node: Any) -> None:
119-
if isinstance(node, dict):
120-
n = node.get("name")
121-
if isinstance(n, str) and pattern.search(n):
122-
found.add(n)
123-
for v in node.values():
124-
walk(v)
125-
elif isinstance(node, list):
126-
for v in node:
127-
walk(v)
128-
129-
walk(load())
130-
return sorted(found)
74+
return _ontology().find_named(search)

0 commit comments

Comments
 (0)