BrentLab · cmatKhan · Mar 4, 2026 · Feb 18, 2026 · Feb 18, 2026 · Mar 4, 2026
diff --git a/docs/tutorials/virtual_db_tutorial.ipynb b/docs/tutorials/virtual_db_tutorial.ipynb
diff --git a/docs/virtual_db.md b/docs/virtual_db.md
@@ -23,6 +23,66 @@ For comparative analysis datasets, VirtualDB creates:
 See the [configuration guide](virtual_db_configuration.md) for setup details
 and the [tutorial](tutorials/virtual_db_tutorial.ipynb) for usage examples.
 
+## Advanced Usage
+
+After any public method is called (e.g. `vdb.tables()`), the underlying DuckDB
+connection is available as `vdb._db`. You can use `_db` to execute any SQL
+on the database, eg creating more views, or creating a table in memory
+
+Custom **views** created this way appear in `tables()`, `describe()`, and
+`get_fields()` automatically because those methods query DuckDB's
+`information_schema`. Custom **tables** do not appear in `tables()` (which
+only lists views), but are fully queryable via `vdb.query()`.
+
+Call at least one public method first to ensure the connection is initialized
+before accessing `_db` directly.
+
+Example -- create a materialized analysis table::
+
+    # Trigger view registration
+    vdb.tables()
+
+    # Create a persistent in-memory table from a complex query.
+    # This example selects one "best" Hackett-2020 sample per regulator
+    # using a priority system: ZEV+P > GEV+P > GEV+M.
+    vdb._db.execute("""
+        CREATE OR REPLACE TABLE hackett_analysis_set AS
+        WITH regulator_tiers AS (
+            SELECT
+                regulator_locus_tag,
+                CASE
+                    WHEN BOOL_OR(mechanism = 'ZEV' AND restriction = 'P') THEN 1
+                    WHEN BOOL_OR(mechanism = 'GEV' AND restriction = 'P') THEN 2
+                    ELSE 3
+                END AS tier
+            FROM hackett_meta
+            WHERE regulator_locus_tag NOT IN ('Z3EV', 'GEV')
+            GROUP BY regulator_locus_tag
+        ),
+        tier_filter AS (
+            SELECT
+                h.sample_id, h.regulator_locus_tag, h.regulator_symbol,
+                h.mechanism, h.restriction, h.date, h.strain, t.tier
+            FROM hackett_meta h
+            JOIN regulator_tiers t USING (regulator_locus_tag)
+            WHERE
+                (t.tier = 1 AND h.mechanism = 'ZEV' AND h.restriction = 'P')
+                OR (t.tier = 2 AND h.mechanism = 'GEV' AND h.restriction = 'P')
+                OR (t.tier = 3 AND h.mechanism = 'GEV' AND h.restriction = 'M')
+        )
+        SELECT DISTINCT
+            sample_id, regulator_locus_tag, regulator_symbol,
+            mechanism, restriction, date, strain
+        FROM tier_filter
+        WHERE regulator_symbol NOT IN ('GCN4', 'RDS2', 'SWI1', 'MAC1')
+        ORDER BY regulator_locus_tag, sample_id
+    """)
+
+    df = vdb.query("SELECT * FROM hackett_analysis_set")
+
+Tables and views created this way are in-memory only and do not persist across
+VirtualDB instances. They exist for the lifetime of the DuckDB connection.
+
 ## API Reference
 
 ::: tfbpapi.virtual_db.VirtualDB

diff --git a/docs/virtual_db_configuration.md b/docs/virtual_db_configuration.md
@@ -10,8 +10,10 @@ levels.
 repositories:
   # Each repository defines a "table" in the virtual database
   BrentLab/harbison_2004:
-    # REQUIRED: Specify which field is the sample identifier. At this level, it means
-    # that all datasets have a field `sample_id` that uniquely identifies samples.
+    # REQUIRED: Specify which column is the sample identifier. The `field`
+    # value is the actual column name in the parquet data. At the repo level,
+    # it applies to all datasets in this repository. If not specified at
+    # either level, the default column name "sample_id" is assumed.
     sample_id:
       field: sample_id
     # Repository-wide properties (apply to all datasets in this repository)
@@ -47,8 +49,9 @@ repositories:
       kemmeren_2014:
         # optional -- see the note for `db_name` in harbison above
         db_name: kemmeren
-        # REQUIRED: If `sample_id` isn't defined at the repo level, then it must be
-        # defined at the dataset level for each dataset in the repo
+        # REQUIRED: If `sample_id` isn't defined at the repo level, it must be
+        # defined at the dataset level. The `field` value is the actual column
+        # name in the parquet data (does not need to be literally "sample_id").
         sample_id:
           field: sample_id
         # Same logical fields, different physical paths
@@ -144,6 +147,62 @@ during metadata extraction and query filtering.
 2. **Type consistency**: When source data might be extracted with incorrect type
 3. **Performance**: Helps with query optimization and prevents type mismatches
 
+## Tags
+
+Tags are arbitrary string key/value pairs for annotating datasets. They follow
+the same hierarchy as property mappings: repo-level tags apply to all datasets
+in the repository, dataset-level tags apply only to that dataset, and
+dataset-level tags override repo-level tags with the same key.
+
+```yaml
+repositories:
+  BrentLab/harbison_2004:
+    # Repo-level tags apply to all datasets in this repository
+    tags:
+      assay: binding
+      organism: yeast
+    dataset:
+      harbison_2004:
+        sample_id:
+          field: sample_id
+        # Dataset-level tags override repo-level tags with the same key
+        tags:
+          assay: chip-chip
+
+  BrentLab/kemmeren_2014:
+    tags:
+      assay: perturbation
+      organism: yeast
+    dataset:
+      kemmeren_2014:
+        sample_id:
+          field: sample_id
+```
+
+Access merged tags via `vdb.get_tags(db_name)`, identifying datasets by
+their name as it appears in `vdb.tables()`:
+
+```python
+from tfbpapi.virtual_db import VirtualDB
+
+vdb = VirtualDB("datasets.yaml")
+
+# Returns {"assay": "chip-chip", "organism": "yeast"}
+# (dataset-level assay overrides repo-level)
+vdb.get_tags("harbison")
+
+# Returns {"assay": "perturbation", "organism": "yeast"}
+vdb.get_tags("kemmeren")
+```
+
+The underlying `MetadataConfig` (available as `vdb.config`) exposes the same
+data via `(repo_id, config_name)` pairs for programmatic or developer use:
+
+```python
+# Equivalent to vdb.get_tags("harbison") above
+vdb.config.get_tags("BrentLab/harbison_2004", "harbison_2004")
+```
+
 ## Comparative Datasets
 
 Comparative datasets differ from other dataset types in that they represent
@@ -152,9 +211,10 @@ Each row relates 2+ samples from other datasets.
 
 ### Structure
 
-Comparative datasets use `source_sample` fields instead of a single `sample_id`:
+Comparative datasets use `source_sample` fields instead of a single sample
+identifier column:
 - Multiple fields with `role: source_sample`
-- Each contains composite identifier: `"repo_id;config_name;sample_id"`
+- Each contains composite identifier: `"repo_id;config_name;sample_id_value"`
 - Example: `binding_id = "BrentLab/harbison_2004;harbison_2004;42"`
 
 ### Fields
@@ -206,10 +266,11 @@ build on each other. Using `harbison` as an example primary dataset and
 
 **1. Metadata view**
 
-One row per unique `sample_id`. Derived columns from the configuration
-(e.g., `carbon_source`, `temperature_celsius`) are resolved here using
-datacard definitions, factor aliases, and missing value labels. This is
-the primary view for querying sample-level metadata.
+One row per unique sample identifier (the column configured via
+`sample_id: {field: <column_name>}`). Derived columns from the
+configuration (e.g., `carbon_source`, `temperature_celsius`) are resolved
+here using datacard definitions, factor aliases, and missing value labels.
+This is the primary view for querying sample-level metadata.
 
 **2. Raw data view**
 
@@ -239,7 +300,7 @@ or filter by source dataset without parsing composite IDs in SQL.
 ```
 __harbison_parquet  (raw parquet, not directly exposed)
   |
-  +-> harbison_meta  (deduplicated, one row per sample_id,
+  +-> harbison_meta  (deduplicated, one row per sample identifier,
   |                   with derived columns from config)
   |
   +-> harbison  (full parquet joined to harbison_meta)