Skip to content
This repository was archived by the owner on Jan 22, 2026. It is now read-only.

Commit c93f85d

Browse files
committed
Refactor batch size and snapshot interval handling to use configurable environment variables
1 parent 41c9518 commit c93f85d

File tree

4 files changed

+49
-18
lines changed

4 files changed

+49
-18
lines changed

docs/internals.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ The schema has six main tables:
1919
- `dependency_changes` records every add, modify, or remove event
2020
- `dependency_snapshots` stores full dependency state at intervals
2121

22-
Snapshots exist because replaying thousands of change records to answer "what dependencies existed at commit X?" would be slow. Instead, we store the complete dependency set every 20 commits (`SNAPSHOT_INTERVAL`). Point-in-time queries find the nearest snapshot and replay only the changes since then.
22+
Snapshots exist because replaying thousands of change records to answer "what dependencies existed at commit X?" would be slow. Instead, we store the complete dependency set every 50 commits by default. Point-in-time queries find the nearest snapshot and replay only the changes since then.
2323

2424
## Git Access
2525

@@ -47,12 +47,12 @@ When you run `git pkgs init` (see [`commands/init.rb`](../lib/git/pkgs/commands/
4747
2. Switches to bulk write mode (WAL, synchronous off, large cache)
4848
3. Walks commits chronologically
4949
4. For each commit with manifest changes, calls `analyzer.analyze_commit`
50-
5. Batches inserts in transactions of 100 commits
51-
6. Creates dependency snapshots every 20 commits that changed dependencies
50+
5. Batches inserts in transactions of 500 commits
51+
6. Creates dependency snapshots every 50 commits that changed dependencies
5252
7. Creates indexes after all data is loaded
5353
8. Switches back to normal sync mode
5454

55-
Deferring index creation until the end speeds things up considerably. The batch size of 100 is a balance between transaction overhead and memory usage.
55+
Deferring index creation until the end speeds things up considerably. Both batch size and snapshot interval are configurable via environment variables (see Performance Notes below).
5656

5757
## Incremental Updates
5858

@@ -139,6 +139,9 @@ ActiveRecord models live in [`lib/git/pkgs/models/`](../lib/git/pkgs/models/). T
139139

140140
## Performance Notes
141141

142-
Typical init speed is around 300 commits per second. The main bottlenecks are git blob reads and bibliothecary parsing. The blob OID cache helps a lot: if a Gemfile hasn't changed in 50 commits, we parse it once and reuse the result. The manifest path regex filter also helps by skipping commits that only touch source files.
142+
Typical init speed is around 75-300 commits per second depending on the repository. The main bottlenecks are git blob reads and bibliothecary parsing. The blob OID cache helps a lot: if a Gemfile hasn't changed in 50 commits, we parse it once and reuse the result. The manifest path regex filter also helps by skipping commits that only touch source files.
143143

144-
For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. You could tune `SNAPSHOT_INTERVAL` if you care more about one than the other.
144+
For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. Two environment variables let you tune this:
145+
146+
- `GIT_PKGS_BATCH_SIZE` - Number of commits per database transaction (default: 500). Larger batches reduce transaction overhead but use more memory.
147+
- `GIT_PKGS_SNAPSHOT_INTERVAL` - Store full dependency state every N commits with changes (default: 50). Lower values speed up point-in-time queries but increase database size.

lib/git/pkgs.rb

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,19 +44,27 @@ class NotInitializedError < Error; end
4444
class NotInGitRepoError < Error; end
4545

4646
class << self
47-
attr_accessor :quiet, :git_dir, :work_tree, :db_path
47+
attr_accessor :quiet, :git_dir, :work_tree, :db_path, :batch_size, :snapshot_interval
4848

4949
def configure_from_env
5050
@git_dir ||= presence(ENV["GIT_DIR"])
5151
@work_tree ||= presence(ENV["GIT_WORK_TREE"])
5252
@db_path ||= presence(ENV["GIT_PKGS_DB"])
53+
@batch_size ||= int_presence(ENV["GIT_PKGS_BATCH_SIZE"])
54+
@snapshot_interval ||= int_presence(ENV["GIT_PKGS_SNAPSHOT_INTERVAL"])
5355
end
5456

5557
def reset_config!
5658
@quiet = false
5759
@git_dir = nil
5860
@work_tree = nil
5961
@db_path = nil
62+
@batch_size = nil
63+
@snapshot_interval = nil
64+
end
65+
66+
def int_presence(value)
67+
value && !value.empty? ? value.to_i : nil
6068
end
6169

6270
def presence(value)

lib/git/pkgs/commands/branch.rb

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,16 @@ module Commands
66
class Branch
77
include Output
88

9-
BATCH_SIZE = 100
10-
SNAPSHOT_INTERVAL = 20
9+
DEFAULT_BATCH_SIZE = 500
10+
DEFAULT_SNAPSHOT_INTERVAL = 50
11+
12+
def batch_size
13+
Git::Pkgs.batch_size || DEFAULT_BATCH_SIZE
14+
end
15+
16+
def snapshot_interval
17+
Git::Pkgs.snapshot_interval || DEFAULT_SNAPSHOT_INTERVAL
18+
end
1119

1220
def initialize(args)
1321
@args = args
@@ -247,7 +255,7 @@ def bulk_process_commits(commits, branch, analyzer, total, repo)
247255

248256
snapshot = result[:snapshot]
249257

250-
if dependency_commit_count % SNAPSHOT_INTERVAL == 0
258+
if dependency_commit_count % snapshot_interval == 0
251259
snapshot.each do |(manifest_path, name), dep_info|
252260
pending_snapshots << {
253261
sha: rugged_commit.oid,
@@ -262,7 +270,7 @@ def bulk_process_commits(commits, branch, analyzer, total, repo)
262270
end
263271
end
264272

265-
flush.call if pending_commits.size >= BATCH_SIZE
273+
flush.call if pending_commits.size >= batch_size
266274
end
267275

268276
if snapshot.any?

lib/git/pkgs/commands/init.rb

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,16 @@ module Commands
66
class Init
77
include Output
88

9-
BATCH_SIZE = 100
10-
SNAPSHOT_INTERVAL = 20 # Store snapshot every N dependency-changing commits
9+
DEFAULT_BATCH_SIZE = 500
10+
DEFAULT_SNAPSHOT_INTERVAL = 50
11+
12+
def batch_size
13+
Git::Pkgs.batch_size || DEFAULT_BATCH_SIZE
14+
end
15+
16+
def snapshot_interval
17+
Git::Pkgs.snapshot_interval || DEFAULT_SNAPSHOT_INTERVAL
18+
end
1119

1220
def initialize(args)
1321
@args = args
@@ -35,15 +43,17 @@ def run
3543

3644
info "Analyzing branch: #{branch_name}"
3745

46+
print "Loading commits..." unless Git::Pkgs.quiet
3847
walker = repo.walk(branch_name, @options[:since])
3948
commits = walker.to_a
4049
total = commits.size
50+
print "\r#{' ' * 20}\r" unless Git::Pkgs.quiet
4151

4252
stats = bulk_process_commits(commits, branch, analyzer, total)
4353

4454
branch.update(last_analyzed_sha: repo.branch_target(branch_name))
4555

46-
print "\rCreating indexes..." unless Git::Pkgs.quiet
56+
print "\rCreating indexes...#{' ' * 20}" unless Git::Pkgs.quiet
4757
Database.create_bulk_indexes
4858
Database.optimize_for_reads
4959

@@ -52,7 +62,7 @@ def run
5262
info "\rDone!#{' ' * 20}"
5363
info "Analyzed #{total} commits"
5464
info "Found #{stats[:dependency_commits]} commits with dependency changes"
55-
info "Stored #{stats[:snapshots_stored]} snapshots (every #{SNAPSHOT_INTERVAL} changes)"
65+
info "Stored #{stats[:snapshots_stored]} snapshots (every #{snapshot_interval} changes)"
5666
info "Blob cache: #{cache_stats[:cached_blobs]} unique blobs, #{cache_stats[:blobs_with_hits]} had cache hits"
5767

5868
unless @options[:no_hooks]
@@ -135,9 +145,11 @@ def bulk_process_commits(commits, branch, analyzer, total)
135145
pending_snapshots.clear
136146
end
137147

148+
progress_interval = [total / 100, 10].max
149+
138150
commits.each do |rugged_commit|
139151
processed += 1
140-
print "\rProcessing commit #{processed}/#{total}..." if !Git::Pkgs.quiet && (processed % 50 == 0 || processed == total)
152+
print "\rProcessing commit #{processed}/#{total}..." if !Git::Pkgs.quiet && (processed % progress_interval == 0 || processed == total)
141153

142154
next if rugged_commit.parents.length > 1 # skip merge commits
143155

@@ -191,7 +203,7 @@ def bulk_process_commits(commits, branch, analyzer, total)
191203
snapshot = result[:snapshot]
192204

193205
# Store snapshot at intervals
194-
if dependency_commit_count % SNAPSHOT_INTERVAL == 0
206+
if dependency_commit_count % snapshot_interval == 0
195207
snapshot.each do |(manifest_path, name), dep_info|
196208
pending_snapshots << {
197209
sha: rugged_commit.oid,
@@ -206,7 +218,7 @@ def bulk_process_commits(commits, branch, analyzer, total)
206218
end
207219
end
208220

209-
flush.call if pending_commits.size >= BATCH_SIZE
221+
flush.call if pending_commits.size >= batch_size
210222
end
211223

212224
# Always store final snapshot for the last processed commit

0 commit comments

Comments
 (0)