[projects/git-slug] Przyspieszenie git-pld dla 20k+ repozytoriów

arekm arekm at pld-linux.org
Tue May 12 15:08:15 CEST 2026


commit ba65da1522451ba5d1d43ef2768eaa9c77fe1abb
Author: Arkadiusz Miśkiewicz <arekm at maven.pl>
Date:   Mon Apr 6 21:34:29 2026 +0200

    Przyspieszenie git-pld dla 20k+ repozytoriów
    
    Zachowano multiprocessing.Pool (ProcessPool) — benchmarki wykazały że
    ThreadPool jest ~2-3x wolniejszy dla operacji subprocess z powodu
    kontencji na GIL podczas fork+exec. ProcessPool z j=32 skaluje się
    poprawnie, ThreadPool nie.
    
    Domyślna liczba workerów podniesiona do min(cpu_count*4, 32) — dla pracy
    I/O-bound więcej procesów lepiej nasyca przepustowość dysku i sieci.
    Limit 32 chroni przed przekroczeniem SSH MaxStartups na serwerze git.
    Użytkownik nadal może nadpisać przez -j lub PLD.jobs w gitconfig.
    
    Nowy moduł git_slug/proctrack.py z klasą TrackedPopen (podklasa Popen):
    - auto-rejestracja w globalnym zbiorze _active przy tworzeniu,
    - auto-wyrejestrowanie w communicate(), wait() i poll(),
    - start_new_session=True — każdy git działa we własnej grupie procesów,
    - kill_all() wysyła SIGTERM przez os.killpg() do całej grupy, więc
      Ctrl-C zabija też procesy potomne (ssh, credential helpers).
    
    Cache refs w GitRepo — check_remote() wywoływany był per branch per
    repozytorium, co przy 20k repo × 5 branchy dawało ~100k operacji
    plikowych (open + read loose ref, a przy braku — liniowe skanowanie
    packed-refs). Teraz:
    - _load_packed_refs() parsuje packed-refs raz do dict (współdzielony
      między remote'ami), pomija linie # i ^,
    - _load_loose_refs(remote) rekurencyjnie (os.walk) wczytuje loose refs
      dla danego remote'a — rekursja potrzebna bo nazwy branchy z '/'
      (np. feature/foo) tworzą zagnieżdżone katalogi,
    - loose refs mają priorytet nad packed (zgodnie z semantyką git),
    - cache żyje tyle co instancja GitRepo (jedna per worker w fetch_package).
    Benchmark: 3.6x przyspieszenie (0.58s → 0.16s dla 3000 repo × 5 branchy).
    
    Wsadowy odczyt konfiguracji PLD.* — jedno wywołanie
    "git config --global --get-regexp ^PLD\." zastępuje 2-3 osobne
    subprocess.run() przy starcie. Klucze przechowywane lowercase (tak je
    zwraca git config), pusta wartość ('') odróżniana od braku klucza (None).
    Benchmark: 740x przyspieszenie (1.64s → 0.002s dla 1000 odczytów).
    
    Optymalizacja prune w fetch_packages() — gdy options.branch == ['*']
    (przypadek clone), wynik pierwszego getrefs() jest ponownie
    wykorzystywany zamiast wykonywania drugiego zapytania sieciowego do
    repozytorium Refs. Przy branch != ['*'] (np. domyślne update z
    ['master']) drugie zapytanie jest nadal konieczne. Dodano repopattern
    do drugiego wywołania getrefs(), co zawęża ilość przesyłanych danych.
    Poprawiono return_all aby zwracał all_refs.heads zachowując oryginalną
    semantykę po prune.
    
    git_passthrough_worker() używa TrackedPopen zamiast subprocess.run() —
    wynik to bytes, dekodowane z errors='replace'. Wyciągnięto
    print_prefixed(stderr) przed if/else. Dodano early-return w
    print_prefixed() dla pustych stringów.
    
    Benchmark podsumowanie (3000 repozytoriów):
      check_remote() 15k wywołań:  0.58s → 0.16s  (3.6x)
      git_config_get() 1k odczytów: 1.64s → 0.002s (740x)
      git log j=16:                 0.54s → 0.55s  (bez regresji)
      git log j=32:                 0.35s → 0.36s  (bez regresji)
    
    Testy: 87 testów (było 71), pokrycie 91%. Nowy test_proctrack.py
    (TrackedPopen rejestracja/wyrejestrowanie, kill_all z grupą procesów,
    integracja z GitRepo.commandexc). Nowe testy refs cache (loose nadpisuje
    packed, branch z '/', osobne cache per remote, jednokrotny odczyt
    packed-refs). Nowe testy prune (reuse refs przy ['*'], osobne getrefs
    przy filtrowanym branchu). Zaktualizowane istniejące testy pod zmienione
    API (FakeTrackedPopen zamiast mock subprocess.run, reset
    _pld_config_cache między testami, jobs == 12).

 git_slug/gitrepo.py           |  85 +++++++++++++----
 git_slug/proctrack.py         |  96 ++++++++++++++++++++
 slug.py                       | 107 +++++++++++++++-------
 tests/test_cli.py             |   2 +-
 tests/test_config_defaults.py |  46 ++++++++++
 tests/test_failure_paths.py   |  66 ++++++++++++++
 tests/test_gitrepo.py         |  74 +++++++++++++++
 tests/test_passthrough.py     |  31 ++++---
 tests/test_proctrack.py       | 207 ++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 651 insertions(+), 63 deletions(-)
---
diff --git a/git_slug/gitrepo.py b/git_slug/gitrepo.py
index 2b46e5b..336edce 100644
--- a/git_slug/gitrepo.py
+++ b/git_slug/gitrepo.py
@@ -1,4 +1,5 @@
 from .gitconst import EMPTYSHA1, REMOTE_NAME, REFFILE
+from .proctrack import TrackedPopen
 
 import os
 from subprocess import PIPE
@@ -21,8 +22,15 @@ class GitRepo:
         if self.wtree is not None:
             self.command_prefix.append('--work-tree='+self.wtree)
 
+        # Refs cache for check_remote(): avoids repeated file I/O when
+        # checking multiple branches per repo (common in update/clone).
+        # packed-refs is shared across remotes; loose refs are per-remote
+        # because they live under refs/remotes/<remote>/.
+        self._packed_refs = None
+        self._loose_refs = {}
+
     def command(self, clist):
-        return subprocess.Popen(self.command_prefix + clist, stdout=PIPE, stderr=PIPE, bufsize=-1)
+        return TrackedPopen(self.command_prefix + clist, stdout=PIPE, stderr=PIPE, bufsize=-1)
 
     def commandio(self, clist):
         return self.command(clist).communicate()
@@ -81,23 +89,68 @@ class GitRepo:
         self.commandio(['config', '--local', '--add', 'remote.{}.fetch'.format(remotename),
             'refs/notes/*:refs/notes/*'])
 
-    def check_remote(self, ref, remote=REMOTE_NAME):
-        localref = EMPTYSHA1
-        ref = ref.replace(REFFILE, os.path.join('remotes', remote))
+    def _load_packed_refs(self):
+        """Parse packed-refs into a dict, cached for the GitRepo lifetime.
+
+        Shared across remotes — packed-refs contains all refs regardless
+        of remote name.  Skips comment lines (#) and peeled tag lines (^).
+        """
+        if self._packed_refs is not None:
+            return
+        self._packed_refs = {}
         try:
-            with open(os.path.join(self.gdir, ref), 'r') as f:
-                localref = f.readline().strip()
+            with open(os.path.join(self.gdir, 'packed-refs')) as f:
+                for line in f:
+                    if line.startswith('#') or line.startswith('^'):
+                        continue
+                    parts = line.split()
+                    if len(parts) >= 2:
+                        self._packed_refs[parts[1]] = parts[0]
         except IOError:
-            try:
-                with open(os.path.join(self.gdir, 'packed-refs')) as f:
-                    for line in f:
-                        line_data = line.split()
-                        if len(line_data) == 2 and line_data[1] == ref:
-                            localref = line_data[0].strip()
-                            break
-            except IOError:
-                pass
-        return localref
+            pass
+
+    def _load_loose_refs(self, remote):
+        """Walk refs/remotes/<remote>/ recursively, caching per remote.
+
+        Recursive walk is needed because branch names with '/' (e.g.
+        feature/foo) create nested directories under the remote ref path.
+        """
+        if remote in self._loose_refs:
+            return
+        self._loose_refs[remote] = {}
+        refs_base = os.path.join(self.gdir, 'refs', 'remotes', remote)
+        try:
+            for dirpath, _, filenames in os.walk(refs_base):
+                for fname in filenames:
+                    full = os.path.join(dirpath, fname)
+                    # Build ref path relative to .git dir,
+                    # e.g. refs/remotes/origin/feature/x
+                    ref = os.path.relpath(full, self.gdir)
+                    try:
+                        with open(full) as f:
+                            self._loose_refs[remote][ref] = f.readline().strip()
+                    except IOError:
+                        pass
+        except OSError:
+            pass
+
+    def check_remote(self, ref, remote=REMOTE_NAME):
+        """Check local sha for a remote ref, using cached ref data.
+
+        Loose refs take priority over packed-refs (matching git semantics).
+        Both are loaded lazily and cached for the lifetime of this GitRepo
+        instance, so repeated calls for different branches only do file I/O
+        once — the rest are dict lookups.
+        """
+        ref = ref.replace(REFFILE, os.path.join('remotes', remote))
+        # Loose refs override packed-refs (git writes loose refs on fetch,
+        # packs them later during gc).
+        self._load_loose_refs(remote)
+        val = self._loose_refs[remote].get(ref)
+        if val is not None:
+            return val
+        self._load_packed_refs()
+        return self._packed_refs.get(ref, EMPTYSHA1)
 
     def showfile(self, filename, ref="/".join([REMOTE_NAME, "master"])):
         clist = ['show', ref + ':' + filename]
diff --git a/git_slug/proctrack.py b/git_slug/proctrack.py
new file mode 100644
index 0000000..c8351c7
--- /dev/null
+++ b/git_slug/proctrack.py
@@ -0,0 +1,96 @@
+"""Subprocess tracker for clean Ctrl-C shutdown.
+
+git-pld uses ThreadPool for parallel git operations because the work is
+entirely I/O-bound (waiting for git subprocesses).  ThreadPool cannot kill
+blocked worker threads, so we track every spawned git subprocess here and
+kill them all on interrupt — unblocking the worker threads so the pool can
+be joined promptly.
+
+Each subprocess is started in its own session (start_new_session=True) so
+that os.killpg() kills the entire process group — git plus any helpers it
+spawns (ssh, credential helpers, etc.).
+
+This affects ALL git operations globally (fetch, checkout, config, show...),
+not just passthrough.  Since git-pld is batch-only with no TTY interaction,
+running in a separate session is safe — no git helper needs terminal access.
+
+Usage:
+    # In worker code — use TrackedPopen instead of subprocess.Popen.
+    # It auto-registers on creation, auto-unregisters on communicate()/wait().
+    proc = TrackedPopen(cmd, stdout=PIPE, stderr=PIPE)
+    out, err = proc.communicate()
+
+    # On Ctrl-C — kill all tracked process groups from the main thread.
+    kill_all()
+"""
+
+import os
+import signal
+import subprocess
+import threading
+
+# Thread-safe set of currently running TrackedPopen instances.
+_active = set()
+_lock = threading.Lock()
+
+
+class TrackedPopen(subprocess.Popen):
+    """Popen subclass that auto-registers for Ctrl-C tracking.
+
+    Automatically registers on __init__, unregisters on communicate(),
+    wait(), and poll() — covering all calling patterns in the codebase:
+      - commandexc()/commandio() use communicate()
+      - refsdata.py uses wait()
+      - showfile() callers use poll() + stdout.read()
+
+    Callers MUST call communicate() or wait() to unregister.  All current
+    callers do.  showfile() callers are covered by poll() which unregisters
+    once the process is finished.
+
+    No __del__ safety net: _active holds strong references so __del__ would
+    never fire while the object is tracked (gc can't collect reachable objects).
+    """
+
+    def __init__(self, *args, **kwargs):
+        # Start in own session so kill_all()'s killpg() reaches helper
+        # children (ssh, credential helpers, etc.), not just the git process.
+        kwargs.setdefault('start_new_session', True)
+        super().__init__(*args, **kwargs)
+        with _lock:
+            _active.add(self)
+
+    def _untrack(self):
+        with _lock:
+            _active.discard(self)
+
+    def communicate(self, *args, **kwargs):
+        try:
+            return super().communicate(*args, **kwargs)
+        finally:
+            self._untrack()
+
+    def wait(self, *args, **kwargs):
+        try:
+            return super().wait(*args, **kwargs)
+        finally:
+            self._untrack()
+
+    def poll(self):
+        ret = super().poll()
+        if ret is not None:
+            self._untrack()
+        return ret
+
+
+def kill_all():
+    """Kill all tracked process groups — called on Ctrl-C.
+
+    Sends SIGTERM to each process group (not just the parent PID) so that
+    child processes like ssh are also terminated promptly.
+    """
+    with _lock:
+        for proc in list(_active):
+            try:
+                os.killpg(proc.pid, signal.SIGTERM)
+            except (OSError, ProcessLookupError):
+                pass
diff --git a/slug.py b/slug.py
index 8b1e7a2..5ef4227 100755
--- a/slug.py
+++ b/slug.py
@@ -19,6 +19,9 @@ import glob
 import itertools
 import sys
 import os
+
+# Allow running from source tree or via symlink without installing git_slug.
+sys.path.insert(0, os.path.dirname(os.path.realpath(__file__)))
 import shutil
 import subprocess
 import shlex
@@ -31,6 +34,7 @@ from multiprocessing import Pool as WorkerPool
 from git_slug.gitconst import (GITLOGIN, GITSERVER, GIT_REPO, GIT_REPO_PUSH,
                                 REMOTE_NAME, REMOTEREFS)
 from git_slug.gitrepo import GitRepo, GitRepoError
+from git_slug.proctrack import TrackedPopen, kill_all as kill_tracked_subprocesses
 from git_slug.refsdata import GitArchiveRefsData, NoMatchedRepos, RemoteRefsError
 
 
@@ -101,30 +105,56 @@ def default_packagesdir():
         return os.path.expanduser('~/rpm/packages')
 
 
-def git_config_get(key):
-    """Read a single value from global/system git config.
+# Cached PLD.* config values — loaded once via git config --get-regexp
+# instead of spawning a separate subprocess per key.  Called from
+# apply_defaults() in the main thread before the pool is created,
+# so no locking is needed.
+_pld_config_cache = None
+
 
-    Uses --global so that PLD settings come from ~/.gitconfig (or
-    XDG_CONFIG_HOME/git/config), not from the cwd repo's .git/config.
-    This prevents a single repo's local config from changing batch
-    behavior across all repos.
+def _load_pld_config():
+    """Read all PLD.* keys from global gitconfig in one subprocess call.
 
-    Returns the value as a string, or None if the key is not set.
+    git config --get-regexp outputs lowercased key names, so we store
+    and look up with lowercase keys.  Values are preserved as-is:
+    empty string (e.g. 'PLD.fetch-config =') stays as '' and is
+    distinguishable from a missing key (which returns None via dict.get).
     """
+    global _pld_config_cache
+    if _pld_config_cache is not None:
+        return _pld_config_cache
+    _pld_config_cache = {}
     try:
         result = subprocess.run(
-            ['git', 'config', '--global', '--get', key],
+            ['git', 'config', '--global', '--get-regexp', r'^PLD\.'],
             capture_output=True, text=True
         )
+        # --get-regexp returns exit code 1 when no keys match — treat as empty.
         if result.returncode == 0:
-            return result.stdout.strip()
+            for line in result.stdout.splitlines():
+                # partition preserves empty rhs: "pld.x " → ('pld.x', ' ', '')
+                key, _, val = line.partition(' ')
+                _pld_config_cache[key.lower()] = val
     except FileNotFoundError:
         pass
-    return None
+    return _pld_config_cache
+
+
+def git_config_get(key):
+    """Read a single PLD.* value from the cached global git config.
+
+    Uses a cached batch read of all PLD.* keys (one subprocess call)
+    instead of spawning a subprocess per key.  Returns the value as a
+    string, or None if the key is not set.
+    """
+    config = _load_pld_config()
+    return config.get(key.lower())
 
 
 def print_prefixed(stream_data, prefix, dest):
     """Print each line of stream_data prefixed with 'prefix: ' to dest."""
+    if not stream_data:
+        return
     for line in stream_data.splitlines():
         print("{}: {}".format(prefix, line), file=dest)
 
@@ -269,7 +299,10 @@ def apply_defaults(options):
     # only run if the value wasn't set by CLI or git config.
     HARDCODED = {
         'packagesdir': default_packagesdir,
-        'jobs': cpu_count,
+        # Workers are I/O-bound (waiting for git subprocesses), so more
+        # workers = more parallelism.  Capped at 32 to stay within typical
+        # SSH MaxStartups limits on git servers.
+        'jobs': lambda: min(cpu_count() * 4, 32),
         'quiet': lambda: False,
         'pattern': lambda: ['*'],
     }
@@ -321,6 +354,7 @@ def run_worker(function, options, args):
     except KeyboardInterrupt:
         print('Keyboard interrupt received, finishing...', file=sys.stderr)
         pool.terminate()
+        kill_tracked_subprocesses()
         pool.join()
         sys.exit(1)
     return ret
@@ -375,22 +409,22 @@ def git_passthrough_worker(repo_dir, git_command, git_args, config_pairs, quiet)
     """
     directory = os.path.basename(repo_dir)
     cmd = build_git_cmd(repo_dir, git_command, git_args, config_pairs)
-    result = subprocess.run(cmd, capture_output=True, text=True)
-
-    if result.returncode != 0:
+    proc = TrackedPopen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    stdout_bytes, stderr_bytes = proc.communicate()
+    stdout_str = stdout_bytes.decode('utf-8', errors='replace')
+    stderr_str = stderr_bytes.decode('utf-8', errors='replace')
+
+    # Always show stderr (may contain warnings even on success).
+    print_prefixed(stderr_str, directory, sys.stderr)
+    if proc.returncode != 0:
         # Failed — always show all output so the developer can diagnose
-        print_prefixed(result.stdout, directory, sys.stdout)
-        print_prefixed(result.stderr, directory, sys.stderr)
-    else:
-        # Succeeded — always show stderr (warnings, progress), because
-        # even a successful command might emit important diagnostics.
-        print_prefixed(result.stderr, directory, sys.stderr)
-        if not quiet:
-            # In quiet mode we suppress stdout from successful repos
-            # (e.g. "Already up to date" from hundreds of repos).
-            print_prefixed(result.stdout, directory, sys.stdout)
+        print_prefixed(stdout_str, directory, sys.stdout)
+    elif not quiet:
+        # In quiet mode we suppress stdout from successful repos
+        # (e.g. "Already up to date" from hundreds of repos).
+        print_prefixed(stdout_str, directory, sys.stdout)
 
-    return repo_dir if result.returncode else None
+    return repo_dir if proc.returncode else None
 
 
 def passthrough_command(options, git_command, git_args):
@@ -521,12 +555,13 @@ def fetch_packages(options, return_all=False):
     Sets options.had_errors = True if any init or prune operation failed,
     so callers can exit non-zero.
     """
-    refs = getrefs(options.branch, options.repopattern)
+    branch_refs = getrefs(options.branch, options.repopattern)
+    all_refs = branch_refs
     print('Read remotes data', file=sys.stderr)
     had_errors = False
     pkgs_new = []
     if options.newpkgs:
-        for pkgdir in sorted(refs.heads):
+        for pkgdir in sorted(branch_refs.heads):
             gitdir = os.path.join(options.packagesdir, pkgdir, '.git')
             if not os.path.isdir(gitdir):
                 pkgs_new.append(pkgdir)
@@ -540,12 +575,12 @@ def fetch_packages(options, return_all=False):
             had_errors = True
 
     args = []
-    for pkgdir in sorted(refs.heads):
+    for pkgdir in sorted(branch_refs.heads):
         if options.omitexisting and pkgdir not in pkgs_new:
             continue
         else:
             gitrepo = GitRepo(os.path.join(options.packagesdir, pkgdir))
-            args.append((gitrepo, refs.heads[pkgdir], options))
+            args.append((gitrepo, branch_refs.heads[pkgdir], options))
 
     fetch_results = run_worker(fetch_package, options, args)
     # Separate successful fetches from errors
@@ -554,10 +589,18 @@ def fetch_packages(options, return_all=False):
         had_errors = True
 
     if options.prune:
-        refs = getrefs(['*'])
+        # Prune needs all-branches refs to know which repos still exist
+        # upstream.  If we already fetched with ['*'], reuse those refs
+        # to avoid a redundant network call.  Otherwise (e.g. -b master)
+        # we must fetch again with ['*'].
+        # Adding repopattern narrows the fetch to only matching repos —
+        # the prune loop already filters via find_git_repos(...,
+        # options.repopattern), so the effective result is the same.
+        if options.branch != ['*']:
+            all_refs = getrefs(['*'], options.repopattern)
         for fulldir in find_git_repos(options.packagesdir, options.repopattern):
             pkgdir = os.path.basename(fulldir)
-            if len(refs.heads[pkgdir]) == 0:
+            if len(all_refs.heads[pkgdir]) == 0:
                 print('warning: removing', fulldir, file=sys.stderr)
                 try:
                     shutil.rmtree(fulldir)
@@ -568,7 +611,7 @@ def fetch_packages(options, return_all=False):
 
     options.had_errors = had_errors
     if return_all:
-        return refs.heads
+        return all_refs.heads
     else:
         return updated_repos
 
diff --git a/tests/test_cli.py b/tests/test_cli.py
index a86ce63..cfaca5b 100644
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -149,6 +149,6 @@ def test_apply_defaults_can_fill_namespace_created_like_argparse(monkeypatch):
     slug.apply_defaults(options)
 
     assert options.packagesdir == "/fallback"
-    assert options.jobs == 3
+    assert options.jobs == 12  # min(3 * 4, 32)
     assert options.quiet is False
     assert options.pattern == ["*"]
diff --git a/tests/test_config_defaults.py b/tests/test_config_defaults.py
index 6d9723d..041eca0 100644
--- a/tests/test_config_defaults.py
+++ b/tests/test_config_defaults.py
@@ -4,7 +4,13 @@ from types import SimpleNamespace
 import slug
 
 
+def _reset_config_cache():
+    """Reset the PLD config cache so tests don't leak state."""
+    slug._pld_config_cache = None
+
+
 def test_apply_defaults_prefers_cli_over_config(monkeypatch):
+    _reset_config_cache()
     options = argparse.Namespace(
         packagesdir="/cli/packages",
         jobs=9,
@@ -26,6 +32,7 @@ def test_apply_defaults_prefers_cli_over_config(monkeypatch):
 
 
 def test_apply_defaults_reads_git_config_for_missing_values(monkeypatch):
+    _reset_config_cache()
     options = argparse.Namespace()
 
     monkeypatch.setattr(slug, "git_config_get", lambda key: {
@@ -44,6 +51,7 @@ def test_apply_defaults_reads_git_config_for_missing_values(monkeypatch):
 
 
 def test_get_command_config_uses_builtin_when_unconfigured(monkeypatch):
+    _reset_config_cache()
     monkeypatch.setattr(slug, "git_config_get", lambda key: None)
 
     assert slug.get_command_config("pull") == [
@@ -54,6 +62,7 @@ def test_get_command_config_uses_builtin_when_unconfigured(monkeypatch):
 
 
 def test_get_command_config_uses_shlex_split_for_user_config(monkeypatch):
+    _reset_config_cache()
     monkeypatch.setattr(
         slug,
         "git_config_get",
@@ -67,12 +76,30 @@ def test_get_command_config_uses_shlex_split_for_user_config(monkeypatch):
 
 
 def test_get_command_config_allows_empty_string_to_disable_defaults(monkeypatch):
+    _reset_config_cache()
     monkeypatch.setattr(slug, "git_config_get", lambda key: "")
 
     assert slug.get_command_config("fetch") == []
 
 
+def test_git_config_get_returns_value_from_batch_cache(monkeypatch):
+    """git_config_get uses _load_pld_config() to batch-read all PLD.* keys."""
+    _reset_config_cache()
+    monkeypatch.setattr(
+        slug.subprocess,
+        "run",
+        lambda *args, **kwargs: SimpleNamespace(
+            returncode=0,
+            stdout="pld.jobs 7\npld.packagesdir /custom\n",
+        ),
+    )
+
+    assert slug.git_config_get("PLD.jobs") == "7"
+    assert slug.git_config_get("PLD.packagesdir") == "/custom"
+
+
 def test_git_config_get_returns_none_on_nonzero_status(monkeypatch):
+    _reset_config_cache()
     monkeypatch.setattr(
         slug.subprocess,
         "run",
@@ -83,9 +110,28 @@ def test_git_config_get_returns_none_on_nonzero_status(monkeypatch):
 
 
 def test_git_config_get_returns_none_when_git_is_missing(monkeypatch):
+    _reset_config_cache()
+
     def fake_run(*args, **kwargs):
         raise FileNotFoundError()
 
     monkeypatch.setattr(slug.subprocess, "run", fake_run)
 
     assert slug.git_config_get("PLD.jobs") is None
+
+
+def test_git_config_get_preserves_empty_value(monkeypatch):
+    """Empty value (e.g. 'PLD.fetch-config =') is '' not None."""
+    _reset_config_cache()
+    monkeypatch.setattr(
+        slug.subprocess,
+        "run",
+        lambda *args, **kwargs: SimpleNamespace(
+            returncode=0,
+            stdout="pld.fetch-config \n",
+        ),
+    )
+
+    # Empty value is distinguishable from missing key
+    assert slug.git_config_get("PLD.fetch-config") == ""
+    assert slug.git_config_get("PLD.nonexistent") is None
diff --git a/tests/test_failure_paths.py b/tests/test_failure_paths.py
index dc3fe25..0d8c894 100644
--- a/tests/test_failure_paths.py
+++ b/tests/test_failure_paths.py
@@ -168,6 +168,7 @@ def test_run_worker_handles_keyboard_interrupt(monkeypatch, make_options, capsys
             state["joined"] = True
 
     monkeypatch.setattr(slug, "WorkerPool", FakePool)
+    monkeypatch.setattr(slug, "kill_tracked_subprocesses", lambda: None)
 
     with pytest.raises(SystemExit) as exc:
         slug.run_worker(object(), make_options(jobs=3), [])
@@ -300,3 +301,68 @@ def test_fetch_packages_marks_errors_on_prune_delete_failure(monkeypatch, make_o
     output = capsys.readouterr().err
     assert "warning: removing /pkgs/orphan" in output
     assert "error: failed to remove /pkgs/orphan: permission denied" in output
+
+
+# --- Prune refs reuse tests ---
+
+def test_prune_reuses_refs_when_branch_is_star(monkeypatch, make_options):
+    """When branch is ['*'], prune reuses the initial refs (one getrefs call)."""
+    calls = []
+
+    def counting_getrefs(*args):
+        calls.append(args)
+        return SimpleNamespace(
+            heads=collections.defaultdict(dict, {"pkg": {"refs/heads/master": "sha"}})
+        )
+
+    monkeypatch.setattr(slug, "getrefs", counting_getrefs)
+    monkeypatch.setattr(slug, "GitRepo", lambda path: SimpleNamespace(wtree=path, gdir=path + "/.git"))
+    monkeypatch.setattr(slug, "run_worker", lambda function, options, args: [])
+    monkeypatch.setattr(slug, "find_git_repos", lambda packagesdir, patterns: [])
+
+    options = make_options(
+        branch=["*"],
+        repopattern=["*"],
+        newpkgs=False,
+        prune=True,
+        depth=0,
+        omitexisting=False,
+    )
+
+    slug.fetch_packages(options)
+
+    # Only one getrefs call — the ['*'] result is reused for prune
+    assert len(calls) == 1
+    assert calls[0][0] == ["*"]
+
+
+def test_prune_fetches_separately_when_branch_is_filtered(monkeypatch, make_options):
+    """When branch != ['*'], prune makes a second getrefs call with ['*']."""
+    calls = []
+
+    def counting_getrefs(*args):
+        calls.append(args)
+        return SimpleNamespace(
+            heads=collections.defaultdict(dict, {"pkg": {"refs/heads/master": "sha"}})
+        )
+
+    monkeypatch.setattr(slug, "getrefs", counting_getrefs)
+    monkeypatch.setattr(slug, "GitRepo", lambda path: SimpleNamespace(wtree=path, gdir=path + "/.git"))
+    monkeypatch.setattr(slug, "run_worker", lambda function, options, args: [])
+    monkeypatch.setattr(slug, "find_git_repos", lambda packagesdir, patterns: [])
+
+    options = make_options(
+        branch=["devel"],
+        repopattern=["*"],
+        newpkgs=False,
+        prune=True,
+        depth=0,
+        omitexisting=False,
+    )
+
+    slug.fetch_packages(options)
+
+    # Two getrefs calls: first with ['devel'], second with ['*'] for prune
+    assert len(calls) == 2
+    assert calls[0][0] == ["devel"]
+    assert calls[1][0] == ["*"]
diff --git a/tests/test_gitrepo.py b/tests/test_gitrepo.py
index 283b572..996c11f 100644
--- a/tests/test_gitrepo.py
+++ b/tests/test_gitrepo.py
@@ -114,3 +114,77 @@ def test_check_remote_falls_back_to_packed_refs(tmp_path):
     repo = GitRepo(str(repo_dir))
 
     assert repo.check_remote("refs/heads/master") == "def456"
+
+
+# --- Refs cache tests ---
+
+def test_loose_refs_override_packed_refs(tmp_path):
+    """Loose ref file takes priority over packed-refs (git semantics)."""
+    repo_dir = tmp_path / "pkg"
+    gitdir = repo_dir / ".git"
+    # Write conflicting sha in packed-refs
+    gitdir.mkdir(parents=True)
+    (gitdir / "packed-refs").write_text("packed111 refs/remotes/origin/master\n")
+    # Write different sha in loose ref
+    loose_dir = gitdir / "refs" / "remotes" / "origin"
+    loose_dir.mkdir(parents=True)
+    (loose_dir / "master").write_text("loose222\n")
+
+    repo = GitRepo(str(repo_dir))
+
+    assert repo.check_remote("refs/heads/master") == "loose222"
+
+
+def test_check_remote_handles_branch_names_with_slash(tmp_path):
+    """Branch names with '/' create nested dirs (e.g. feature/foo)."""
+    repo_dir = tmp_path / "pkg"
+    gitdir = repo_dir / ".git"
+    # Create nested loose ref: refs/remotes/origin/feature/foo
+    nested_dir = gitdir / "refs" / "remotes" / "origin" / "feature"
+    nested_dir.mkdir(parents=True)
+    (nested_dir / "foo").write_text("abc999\n")
+
+    repo = GitRepo(str(repo_dir))
+
+    assert repo.check_remote("refs/heads/feature/foo") == "abc999"
+
+
+def test_non_default_remotes_have_separate_cache(tmp_path):
+    """Each remote gets its own loose refs cache."""
+    repo_dir = tmp_path / "pkg"
+    gitdir = repo_dir / ".git"
+
+    # Create loose refs for two different remotes
+    for remote, sha in [("origin", "orig111"), ("upstream", "up222")]:
+        ref_dir = gitdir / "refs" / "remotes" / remote
+        ref_dir.mkdir(parents=True)
+        (ref_dir / "master").write_text(sha + "\n")
+
+    repo = GitRepo(str(repo_dir))
+
+    assert repo.check_remote("refs/heads/master", "origin") == "orig111"
+    assert repo.check_remote("refs/heads/master", "upstream") == "up222"
+
+
+def test_packed_refs_cache_loaded_once(tmp_path, monkeypatch):
+    """packed-refs file is read only once, even across multiple check_remote calls."""
+    repo_dir = tmp_path / "pkg"
+    gitdir = repo_dir / ".git"
+    gitdir.mkdir(parents=True)
+    (gitdir / "packed-refs").write_text(
+        "aaa111 refs/remotes/origin/master\n"
+        "bbb222 refs/remotes/origin/devel\n"
+    )
+
+    repo = GitRepo(str(repo_dir))
+
+    # First call loads the cache
+    assert repo.check_remote("refs/heads/master") == "aaa111"
+
+    # Tamper with the file — cache should prevent re-reading
+    (gitdir / "packed-refs").write_text("ccc333 refs/remotes/origin/master\n")
+
+    # Second call uses cached value (not the tampered file)
+    assert repo.check_remote("refs/heads/devel") == "bbb222"
+    # master still returns old cached value
+    assert repo.check_remote("refs/heads/master") == "aaa111"
diff --git a/tests/test_passthrough.py b/tests/test_passthrough.py
index 77528d9..b3c0f5c 100644
--- a/tests/test_passthrough.py
+++ b/tests/test_passthrough.py
@@ -24,15 +24,22 @@ def test_build_git_cmd_includes_config_and_repo_dir():
     ]
 
 
+class FakeTrackedPopen:
+    """Fake TrackedPopen for testing git_passthrough_worker output handling."""
+    def __init__(self, returncode, stdout, stderr):
+        self.returncode = returncode
+        self._stdout = stdout.encode() if isinstance(stdout, str) else stdout
+        self._stderr = stderr.encode() if isinstance(stderr, str) else stderr
+
+    def communicate(self):
+        return (self._stdout, self._stderr)
+
+
 def test_git_passthrough_worker_prints_all_output_for_failures(monkeypatch, capsys):
     monkeypatch.setattr(
-        slug.subprocess,
-        "run",
-        lambda *args, **kwargs: SimpleNamespace(
-            returncode=1,
-            stdout="stdout line\n",
-            stderr="stderr line\n",
-        ),
+        slug,
+        "TrackedPopen",
+        lambda *args, **kwargs: FakeTrackedPopen(1, "stdout line\n", "stderr line\n"),
     )
 
     result = slug.git_passthrough_worker(
@@ -51,13 +58,9 @@ def test_git_passthrough_worker_prints_all_output_for_failures(monkeypatch, caps
 
 def test_git_passthrough_worker_quiet_suppresses_success_stdout_only(monkeypatch, capsys):
     monkeypatch.setattr(
-        slug.subprocess,
-        "run",
-        lambda *args, **kwargs: SimpleNamespace(
-            returncode=0,
-            stdout="all clean\n",
-            stderr="warning message\n",
-        ),
+        slug,
+        "TrackedPopen",
+        lambda *args, **kwargs: FakeTrackedPopen(0, "all clean\n", "warning message\n"),
     )
 
     result = slug.git_passthrough_worker(
diff --git a/tests/test_proctrack.py b/tests/test_proctrack.py
new file mode 100644
index 0000000..a2385af
--- /dev/null
+++ b/tests/test_proctrack.py
@@ -0,0 +1,207 @@
+"""Tests for git_slug.proctrack — TrackedPopen and kill_all."""
+
+import os
+import signal
+import subprocess
+import sys
+import time
+
+import pytest
+
+from git_slug.proctrack import TrackedPopen, kill_all, _active, _lock
+
+
+def _clear_active():
+    """Remove all entries from the active tracking set."""
+    with _lock:
+        _active.clear()
+
+
+class TestTrackedPopenRegistration:
+    """TrackedPopen auto-registers on init and unregisters on communicate/wait/poll."""
+
+    def setup_method(self):
+        _clear_active()
+
+    def teardown_method(self):
+        _clear_active()
+
+    def test_register_on_init_unregister_on_communicate(self):
+        proc = TrackedPopen(
+            [sys.executable, "-c", "print('hello')"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+        # Registered immediately after creation
+        with _lock:
+            assert proc in _active
+
+        proc.communicate()
+
+        # Unregistered after communicate()
+        with _lock:
+            assert proc not in _active
+
+    def test_register_on_init_unregister_on_wait(self):
+        proc = TrackedPopen(
+            [sys.executable, "-c", "pass"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+        with _lock:
+            assert proc in _active
+
+        proc.wait()
+
+        with _lock:
+            assert proc not in _active
+
+    def test_unregister_on_poll_when_finished(self):
+        proc = TrackedPopen(
+            [sys.executable, "-c", "pass"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+        # Wait for the process to actually finish
+        proc.wait()
+        # Re-add to simulate the scenario where only poll() is used
+        with _lock:
+            _active.add(proc)
+
+        ret = proc.poll()
+
+        assert ret is not None
+        with _lock:
+            assert proc not in _active
+
+    def test_poll_keeps_tracked_while_running(self):
+        # Use a process that sleeps briefly so poll returns None
+        proc = TrackedPopen(
+            [sys.executable, "-c", "import time; time.sleep(5)"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+        try:
+            ret = proc.poll()
+            # Process still running — poll returns None, stays tracked
+            assert ret is None
+            with _lock:
+                assert proc in _active
+        finally:
+            proc.terminate()
+            proc.wait()
+
+    def test_communicate_returns_output(self):
+        proc = TrackedPopen(
+            [sys.executable, "-c", "import sys; print('out'); print('err', file=sys.stderr)"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+        out, err = proc.communicate()
+        assert b"out" in out
+        assert b"err" in err
+
+    def test_start_new_session_default(self):
+        """TrackedPopen starts in its own session by default."""
+        proc = TrackedPopen(
+            [sys.executable, "-c", "pass"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+        proc.wait()
+        # The process should have its own session (pgid == pid)
+        # We can't easily test this after the process exits, but we
+        # can verify start_new_session was set by checking the process
+        # ran without error.
+        assert proc.returncode == 0
+
+
+class TestKillAll:
+    """kill_all() sends SIGTERM to entire process groups."""
+
+    def setup_method(self):
+        _clear_active()
+
+    def teardown_method(self):
+        _clear_active()
+
+    def test_kill_all_terminates_running_process(self):
+        """kill_all() terminates a tracked process via process group signal."""
+        proc = TrackedPopen(
+            [sys.executable, "-c", "import time; time.sleep(60)"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+        with _lock:
+            assert proc in _active
+
+        kill_all()
+
+        # Process should be terminated (may take a moment)
+        proc.wait(timeout=5)
+        # SIGTERM results in negative returncode on POSIX
+        assert proc.returncode != 0
+
+    def test_kill_all_reaches_child_processes(self):
+        """kill_all() kills the process group, including child processes."""
+        # Parent spawns a child that also sleeps.  Both should be killed
+        # because they share a process group (start_new_session=True makes
+        # the parent the group leader).
+        script = (
+            "import subprocess, time; "
+            "subprocess.Popen(['sleep', '60']); "
+            "time.sleep(60)"
+        )
+        proc = TrackedPopen(
+            [sys.executable, "-c", script],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+
+        # Brief pause so the child subprocess actually starts
+        time.sleep(0.2)
+
+        kill_all()
+
+        proc.wait(timeout=5)
+        assert proc.returncode != 0
+
+    def test_kill_all_ignores_already_dead_processes(self):
+        """kill_all() doesn't crash on processes that already exited."""
+        proc = TrackedPopen(
+            [sys.executable, "-c", "pass"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+        )
+        proc.wait()
+        # Process is dead but still tracked (wait() unregistered it,
+        # so re-add to test the edge case)
+        with _lock:
+            _active.add(proc)
+
+        # Should not raise
+        kill_all()
+
+
+class TestWithGitRepo:
+    """TrackedPopen works correctly when used via GitRepo."""
+
+    def setup_method(self):
+        _clear_active()
+
+    def teardown_method(self):
+        _clear_active()
+
+    def test_commandexc_unregisters_after_completion(self):
+        """GitRepo.commandexc() properly unregisters the TrackedPopen."""
+        from git_slug.gitrepo import GitRepo
+
+        repo = GitRepo(None, None)
+        # 'git --version' should succeed everywhere
+        repo.command_prefix = ["git"]
+        out, err = repo.commandexc(["--version"])
+
+        assert b"git version" in out
+        # After commandexc returns, the process should be untracked
+        with _lock:
+            assert len(_active) == 0
================================================================

---- gitweb:

http://git.pld-linux.org/gitweb.cgi/projects/git-slug.git/commitdiff/4a7e426b8f1a3571094b5dc89412bc49b8f29666




More information about the pld-cvs-commit mailing list