beter garbage collecting

This commit is contained in:
Dominik Krenn 2025-11-05 10:10:51 +01:00
parent 443f9f4f4b
commit 89746b1076
5 changed files with 197 additions and 7 deletions

78
WORK.md Normal file
View File

@ -0,0 +1,78 @@
# Whats going on inside `mathstream`
Figured Id write down how the whole thing is wired, more like a brain dump than polished docs. If youre diving in to fix something, this should save you a bunch of spelunking.
## Directory map (so you know where to poke)
```
mathstream/
__init__.py # re-export central
engine.py # arithmetic guts
exceptions.py # custom errors
number.py # StreamNumber, manual GC, watcher
utils.py # sqlite junk drawer
test.py # smoke/integration script
```
## StreamNumber the heart of it
`mathstream/number.py` owns the `StreamNumber` class. The class does a couple jobs:
- wraps a file of digits (either you give us a path or a literal; literals get canonicalised and dropped into `instance/log/literal_<hash>.txt`).
- streaming happens via `.stream(chunk_size)` so we never load the whole thing; every time we read we call `touch_log_file` so the usage timestamp keeps moving.
- when a new stream gets created we check if it lives under `LOG_DIR`. If yes, we register it with the sqlite tracker (`register_log_file`) and also bump a ref counter via `register_reference`.
- theres a weakref finaliser plus a global `_ACTIVE_COUNTER` that keeps tabs on python-side references. If the object falls out of scope we run `_finalize_instance`, which decrements the counter and if it was the last one we call `release_reference` (that may nuke the file instantly).
- explicit `free()` exists for people who want deterministic cleanup. Its basically like `free()` in C: drop ref count and optionally delete the file now. Theres an alias `free_stream` and the class is a context manager so `with StreamNumber(...) as sn:` cleans up automatically.
So any time youve got a staged result hanging around in memory, the watcher knows about it. Once you ditch it—either by `free()` or just letting the object die—the sqlite ref count drops.
## Engine maths without ints
Living in `mathstream/engine.py`. All the operators (`add/sub/mul/div/mod/pow`) pull chunks from the `StreamNumber` inputs, normalise them into sign + digit strings, run grade-school algorithms, then write the result back into `LOG_DIR`.
- `_write_result` is the important bit: writes to disk, calls `register_log_file`, then wraps the file in a new `StreamNumber`. Because of that call, every staged result is tracked automatically.
- Were careful about signs: division and modulo follow Pythons floor division rules. Divide-by-zero is intercepted and converted into `DivideByZeroError`.
- `clear_logs()` wipes the folder and calls `wipe_log_records()` to empty sqlite so the next run isnt polluted.
## Exceptions
`mathstream/exceptions.py` just defines `MathStreamError` and the more specific `DivideByZeroError`. Nothing fancy, just so we dont leak raw `ZeroDivisionError`.
## SQLite watcher (`mathstream/utils.py`)
This is the garbage-collection HQ. On import we run `_ensure_db(reset=True)` so every run starts from a clean DB (no migrations, no surprises). Two tables:
- `logs` → metadata about every staged file: created time, last access, access count.
- `refs` → current reference count (think “how many StreamNumber instances think they own this file”).
Important functions:
- `register_log_file(path)` ensure both tables have a row (initial ref count 0).
- `register_reference(path)` increments the ref count, updates last access, access count etc. Called whenever a new `StreamNumber` points at the staged file.
- `touch_log_file(path)` called from `.stream()` so we know the file is being read.
- `release_reference(path, delete_file=True)` the inverse of register. If the count hits zero we remove the DB row and (optionally) delete the file right away.
- `collect_garbage(score_threshold)` this is the periodic sweeper. Computes `score = age / ((ref_count + 1) * (access_count + 1))`. Bigger score means older + less used. If score >= threshold it gets unlinked and removed from DB. Negative thresholds blow up on purpose.
- `tracked_files()` dumb helper that dumps `{path: ref_count}` out of the DB.
- `wipe_log_records()` nukes both tables; used by `clear_logs`.
## How cleanup flows
1. You run an operation (`add`, `mul`, whatever). Result file lands in `LOG_DIR`, gets registered, comes back as a `StreamNumber`.
2. You stream it or create more streams from it metadata keeps getting updated via `touch_log_file`/`register_reference`.
3. When youre done, call `.free()` or just drop references. Manual free is immediate. Otherwise the weakref finaliser catches it eventually.
4. `release_reference` is what actually removes the sqlite entries and unlinks the data file when there are no logical references left.
5. If you still have detritus (e.g. you crashed before refs hit zero), run `collect_garbage(threshold)` to sweep anything whose age outweighs usage.
6. `active_streams()` reports whats still alive in Python land; `tracked_files()` shows what the DB thinks is referenced.
## Example run (`test.py`)
`test.py` is half regression, half reference script. It:
- seeds some numbers, runs every operation, checks results.
- makes sure `DivideByZeroError` fires.
- frees every staged number to prove files vanish on the spot.
- runs `collect_garbage(0)` just to make sure nothing else lingers.
- dumps `active_streams()` and `tracked_files()` so you can see python vs sqlite state.
If the logs ever seem suspicious, run that script—itll tell you immediately whether somethings still referenced or if the GC is forgetting to clean up.

View File

@ -23,6 +23,7 @@ from mathstream import (
pow,
is_even,
is_odd,
free_stream,
collect_garbage,
)
@ -39,6 +40,9 @@ print("power =", "".join(pow(a, e).stream()))
print("a is even?", is_even(a))
print("b is odd?", is_odd(b))
# drop staged artifacts immediately when you are done
free_stream(b)
# reclaim space for files whose age outweighs their use
collect_garbage(0.5)
```
@ -52,8 +56,9 @@ Each arithmetic call writes its result back into `instance/log` (configurable vi
- **Automatic staging** Outputs are stored under `LOG_DIR` with hashes based on input file paths, letting you compose operations without manual bookkeeping.
- **Sign-aware** Addition, subtraction, multiplication, division (`//` behavior), modulo, and exponentiation (non-negative exponents) all respect operand sign. Division/modulo follow Pythons floor-division rules.
- **Utilities** `clear_logs()` wipes prior staged results so you can start fresh.
- **Manual freeing** Call `stream.free()` (or `free_stream(stream)`) once you are done with a staged number to release its reference immediately. Logger metadata keeps per-path reference counts so the final free removes the backing file on the spot.
- **Parity helpers** `is_even` and `is_odd` inspect the streamed digits without materializing the integer.
- **Garbage collection** `collect_garbage(score_threshold)` computes a score from file age, access count, and reference count (tracked in `instance/mathstream_logs.sqlite`, freshly truncated each run). Files whose score meets or exceeds the threshold are deleted, letting you tune how aggressively to reclaim space. Both staged results and literal caches participate.
- **Garbage collection** `collect_garbage(score_threshold)` computes a score from file age, access count, and reference count (tracked in `instance/mathstream_logs.sqlite`, freshly truncated each run). Files whose score meets or exceeds the threshold are deleted, letting you tune how aggressively to reclaim space. Both staged results and literal caches participate. Use `tracked_files()` or `active_streams()` to inspect current state.
Divide-by-zero scenarios raise the custom `DivideByZeroError` so callers can distinguish mathstream issues from Pythons native exceptions.

View File

@ -1,11 +1,12 @@
from .engine import clear_logs, add, sub, mul, div, mod, pow, is_even, is_odd
from .exceptions import MathStreamError, DivideByZeroError
from .number import StreamNumber
from .utils import collect_garbage
from .number import StreamNumber, free_stream, active_streams
from .utils import collect_garbage, tracked_files
__all__ = [
"clear_logs",
"collect_garbage",
"tracked_files",
"add",
"sub",
"mul",
@ -15,6 +16,8 @@ __all__ = [
"is_even",
"is_odd",
"StreamNumber",
"free_stream",
"active_streams",
"MathStreamError",
"DivideByZeroError",
]

View File

@ -1,8 +1,15 @@
import hashlib
import weakref
from collections import Counter
from pathlib import Path
from typing import Optional, Union
from typing import Dict, Optional, Union
from .utils import register_log_file, register_reference, touch_log_file
from .utils import (
register_log_file,
register_reference,
touch_log_file,
release_reference,
)
LOG_DIR = Path("./instance/log")
@ -60,11 +67,19 @@ class StreamNumber:
if not self.path.exists():
raise FileNotFoundError(self.path)
self.hash = hashlib.sha1(str(self.path).encode()).hexdigest()[:10]
self._normalized_path = str(self.path.resolve())
self._released = False
_increment_active(self.path)
if _is_in_log_dir(self.path):
register_log_file(self.path)
register_reference(self.path)
self.hash = hashlib.sha1(str(self.path).encode()).hexdigest()[:10]
self._finalizer = weakref.finalize(
self, _finalize_instance, self._normalized_path
)
def __repr__(self):
return f"<StreamNumber {self.path.name}>"
@ -85,3 +100,52 @@ class StreamNumber:
f.write(data.encode())
register_log_file(stage_file)
return stage_file
def free(self, *, delete_file: bool = True) -> None:
"""Release this stream's reference and optionally delete the staged file."""
if self._released:
return
self._released = True
if self._finalizer.alive:
self._finalizer.detach()
_decrement_active(Path(self._normalized_path), delete_file=delete_file)
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb):
self.free()
_ACTIVE_COUNTER: Counter[str] = Counter()
def _increment_active(path: Path) -> None:
key = str(path.resolve())
_ACTIVE_COUNTER[key] += 1
def _decrement_active(path: Path, delete_file: bool = True) -> None:
key = str(path.resolve())
current = _ACTIVE_COUNTER.get(key, 0)
if current <= 1:
_ACTIVE_COUNTER.pop(key, None)
else:
_ACTIVE_COUNTER[key] = current - 1
if _is_in_log_dir(path):
release_reference(path, delete_file=delete_file)
def _finalize_instance(path_str: str) -> None:
_decrement_active(Path(path_str))
def free_stream(number: StreamNumber, *, delete_file: bool = True) -> None:
"""Convenience helper mirroring manual memory management semantics."""
number.free(delete_file=delete_file)
def active_streams() -> Dict[str, int]:
"""Return the active StreamNumber paths mapped to in-memory reference counts."""
return dict(_ACTIVE_COUNTER)

View File

@ -1,7 +1,7 @@
import sqlite3
from datetime import datetime, timezone
from pathlib import Path
from typing import Iterable, List
from typing import Iterable, List, Dict
LOG_DB_PATH = Path("./instance/mathstream_logs.sqlite")
@ -178,3 +178,43 @@ def collect_garbage(score_threshold: float) -> list[Path]:
_delete_records(removed)
return removed
def release_reference(path: Path, delete_file: bool = True) -> bool:
"""Decrease the reference count and optionally delete the file when it hits zero."""
normalized = _normalize_paths([path])[0]
_ensure_db()
with sqlite3.connect(LOG_DB_PATH) as conn:
row = conn.execute(
"SELECT ref_count FROM refs WHERE path = ?", (normalized,)
).fetchone()
if row is None:
return False
current = row[0] or 0
new_count = max(current - 1, 0)
if new_count > 0:
conn.execute(
"UPDATE refs SET ref_count = ? WHERE path = ?", (new_count, normalized)
)
conn.commit()
return False
conn.execute("DELETE FROM refs WHERE path = ?", (normalized,))
conn.execute("DELETE FROM logs WHERE path = ?", (normalized,))
conn.commit()
removed = False
if delete_file and path.exists():
try:
path.unlink()
removed = True
except OSError:
removed = False
return removed
def tracked_files() -> Dict[str, int]:
"""Return a mapping of tracked file paths to their reference counts."""
_ensure_db()
with sqlite3.connect(LOG_DB_PATH) as conn:
rows = conn.execute("SELECT path, ref_count FROM refs").fetchall()
return {path: ref_count for path, ref_count in rows}