better readme for mathstream

2025-11-05 14:20:36 +01:00 · 2025-11-05 14:20:36 +01:00 · 60568933e6
commit 60568933e6
parent 3f8a4a5bf0
1 changed files with 83 additions and 69 deletions
--- a/mathstream/README.md
+++ b/mathstream/README.md
@ -1,8 +1,31 @@
-# Mathstream Library
+# Mathstream

-`mathstream` offers streamed, string-based arithmetic for very large integers that you may not want to load entirely into memory. Instead of parsing numbers into Python `int` values, you work with digit files on disk via `StreamNumber` and call math operations that operate chunk-by-chunk.
+*Scalable streamed arithmetic for ultra-large integers, chunked from disk.*

-## Quick Start
+Do math on numbers too big for RAM by streaming digits off disk. Feed the library paths or literals, compose operations, and keep memory flat while outputs land in `instance/log`.
+
+## Why?
+
+Traditional `int` types break when your numbers don’t fit in memory. `mathstream` trades RAM for disk by using streamed digit files, letting you work with absurdly large integers on normal machines.
+
+Perfect for:
+- Iterative transformations like Collatz walks
+- Memory-constrained pipelines
+- Low-level big-number experimentation
+- Long-running experiments where deterministic cleanup beats GC guesswork
+
+## Quick Demo
+
+```python
+from mathstream import StreamNumber, add
+
+a = StreamNumber(literal="999999999999999999")
+b = StreamNumber(literal="1")
+
+print("sum =", "".join(add(a, b).stream()))
+```
+
+## Installation

 ```bash
 python -m venv venv
@ -10,90 +33,81 @@ source venv/bin/activate
 pip install -e .
 ```

-Create digit files anywhere you like (the examples below use `instance/log`), or supply ad-hoc literals, then construct `StreamNumber` objects and call the helpers:
+## Usage

 ```python
-from mathstream import (
-    StreamNumber,
-    add,
-    sub,
-    mul,
-    div,
-    mod,
-    pow,
-    is_even,
-    is_odd,
-    free_stream,
-    collect_garbage,
-)
+from mathstream import mul, StreamNumber

-a = StreamNumber("instance/log/huge.txt")
-b = StreamNumber(literal="34567")
-e = StreamNumber(literal="3")
+a = StreamNumber("path/to/big.txt")
+b = StreamNumber(literal="1337")

-print("sum =", "".join(add(a, b).stream()))
-print("difference =", "".join(sub(a, b).stream()))
-print("product =", "".join(mul(a, b).stream()))
-print("quotient =", "".join(div(a, b).stream()))
-print("modulo =", "".join(mod(a, b).stream()))
-print("power =", "".join(pow(a, e).stream()))
-print("a is even?", is_even(a))
-print("b is odd?", is_odd(b))
-
-# drop staged artifacts immediately when you are done
-free_stream(b)
-
-# reclaim space for files whose age outweighs their use
-collect_garbage(0.5)
+result = mul(a, b)
+print("".join(result.stream()))
 ```

-Each arithmetic call writes its result back into `instance/log` (configurable via `mathstream.number.LOG_DIR`) so you can stream the digits later or reuse them in further operations.
+Available operations:
+- Core arithmetic: `add`, `sub`, `mul`, `div`, `mod`, `pow`
+- Introspection & helpers: `is_even`, `is_odd`, `free_stream`, `active_streams`, `tracked_files`
+- Lifecycle control: `collect_garbage`, `set_manual_free_only`, `manual_free_only_enabled`
+- Environment helpers: `clear_logs`, `StreamNumber.write_stage`, `engine.LOG_DIR`

-## Core Concepts
+## How It Works

- **StreamNumber(path | literal=...)** – Wraps a digit text file or creates one for an integer literal inside `LOG_DIR`. Literal operands are persisted as `literal_<hash>.txt`, so repeated runs reuse the same staged file (note that `clear_logs()` removes these cache files too).
- **`.stream(chunk_size)`** – Yields strings of digits with the provided chunk size. Operations in `mathstream.engine` consume these streams to avoid loading the entire number at once.
- **Automatic staging** – Outputs are stored under `LOG_DIR` with hashes based on input file paths, letting you compose operations without manual bookkeeping.
- **Sign-aware** – Addition, subtraction, multiplication, division (`//` behavior), modulo, and exponentiation (non-negative exponents) all respect operand sign. Division/modulo follow Python’s floor-division rules.
- **Utilities** – `clear_logs()` wipes prior staged results so you can start fresh.
- **Manual freeing** – Call `stream.free()` (or `free_stream(stream)`) once you are done with a staged number to release its reference immediately. Logger metadata keeps per-path reference counts so the final free removes the backing file on the spot.
- **GC toggle** – Need total control over when files disappear? Flip `mathstream.set_manual_free_only(True)` so automatic finalizers stop unlinking staged files; they will persist until you call `free()` (or `collect_garbage`). Use `mathstream.manual_free_only_enabled()` to inspect the current setting.
- **Parity helpers** – `is_even` and `is_odd` inspect the streamed digits without materializing the integer.
- **Garbage collection** – `collect_garbage(score_threshold)` computes a score from file age, access count, and reference count (tracked in `instance/mathstream_logs.sqlite`, freshly truncated each run). Files whose score meets or exceeds the threshold are deleted, letting you tune how aggressively to reclaim space. Both staged results and literal caches participate. Use `tracked_files()` or `active_streams()` to inspect current state.
-
-Divide-by-zero scenarios raise the custom `DivideByZeroError` so callers can distinguish mathstream issues from Python’s native exceptions.
+- **Streamed operands** – `StreamNumber` wraps either a user-supplied digit file or an integer literal materialised in `LOG_DIR`. Data is read linearly in configurable chunks, never promoted to a Python `int`.
+- **Staging directory** – Every operation writes results into `mathstream.number.LOG_DIR` (default `instance/log`). File names include hashes of input paths so repeated calls reuse the same staged copies.
+- **Bookkeeping database** – `mathstream/utils.py` keeps `instance/mathstream_logs.sqlite`, recording creation time, last access, total access count, and reference counts. This powers GC decisions and makes it trivial to audit what’s on disk.
+- **Reference counting** – Every `StreamNumber` bumps a ref count in sqlite and in-process counters. Dropping the last reference (or calling `free_stream`) decrements counts and optionally unlinks the file immediately.
+- **Manual-only mode** – Call `set_manual_free_only(True)` when you want absolute control over lifecycle. The weakref finaliser stops deleting staged files, so outputs persist until you call `.free()` or `collect_garbage()`.
+- **Zero-copy chaining** – Since staged files stay on disk, you can pass `StreamNumber` handles between processes or reuse them in later runs without recomputing.

 ## Performance Tips

- **Reuse literal streams** – `StreamNumber(literal=...)` persists a hashed copy under `LOG_DIR`. Reuse those objects (or their filenames) across operations instead of recreating them every call. Repeated literal construction churns the filesystem: you pay the cost to rewrite identical data, poll the logger database, and spike disk I/O. Hang on to the staged literal or memoize it so it can be streamed repeatedly without rewriting.
- **Free aggressively** – When a staged result or literal copy is no longer needed, call `free_stream()` (or use `with StreamNumber(...) as n:`) so the reference count drops immediately. This keeps the cache tidy and reduces the chance that stale literal files pile up between runs.
+- Reuse literal `StreamNumber` objects to avoid rewriting identical data.
+- Call `free_stream(...)` or use context managers to drop staged results quickly.
+- Run `collect_garbage(score_threshold)` to purge stale intermediates.
+- Keep an eye on disk space in `instance/log`—streaming shifts the pressure from RAM to storage.
+- For huge literals (10⁶+ digits), generate them directly on disk and wrap the path instead of passing `literal=...`.
+- Tweak `StreamNumber.stream(chunk_size)` to balance syscalls vs. memory: large chunks speed up CPU-bound math, smaller chunks play nicer with slow disks.
+- If you are scripting long sessions, snapshot `tracked_files()` periodically; it’s an easy indicator of leaked references.

-## Example Script
+### Common Pitfalls & Recoveries

-`test.py` in the repository root demonstrates a minimal workflow:
+- **Accidentally freed files** – Automatic finalizers may delete staged outputs while you still hold the path elsewhere. Fix: call `set_manual_free_only(True)` at the start of long-lived workflows, or pass `delete_file=False` to `free_stream` when you need to keep the digits around manually.
+- **Literal churn** – Recreating the same `StreamNumber(literal="123")` millions of times hammers the filesystem. Fix: stash the first instance, or cache the `.path` and rely on `StreamNumber(existing_path)` in hot loops.
+- **GC too aggressive** – Running `collect_garbage(0)` after every operation removes recently written files. Fix: raise the threshold (e.g., `collect_garbage(1000)`) or run GC only after you’ve freed all references.
+- **Chunk mismatch** – Some editors save files with BOMs or commas. `_normalize_stream` will raise `ValueError("Non-digit characters found...")`. Fix: sanitise input files (only ASCII digits with optional leading sign).
+- **Disk exhaustion** – Terabyte-scale runs fill `instance/log`. Fix: relocate `engine.LOG_DIR` to a larger volume or run periodic `collect_garbage` sweeps and archive intermediate files.
+- **Concurrency surprises** – Multiple processes writing to the same `LOG_DIR` share the sqlite tracker. Ensure each writer calls `free_stream` and `collect_garbage` responsibly, or isolate runs by changing `LOG_DIR` per worker.

-1. Writes sample operands to `tests/*.txt`.
-2. Calls every arithmetic primitive plus the modulo/parity helpers.
-3. Asserts that the streamed outputs match known values (helpful for quick regression checks).
+## Tools and Experiments

-Run it via:
-
-```bash
-python test.py
-```
+- `test.py` – Regression smoke test covering all arithmetic helpers.
+- `collatz.py` / `collatz_ui/` – Curses dashboard that streams Collatz sequences.
+- `seed_start.py` – Seeds `start.txt` via streamed additions from various sources.
+- `find_my.py` + `pi_finder/` – Nilakantha-based π explorer that writes results to `found.pi`.
+- `WORK.md` – Deep dive into architecture (Logger DB schema, reference lifetimes, cleanup flow).
+- `collatz_ui/views.py` – Reference implementation of a threaded worker that coordinates streamed math and curses rendering without blocking.
+- `pi_finder/engine.py` – Example of building high-precision algorithms (Nilakantha π) purely via streamed primitives, including manual caching of million-digit scale factors.

 ## Extending

- To hook into other storage backends, implement your own `StreamNumber` variant with the same `.stream()` interface.
- Need modulo or gcd? Compose the existing primitives (e.g., repeated subtraction or using `div` + remainder tracking inside `_divide_abs`) or add new helpers following the same streamed pattern.
- For more control over output locations, override `LOG_DIR` before using the operations:
+You can:
+- Implement custom storage backends (e.g., S3-backed digit files).
+- Compose primitives to build new helpers (gcd, factorial, etc.).
+- Point `engine.LOG_DIR` at your own staging directory before running operations.
+- Add new operations by mirroring the pattern in `mathstream/engine.py`: normalise inputs with `_normalize_stream`, perform chunk-based math, then `_write_result(...)`.
+- Build higher-level services (REST APIs, workers, dashboards) by sharing staged file paths instead of raw numbers.
+- Layer parity or divisibility checks by reading the streamed digits lazily; there’s no requirement to materialise entire outputs unless you need them.

-```python
-from mathstream import engine
-from pathlib import Path
+## Contributing

-engine.LOG_DIR = Path("/tmp/my_mathstage")
-engine.clear_logs()
-```
+Open to PRs for:
+- New streamed math operations and optimizations.
+- Smarter garbage collection / tooling around the sqlite tracker.
+- Experiments that showcase creative uses (Collatz encoding, π spigots, etc.).

-With these building blocks, you can manipulate arbitrarily large integers while keeping memory usage constant. Happy streaming!
+Please lint with `ruff` and follow the existing streaming patterns.
+
+## License
+
+MIT. Use it, remix it, but keep backups—massive streamed math can chew through SSDs fast. 😅