mathy/mathstream/README.md
2025-11-05 16:35:15 +01:00

125 lines
7.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mathstream
*Scalable streamed arithmetic for ultra-large integers, chunked from disk.*
Do math on numbers too big for RAM by streaming digits off disk. Feed the library paths or literals, compose operations, and keep memory flat while outputs land in `instance/log`.
## Why?
Traditional `int` types break when your numbers dont fit in memory. `mathstream` trades RAM for disk by using streamed digit files, letting you work with absurdly large integers on normal machines.
Perfect for:
- Iterative transformations like Collatz walks
- Memory-constrained pipelines
- Low-level big-number experimentation
- Long-running experiments where deterministic cleanup beats GC guesswork
## Quick Demo
```python
from mathstream import StreamNumber, add
a = StreamNumber(literal="999999999999999999")
b = StreamNumber(literal="1")
print("sum =", "".join(add(a, b).stream()))
```
## Installation
```bash
python -m venv venv
source venv/bin/activate
pip install -e .
```
## Usage
```python
from mathstream import mul, StreamNumber
a = StreamNumber("path/to/big.txt")
b = StreamNumber(literal="1337")
result = mul(a, b)
print("".join(result.stream()))
# same helpers are available via Python operators
total = a + b # calls mathstream.add under the hood
ratio = total / 2 # literal coercion is automatic
with StreamNumber(literal="10") as temp:
product = temp * ratio
```
Available operations:
- Core arithmetic: `add`, `sub`, `mul`, `div`, `mod`, `pow`
- Introspection & helpers: `is_even`, `is_odd`, `free_stream`, `active_streams`, `tracked_files`
- Lifecycle control: `collect_garbage`, `set_manual_free_only`, `manual_free_only_enabled`
- Environment helpers: `clear_logs`, `StreamNumber.write_stage`, `engine.LOG_DIR`
- Python sugar: `StreamNumber` implements `+`, `-`, `*`, `/`, `%`, `**`, and their reflected counterparts. Raw `int`, `str`, or `pathlib.Path` operands are coerced automatically.
- Context manager support: `with StreamNumber(...) as sn:` ensures `.free()` is called at exit.
- Module entry point: `python -m mathstream` launches the interactive streamed-math REPL from `stream_repl.py`.
## How It Works
- **Streamed operands** `StreamNumber` wraps either a user-supplied digit file or an integer literal materialised in `LOG_DIR`. Data is read linearly in configurable chunks, never promoted to a Python `int`.
- **Staging directory** Every operation writes results into `mathstream.number.LOG_DIR` (default `instance/log`). File names include hashes of input paths so repeated calls reuse the same staged copies.
- **Bookkeeping database** `mathstream/utils.py` keeps `instance/mathstream_logs.sqlite`, recording creation time, last access, total access count, and reference counts. This powers GC decisions and makes it trivial to audit whats on disk.
- **Reference counting** Every `StreamNumber` bumps a ref count in sqlite and in-process counters. Dropping the last reference (or calling `free_stream`) decrements counts and optionally unlinks the file immediately.
- **Manual-only mode** Call `set_manual_free_only(True)` when you want absolute control over lifecycle. The weakref finaliser stops deleting staged files, so outputs persist until you call `.free()` or `collect_garbage()`.
- **Zero-copy chaining** Since staged files stay on disk, you can pass `StreamNumber` handles between processes or reuse them in later runs without recomputing.
## Performance Tips
- Reuse literal `StreamNumber` objects to avoid rewriting identical data.
- Call `free_stream(...)` or use context managers to drop staged results quickly.
- Run `collect_garbage(score_threshold)` to purge stale intermediates.
- Keep an eye on disk space in `instance/log`—streaming shifts the pressure from RAM to storage.
- For huge literals (10⁶+ digits), generate them directly on disk and wrap the path instead of passing `literal=...`.
- Tweak `StreamNumber.stream(chunk_size)` to balance syscalls vs. memory: large chunks speed up CPU-bound math, smaller chunks play nicer with slow disks.
- If you are scripting long sessions, snapshot `tracked_files()` periodically; its an easy indicator of leaked references.
### Common Pitfalls & Recoveries
- **Accidentally freed files** Automatic finalizers may delete staged outputs while you still hold the path elsewhere. Fix: call `set_manual_free_only(True)` at the start of long-lived workflows, or pass `delete_file=False` to `free_stream` when you need to keep the digits around manually.
- **Operator coercion surprises** Arithmetic operators turn `int`, `str`, or `Path` operands into streamed numbers. If a string happens to be a *file path* instead of a literal, the actual file will be wrapped. Fix: be explicit (`StreamNumber(literal="...")`) when in doubt.
- **Literal churn** Recreating the same `StreamNumber(literal="123")` millions of times hammers the filesystem. Fix: stash the first instance, or cache the `.path` and rely on `StreamNumber(existing_path)` in hot loops.
- **GC too aggressive** Running `collect_garbage(0)` after every operation removes recently written files. Fix: raise the threshold (e.g., `collect_garbage(1000)`) or run GC only after youve freed all references.
- **Chunk mismatch** Some editors save files with BOMs or commas. `_normalize_stream` will raise `ValueError("Non-digit characters found...")`. Fix: sanitise input files (only ASCII digits with optional leading sign).
- **Disk exhaustion** Terabyte-scale runs fill `instance/log`. Fix: relocate `engine.LOG_DIR` to a larger volume or run periodic `collect_garbage` sweeps and archive intermediate files.
- **Concurrency surprises** Multiple processes writing to the same `LOG_DIR` share the sqlite tracker. Ensure each writer calls `free_stream` and `collect_garbage` responsibly, or isolate runs by changing `LOG_DIR` per worker.
## Tools and Experiments
- `test.py` Regression smoke test covering all arithmetic helpers.
- `collatz.py` / `collatz_ui/` Curses dashboard that streams Collatz sequences.
- `seed_start.py` Seeds `start.txt` via streamed additions from various sources.
- `find_my.py` + `pi_finder/` Nilakantha-based π explorer that writes results to `found.pi`.
- `stream_repl.py` / `python -m mathstream` Interactive REPL for streamed math (supports `save <var> <path>`, `:show`, `:purge`, `:cleanmode`, `:stats`, and `exit -s` to keep staging files).
- `WORK.md` Deep dive into architecture (Logger DB schema, reference lifetimes, cleanup flow).
- `collatz_ui/views.py` Reference implementation of a threaded worker that coordinates streamed math and curses rendering without blocking.
- `pi_finder/engine.py` Example of building high-precision algorithms (Nilakantha π) purely via streamed primitives, including manual caching of million-digit scale factors.
## Extending
You can:
- Implement custom storage backends (e.g., S3-backed digit files).
- Compose primitives to build new helpers (gcd, factorial, etc.).
- Point `engine.LOG_DIR` at your own staging directory before running operations.
- Add new operations by mirroring the pattern in `mathstream/engine.py`: normalise inputs with `_normalize_stream`, perform chunk-based math, then `_write_result(...)`.
- Build higher-level services (REST APIs, workers, dashboards) by sharing staged file paths instead of raw numbers.
- Layer parity or divisibility checks by reading the streamed digits lazily; theres no requirement to materialise entire outputs unless you need them.
## Contributing
Open to PRs for:
- New streamed math operations and optimizations.
- Smarter garbage collection / tooling around the sqlite tracker.
- Experiments that showcase creative uses (Collatz encoding, π spigots, etc.).
Please lint with `ruff` and follow the existing streaming patterns.
## License
MIT. Use it, remix it, but keep backups—massive streamed math can chew through SSDs fast. 😅