better readme for mathstream

This commit is contained in:
Dominik Krenn 2025-11-05 14:20:36 +01:00
parent 3f8a4a5bf0
commit 60568933e6

View File

@ -1,8 +1,31 @@
# Mathstream Library
# Mathstream
`mathstream` offers streamed, string-based arithmetic for very large integers that you may not want to load entirely into memory. Instead of parsing numbers into Python `int` values, you work with digit files on disk via `StreamNumber` and call math operations that operate chunk-by-chunk.
*Scalable streamed arithmetic for ultra-large integers, chunked from disk.*
## Quick Start
Do math on numbers too big for RAM by streaming digits off disk. Feed the library paths or literals, compose operations, and keep memory flat while outputs land in `instance/log`.
## Why?
Traditional `int` types break when your numbers dont fit in memory. `mathstream` trades RAM for disk by using streamed digit files, letting you work with absurdly large integers on normal machines.
Perfect for:
- Iterative transformations like Collatz walks
- Memory-constrained pipelines
- Low-level big-number experimentation
- Long-running experiments where deterministic cleanup beats GC guesswork
## Quick Demo
```python
from mathstream import StreamNumber, add
a = StreamNumber(literal="999999999999999999")
b = StreamNumber(literal="1")
print("sum =", "".join(add(a, b).stream()))
```
## Installation
```bash
python -m venv venv
@ -10,90 +33,81 @@ source venv/bin/activate
pip install -e .
```
Create digit files anywhere you like (the examples below use `instance/log`), or supply ad-hoc literals, then construct `StreamNumber` objects and call the helpers:
## Usage
```python
from mathstream import (
StreamNumber,
add,
sub,
mul,
div,
mod,
pow,
is_even,
is_odd,
free_stream,
collect_garbage,
)
from mathstream import mul, StreamNumber
a = StreamNumber("instance/log/huge.txt")
b = StreamNumber(literal="34567")
e = StreamNumber(literal="3")
a = StreamNumber("path/to/big.txt")
b = StreamNumber(literal="1337")
print("sum =", "".join(add(a, b).stream()))
print("difference =", "".join(sub(a, b).stream()))
print("product =", "".join(mul(a, b).stream()))
print("quotient =", "".join(div(a, b).stream()))
print("modulo =", "".join(mod(a, b).stream()))
print("power =", "".join(pow(a, e).stream()))
print("a is even?", is_even(a))
print("b is odd?", is_odd(b))
# drop staged artifacts immediately when you are done
free_stream(b)
# reclaim space for files whose age outweighs their use
collect_garbage(0.5)
result = mul(a, b)
print("".join(result.stream()))
```
Each arithmetic call writes its result back into `instance/log` (configurable via `mathstream.number.LOG_DIR`) so you can stream the digits later or reuse them in further operations.
Available operations:
- Core arithmetic: `add`, `sub`, `mul`, `div`, `mod`, `pow`
- Introspection & helpers: `is_even`, `is_odd`, `free_stream`, `active_streams`, `tracked_files`
- Lifecycle control: `collect_garbage`, `set_manual_free_only`, `manual_free_only_enabled`
- Environment helpers: `clear_logs`, `StreamNumber.write_stage`, `engine.LOG_DIR`
## Core Concepts
## How It Works
- **StreamNumber(path | literal=...)** Wraps a digit text file or creates one for an integer literal inside `LOG_DIR`. Literal operands are persisted as `literal_<hash>.txt`, so repeated runs reuse the same staged file (note that `clear_logs()` removes these cache files too).
- **`.stream(chunk_size)`** Yields strings of digits with the provided chunk size. Operations in `mathstream.engine` consume these streams to avoid loading the entire number at once.
- **Automatic staging** Outputs are stored under `LOG_DIR` with hashes based on input file paths, letting you compose operations without manual bookkeeping.
- **Sign-aware** Addition, subtraction, multiplication, division (`//` behavior), modulo, and exponentiation (non-negative exponents) all respect operand sign. Division/modulo follow Pythons floor-division rules.
- **Utilities** `clear_logs()` wipes prior staged results so you can start fresh.
- **Manual freeing** Call `stream.free()` (or `free_stream(stream)`) once you are done with a staged number to release its reference immediately. Logger metadata keeps per-path reference counts so the final free removes the backing file on the spot.
- **GC toggle** Need total control over when files disappear? Flip `mathstream.set_manual_free_only(True)` so automatic finalizers stop unlinking staged files; they will persist until you call `free()` (or `collect_garbage`). Use `mathstream.manual_free_only_enabled()` to inspect the current setting.
- **Parity helpers** `is_even` and `is_odd` inspect the streamed digits without materializing the integer.
- **Garbage collection** `collect_garbage(score_threshold)` computes a score from file age, access count, and reference count (tracked in `instance/mathstream_logs.sqlite`, freshly truncated each run). Files whose score meets or exceeds the threshold are deleted, letting you tune how aggressively to reclaim space. Both staged results and literal caches participate. Use `tracked_files()` or `active_streams()` to inspect current state.
Divide-by-zero scenarios raise the custom `DivideByZeroError` so callers can distinguish mathstream issues from Pythons native exceptions.
- **Streamed operands** `StreamNumber` wraps either a user-supplied digit file or an integer literal materialised in `LOG_DIR`. Data is read linearly in configurable chunks, never promoted to a Python `int`.
- **Staging directory** Every operation writes results into `mathstream.number.LOG_DIR` (default `instance/log`). File names include hashes of input paths so repeated calls reuse the same staged copies.
- **Bookkeeping database** `mathstream/utils.py` keeps `instance/mathstream_logs.sqlite`, recording creation time, last access, total access count, and reference counts. This powers GC decisions and makes it trivial to audit whats on disk.
- **Reference counting** Every `StreamNumber` bumps a ref count in sqlite and in-process counters. Dropping the last reference (or calling `free_stream`) decrements counts and optionally unlinks the file immediately.
- **Manual-only mode** Call `set_manual_free_only(True)` when you want absolute control over lifecycle. The weakref finaliser stops deleting staged files, so outputs persist until you call `.free()` or `collect_garbage()`.
- **Zero-copy chaining** Since staged files stay on disk, you can pass `StreamNumber` handles between processes or reuse them in later runs without recomputing.
## Performance Tips
- **Reuse literal streams** `StreamNumber(literal=...)` persists a hashed copy under `LOG_DIR`. Reuse those objects (or their filenames) across operations instead of recreating them every call. Repeated literal construction churns the filesystem: you pay the cost to rewrite identical data, poll the logger database, and spike disk I/O. Hang on to the staged literal or memoize it so it can be streamed repeatedly without rewriting.
- **Free aggressively** When a staged result or literal copy is no longer needed, call `free_stream()` (or use `with StreamNumber(...) as n:`) so the reference count drops immediately. This keeps the cache tidy and reduces the chance that stale literal files pile up between runs.
- Reuse literal `StreamNumber` objects to avoid rewriting identical data.
- Call `free_stream(...)` or use context managers to drop staged results quickly.
- Run `collect_garbage(score_threshold)` to purge stale intermediates.
- Keep an eye on disk space in `instance/log`—streaming shifts the pressure from RAM to storage.
- For huge literals (10⁶+ digits), generate them directly on disk and wrap the path instead of passing `literal=...`.
- Tweak `StreamNumber.stream(chunk_size)` to balance syscalls vs. memory: large chunks speed up CPU-bound math, smaller chunks play nicer with slow disks.
- If you are scripting long sessions, snapshot `tracked_files()` periodically; its an easy indicator of leaked references.
## Example Script
### Common Pitfalls & Recoveries
`test.py` in the repository root demonstrates a minimal workflow:
- **Accidentally freed files** Automatic finalizers may delete staged outputs while you still hold the path elsewhere. Fix: call `set_manual_free_only(True)` at the start of long-lived workflows, or pass `delete_file=False` to `free_stream` when you need to keep the digits around manually.
- **Literal churn** Recreating the same `StreamNumber(literal="123")` millions of times hammers the filesystem. Fix: stash the first instance, or cache the `.path` and rely on `StreamNumber(existing_path)` in hot loops.
- **GC too aggressive** Running `collect_garbage(0)` after every operation removes recently written files. Fix: raise the threshold (e.g., `collect_garbage(1000)`) or run GC only after youve freed all references.
- **Chunk mismatch** Some editors save files with BOMs or commas. `_normalize_stream` will raise `ValueError("Non-digit characters found...")`. Fix: sanitise input files (only ASCII digits with optional leading sign).
- **Disk exhaustion** Terabyte-scale runs fill `instance/log`. Fix: relocate `engine.LOG_DIR` to a larger volume or run periodic `collect_garbage` sweeps and archive intermediate files.
- **Concurrency surprises** Multiple processes writing to the same `LOG_DIR` share the sqlite tracker. Ensure each writer calls `free_stream` and `collect_garbage` responsibly, or isolate runs by changing `LOG_DIR` per worker.
1. Writes sample operands to `tests/*.txt`.
2. Calls every arithmetic primitive plus the modulo/parity helpers.
3. Asserts that the streamed outputs match known values (helpful for quick regression checks).
## Tools and Experiments
Run it via:
```bash
python test.py
```
- `test.py` Regression smoke test covering all arithmetic helpers.
- `collatz.py` / `collatz_ui/` Curses dashboard that streams Collatz sequences.
- `seed_start.py` Seeds `start.txt` via streamed additions from various sources.
- `find_my.py` + `pi_finder/` Nilakantha-based π explorer that writes results to `found.pi`.
- `WORK.md` Deep dive into architecture (Logger DB schema, reference lifetimes, cleanup flow).
- `collatz_ui/views.py` Reference implementation of a threaded worker that coordinates streamed math and curses rendering without blocking.
- `pi_finder/engine.py` Example of building high-precision algorithms (Nilakantha π) purely via streamed primitives, including manual caching of million-digit scale factors.
## Extending
- To hook into other storage backends, implement your own `StreamNumber` variant with the same `.stream()` interface.
- Need modulo or gcd? Compose the existing primitives (e.g., repeated subtraction or using `div` + remainder tracking inside `_divide_abs`) or add new helpers following the same streamed pattern.
- For more control over output locations, override `LOG_DIR` before using the operations:
You can:
- Implement custom storage backends (e.g., S3-backed digit files).
- Compose primitives to build new helpers (gcd, factorial, etc.).
- Point `engine.LOG_DIR` at your own staging directory before running operations.
- Add new operations by mirroring the pattern in `mathstream/engine.py`: normalise inputs with `_normalize_stream`, perform chunk-based math, then `_write_result(...)`.
- Build higher-level services (REST APIs, workers, dashboards) by sharing staged file paths instead of raw numbers.
- Layer parity or divisibility checks by reading the streamed digits lazily; theres no requirement to materialise entire outputs unless you need them.
```python
from mathstream import engine
from pathlib import Path
## Contributing
engine.LOG_DIR = Path("/tmp/my_mathstage")
engine.clear_logs()
```
Open to PRs for:
- New streamed math operations and optimizations.
- Smarter garbage collection / tooling around the sqlite tracker.
- Experiments that showcase creative uses (Collatz encoding, π spigots, etc.).
With these building blocks, you can manipulate arbitrarily large integers while keeping memory usage constant. Happy streaming!
Please lint with `ruff` and follow the existing streaming patterns.
## License
MIT. Use it, remix it, but keep backups—massive streamed math can chew through SSDs fast. 😅