Architecture

Phases

Phase 1: Core tensor ops

Implemented struct Tensor wrapping Slice!(float*, N).
Implemented matmul (matrix multiplication) with naive and tiled block implementations.
SIMD support (e.g. core.simd) is planned.

Phase 2: GGUF parsing

GGUF reader in source/densor/format/gguf.d.
Map GGUF tensor types (F16, Q4_0, Q8_0) to internal storage.
Dequantization kernels (Q4_0 → F32) for inference.

Phase 3: CLIP

Layers for CLIP (Vision Transformer): Conv2d, LayerNorm, MultiHeadAttention, MLP.

Phase 4: Validation

Compare outputs layer-by-layer against Python transformers with known seed/input.