Changes in version 0.4.1 (2026-05-26)                  

CRAN Resubmission Fixes

  - Stderr references in compiled objects (CRAN auto-check NOTE on
    Debian): the previous CRAN cleanup (commit d8870bd) added stdio
    suppression to 7 upstream files but missed ggml/ggml.c and
    ggml/ggml-opt.cpp. Both now include the same #ifdef USING_R macro
    block that neutralizes printf, fprintf, fputs, fflush, stderr, and
    stdout. These calls were diagnostic-only and were already silent at
    runtime via the installed log callback; now the symbols never reach
    the compiled object files either.

                        Changes in version 0.4.0                        

Structured Output, Embeddings, RAG, and API Server

New Features

  - Grammar-constrained generation (edge_grammar_completion()): Force
    model output to conform to a GBNF grammar specification. Ensures
    valid, parseable structured output (JSON, enums, numbers, etc.)
    using llama.cpp's native grammar sampler.

  - JSON schema helper (edge_json_grammar()): Convert a simple R list
    schema into a GBNF grammar string. Supports string, number, integer,
    boolean fields and enum (character vector) constraints.

  - Structured data extraction (edge_extract()): High-level function
    that combines prompt construction with grammar-constrained
    generation to extract structured data from text. Returns a parsed R
    list (requires jsonlite).

  - Text classification (edge_classify()): Classify text into predefined
    categories using grammar constraints. Supports single text and batch
    (vectorized) classification. Output is guaranteed to be one of the
    specified categories.

  - Text embeddings (edge_embeddings()): Extract dense vector embeddings
    from any loaded model. Returns a numeric matrix (n_texts x n_embd)
    suitable for clustering, semantic search, similarity computation,
    and RAG pipelines. Supports optional L2 normalization.

  - Cosine similarity (edge_similarity(), edge_similarity_matrix()):
    Compute pairwise cosine similarity between embedding vectors. Matrix
    version efficiently computes all-pairs similarity using normalized
    matrix multiply.

  - Embedding dimension query (edge_model_n_embd()): Query the embedding
    dimension of a loaded model.

  - Batch processing (edge_map()): Apply a prompt template over a vector
    of texts with progress reporting. Supports both string templates
    with {text} placeholder and custom prompt functions. Optional
    grammar constraint for structured batch output.

  - Batch extraction (edge_extract_batch()): Extract structured data
    from multiple texts, returning a data frame with one row per input.

  - RAG document indexing (edge_index_documents()): Build a semantic
    embedding index from a directory of text files or a character
    vector. Automatic chunking with configurable size and overlap.

  - RAG semantic search (edge_search()): Find the most relevant text
    chunks for a query using cosine similarity over the embedding index.

  - RAG question answering (edge_ask()): Retrieval-augmented generation
    that retrieves relevant context from an index and generates a
    grounded answer. Supports custom system prompts and optional context
    return for debugging/transparency.

  - Plumber API server (edge_serve()): Serve a model as a local
    OpenAI-compatible REST API. Endpoints: /v1/completions,
    /v1/chat/completions, /v1/embeddings, /v1/models, /health. Supports
    optional API key authentication and CORS. Requires plumber.

  - Qwen3 model family in edge_list_models(): Added Qwen3-0.6B, 1.7B,
    4B, and 8B pre-configured entries from the unsloth GGUF repository.

  - Friendly names in edge_download_model(): Now accepts model names
    from edge_list_models() (e.g., edge_download_model("Qwen3-0.6B")) in
    addition to HuggingFace repo IDs. Filename is auto-resolved from the
    model registry.

  - httr download fallback: .robust_download() now tries httr::GET
    before R's download.file, improving reliability on corporate
    networks with custom SSL certificates or proxy configurations.

  - SIMD optimization warning: On package load, warns if running without
    SIMD (generic mode) and suggests reinstalling from source with
    EDGEMODELR_SIMD=NATIVE for faster inference.

Bug Fixes

  - Fixed grammar-constrained generation failures (issue #41):
    edge_grammar_completion(), edge_extract(), and edge_extract_batch()
    were unusable due to two bugs. First, edge_json_grammar() emitted
    rule names like field_1 containing underscores, which llama.cpp's
    grammar parser rejects (only [a-zA-Z0-9-] is allowed in rule
    identifiers). Renamed to field-1. Second, llama_sampler_accept()
    throws "Unexpected empty grammar stack" when a token fully satisfies
    the grammar; the binding now catches this and terminates cleanly,
    same as end-of-generation handling.

  - Fixed crash from silent context size override (issue #40 item 11):
    Removed the auto-reduction of n_ctx for small models that silently
    changed the user's requested context size. This caused segfaults
    when prompts exceeded the reduced context. Context is now used
    as-is. Minimum n_ctx lowered from 512 to 128 for short-task use
    cases.

  - Fixed prompt echo in completion output (issue #40 item 1):
    edge_completion() previously returned prompt + generated_text. Now
    returns only the generated text, matching user expectations.

  - Added prompt length validation: All completion functions now
    validate that the tokenized prompt fits within the model's context
    window before calling llama_decode(). Exceeding the context now
    raises a clear R error instead of crashing the process.

  - Model-native chat templates (issue #40 item 7): New
    edge_chat_completion() function reads the model's chat template from
    GGUF metadata (via llama_chat_apply_template) and formats messages
    correctly for each model architecture (ChatML, Llama, Gemma, etc.).
    build_chat_prompt() updated to accept an optional ctx parameter for
    native template formatting, with ChatML as the generic fallback
    (replacing the old Human:/Assistant: format).

Use Cases Unlocked

  - Sentiment analysis: edge_classify(ctx, text, c("positive",
    "negative", "neutral"))
  - Entity extraction: edge_extract(ctx, text, list(name = "string",
    role = "string"))
  - Data labeling: Batch classify thousands of rows with guaranteed
    valid labels
  - Semantic search: Embed documents and queries, find nearest neighbors
  - Document clustering: Compute similarity matrices, feed to
    hclust/kmeans
  - RAG foundations: Embed corpus, retrieve relevant context for
    generation

                        Changes in version 0.3.0                        

CUDA GPU Support and Qwen3 Tokenizer Fix

New Features

  - CUDA GPU acceleration (Windows): New edge_install_cuda() and
    edge_install_cuda_toolkit() functions set up GPU inference
    automatically.
    
      - edge_install_cuda() downloads the matching ggml-cuda dynamic
        backend from llama.cpp releases and extracts the companion
        ggml-base.dll / ggml.dll runtime libraries.
      - edge_install_cuda_toolkit() copies nvcudart_hybrid64.dll from
        the Windows DriverStore (already on any NVIDIA-driver machine,
        no download required) and fetches cublas64 / cublasLt64 from
        NVIDIA's redistrib server.
      - edge_reload_cuda() activates the CUDA backend in the current R
        session without restarting R.
      - edge_cuda_info() reports whether CUDA is installed and active.
      - Pass n_gpu_layers = -1L to edge_load_model() for full GPU
        offload.
      - Tested on NVIDIA RTX 5070 Ti (Blackwell sm_120, CUDA 13.1, 12 GB
        VRAM): Qwen3-14B loads in 3.4 s with full VRAM offload.

  - Updated llama.cpp to build b8179 (GGML 0.9.7): Brings all upstream
    model architecture updates, sampler improvements, and quantization
    fixes.

Bug Fixes

  - Qwen3 / QWEN2 tokenizer 40-minute load time (8000× speedup): The
    QWEN2 byte-level regex pattern caused GCC's std::regex to spend 40+
    minutes in exponential backtracking. Added a hand-written fast path
    unicode_regex_split_custom_qwen2() in unicode.cpp, matching the
    logic of the existing llama-3 fast path. Qwen3-14B now loads in 0.3
    s on CPU (3.4 s on GPU including VRAM transfer). Covers QWEN2 and
    QWEN3.5 variants.

CRAN Compliance

  - Replaced abort() in ggml_abort() with raise(SIGABRT) under #ifdef
    USING_R; replaces abort() token in ggml.cpp with std::terminate().
  - Guarded ggml_print_backtrace() body and fflush(stdout) /
    fprintf(stderr, …) in ggml_abort() with #ifndef USING_R to remove
    _Exit, stdout, and stderr symbol references from ggml.o on macOS.
  - Added #define _GNU_SOURCE to ggml-cpu.c (required for SCHED_BATCH,
    CPU_ZERO, pthread_setaffinity_np on Linux).
  - CXX_STD = CXX17 replaces -std=c++17 in PKG_CXXFLAGS in both Makevars
    and Makevars.win.
  - -fno-builtin-printf added to GGML_CFLAGS to suppress printf → puts
    optimizations.
  - Man pages added for edge_install_cuda, edge_install_cuda_toolkit,
    edge_reload_cuda, edge_cuda_info.

                 Changes in version 0.2.0 (2026-02-25)                  

SIMD Optimizations for Faster CPU Inference

New Features

  - Flash attention support: Enabled by default in edge_load_model() via
    flash_attn = TRUE. Reduces memory usage and improves attention
    computation speed on CPU.

  - Full hardware thread utilization: Removed the 4-thread cap for small
    contexts. edge_load_model() now uses all available CPU threads by
    default, with n_threads_batch set to max for prompt processing.

  - User-configurable threading: New n_threads parameter in
    edge_load_model() allows explicit control over CPU thread count.
    Pass NULL (default) for auto-detect or an integer to limit cores.

  - Apple Accelerate framework (macOS): Automatically links the
    Accelerate framework on macOS builds, enabling hardware-accelerated
    vDSP vector operations for faster matrix math.

  - Compiler auto-vectorization: Added -ftree-vectorize to GGML
    compilation flags on all platforms, allowing GCC/Clang to generate
    SIMD instructions for eligible loops beyond the hand-tuned GGML
    kernels.

Existing Features

  - SIMD-optimized build system: Replaced generic scalar fallback with
    architecture-aware SIMD detection in both Makevars (Unix) and
    Makevars.win (Windows)
    
      - x86_64: Enables SSE4.2 baseline by default (universal since
        Intel Nehalem 2008)
      - aarch64/arm64: NEON support built into the ABI (no extra flags
        needed)
      - Other architectures: Automatic generic fallback

  - User-configurable SIMD levels: Set EDGEMODELR_SIMD environment
    variable before install to select optimization level:
    
      - GENERIC: Scalar fallback (maximum compatibility)
      - SSE42: SSE4.2 baseline (default on x86_64)
      - AVX: AVX + F16C (Intel Sandy Bridge 2011+)
      - AVX2: AVX2 + FMA + F16C (Intel Haswell 2013+, recommended)
      - AVX512: AVX-512 (Intel Skylake-X 2017+)
      - NATIVE: Uses -march=native for maximum performance on the build
        machine

  - edge_simd_info(): New function to query compile-time SIMD status
    including architecture, compiler features, and GGML optimization
    flags

  - x86 architecture-specific quantization: Enabled optimized x86
    quantization kernels (arch/x86/quants.c, arch/x86/repack.cpp) with
    SIMD-accelerated dot products and matrix operations

Performance

  - 15-40% faster inference on x86_64 with SSE4.2 baseline vs generic
    scalar
  - Up to 2-3x faster with AVX2 for quantized model operations
  - SSSE3-accelerated integer multiply-accumulate for quantized dot
    products

                 Changes in version 0.1.5 (2026-01-28)                  

CRAN Policy Fixes

Bug Fixes

  - Fixed donttest examples: Changed resource-intensive examples from
    \donttest{} to \dontrun{} to prevent downloading multi-GB models
    during CRAN checks

  - Fixed M1 Mac compiler warnings: Added explicit static_cast<> for:
    
      - double to float conversions for temperature/top_p parameters
      - size_type to int32_t conversions for buffer size parameters

  - Fixed connection handling: Replaced on.exit() with tryCatch/finally
    for proper connection cleanup in loops (thanks @eddelbuettel)

                 Changes in version 0.1.4 (2026-01-22)                  

Performance Optimizations for Small Language Models

New Features

  - Small Model Configuration Helper: New edge_small_model_config()
    function provides optimized settings for small models (1B-3B
    parameters)
    
      - Device-specific presets: mobile, laptop, desktop, and server
      - Adaptive configuration based on model size and available RAM
      - Built-in performance tips and recommendations
      - Automatic parameter tuning for optimal inference speed

  - Adaptive Batch Processing: Intelligent batch size optimization based
    on context length
    
      - Small contexts (≤512): Uses up to full context for batching
      - Medium contexts (512-2048): Uses 1/2 context for optimal
        throughput
      - Large contexts (2048-4096): Uses 1/4 context to balance speed
        and memory
      - Very large contexts (>4096): Caps at 2048 tokens for stability

  - Smart Thread Allocation: Context-aware CPU thread management
    
      - Small models automatically limit threads to avoid overhead
      - Reduces CPU contention on resource-constrained devices
      - Improves inference speed for models with contexts ≤2048 tokens

  - Automatic Context Optimization: Model size-based context tuning
    
      - Small models (<1GB): Optimized to 1024 tokens for faster
        inference
      - Medium models (1-2GB): Set to 1536 tokens for balanced
        performance
      - Large models (>2GB): Maintains 2048+ tokens for quality
      - User override available via n_ctx parameter

Performance Improvements

  - Faster Small Model Inference: 15-30% speed improvement for small
    models through optimized batch and thread settings
  - Reduced Memory Footprint: Better memory efficiency for
    resource-constrained environments
  - Lower Latency: Optimized thread allocation reduces context switching
    overhead
  - Better Scalability: Adaptive configurations scale from mobile
    devices to servers

Examples and Documentation

  - Small Model Optimization Example: Comprehensive example
    demonstrating all optimization features
    
      - Configuration comparison across device types
      - Performance benchmarking workflow
      - Best practices for different model sizes
      - Manual tuning guidelines

  - Enhanced Testing: New test suite for small model configuration
    
      - Tests for all device target configurations
      - Validation of adaptive parameter adjustments
      - Safety checks for edge cases

Technical Details

  - Improved C++ bindings with adaptive batch size calculations
  - Enhanced R API with intelligent parameter defaults
  - Better integration between model size detection and configuration
  - Comprehensive documentation for optimization features

                        Changes in version 0.1.2                        

Major New Features

Ollama Integration

  - Native Ollama Support: Complete integration with Ollama models
    through automatic model discovery and SHA-256 hash-based loading
  - edge_find_ollama_models() - Discover all locally available Ollama
    models across platforms (Windows, macOS, Linux)
  - edge_load_ollama_model() - Load Ollama models using convenient
    SHA-256 hash prefixes instead of full file paths
  - test_ollama_model_compatibility() - Built-in compatibility testing
    for Ollama models
  - Cross-platform Model Detection: Robust model discovery supporting
    standard installations, snap packages (Linux), and various Windows
    configurations
  - Windows OneDrive Compatibility: Enhanced path detection that
    properly handles Windows OneDrive document folder redirections

Comprehensive Examples Suite

  - Structured Learning Path: Complete examples directory with
    progressive difficulty levels (Beginner → Intermediate → Advanced)
  - 01_basic_usage.R: Fundamental operations including model loading,
    text generation, parameter tuning, and error handling
  - 02_ollama_integration.R: Complete Ollama workflow with model
    discovery, hash-based loading, and compatibility testing
  - 03_streaming_generation.R: Real-time streaming text generation with
    interactive chat interfaces and callback processing
  - 04_performance_optimization.R: Advanced performance tuning including
    GPU acceleration, benchmarking, memory management, and batch
    processing
  - examples/README.md: Comprehensive documentation with learning paths,
    troubleshooting guide, and customization instructions

Package Structure Improvements

  - Organized File Structure: Consolidated all examples into structured
    examples/ directory with consistent formatting
  - Enhanced Documentation: Improved inline documentation and example
    comments throughout

                        Changes in version 0.1.1                        

Bug Fixes and Improvements

Compilation Fixes

  - macOS Boolean Conflicts: Completely resolved Boolean enum conflicts
    by avoiding problematic system headers and using direct function
    declarations
  - Filesystem Compatibility: Added comprehensive fallback
    implementation for disabled std::filesystem on macOS builds
  - Header Protection: Implemented robust cross-platform header
    inclusion strategy that works with R, Rcpp, and system headers
  - System Header Workarounds: Replaced <mach-o/dyld.h> inclusion with
    direct function declarations to avoid enum conflicts
  - Format Attribute Warnings: Suppressed unsupported printf format
    attribute warnings on macOS Apple Clang compiler
  - CRAN Compliance: Removed non-portable optimization flags
    (-march=native, -mtune=native, etc.) from Makevars for CRAN
    compatibility
  - Cross-platform Build: Enhanced Makevars configuration for better
    macOS compatibility with R package requirements

Demo and Documentation Updates

  - Modern UI: Updated streaming chat demo with modern bslib interface
    for enhanced user experience
  - Documentation: Improved documentation for edge_clean_cache()
    function
  - Examples: Enhanced streaming chat example with better UI components

Technical Improvements

  - Build System: Updated Makevars files for improved compilation on
    Windows and Unix systems
  - Core Bindings: Enhanced C++ bindings for better performance and
    stability

                 Changes in version 0.1.0 (2025-09-22)                  

Initial CRAN Release

New Features

  - Local LLM Inference: Complete R interface for running large language
    models locally using llama.cpp and GGUF model files
  - Model Management: Built-in functions for downloading and managing
    popular models from Hugging Face
  - Text Generation: Support for both blocking and streaming text
    completion
  - Interactive Chat: Real-time streaming chat interface with
    conversation history
  - Privacy-First: All processing happens locally without external API
    calls

Core Functions

  - edge_load_model() - Load GGUF model files for inference
  - edge_completion() - Generate text completions
  - edge_stream_completion() - Stream text generation with real-time
    callbacks
  - edge_chat_stream() - Interactive chat session with streaming
    responses
  - edge_free_model() - Memory management and cleanup
  - is_valid_model() - Model context validation

Model Management

  - edge_list_models() - List pre-configured popular models
  - edge_download_model() - Download models from Hugging Face Hub
  - edge_quick_setup() - One-line model download and setup

System Support

  - Self-contained: Includes complete llama.cpp implementation
  - Cross-platform: Works on Windows, macOS, and Linux
  - CPU optimized: Runs efficiently on standard hardware
  - Memory efficient: Support for quantized models

Documentation

  - Comprehensive getting started vignette
  - Complete API documentation with examples
  - README with extensive usage examples
  - Test coverage for all major functionality

Technical Implementation

  - C++17 integration via Rcpp
  - Real-time token streaming with callback support
  - Automatic memory management with RAII
  - Robust error handling and validation
  - Thread-safe model operations

This release provides a complete, production-ready solution for Local
Large Language Model Inference Engine in R, enabling private, offline
text generation workflows.