Changes in version 0.4.1 (2026-05-26) CRAN Resubmission Fixes - Stderr references in compiled objects (CRAN auto-check NOTE on Debian): the previous CRAN cleanup (commit d8870bd) added stdio suppression to 7 upstream files but missed ggml/ggml.c and ggml/ggml-opt.cpp. Both now include the same #ifdef USING_R macro block that neutralizes printf, fprintf, fputs, fflush, stderr, and stdout. These calls were diagnostic-only and were already silent at runtime via the installed log callback; now the symbols never reach the compiled object files either. Changes in version 0.4.0 Structured Output, Embeddings, RAG, and API Server New Features - Grammar-constrained generation (edge_grammar_completion()): Force model output to conform to a GBNF grammar specification. Ensures valid, parseable structured output (JSON, enums, numbers, etc.) using llama.cpp's native grammar sampler. - JSON schema helper (edge_json_grammar()): Convert a simple R list schema into a GBNF grammar string. Supports string, number, integer, boolean fields and enum (character vector) constraints. - Structured data extraction (edge_extract()): High-level function that combines prompt construction with grammar-constrained generation to extract structured data from text. Returns a parsed R list (requires jsonlite). - Text classification (edge_classify()): Classify text into predefined categories using grammar constraints. Supports single text and batch (vectorized) classification. Output is guaranteed to be one of the specified categories. - Text embeddings (edge_embeddings()): Extract dense vector embeddings from any loaded model. Returns a numeric matrix (n_texts x n_embd) suitable for clustering, semantic search, similarity computation, and RAG pipelines. Supports optional L2 normalization. - Cosine similarity (edge_similarity(), edge_similarity_matrix()): Compute pairwise cosine similarity between embedding vectors. Matrix version efficiently computes all-pairs similarity using normalized matrix multiply. - Embedding dimension query (edge_model_n_embd()): Query the embedding dimension of a loaded model. - Batch processing (edge_map()): Apply a prompt template over a vector of texts with progress reporting. Supports both string templates with {text} placeholder and custom prompt functions. Optional grammar constraint for structured batch output. - Batch extraction (edge_extract_batch()): Extract structured data from multiple texts, returning a data frame with one row per input. - RAG document indexing (edge_index_documents()): Build a semantic embedding index from a directory of text files or a character vector. Automatic chunking with configurable size and overlap. - RAG semantic search (edge_search()): Find the most relevant text chunks for a query using cosine similarity over the embedding index. - RAG question answering (edge_ask()): Retrieval-augmented generation that retrieves relevant context from an index and generates a grounded answer. Supports custom system prompts and optional context return for debugging/transparency. - Plumber API server (edge_serve()): Serve a model as a local OpenAI-compatible REST API. Endpoints: /v1/completions, /v1/chat/completions, /v1/embeddings, /v1/models, /health. Supports optional API key authentication and CORS. Requires plumber. - Qwen3 model family in edge_list_models(): Added Qwen3-0.6B, 1.7B, 4B, and 8B pre-configured entries from the unsloth GGUF repository. - Friendly names in edge_download_model(): Now accepts model names from edge_list_models() (e.g., edge_download_model("Qwen3-0.6B")) in addition to HuggingFace repo IDs. Filename is auto-resolved from the model registry. - httr download fallback: .robust_download() now tries httr::GET before R's download.file, improving reliability on corporate networks with custom SSL certificates or proxy configurations. - SIMD optimization warning: On package load, warns if running without SIMD (generic mode) and suggests reinstalling from source with EDGEMODELR_SIMD=NATIVE for faster inference. Bug Fixes - Fixed grammar-constrained generation failures (issue #41): edge_grammar_completion(), edge_extract(), and edge_extract_batch() were unusable due to two bugs. First, edge_json_grammar() emitted rule names like field_1 containing underscores, which llama.cpp's grammar parser rejects (only [a-zA-Z0-9-] is allowed in rule identifiers). Renamed to field-1. Second, llama_sampler_accept() throws "Unexpected empty grammar stack" when a token fully satisfies the grammar; the binding now catches this and terminates cleanly, same as end-of-generation handling. - Fixed crash from silent context size override (issue #40 item 11): Removed the auto-reduction of n_ctx for small models that silently changed the user's requested context size. This caused segfaults when prompts exceeded the reduced context. Context is now used as-is. Minimum n_ctx lowered from 512 to 128 for short-task use cases. - Fixed prompt echo in completion output (issue #40 item 1): edge_completion() previously returned prompt + generated_text. Now returns only the generated text, matching user expectations. - Added prompt length validation: All completion functions now validate that the tokenized prompt fits within the model's context window before calling llama_decode(). Exceeding the context now raises a clear R error instead of crashing the process. - Model-native chat templates (issue #40 item 7): New edge_chat_completion() function reads the model's chat template from GGUF metadata (via llama_chat_apply_template) and formats messages correctly for each model architecture (ChatML, Llama, Gemma, etc.). build_chat_prompt() updated to accept an optional ctx parameter for native template formatting, with ChatML as the generic fallback (replacing the old Human:/Assistant: format). Use Cases Unlocked - Sentiment analysis: edge_classify(ctx, text, c("positive", "negative", "neutral")) - Entity extraction: edge_extract(ctx, text, list(name = "string", role = "string")) - Data labeling: Batch classify thousands of rows with guaranteed valid labels - Semantic search: Embed documents and queries, find nearest neighbors - Document clustering: Compute similarity matrices, feed to hclust/kmeans - RAG foundations: Embed corpus, retrieve relevant context for generation Changes in version 0.3.0 CUDA GPU Support and Qwen3 Tokenizer Fix New Features - CUDA GPU acceleration (Windows): New edge_install_cuda() and edge_install_cuda_toolkit() functions set up GPU inference automatically. - edge_install_cuda() downloads the matching ggml-cuda dynamic backend from llama.cpp releases and extracts the companion ggml-base.dll / ggml.dll runtime libraries. - edge_install_cuda_toolkit() copies nvcudart_hybrid64.dll from the Windows DriverStore (already on any NVIDIA-driver machine, no download required) and fetches cublas64 / cublasLt64 from NVIDIA's redistrib server. - edge_reload_cuda() activates the CUDA backend in the current R session without restarting R. - edge_cuda_info() reports whether CUDA is installed and active. - Pass n_gpu_layers = -1L to edge_load_model() for full GPU offload. - Tested on NVIDIA RTX 5070 Ti (Blackwell sm_120, CUDA 13.1, 12 GB VRAM): Qwen3-14B loads in 3.4 s with full VRAM offload. - Updated llama.cpp to build b8179 (GGML 0.9.7): Brings all upstream model architecture updates, sampler improvements, and quantization fixes. Bug Fixes - Qwen3 / QWEN2 tokenizer 40-minute load time (8000× speedup): The QWEN2 byte-level regex pattern caused GCC's std::regex to spend 40+ minutes in exponential backtracking. Added a hand-written fast path unicode_regex_split_custom_qwen2() in unicode.cpp, matching the logic of the existing llama-3 fast path. Qwen3-14B now loads in 0.3 s on CPU (3.4 s on GPU including VRAM transfer). Covers QWEN2 and QWEN3.5 variants. CRAN Compliance - Replaced abort() in ggml_abort() with raise(SIGABRT) under #ifdef USING_R; replaces abort() token in ggml.cpp with std::terminate(). - Guarded ggml_print_backtrace() body and fflush(stdout) / fprintf(stderr, …) in ggml_abort() with #ifndef USING_R to remove _Exit, stdout, and stderr symbol references from ggml.o on macOS. - Added #define _GNU_SOURCE to ggml-cpu.c (required for SCHED_BATCH, CPU_ZERO, pthread_setaffinity_np on Linux). - CXX_STD = CXX17 replaces -std=c++17 in PKG_CXXFLAGS in both Makevars and Makevars.win. - -fno-builtin-printf added to GGML_CFLAGS to suppress printf → puts optimizations. - Man pages added for edge_install_cuda, edge_install_cuda_toolkit, edge_reload_cuda, edge_cuda_info. Changes in version 0.2.0 (2026-02-25) SIMD Optimizations for Faster CPU Inference New Features - Flash attention support: Enabled by default in edge_load_model() via flash_attn = TRUE. Reduces memory usage and improves attention computation speed on CPU. - Full hardware thread utilization: Removed the 4-thread cap for small contexts. edge_load_model() now uses all available CPU threads by default, with n_threads_batch set to max for prompt processing. - User-configurable threading: New n_threads parameter in edge_load_model() allows explicit control over CPU thread count. Pass NULL (default) for auto-detect or an integer to limit cores. - Apple Accelerate framework (macOS): Automatically links the Accelerate framework on macOS builds, enabling hardware-accelerated vDSP vector operations for faster matrix math. - Compiler auto-vectorization: Added -ftree-vectorize to GGML compilation flags on all platforms, allowing GCC/Clang to generate SIMD instructions for eligible loops beyond the hand-tuned GGML kernels. Existing Features - SIMD-optimized build system: Replaced generic scalar fallback with architecture-aware SIMD detection in both Makevars (Unix) and Makevars.win (Windows) - x86_64: Enables SSE4.2 baseline by default (universal since Intel Nehalem 2008) - aarch64/arm64: NEON support built into the ABI (no extra flags needed) - Other architectures: Automatic generic fallback - User-configurable SIMD levels: Set EDGEMODELR_SIMD environment variable before install to select optimization level: - GENERIC: Scalar fallback (maximum compatibility) - SSE42: SSE4.2 baseline (default on x86_64) - AVX: AVX + F16C (Intel Sandy Bridge 2011+) - AVX2: AVX2 + FMA + F16C (Intel Haswell 2013+, recommended) - AVX512: AVX-512 (Intel Skylake-X 2017+) - NATIVE: Uses -march=native for maximum performance on the build machine - edge_simd_info(): New function to query compile-time SIMD status including architecture, compiler features, and GGML optimization flags - x86 architecture-specific quantization: Enabled optimized x86 quantization kernels (arch/x86/quants.c, arch/x86/repack.cpp) with SIMD-accelerated dot products and matrix operations Performance - 15-40% faster inference on x86_64 with SSE4.2 baseline vs generic scalar - Up to 2-3x faster with AVX2 for quantized model operations - SSSE3-accelerated integer multiply-accumulate for quantized dot products Changes in version 0.1.5 (2026-01-28) CRAN Policy Fixes Bug Fixes - Fixed donttest examples: Changed resource-intensive examples from \donttest{} to \dontrun{} to prevent downloading multi-GB models during CRAN checks - Fixed M1 Mac compiler warnings: Added explicit static_cast<> for: - double to float conversions for temperature/top_p parameters - size_type to int32_t conversions for buffer size parameters - Fixed connection handling: Replaced on.exit() with tryCatch/finally for proper connection cleanup in loops (thanks @eddelbuettel) Changes in version 0.1.4 (2026-01-22) Performance Optimizations for Small Language Models New Features - Small Model Configuration Helper: New edge_small_model_config() function provides optimized settings for small models (1B-3B parameters) - Device-specific presets: mobile, laptop, desktop, and server - Adaptive configuration based on model size and available RAM - Built-in performance tips and recommendations - Automatic parameter tuning for optimal inference speed - Adaptive Batch Processing: Intelligent batch size optimization based on context length - Small contexts (≤512): Uses up to full context for batching - Medium contexts (512-2048): Uses 1/2 context for optimal throughput - Large contexts (2048-4096): Uses 1/4 context to balance speed and memory - Very large contexts (>4096): Caps at 2048 tokens for stability - Smart Thread Allocation: Context-aware CPU thread management - Small models automatically limit threads to avoid overhead - Reduces CPU contention on resource-constrained devices - Improves inference speed for models with contexts ≤2048 tokens - Automatic Context Optimization: Model size-based context tuning - Small models (<1GB): Optimized to 1024 tokens for faster inference - Medium models (1-2GB): Set to 1536 tokens for balanced performance - Large models (>2GB): Maintains 2048+ tokens for quality - User override available via n_ctx parameter Performance Improvements - Faster Small Model Inference: 15-30% speed improvement for small models through optimized batch and thread settings - Reduced Memory Footprint: Better memory efficiency for resource-constrained environments - Lower Latency: Optimized thread allocation reduces context switching overhead - Better Scalability: Adaptive configurations scale from mobile devices to servers Examples and Documentation - Small Model Optimization Example: Comprehensive example demonstrating all optimization features - Configuration comparison across device types - Performance benchmarking workflow - Best practices for different model sizes - Manual tuning guidelines - Enhanced Testing: New test suite for small model configuration - Tests for all device target configurations - Validation of adaptive parameter adjustments - Safety checks for edge cases Technical Details - Improved C++ bindings with adaptive batch size calculations - Enhanced R API with intelligent parameter defaults - Better integration between model size detection and configuration - Comprehensive documentation for optimization features Changes in version 0.1.2 Major New Features Ollama Integration - Native Ollama Support: Complete integration with Ollama models through automatic model discovery and SHA-256 hash-based loading - edge_find_ollama_models() - Discover all locally available Ollama models across platforms (Windows, macOS, Linux) - edge_load_ollama_model() - Load Ollama models using convenient SHA-256 hash prefixes instead of full file paths - test_ollama_model_compatibility() - Built-in compatibility testing for Ollama models - Cross-platform Model Detection: Robust model discovery supporting standard installations, snap packages (Linux), and various Windows configurations - Windows OneDrive Compatibility: Enhanced path detection that properly handles Windows OneDrive document folder redirections Comprehensive Examples Suite - Structured Learning Path: Complete examples directory with progressive difficulty levels (Beginner → Intermediate → Advanced) - 01_basic_usage.R: Fundamental operations including model loading, text generation, parameter tuning, and error handling - 02_ollama_integration.R: Complete Ollama workflow with model discovery, hash-based loading, and compatibility testing - 03_streaming_generation.R: Real-time streaming text generation with interactive chat interfaces and callback processing - 04_performance_optimization.R: Advanced performance tuning including GPU acceleration, benchmarking, memory management, and batch processing - examples/README.md: Comprehensive documentation with learning paths, troubleshooting guide, and customization instructions Package Structure Improvements - Organized File Structure: Consolidated all examples into structured examples/ directory with consistent formatting - Enhanced Documentation: Improved inline documentation and example comments throughout Changes in version 0.1.1 Bug Fixes and Improvements Compilation Fixes - macOS Boolean Conflicts: Completely resolved Boolean enum conflicts by avoiding problematic system headers and using direct function declarations - Filesystem Compatibility: Added comprehensive fallback implementation for disabled std::filesystem on macOS builds - Header Protection: Implemented robust cross-platform header inclusion strategy that works with R, Rcpp, and system headers - System Header Workarounds: Replaced inclusion with direct function declarations to avoid enum conflicts - Format Attribute Warnings: Suppressed unsupported printf format attribute warnings on macOS Apple Clang compiler - CRAN Compliance: Removed non-portable optimization flags (-march=native, -mtune=native, etc.) from Makevars for CRAN compatibility - Cross-platform Build: Enhanced Makevars configuration for better macOS compatibility with R package requirements Demo and Documentation Updates - Modern UI: Updated streaming chat demo with modern bslib interface for enhanced user experience - Documentation: Improved documentation for edge_clean_cache() function - Examples: Enhanced streaming chat example with better UI components Technical Improvements - Build System: Updated Makevars files for improved compilation on Windows and Unix systems - Core Bindings: Enhanced C++ bindings for better performance and stability Changes in version 0.1.0 (2025-09-22) Initial CRAN Release New Features - Local LLM Inference: Complete R interface for running large language models locally using llama.cpp and GGUF model files - Model Management: Built-in functions for downloading and managing popular models from Hugging Face - Text Generation: Support for both blocking and streaming text completion - Interactive Chat: Real-time streaming chat interface with conversation history - Privacy-First: All processing happens locally without external API calls Core Functions - edge_load_model() - Load GGUF model files for inference - edge_completion() - Generate text completions - edge_stream_completion() - Stream text generation with real-time callbacks - edge_chat_stream() - Interactive chat session with streaming responses - edge_free_model() - Memory management and cleanup - is_valid_model() - Model context validation Model Management - edge_list_models() - List pre-configured popular models - edge_download_model() - Download models from Hugging Face Hub - edge_quick_setup() - One-line model download and setup System Support - Self-contained: Includes complete llama.cpp implementation - Cross-platform: Works on Windows, macOS, and Linux - CPU optimized: Runs efficiently on standard hardware - Memory efficient: Support for quantized models Documentation - Comprehensive getting started vignette - Complete API documentation with examples - README with extensive usage examples - Test coverage for all major functionality Technical Implementation - C++17 integration via Rcpp - Real-time token streaming with callback support - Automatic memory management with RAII - Robust error handling and validation - Thread-safe model operations This release provides a complete, production-ready solution for Local Large Language Model Inference Engine in R, enabling private, offline text generation workflows.