ggml/ggml.c and
ggml/ggml-opt.cpp. Both now include the same #ifdef USING_R
macro block that neutralizes printf, fprintf, fputs, fflush,
stderr, and stdout. These calls were diagnostic-only and were
already silent at runtime via the installed log callback; now the
symbols never reach the compiled object files either.Grammar-constrained generation (edge_grammar_completion()): Force model
output to conform to a GBNF grammar specification. Ensures valid, parseable
structured output (JSON, enums, numbers, etc.) using llama.cpp's native
grammar sampler.
JSON schema helper (edge_json_grammar()): Convert a simple R list
schema into a GBNF grammar string. Supports string, number, integer,
boolean fields and enum (character vector) constraints.
Structured data extraction (edge_extract()): High-level function that
combines prompt construction with grammar-constrained generation to extract
structured data from text. Returns a parsed R list (requires jsonlite).
Text classification (edge_classify()): Classify text into predefined
categories using grammar constraints. Supports single text and batch
(vectorized) classification. Output is guaranteed to be one of the
specified categories.
Text embeddings (edge_embeddings()): Extract dense vector embeddings
from any loaded model. Returns a numeric matrix (n_texts x n_embd) suitable
for clustering, semantic search, similarity computation, and RAG pipelines.
Supports optional L2 normalization.
Cosine similarity (edge_similarity(), edge_similarity_matrix()):
Compute pairwise cosine similarity between embedding vectors. Matrix version
efficiently computes all-pairs similarity using normalized matrix multiply.
Embedding dimension query (edge_model_n_embd()): Query the embedding
dimension of a loaded model.
Batch processing (edge_map()): Apply a prompt template over a vector
of texts with progress reporting. Supports both string templates with
{text} placeholder and custom prompt functions. Optional grammar
constraint for structured batch output.
Batch extraction (edge_extract_batch()): Extract structured data from
multiple texts, returning a data frame with one row per input.
RAG document indexing (edge_index_documents()): Build a semantic
embedding index from a directory of text files or a character vector.
Automatic chunking with configurable size and overlap.
RAG semantic search (edge_search()): Find the most relevant text
chunks for a query using cosine similarity over the embedding index.
RAG question answering (edge_ask()): Retrieval-augmented generation
that retrieves relevant context from an index and generates a grounded
answer. Supports custom system prompts and optional context return for
debugging/transparency.
Plumber API server (edge_serve()): Serve a model as a local
OpenAI-compatible REST API. Endpoints: /v1/completions,
/v1/chat/completions, /v1/embeddings, /v1/models, /health.
Supports optional API key authentication and CORS. Requires plumber.
Qwen3 model family in edge_list_models(): Added Qwen3-0.6B, 1.7B,
4B, and 8B pre-configured entries from the unsloth GGUF repository.
Friendly names in edge_download_model(): Now accepts model names from
edge_list_models() (e.g., edge_download_model("Qwen3-0.6B")) in addition
to HuggingFace repo IDs. Filename is auto-resolved from the model registry.
httr download fallback: .robust_download() now tries httr::GET before
R's download.file, improving reliability on corporate networks with custom
SSL certificates or proxy configurations.
SIMD optimization warning: On package load, warns if running without SIMD
(generic mode) and suggests reinstalling from source with
EDGEMODELR_SIMD=NATIVE for faster inference.
Fixed grammar-constrained generation failures (issue #41):
edge_grammar_completion(), edge_extract(), and edge_extract_batch() were
unusable due to two bugs. First, edge_json_grammar() emitted rule names like
field_1 containing underscores, which llama.cpp's grammar parser rejects
(only [a-zA-Z0-9-] is allowed in rule identifiers). Renamed to field-1.
Second, llama_sampler_accept() throws "Unexpected empty grammar stack" when
a token fully satisfies the grammar; the binding now catches this and
terminates cleanly, same as end-of-generation handling.
Fixed crash from silent context size override (issue #40 item 11):
Removed the auto-reduction of n_ctx for small models that silently changed
the user's requested context size. This caused segfaults when prompts exceeded
the reduced context. Context is now used as-is. Minimum n_ctx lowered from
512 to 128 for short-task use cases.
Fixed prompt echo in completion output (issue #40 item 1):
edge_completion() previously returned prompt + generated_text. Now returns
only the generated text, matching user expectations.
Added prompt length validation: All completion functions now validate that
the tokenized prompt fits within the model's context window before calling
llama_decode(). Exceeding the context now raises a clear R error instead of
crashing the process.
Model-native chat templates (issue #40 item 7): New
edge_chat_completion() function reads the model's chat template from GGUF
metadata (via llama_chat_apply_template) and formats messages correctly for
each model architecture (ChatML, Llama, Gemma, etc.). build_chat_prompt()
updated to accept an optional ctx parameter for native template formatting,
with ChatML as the generic fallback (replacing the old Human:/Assistant:
format).
edge_classify(ctx, text, c("positive", "negative", "neutral"))edge_extract(ctx, text, list(name = "string", role = "string"))CUDA GPU acceleration (Windows): New edge_install_cuda() and
edge_install_cuda_toolkit() functions set up GPU inference automatically.
edge_install_cuda() downloads the matching ggml-cuda dynamic backend
from llama.cpp releases and extracts the companion ggml-base.dll /
ggml.dll runtime libraries.edge_install_cuda_toolkit() copies nvcudart_hybrid64.dll from the
Windows DriverStore (already on any NVIDIA-driver machine, no download
required) and fetches cublas64 / cublasLt64 from NVIDIA's redistrib
server.edge_reload_cuda() activates the CUDA backend in the current R session
without restarting R.edge_cuda_info() reports whether CUDA is installed and active.n_gpu_layers = -1L to edge_load_model() for full GPU offload.Updated llama.cpp to build b8179 (GGML 0.9.7): Brings all upstream model architecture updates, sampler improvements, and quantization fixes.
std::regex to spend 40+
minutes in exponential backtracking. Added a hand-written fast path
unicode_regex_split_custom_qwen2() in unicode.cpp, matching the logic
of the existing llama-3 fast path. Qwen3-14B now loads in 0.3 s on CPU
(3.4 s on GPU including VRAM transfer). Covers QWEN2 and QWEN3.5 variants.abort() in ggml_abort() with raise(SIGABRT) under
#ifdef USING_R; replaces abort() token in ggml.cpp with
std::terminate().ggml_print_backtrace() body and fflush(stdout) /
fprintf(stderr, …) in ggml_abort() with #ifndef USING_R to remove
_Exit, stdout, and stderr symbol references from ggml.o on macOS.#define _GNU_SOURCE to ggml-cpu.c (required for SCHED_BATCH,
CPU_ZERO, pthread_setaffinity_np on Linux).CXX_STD = CXX17 replaces -std=c++17 in PKG_CXXFLAGS in both
Makevars and Makevars.win.-fno-builtin-printf added to GGML_CFLAGS to suppress
printf → puts optimizations.edge_install_cuda, edge_install_cuda_toolkit,
edge_reload_cuda, edge_cuda_info.Flash attention support: Enabled by default in edge_load_model() via flash_attn = TRUE. Reduces memory usage and improves attention computation speed on CPU.
Full hardware thread utilization: Removed the 4-thread cap for small contexts. edge_load_model() now uses all available CPU threads by default, with n_threads_batch set to max for prompt processing.
User-configurable threading: New n_threads parameter in edge_load_model() allows explicit control over CPU thread count. Pass NULL (default) for auto-detect or an integer to limit cores.
Apple Accelerate framework (macOS): Automatically links the Accelerate framework on macOS builds, enabling hardware-accelerated vDSP vector operations for faster matrix math.
Compiler auto-vectorization: Added -ftree-vectorize to GGML compilation flags on all platforms, allowing GCC/Clang to generate SIMD instructions for eligible loops beyond the hand-tuned GGML kernels.
SIMD-optimized build system: Replaced generic scalar fallback with architecture-aware SIMD detection in both Makevars (Unix) and Makevars.win (Windows)
User-configurable SIMD levels: Set EDGEMODELR_SIMD environment variable before install to select optimization level:
GENERIC: Scalar fallback (maximum compatibility)SSE42: SSE4.2 baseline (default on x86_64)AVX: AVX + F16C (Intel Sandy Bridge 2011+)AVX2: AVX2 + FMA + F16C (Intel Haswell 2013+, recommended)AVX512: AVX-512 (Intel Skylake-X 2017+)NATIVE: Uses -march=native for maximum performance on the build machineedge_simd_info(): New function to query compile-time SIMD status including architecture, compiler features, and GGML optimization flags
x86 architecture-specific quantization: Enabled optimized x86 quantization kernels (arch/x86/quants.c, arch/x86/repack.cpp) with SIMD-accelerated dot products and matrix operations
Fixed donttest examples: Changed resource-intensive examples from \donttest{} to \dontrun{} to prevent downloading multi-GB models during CRAN checks
Fixed M1 Mac compiler warnings: Added explicit static_cast<> for:
double to float conversions for temperature/top_p parameterssize_type to int32_t conversions for buffer size parametersFixed connection handling: Replaced on.exit() with tryCatch/finally for proper connection cleanup in loops (thanks @eddelbuettel)
Small Model Configuration Helper: New edge_small_model_config() function provides optimized settings for small models (1B-3B parameters)
Adaptive Batch Processing: Intelligent batch size optimization based on context length
Smart Thread Allocation: Context-aware CPU thread management
Automatic Context Optimization: Model size-based context tuning
Small Model Optimization Example: Comprehensive example demonstrating all optimization features
Enhanced Testing: New test suite for small model configuration
edge_find_ollama_models() - Discover all locally available Ollama models across platforms (Windows, macOS, Linux)edge_load_ollama_model() - Load Ollama models using convenient SHA-256 hash prefixes instead of full file pathstest_ollama_model_compatibility() - Built-in compatibility testing for Ollama modelsstd::filesystem on macOS builds<mach-o/dyld.h> inclusion with direct function declarations to avoid enum conflicts-march=native, -mtune=native, etc.) from Makevars for CRAN compatibilityedge_clean_cache() functionedge_load_model() - Load GGUF model files for inferenceedge_completion() - Generate text completionsedge_stream_completion() - Stream text generation with real-time callbacksedge_chat_stream() - Interactive chat session with streaming responsesedge_free_model() - Memory management and cleanupis_valid_model() - Model context validationedge_list_models() - List pre-configured popular modelsedge_download_model() - Download models from Hugging Face Hubedge_quick_setup() - One-line model download and setupThis release provides a complete, production-ready solution for Local Large Language Model Inference Engine in R, enabling private, offline text generation workflows.