Cortex

Configure how Cortex.cpp runs AI models. Access these settings through Settings () > Core Extensions > Cortex:

Performance Settings

Continuous Batching

Enables processing multiple requests simultaneously
Improves throughput for concurrent operations, which usually improves performance

Continuous Batching is turned on by default:

Continuous Batching

Parallel Operations

Number of prompts to run in parallel
Affects model inference speed

By default, Parallel Operations is 4:

Parallel Operations

CPU Threads

Number of CPU cores to use when running without GPU
Adjust based on your system's capabilities

Higher thread counts may improve performance but increase CPU usage.

Leave blank at CPU Threads to use system defaults:

Parallel Operations

Memory Settings

Flash Attention

Optimizes attention computation
Reduces memory usage
Recommended for most cases

Flash Attention is turned on by default:

Parallel Operations

Caching

Enable to store recent prompts and responses
Improves response time for repeated prompts

Caching is turned on by default:

Caching

KV Cache Type

KV cache implementation type; controls memory usage and precision trade-off.
Options:
- f16 (most stable)
- q8_0 (balanced)
- q4_0 (lowest memory)

⚠️

Changing KV cache type from default (f16) may affect model stability.

By default, KV Cache Type is f16:

KV Cache Type

MMAP

Enables memory-mapped model loading
Reduces memory usage
Recommended for large models

MMAP is turned on by default:

MMAP

Best Practices:

Start with default settings

Adjust based on your hardware capabilities

Monitor system performance

Test changes with your specific use case

Performance impact varies by hardware, model size, and usage patterns.

Overview Model Management