Documentation
Extension Settings
Cortex

Cortex

Configure how Cortex.cpp runs AI models. Access these settings through Settings () > Core Extensions > Cortex:

Performance Settings

Continuous Batching

  • Enables processing multiple requests simultaneously
  • Improves throughput for concurrent operations, which usually improves performance

Continuous Batching is turned on by default:


Continuous Batching


Parallel Operations

  • Number of prompts to run in parallel
  • Affects model inference speed

By default, Parallel Operations is 4:


Parallel Operations


CPU Threads

  • Number of CPU cores to use when running without GPU
  • Adjust based on your system's capabilities

Higher thread counts may improve performance but increase CPU usage.

Leave blank at CPU Threads to use system defaults:


Parallel Operations


Memory Settings

Flash Attention

  • Optimizes attention computation
  • Reduces memory usage
  • Recommended for most cases

Flash Attention is turned on by default:


Parallel Operations


Caching

  • Enable to store recent prompts and responses
  • Improves response time for repeated prompts

Caching is turned on by default:


Caching


KV Cache Type

  • KV cache implementation type; controls memory usage and precision trade-off.
  • Options:
    • f16 (most stable)
    • q8_0 (balanced)
    • q4_0 (lowest memory)
⚠️

Changing KV cache type from default (f16) may affect model stability.

By default, KV Cache Type is f16:


KV Cache Type


MMAP

  • Enables memory-mapped model loading
  • Reduces memory usage
  • Recommended for large models

MMAP is turned on by default:


MMAP


Best Practices:

  • Start with default settings
  • Adjust based on your hardware capabilities
  • Monitor system performance
  • Test changes with your specific use case

Performance impact varies by hardware, model size, and usage patterns.