Skip to content

Performance Tuning

This page describes AutoMem’s performance optimization strategies, including embedding batching, relationship count caching, query time tracking, and structured logging. These optimizations reduce API costs by 40-50%, speed up consolidation by 80%, and improve monitoring capabilities. For operational monitoring strategies, see Health Monitoring.

AutoMem implements performance optimizations in four key areas:

graph TB
    subgraph "Optimization 1: Embedding Batching"
        EmbedRequest["Memory POST requests<br/>app.py:/memory"]
        EmbedQueue["embedding_queue<br/>ServiceState.embedding_queue"]
        BatchCollect["Batch Collection<br/>Size: EMBEDDING_BATCH_SIZE=20<br/>Timeout: EMBEDDING_BATCH_TIMEOUT_SECONDS=2.0"]
        BatchProcess["_process_embedding_batch()<br/>Single API call for 20 items"]

        EmbedRequest --> EmbedQueue
        EmbedQueue --> BatchCollect
        BatchCollect --> BatchProcess
    end

    subgraph "Optimization 2: LRU Caching"
        ConsolRequest["Consolidation cycles<br/>Decay/Creative/Cluster"]
        CacheImpl["_get_relationship_count_cached_impl()<br/>@lru_cache(maxsize=10000)"]
        CacheKey["Cache key: memory_id + hour_timestamp<br/>Automatic hourly invalidation"]

        ConsolRequest --> CacheImpl
        CacheImpl --> CacheKey
    end

    subgraph "Optimization 3: Async Processing"
        MemoryPOST["/memory endpoint<br/>app.py"]
        SyncWrite["FalkorDB write<br/>Immediate, blocking"]
        AsyncEmbed["Embedding generation<br/>Queued for background"]
        AsyncEnrich["Enrichment<br/>Queued for background"]
        Return["Return 200 OK<br/>100-150ms total"]

        MemoryPOST --> SyncWrite
        MemoryPOST --> AsyncEmbed
        MemoryPOST --> AsyncEnrich
        SyncWrite --> Return
    end

    subgraph "Optimization 4: Structured Logging"
        APIRequest["API Request"]
        PerfCounters["time.perf_counter()<br/>Start/End timestamps"]
        StructuredLog["logger.info()<br/>extra={'query': ..., 'latency_ms': ..., 'results': ...}"]
        QueryTimeField["Response JSON<br/>query_time_ms field"]

        APIRequest --> PerfCounters
        PerfCounters --> StructuredLog
        PerfCounters --> QueryTimeField
    end

These optimizations were implemented in version 0.6.0 with an estimated ROI of 200-300% in year 1.


Prior to optimization, embeddings were generated one-at-a-time via _generate_real_embedding(), resulting in high API overhead. Each memory creation triggered a separate OpenAI API call, leading to:

  • 1000 API calls per 1000 memories
  • High request overhead (~50ms per call)
  • Annual cost of $20-30 for typical usage

The embedding worker accumulates memories in a batch queue and processes them together using OpenAI’s bulk embedding API.

The worker uses a timeout-based accumulation strategy:

  1. Pop item from embedding_queue
  2. Add to batch list
  3. Check if len(batch) >= EMBEDDING_BATCH_SIZE or timeout elapsed
  4. If batch ready, call _process_embedding_batch()
  5. Otherwise, continue accumulating with 0.1s sleep intervals
FunctionPurposeLocation
embedding_worker()Main worker loop with batch accumulationapp.py:2405-2503
_process_embedding_batch()Processes accumulated batchapp.py:2443-2495
_generate_real_embeddings_batch()Calls OpenAI API with multiple textsapp.py:2329-2367
_store_embedding_in_qdrant()Stores single embedding in Qdrantapp.py:2369-2403
VariableDefaultRangeDescriptionTuning Guidance
EMBEDDING_BATCH_SIZE201-2048Maximum items per batchHigh-volume: 50-100; Low-latency: 10
EMBEDDING_BATCH_TIMEOUT_SECONDS2.00.1-60.0Max wait time for partial batchCost-optimized: 5-10s; Low-latency: 1s
MetricBeforeAfterImprovement
API calls per 1000 memories100050-10040-50% decrease
Annual embedding cost$20-30$12-18$8-15 saved
Request overhead50ms/memory2.5-5ms/memory90% decrease

During consolidation, calculate_relevance_score() queries the graph for relationship counts to determine memory preservation weight. With 10,000 memories, this resulted in:

  • 10,000 graph queries per consolidation run
  • ~5 minute execution time for decay task
  • O(N) query complexity

An LRU cache with hourly invalidation reduces graph queries while maintaining fresh data.

The cache uses functools.lru_cache with a clever hour-based key for automatic invalidation. The wrapper function generates an hour_key based on the current timestamp truncated to the hour, and combines it with the memory_id to form the cache key.

The hour_key approach provides a balance between:

  • Freshness: Data refreshes every 60 minutes
  • Performance: 80% cache hit rate during consolidation runs
  • Simplicity: No manual cache clearing required
ComponentPurposeInvalidation
memory_idUnique identifierPer-memory granularity
hour_keyTimestamp bucketAutomatic hourly refresh
LRU policyMemory managementEvicts least-used entries
MetricBeforeAfterImprovement
Graph queries (10k memories)10,000~2,00080% decrease
Decay consolidation time~5 min~1 min80% faster
Cache hit rateN/A80%-
MetricValueNotes
LRU cache max size10,000 entriesSufficient for 10k memories
Cache hit rate (consolidation)~80%Measured during decay runs
Cache invalidationHourlyAutomatic via hour_key
Memory overhead~1-2 MBNegligible for typical usage

All API endpoints track query execution time using time.perf_counter() and include it in responses.

EndpointLocationMetric Name
GET /recallapp.py:2844-2961query_time_ms
POST /memoryapp.py:1896-1964query_time_ms
GET /healthapp.py:3043-3097query_time_ms
GET /analyzeapp.py:3243-3331query_time_ms

/recall endpoint:

{
"memories": [...],
"count": 5,
"query_time_ms": 42.3
}

POST /memory endpoint:

{
"memory_id": "abc-123",
"status": "stored",
"query_time_ms": 112.7
}

Structured logging provides machine-parseable performance data in log output, enabling:

  • Automated performance analysis
  • Bottleneck identification
  • Production debugging
  • Metrics dashboard integration

Logs use the extra={} parameter to include structured data alongside log messages:

logger.info("Recall completed", extra={
"query": query,
"results": len(memories),
"latency_ms": elapsed_ms,
"vector_enabled": qdrant_available,
"vector_matches": vector_hit_count,
"has_time_filter": bool(time_filter),
"has_tag_filter": bool(tag_filter),
"limit": limit
})
FieldTypeDescription
querystringQuery text
resultsintNumber of results returned
latency_msfloatQuery execution time
vector_enabledboolQdrant availability
vector_matchesintSemantic search hits (if applicable)
has_time_filterboolTemporal filtering active
has_tag_filterboolTag filtering active
limitintResult limit
FieldTypeDescription
memory_idstringUnique identifier
typestringMemory classification
importancefloatUser-defined priority
tags_countintNumber of tags
content_lengthintContent size in bytes
latency_msfloatStore operation time
embedding_statusstring"queued" or "provided"
qdrant_statusstring"queued", "stored", or "disabled"
enrichment_queuedboolEnrichment pipeline status
Terminal window
# Find slow recall queries (>500ms)
railway logs | grep "Recall completed" | jq 'select(.latency_ms > 500)'
# Count results by query pattern
railway logs | grep "Recall completed" | jq '.query' | sort | uniq -c
# Monitor embedding queue status
railway logs | grep "embedding_status" | jq '{status: .embedding_status, id: .memory_id}'

The /health endpoint includes enrichment queue metrics without requiring authentication:

{
"status": "healthy",
"enrichment": {
"status": "running",
"queue_depth": 3,
"pending": 2,
"inflight": 1,
"processed": 1247,
"failed": 2
}
}
MetricDescriptionAlert Threshold
status"running", "stopped", or "error"Alert if not "running"
queue_depthJobs waiting in queueAlert if > 100 for 5+ min
pendingUnprocessed memoriesMonitor for backlog
inflightCurrently processingUsually 0-3
processedTotal successful enrichmentsTrend analysis
failedTotal failuresAlert if increasing

VariableDefaultRangePurposePerformance Impact
EMBEDDING_BATCH_SIZE201-2048Max items per embedding batchHigher = fewer API calls, higher latency
EMBEDDING_BATCH_TIMEOUT_SECONDS2.00.1-60.0Max wait time for partial batchHigher = better batching, higher latency
CONSOLIDATION_DECAY_INTERVAL_SECONDS360060-86400Seconds between decay runsLower = fresher scores, more CPU
CONSOLIDATION_CREATIVE_INTERVAL_SECONDS3600600-604800Seconds between creative association discoveryLower = more associations, more CPU
CONSOLIDATION_CLUSTER_INTERVAL_SECONDS216003600-2592000Seconds between clustering runsLower = fresher clusters, more CPU

High-Volume Scenario (>5000 memories/day):

Terminal window
EMBEDDING_BATCH_SIZE=100
EMBEDDING_BATCH_TIMEOUT_SECONDS=5
CONSOLIDATION_DECAY_INTERVAL_SECONDS=7200

Trade-offs:

  • Pros: Maximum cost efficiency, reduced API load
  • Cons: 5s max latency for embeddings, less frequent score updates

Low-Latency Scenario (interactive applications):

Terminal window
EMBEDDING_BATCH_SIZE=10
EMBEDDING_BATCH_TIMEOUT_SECONDS=1
CONSOLIDATION_DECAY_INTERVAL_SECONDS=1800

Trade-offs:

  • Pros: 1s max latency, fresher relevance scores
  • Cons: Higher API costs, more frequent consolidation overhead

Cost-Optimized Scenario (can tolerate delays):

Terminal window
EMBEDDING_BATCH_SIZE=50
EMBEDDING_BATCH_TIMEOUT_SECONDS=10
CONSOLIDATION_DECAY_INTERVAL_SECONDS=86400

Trade-offs:

  • Pros: Minimum API costs, lowest overhead
  • Cons: 10s max latency, stale scores possible

MetricBefore v0.6.0After v0.6.0Improvement
OpenAI API calls/day100050-10040-50% decrease
API request overhead50s/day5-10s/day80-90% decrease
Annual embedding cost$20-30$12-18$8-15 saved
Consolidation time (10k memories)~5 min~1 min80% faster
/memory POST latency250-400ms100-150ms60% faster

Trade-off: Batching introduces latency (up to EMBEDDING_BATCH_TIMEOUT_SECONDS) for partial batches.

Mitigation:

  • Set timeout based on use case (1-2s for interactive, 5-10s for batch processing)
  • Monitoring: Track queue_depth in /health endpoint
  • Alert threshold: Queue depth > 50 indicates batching may be too aggressive

Trade-off: Hourly cache invalidation means relationship counts may be up to 1 hour stale.

Impact: Minimal for consolidation use case, as relevance scores change slowly.

Alternative: For real-time applications, consider cache invalidation on relationship creation (requires code modification in consolidation.py:152-176).

Embedding Queue: Each queued memory holds ~1KB (content + metadata)

  • Max queue depth: ~1000 items
  • Max memory: ~1MB

LRU Cache: Each cached count holds ~100 bytes (memory_id + count + hour_key)

  • Max cache size: 10,000 entries
  • Max memory: ~1MB

Total overhead: <5MB for typical deployments

Query Time Tracking: time.perf_counter() adds <0.1ms per request (negligible)

Structured Logging: Log volume increases by ~200 bytes per request

  • Impact: Minimal for most log aggregation services
  • Mitigation: Configure log retention policies appropriately

Potential: Reduce embedding dimensions from 768 to 512 for additional 33% cost reduction

Implementation: Modify _generate_real_embedding() to pass dimensions=512 to OpenAI API

Trade-off: Slightly lower semantic search accuracy (~1-2% reduction)

Potential: Batch all relationship queries in consolidation into single Cypher query for 95% speedup

Complexity: Requires rewriting decay logic to fetch all counts at once, then compute scores

Estimated effort: 4 hours implementation + testing

Potential: Export structured metrics via /metrics endpoint for Grafana dashboards

Benefits: Real-time performance monitoring, historical trend analysis

Libraries: prometheus_client for Python


Set batch size to 1 to revert to synchronous-equivalent behavior:

Terminal window
EMBEDDING_BATCH_SIZE=1
EMBEDDING_BATCH_TIMEOUT_SECONDS=0.1

Effect: Each embedding is processed immediately (pre-v0.6.0 behavior).

To disable caching without code changes, set a very short cache lifetime:

The LRU cache uses hour_key for invalidation — this cannot be fully disabled without code changes, but setting CONSOLIDATION_DECAY_INTERVAL_SECONDS to a high value minimizes consolidation frequency.

Remove the enrichment section from the health response by modifying app.py:3043-3097 to exclude the enrichment key from the response JSON.