Troubleshooting
This page covers common operational issues and their solutions when deploying and running AutoMem. For configuration reference, see Railway Deployment. For monitoring setup, see Health Monitoring. For backup and recovery procedures, see Backup & Recovery.
Connection Failures
Section titled “Connection Failures”ECONNREFUSED Errors
Section titled “ECONNREFUSED Errors”The most common deployment issue is services unable to connect to each other, manifesting as ECONNREFUSED errors in logs.
graph TB
subgraph external["External Clients"]
Client["HTTP Client"]
end
subgraph railway["Railway Internal Network (IPv6)"]
subgraph mcp["mcp-sse-server"]
MCPServer["Node.js Server<br/>Binds to ::<br/>PORT from env"]
end
subgraph api["memory-service"]
FlaskApp["Flask App<br/>app.py<br/>Binds to ::<br/>PORT from env"]
end
subgraph db["falkordb"]
FalkorDB["FalkorDB<br/>Port 6379<br/>FALKOR_PASSWORD"]
end
end
Client -->|"HTTPS<br/>Public Domain"| MCPServer
Client -->|"HTTPS<br/>Public Domain"| FlaskApp
MCPServer -->|"HTTP<br/>memory-service.railway.internal:8001"| FlaskApp
FlaskApp -->|"Redis Protocol<br/>falkordb.railway.internal:6379"| FalkorDB
Symptom: Connection Refused to Memory Service
Section titled “Symptom: Connection Refused to Memory Service”Error in logs:
Error: connect ECONNREFUSED fd12:ca03:42be:0:1000:50:1079:5b6c:8001[AutoMem] GET http://memory-service.railway.internal:8001/health failedRoot Cause 1: Missing PORT Environment Variable
Section titled “Root Cause 1: Missing PORT Environment Variable”Problem: Flask defaults to port 5000 when PORT is not set, but other services expect port 8001.
Solution:
- Go to Railway Dashboard →
memory-service→ Variables - Add:
PORT=8001 - Redeploy the service
Code reference: app.py:1-10 — Flask app initialization reads PORT from environment.
Root Cause 2: IPv4-Only Binding (Fixed in v0.7.1+)
Section titled “Root Cause 2: IPv4-Only Binding (Fixed in v0.7.1+)”Problem: Older versions bound to 0.0.0.0 (IPv4 only), but Railway’s internal network uses IPv6 addresses.
Check startup logs:
# Expected output (v0.7.1+):* Running on http://[::]:8001
# Bad output (old versions):* Running on http://0.0.0.0:8001Solution: Update to AutoMem v0.7.1 or later. Flask now binds to :: (dual-stack IPv6/IPv4) via the host="::" parameter in app.py.
Root Cause 3: Wrong Internal Hostname
Section titled “Root Cause 3: Wrong Internal Hostname”Problem: Using public domain or incorrect internal hostname.
Solution: Ensure AUTOMEM_API_URL uses the internal hostname format:
# Correct (internal):AUTOMEM_API_URL=http://memory-service.railway.internal:8001
# Incorrect (public domain):AUTOMEM_API_URL=https://your-service.up.railway.appInternal hostnames follow the pattern {service-name}.railway.internal.
Variable Resolution Failures
Section titled “Variable Resolution Failures”Symptom: Variables Show Literal ${{...}} in Logs
Section titled “Symptom: Variables Show Literal ${{...}} in Logs”Error example:
FALKORDB_HOST=${{falkordb.RAILWAY_PRIVATE_DOMAIN}}Root Cause: Template Syntax in Manual Configuration
Section titled “Root Cause: Template Syntax in Manual Configuration”Problem: Railway’s ${{...}} variable reference syntax only works in Railway templates, not manual service configuration.
flowchart TD
Start["Variable not resolving?"]
Start --> Check1{"Using Railway<br/>template deploy?"}
Check1 -->|Yes| Check2{"Variable shows<br/>literal ${{...}}?"}
Check1 -->|No| UseHardcoded["Use hardcoded values<br/>NOT template syntax"]
Check2 -->|Yes| Bug["Template bug<br/>Report to Railway"]
Check2 -->|No| Working["Working correctly"]
UseHardcoded --> SetVars["Set variables:<br/>FALKORDB_HOST=falkordb.railway.internal<br/>FALKORDB_PASSWORD=(copy from falkordb)"]
SetVars --> Verify["Test connection:<br/>railway run redis-cli ping"]
Bug --> Workaround["Workaround:<br/>Use hardcoded values"]
Workaround --> SetVars
Solution: Use hardcoded values for manual deployments:
# Set these directly in Railway Dashboard → memory-service → VariablesFALKORDB_HOST=falkordb.railway.internalFALKORDB_PORT=6379FALKORDB_PASSWORD=(copy exact value from falkordb service variables)Benefits:
- More stable and predictable
- Easier to debug
- Works across redeployments
- Clear in logs
Authentication Issues
Section titled “Authentication Issues”Invalid API Token Errors
Section titled “Invalid API Token Errors”Symptom: 401 Unauthorized
Section titled “Symptom: 401 Unauthorized”Error response:
{"error": "Unauthorized", "message": "Invalid or missing API token"}Diagnosis Flow
Section titled “Diagnosis Flow”flowchart TD
Error["401 Unauthorized Error"]
Error --> Check1{"Which endpoint<br/>failing?"}
Check1 -->|"/admin/*"| AdminToken["Check ADMIN_API_TOKEN"]
Check1 -->|"Other endpoints"| StdToken["Check AUTOMEM_API_TOKEN"]
AdminToken --> Verify1["Compare token in request<br/>vs Railway variables"]
StdToken --> Verify2["Compare token in request<br/>vs Railway variables"]
Verify1 --> Match1{"Tokens match?"}
Verify2 --> Match2{"Tokens match?"}
Match1 -->|No| Rotate1["Update token in:<br/>1. Railway variables<br/>2. Client config<br/>3. MCP bridge"]
Match2 -->|No| Rotate2["Update token in:<br/>1. Railway variables<br/>2. Client config<br/>3. MCP bridge"]
Match1 -->|Yes| CheckMethod1["Check auth method"]
Match2 -->|Yes| CheckMethod2["Check auth method"]
CheckMethod1 --> Methods["Supported methods:<br/>1. Authorization: Bearer TOKEN<br/>2. X-API-Key: TOKEN<br/>3. ?api_key=TOKEN"]
CheckMethod2 --> Methods
Solution: Verify Token Configuration
Section titled “Solution: Verify Token Configuration”Step 1: Get current tokens
# Check token in Railway Dashboard → memory-service → Variables# Look for AUTOMEM_API_TOKEN or ADMIN_API_TOKENStep 2: Test token
curl -H "Authorization: Bearer your-token" \ https://your-service.up.railway.app/healthStep 3: If tokens don’t match, update everywhere:
- Railway variable (source of truth): Update
AUTOMEM_API_TOKENin Railway Dashboard - MCP bridge (if using): Update
AUTOMEM_API_TOKENin the mcp-sse-server service variables - Client configuration (Claude Desktop, Cursor, etc.): Update
.envor platform config file with the copied token value
Code reference: app.py:1-50 — require_api_token middleware validates tokens.
TCP Proxy Configuration
Section titled “TCP Proxy Configuration”When TCP Proxy is Required
Section titled “When TCP Proxy is Required”TCP Proxy enables external access to FalkorDB for:
- GitHub Actions automated backups
- Local development against production
- Manual database inspection
Not needed for:
- Service-to-service communication (use
*.railway.internal) - Normal API operations
graph TB
subgraph external["External Network"]
GHA["GitHub Actions<br/>Backup Workflow"]
DevMachine["Developer Machine<br/>redis-cli"]
end
subgraph railway["Railway Project"]
subgraph services["Internal Network"]
API["memory-service<br/>memory-service.railway.internal:8001"]
FalkorInternal["falkordb<br/>falkordb.railway.internal:6379"]
end
subgraph networking["Railway Networking"]
TCPProxy["TCP Proxy<br/>monorail.proxy.rlwy.net:XXXXX<br/>RAILWAY_TCP_PROXY_DOMAIN<br/>RAILWAY_TCP_PROXY_PORT"]
end
end
API -->|"Internal connection<br/>No proxy needed"| FalkorInternal
GHA -->|"MUST use TCP Proxy<br/>External connection"| TCPProxy
DevMachine -->|"MUST use TCP Proxy<br/>External connection"| TCPProxy
TCPProxy -->|"Forwards to"| FalkorInternal
Symptom: GitHub Actions Backup Failing
Section titled “Symptom: GitHub Actions Backup Failing”Error in GitHub Actions logs:
❌ ERROR: Cannot connect to FalkorDBError: [Errno 104] Connection reset by peerRoot Cause: Using Internal Hostname from External Runner
Section titled “Root Cause: Using Internal Hostname from External Runner”GitHub Actions runners are outside Railway’s network and cannot resolve *.railway.internal hostnames. The workflow at .github/workflows/backup.yml validates that FALKORDB_HOST does not contain .railway.internal.
Solution: Enable and Configure TCP Proxy
Section titled “Solution: Enable and Configure TCP Proxy”Step 1: Enable TCP Proxy
- Railway Dashboard →
falkordbservice - Settings → Networking → Enable TCP Proxy
- Note the public endpoint (e.g.,
monorail.proxy.rlwy.net:12345)
Step 2: Update GitHub Secrets
In your GitHub repository → Settings → Secrets and variables → Actions:
FALKORDB_HOST = monorail.proxy.rlwy.netFALKORDB_PORT = 12345FALKORDB_PASSWORD = (same password as the Railway service variable)Step 3: Test Connectivity
# From local machine with redis-cliredis-cli -h monorail.proxy.rlwy.net -p 12345 -a your-password pingDebug checklist:
- TCP Proxy is enabled in Railway Dashboard
FALKORDB_HOSTsecret uses TCP proxy domain (not.railway.internal)FALKORDB_PORTsecret uses TCP proxy port (not6379)FALKORDB_PASSWORDmatches value in FalkorDB service variables- Test connection with
redis-clifrom local machine succeeds
Data Synchronization Issues
Section titled “Data Synchronization Issues”Health Monitor Drift Detection
Section titled “Health Monitor Drift Detection”Symptom: Drift Alerts or Health Check Warnings
Section titled “Symptom: Drift Alerts or Health Check Warnings”Log output:
⚠️ WARNING: Drift detected - FalkorDB: 850 memories, Qdrant: 892 memories (4.7% drift)Understanding Drift Thresholds
Section titled “Understanding Drift Thresholds”flowchart TD
Monitor["Health Monitor<br/>health_monitor.py"]
Monitor --> Count["Count memories:<br/>FalkorDB vs Qdrant"]
Count --> Calc["Calculate drift %:<br/>(|falkor - qdrant| / qdrant) * 100"]
Calc --> Check{"Drift level?"}
Check -->|"< 1%"| Normal["Normal<br/>In-flight writes<br/>No action needed"]
Check -->|"1-5%"| Warning["Warning<br/>HEALTH_MONITOR_DRIFT_THRESHOLD<br/>Monitor but OK"]
Check -->|"5-50%"| Alert["Alert<br/>Investigation needed<br/>Webhook notification"]
Check -->|"> 50%"| Critical["Critical<br/>HEALTH_MONITOR_CRITICAL_THRESHOLD<br/>Data loss event<br/>Trigger recovery"]
Alert --> Manual["Manual review:<br/>Check logs<br/>Verify writes"]
Critical --> Auto["Auto-recovery (if enabled):<br/>recover_from_qdrant.py"]
Default thresholds:
HEALTH_MONITOR_DRIFT_THRESHOLD=5(warning level — alert if drift exceeds 5%)HEALTH_MONITOR_CRITICAL_THRESHOLD=50(critical level — trigger recovery if drift exceeds 50%)
Common Causes of Drift
Section titled “Common Causes of Drift”1. In-Flight Writes (Normal, <1% drift)
- Memory being written during health check
- Embedding worker processing queue
- Enrichment pipeline updating metadata
2. Failed Writes (5-10% drift)
Qdrant writes have try/except wrappers in automem/stores/qdrant.py. If the Qdrant service is unreachable, writes to FalkorDB succeed but Qdrant writes fail silently, causing drift.
3. Data Loss Event (>50% drift)
- FalkorDB restart without persistent volume
- Qdrant collection deleted or reset
- Manual data manipulation
Solution: Recover from Drift
Section titled “Solution: Recover from Drift”Option 1: Auto-Recovery (if enabled)
# Set on health monitor serviceHEALTH_MONITOR_AUTO_RECOVER=trueOption 2: Manual Recovery
python scripts/recover_from_qdrant.pyExpected recovery time:
- 5-10 minutes for 1000 memories
- 99.7% data recovery rate
- Relationships automatically rebuilt
For complete recovery procedures, see Backup & Recovery.
Embedding Provider Issues
Section titled “Embedding Provider Issues”API Key Configuration Errors
Section titled “API Key Configuration Errors”Symptom: Embeddings Skipped or Placeholder Used
Section titled “Symptom: Embeddings Skipped or Placeholder Used”Log output:
⚠️ No OPENAI_API_KEY or VOYAGE_API_KEY found - using placeholder embeddingsSemantic search will not work correctlyEmbedding Provider Selection Flow
Section titled “Embedding Provider Selection Flow”flowchart TD
Request["Embedding Request<br/>EmbeddingProvider.get_instance()"]
Request --> CheckConfig{"EMBEDDING_PROVIDER<br/>config?"}
CheckConfig -->|"auto (default)"| Priority["Auto-selection priority"]
CheckConfig -->|"voyage"| ForceVoyage["Force Voyage<br/>Require VOYAGE_API_KEY"]
CheckConfig -->|"openai"| ForceOpenAI["Force OpenAI<br/>Require OPENAI_API_KEY"]
CheckConfig -->|"local"| ForceLocal["Force FastEmbed<br/>Local ONNX model"]
CheckConfig -->|"placeholder"| ForcePlaceholder["Force Placeholder<br/>Hash-based"]
Priority --> Try1{"VOYAGE_API_KEY<br/>set?"}
Try1 -->|Yes| Voyage["VoyageEmbeddingProvider<br/>voyage-4: 1024d"]
Try1 -->|No| Try2{"OPENAI_API_KEY<br/>set?"}
Try2 -->|Yes| OpenAI["OpenAIEmbeddingProvider<br/>text-embedding-3-small: 768d"]
Try2 -->|No| Try3{"FastEmbed<br/>available?"}
Try3 -->|Yes| FastEmbed["FastEmbedProvider<br/>BAAI/bge-base-en-v1.5: 768d"]
Try3 -->|No| Placeholder["PlaceholderEmbeddingProvider<br/>No semantic meaning"]
Voyage --> Validate["Validate dimension<br/>matches VECTOR_SIZE"]
OpenAI --> Validate
FastEmbed --> Validate
Placeholder --> Validate
ForceVoyage --> Validate
ForceOpenAI --> Validate
ForceLocal --> Validate
ForcePlaceholder --> Validate
Validate --> Store["Store in Qdrant<br/>or log warning"]
Solution: Configure API Keys
Section titled “Solution: Configure API Keys”Step 1: Choose provider
| Provider | Cost | Quality | Key Required |
|---|---|---|---|
| Voyage AI | $0.00012/1K tokens | Best | VOYAGE_API_KEY |
| OpenAI | $0.00002/1K tokens | Good | OPENAI_API_KEY |
| FastEmbed | Free | Good (local ONNX) | None |
| Placeholder | Free | No semantic meaning | None (testing only) |
Step 2: Verify provider is working
# Check which provider is activecurl https://your-service.up.railway.app/health | jq '.embedding_provider'Step 3: Re-embed existing memories (if needed)
If you switch providers, existing memories may have incorrect-dimension vectors. Trigger re-embedding:
curl -X POST https://your-service.up.railway.app/admin/reembed \ -H "Authorization: Bearer your-admin-token"Vector Dimension Mismatch
Section titled “Vector Dimension Mismatch”Symptom: Qdrant Upsert Errors
Section titled “Symptom: Qdrant Upsert Errors”Error in logs:
ValueError: Vector dimension mismatch: expected 768, got 1024Failed to upsert memory to QdrantRoot Cause: Changed Provider or Model
Section titled “Root Cause: Changed Provider or Model”- Scenario 1: Switching from OpenAI (768d) to Voyage (1024d)
- Scenario 2: Changing
EMBEDDING_MODELto different dimensions - Scenario 3: Existing Qdrant collection has different dimensions
Solution: Reconfigure or Reset Collection
Section titled “Solution: Reconfigure or Reset Collection”Option 1: Auto-detect existing dimensions (safe)
The auto-detection logic in automem/stores/qdrant.py:100-150 checks the existing collection schema. Set EMBEDDING_PROVIDER=auto and restart the service — it will detect the existing dimensions.
Option 2: Reset Qdrant collection (data loss)
# Delete and recreate the collectioncurl -X DELETE https://your-qdrant-url/collections/memories \ -H "api-key: your-qdrant-key"# Then restart memory-service to recreate with correct dimensionsOption 3: Create new collection and migrate
# Create a new collection with target dimensions# Migrate by re-embedding all memories from FalkorDBcurl -X POST https://your-service.up.railway.app/admin/reembed \ -H "Authorization: Bearer your-admin-token"Performance Issues
Section titled “Performance Issues”High Memory Usage
Section titled “High Memory Usage”Symptom: OOM Kills or Slow Performance
Section titled “Symptom: OOM Kills or Slow Performance”Railway logs:
Error: Process killed (OOM)Memory usage: 1.2GB / 1GB limitMemory Usage by Component
Section titled “Memory Usage by Component”| Component | Typical RAM | High Load | Config |
|---|---|---|---|
| Flask API | 100-200MB | 300-500MB | - |
| FalkorDB | 200-400MB | 800MB-2GB | --maxmemory |
| Enrichment Queue | 50-100MB | 200MB | ENRICHMENT_QUEUE_SIZE |
| Embedding Queue | 50-100MB | 200MB | EMBEDDING_BATCH_SIZE |
| Total recommended | 512MB-1GB | 1.5-2GB | Railway service size |
Solution: Optimize Memory Configuration
Section titled “Solution: Optimize Memory Configuration”FalkorDB memory limits:
# In Railway Dashboard → falkordb → VariablesREDIS_ARGS=--maxmemory 512mb --maxmemory-policy allkeys-lru --save 60 1 --appendonly yesQueue size limits:
# Reduce queue sizes if experiencing memory pressureEMBEDDING_BATCH_SIZE=10ENRICHMENT_QUEUE_SIZE=100Graceful degradation check:
# Check if service is degrading gracefullycurl https://your-service.up.railway.app/health | jq '.enrichment.queue_depth'Slow Query Performance
Section titled “Slow Query Performance”Symptom: High Latency on /recall
Section titled “Symptom: High Latency on /recall”Response includes timing:
{ "memories": [...], "query_time_ms": 1247.3}Performance Optimization Checklist
Section titled “Performance Optimization Checklist”1. Check relationship cache:
The LRU cache in automem/consolidation.py:200-250 should have an 80% hit rate during normal operation. If consolidation is running frequently, the cache may be continuously invalidated.
2. Verify indexes:
Ensure the Qdrant keyword index is set up correctly. The index setup is in automem/stores/qdrant.py:1-100.
3. Limit expansion:
High RECALL_LIMIT values increase graph traversal time:
# Reduce if high latency is observedRECALL_LIMIT=10 # default is 204. Check batch processing:
A high queue_depth in /health indicates the embedding worker is backlogged. Increase EMBEDDING_BATCH_SIZE or reduce EMBEDDING_BATCH_TIMEOUT_SECONDS.
Diagnostic Commands
Section titled “Diagnostic Commands”Quick Health Check
Section titled “Quick Health Check”# Full health statuscurl https://your-service.up.railway.app/health | jq .
# Check specific componentscurl https://your-service.up.railway.app/health | jq '{ status: .status, falkordb: .statistics.falkordb, qdrant: .statistics.qdrant, enrichment: .enrichment.status, queue: .enrichment.queue_depth}'Connection Testing
Section titled “Connection Testing”# Test memory servicecurl https://your-service.up.railway.app/health
# Test FalkorDB via TCP proxy (external)redis-cli -h monorail.proxy.rlwy.net -p 12345 -a your-password ping
# Test Qdrantcurl https://your-qdrant-url/collections/memoriesLog Analysis
Section titled “Log Analysis”# Railway CLI log streamingrailway logs --follow --service memory-service
# Filter for errorsrailway logs | grep -E "ERROR|CRITICAL|Exception"
# Filter for performance issuesrailway logs | grep "latency_ms" | jq 'select(.latency_ms > 500)'
# Check drift warningsrailway logs | grep "Drift detected"Service Status
Section titled “Service Status”# List all Railway services in projectrailway status
# Check specific service variablesrailway variables --service memory-service
# Trigger manual redeployrailway redeploy --service memory-serviceGetting Help
Section titled “Getting Help”If issues persist after following this guide:
-
Check logs first:
Terminal window railway logs --service memory-service | tail -100 -
Verify configuration:
- All required environment variables set (see Railway Deployment)
- Tokens match between services
- Hostnames use internal
.railway.internaldomains PORT=8001is set on memory-service
-
Test minimal configuration:
- Deploy with only FalkorDB (no Qdrant)
- Use placeholder embeddings (no OpenAI)
- Disable MCP bridge temporarily
-
Open an issue:
- GitHub: https://github.com/verygoodplugins/automem/issues
- Include: error logs, configuration (redact secrets), Railway service logs
- Mention which troubleshooting steps you’ve already tried