Skip to content

Health Monitoring

This page covers AutoMem’s health monitoring system, which continuously tracks the status of FalkorDB and Qdrant databases, detects data synchronization drift, and can automatically trigger recovery when data loss is detected. It also documents the monitoring and introspection endpoints that provide visibility into AutoMem’s operational status, database connectivity, queue health, and memory statistics.

For backup strategies and disaster recovery procedures, see Backup & Recovery. For deployment configuration, see Railway Deployment.

AutoMem’s health monitoring system operates at multiple levels to ensure data integrity and service availability.

graph TB
    subgraph "Health Monitoring System"
        HealthEndpoint["/health Endpoint<br/>app.py"]
        HealthMonitor["health_monitor.py<br/>Continuous Monitoring"]
        GHAWorkflow[".github/workflows/backup.yml<br/>Scheduled Checks"]
    end

    subgraph "Data Stores"
        FalkorDB["FalkorDB<br/>Graph Database<br/>Port 6379"]
        Qdrant["Qdrant<br/>Vector Database<br/>Port 6333"]
    end

    subgraph "Monitoring Outputs"
        Dashboard["Railway Dashboard<br/>Service Metrics"]
        Webhooks["Slack/Discord<br/>Alert Webhooks"]
        AutoRecovery["recover_from_qdrant.py<br/>Auto-Recovery Script"]
    end

    subgraph "External Tools"
        UptimeRobot["UptimeRobot<br/>External Monitoring"]
        Grafana["Grafana Cloud<br/>Observability"]
    end

    HealthEndpoint -->|"Check Status"| FalkorDB
    HealthEndpoint -->|"Check Status"| Qdrant

    HealthMonitor -->|"Poll /health"| HealthEndpoint
    HealthMonitor -->|"Count Memories"| FalkorDB
    HealthMonitor -->|"Count Vectors"| Qdrant
    HealthMonitor -->|"Calculate Drift"| DriftCheck{"Drift > 5%?"}

    GHAWorkflow -->|"Scheduled Check"| HealthEndpoint

    DriftCheck -->|"Warning"| Webhooks
    DriftCheck -->|"Critical > 50%"| Webhooks
    DriftCheck -->|"Critical + Auto-Recover"| AutoRecovery

    HealthMonitor -->|"Metrics"| Dashboard

    UptimeRobot -->|"HTTP Checks"| HealthEndpoint
    Grafana -->|"Metrics Aggregation"| HealthEndpoint

    AutoRecovery -->|"Rebuild Graph"| FalkorDB
    AutoRecovery -->|"Source Data"| Qdrant

The health endpoint provides real-time service status, database connectivity checks, and enrichment pipeline metrics. This endpoint does not require authentication and is designed for automated health monitoring systems.

PropertyValue
Path/health
MethodGET
AuthenticationNone (public endpoint)
Response TypeJSON
Timeout100 seconds (Railway default)
{
"status": "healthy",
"falkordb": "connected",
"qdrant": "connected",
"memory_count": 884,
"qdrant_count": 884,
"graph": "memories",
"timestamp": "2025-10-20T14:30:00Z",
"enrichment": {
"status": "running",
"queue_depth": 2,
"pending": 2,
"inflight": 0,
"processed": 1247,
"failed": 0,
"last_success": "2025-10-20T14:29:45Z",
"last_error": null,
"worker_active": true
}
}
FieldTypeDescription
statusstringOverall health: "healthy" or "degraded"
falkordbstringFalkorDB status: "connected", "unknown", or "error: ..."
qdrantstringQdrant status: "connected", "not_configured", or "error: ..."
memory_countinteger|nullTotal memories in FalkorDB (null if query fails)
qdrant_countinteger|nullTotal points in Qdrant collection (null if unavailable)
enrichmentobjectEnrichment queue metrics (see below)
graphstringFalkorDB graph name (FALKORDB_GRAPH env variable)
timestampstringISO 8601 timestamp of health check

The enrichment object provides visibility into the background enrichment pipeline:

FieldTypeDescription
statusstringWorker state: "running", "idle", or "stopped"
queue_depthintegerTotal jobs in queue (pending + inflight)
pendingintegerJobs waiting to be processed
inflightintegerJobs currently being processed
processedintegerTotal jobs completed since service start
failedintegerTotal jobs that failed permanently
last_successstring|nullTimestamp of most recent successful enrichment
last_errorstring|nullMost recent error message (if any)
worker_activebooleanWhether enrichment worker thread is alive
FieldPossible ValuesMeaning
statushealthy, degraded, unhealthyOverall service status
falkordbconnected, disconnected, errorFalkorDB connection state
qdrantconnected, disconnected, unavailableQdrant connection state (optional service)
enrichment.statusrunning, stopped, errorBackground enrichment worker state
sequenceDiagram
    participant Client
    participant API as "app.py<br/>/health"
    participant Falkor as "FalkorDB<br/>redis_ping()"
    participant Qdrant as "Qdrant<br/>client.get_collections()"

    Client->>API: GET /health

    API->>Falkor: Test connection<br/>redis_ping()
    alt FalkorDB Available
        Falkor-->>API: PONG
        API->>API: Set falkordb: connected
    else FalkorDB Unavailable
        Falkor-->>API: Exception
        API->>API: Set falkordb: unavailable
        API-->>Client: 503 Service Unavailable
    end

    API->>Qdrant: Test connection<br/>get_collections()
    alt Qdrant Available
        Qdrant-->>API: Collections list
        API->>API: Set qdrant: connected
    else Qdrant Unavailable
        Qdrant-->>API: Exception
        API->>API: Set qdrant: unavailable<br/>(continue anyway)
    end

    API->>Falkor: Count memories<br/>MATCH (m:Memory) RETURN count(m)
    Falkor-->>API: count

    API->>API: Check enrichment queue<br/>ServiceState.enrichment_queue

    API-->>Client: 200 OK<br/>{status, falkordb, qdrant,<br/>memory_count, enrichment}
Terminal window
# Basic health check
curl https://your-project.up.railway.app/health
# With jq formatting
curl -s https://your-project.up.railway.app/health | jq .
# Check just the status field
curl -s https://your-project.up.railway.app/health | jq .status

AutoMem continues operating even when components are unavailable:

  • Qdrant unavailable: status remains "healthy", qdrant shows "not_configured" or error
  • FalkorDB unavailable: status becomes "degraded", HTTP 503 returned
  • Enrichment worker stopped: Service remains healthy but enrichment pipeline stops processing
Status CodeMeaningAction
200Service healthyContinue normal operation
503Service degradedFalkorDB or Qdrant unreachable
500Service unhealthyCritical failure in health check itself

Railway’s health check configuration monitors the /health endpoint to determine service availability. If the endpoint returns non-2xx status or fails to respond within 100 seconds, Railway marks the service as unhealthy and may restart it.

Railway health check configuration in railway.json:

{
"healthcheckPath": "/health",
"healthcheckTimeout": 100
}

The analyze endpoint provides comprehensive statistics about the memory graph, including type distributions, entity frequencies, temporal patterns, and relationship counts.

PropertyValue
Path/analyze
MethodGET
AuthenticationRequired (API token via Bearer, X-API-Key, or api_key query parameter)
ResponseHTTP 200 with JSON analytics, or HTTP 401 if unauthorized

The /analyze endpoint executes 7 independent Cypher queries against FalkorDB:

  1. Total Memory Count: MATCH (m:Memory) RETURN count(m)
  2. Type Distribution: Groups memories by m.type field
  3. Entity Frequency: Unwinds m.entities array and counts occurrences (top 20)
  4. Confidence Distribution: Buckets m.confidence scores by 0.1 intervals
  5. Activity by Hour: Extracts hour from m.timestamp and counts memories
  6. Tag Frequency: Unwinds m.tags array and counts occurrences (top 20)
  7. Relationship Counts: Counts all edges by relationship type

Each query is wrapped in a try-except block — if a query fails, the corresponding field is set to null, {}, or [] depending on the expected type.

sequenceDiagram
    participant Client
    participant API as "app.py<br/>/analyze"
    participant FalkorDB

    Client->>API: GET /analyze<br/>Authorization: Bearer TOKEN

    API->>API: Validate token

    API->>FalkorDB: MATCH (m:Memory) RETURN count(m)
    FalkorDB-->>API: total_count

    API->>FalkorDB: Group by m.type
    FalkorDB-->>API: type_distribution

    API->>FalkorDB: Unwind entities, count occurrences
    FalkorDB-->>API: top_entities (top 20)

    API->>FalkorDB: Bucket confidence scores
    FalkorDB-->>API: confidence_distribution

    API->>FalkorDB: Extract hour from timestamp
    FalkorDB-->>API: activity_by_hour

    API->>FalkorDB: Unwind tags, count occurrences
    FalkorDB-->>API: top_tags (top 20)

    API->>FalkorDB: Count edges by type
    FalkorDB-->>API: relationship_counts

    API-->>Client: 200 OK {analytics object}
Terminal window
# Get memory analytics
curl -H "Authorization: Bearer YOUR_TOKEN" \
https://your-project.up.railway.app/analyze
# Check relationship distribution
curl -s -H "Authorization: Bearer YOUR_TOKEN" \
https://your-project.up.railway.app/analyze | jq .relationships
Use CaseRelevant Fields
Identify memory class imbalancememories_by_type
Find frequently discussed projects/toolstop_entities
Assess memory qualityconfidence_distribution
Understand activity patternsactivity_by_hour
Audit tagging consistencytop_tags
Verify enrichment pipeline resultsrelationships["SIMILAR_TO"], relationships["EXEMPLIFIES"]
Detect temporal validity issuesrelationships["INVALIDATED_BY"], relationships["EVOLVED_INTO"]

The startup recall endpoint returns a curated set of memories suitable for initializing AI agent context. It prioritizes high-importance memories and falls back to recent memories.

PropertyValue
Path/startup-recall
MethodGET
AuthenticationNone required
Query ParametersNone
ResponseHTTP 200 with JSON memory list, or HTTP 503 if FalkorDB unavailable

The startup recall endpoint uses a two-phase retrieval strategy:

  1. Phase 1: Trending Memories (primary)
    • Queries: MATCH (m:Memory) ORDER BY m.importance DESC, m.timestamp DESC LIMIT 10
    • Returns high-importance memories regardless of recency
  2. Phase 2: Recent Memories (fallback)
    • Only triggered if Phase 1 returns fewer than 10 memories
    • Queries: MATCH (m:Memory) ORDER BY m.timestamp DESC LIMIT (10 - phase1_count)
    • Fills remaining slots with most recent memories

The startup recall endpoint is designed for AI agent initialization. An agent can call this endpoint at the start of a session to load relevant context before processing user requests.

Terminal window
# Retrieve startup context
curl https://your-project.up.railway.app/startup-recall | jq '.memories | length'

The health_monitor.py script provides continuous monitoring with drift detection, alerting, and optional auto-recovery.

sequenceDiagram
    participant HM as health_monitor.py
    participant API as Flask API /health
    participant FK as FalkorDB
    participant QD as Qdrant
    participant WH as Webhook<br/>(Slack/Discord)
    participant REC as recover_from_qdrant.py

    Note over HM: Check Interval<br/>(300s default)

    HM->>API: GET /health
    API-->>HM: {"status": "healthy", ...}

    HM->>FK: MATCH (m:Memory) RETURN count(m)
    FK-->>HM: falkor_count: 884

    HM->>QD: GET /collections/memories
    QD-->>HM: qdrant_count: 884

    HM->>HM: Calculate drift %<br/>|FK - QD| / max(FK, QD)

    alt Drift < 5%
        Note over HM: Normal - No action
    else Drift 5-50%
        HM->>WH: POST warning alert
        Note over WH: "Warning: 12% drift detected"
    else Drift > 50%
        HM->>WH: POST critical alert
        Note over WH: "Critical: 52% drift detected"

        alt Auto-Recover Enabled
            HM->>REC: Execute recovery script
            REC->>QD: Read all vectors + payloads
            REC->>FK: Rebuild graph structure
            REC-->>HM: Recovery complete
            HM->>WH: POST success notification
        else Manual Mode
            Note over HM: Log critical event<br/>Wait for manual intervention
        end
    end

Drift percentage is calculated as:

drift_percent = |falkordb_count - qdrant_count| / max(falkordb_count, qdrant_count) * 100
graph LR
    subgraph "Monitoring Services"
        SyncWorker["Sync Worker Thread<br/>automem/workers/sync.py<br/>SYNC_CHECK_INTERVAL"]
        HealthMonitor["health_monitor.py<br/>scripts/<br/>External monitoring"]
        HealthEndpoint["/health endpoint<br/>automem/api/health.py<br/>Public access"]
    end

    subgraph "Data Stores"
        FalkorDB[("FalkorDB<br/>MATCH (m:Memory) RETURN count(m)")]
        Qdrant[("Qdrant<br/>client.count()")]
    end

    subgraph "Detection Thresholds"
        Normal["Normal Drift<br/>< 5%<br/>In-flight writes"]
        Warning["Warning Drift<br/>5% - 50%<br/>Failed writes"]
        Critical["Critical Drift<br/>> 50%<br/>Data loss event"]
    end

    subgraph "Response Actions"
        Log["Log Metrics<br/>sync_last_result"]
        Alert["Webhook Alert<br/>Slack/Discord"]
        AutoRecover["Auto Recovery<br/>recover_from_qdrant.py<br/>Optional"]
    end

    SyncWorker --> FalkorDB
    SyncWorker --> Qdrant
    HealthMonitor --> FalkorDB
    HealthMonitor --> Qdrant
    HealthEndpoint --> SyncWorker

    FalkorDB --> DriftCalc["Drift Calculation<br/>abs(falkor - qdrant) / max(falkor, qdrant)"]
    Qdrant --> DriftCalc

    DriftCalc --> Normal
    DriftCalc --> Warning
    DriftCalc --> Critical

    Normal --> Log
    Warning --> Log
    Warning --> Alert
    Critical --> Alert
    Critical --> AutoRecover
ThresholdPercentageActionMeaning
Normal< 5%NoneAcceptable in-flight writes
Warning5-50%Alert webhookPossible failed writes to one store
Critical> 50%Alert webhook + Optional auto-recoveryData loss event likely

Common Causes of Drift:

  • < 1% drift: Normal — in-flight writes during health check
  • 5-10% drift: Failed writes to one database during temporary outage
  • > 50% drift: Critical data loss — one database was cleared/corrupted
flowchart TD
    Monitor["Health Monitor<br/>health_monitor.py"]

    Monitor --> Count["Count memories:<br/>FalkorDB vs Qdrant"]

    Count --> Calc["Calculate drift %:<br/>(|falkor - qdrant| / qdrant) * 100"]

    Calc --> Check{"Drift level?"}

    Check -->|"< 1%"| Normal["Normal<br/>In-flight writes<br/>No action needed"]
    Check -->|"1-5%"| Warning["Warning<br/>HEALTH_MONITOR_DRIFT_THRESHOLD<br/>Monitor but OK"]
    Check -->|"5-50%"| Alert["Alert<br/>Investigation needed<br/>Webhook notification"]
    Check -->|"> 50%"| Critical["Critical<br/>HEALTH_MONITOR_CRITICAL_THRESHOLD<br/>Data loss event<br/>Trigger recovery"]

    Alert --> Manual["Manual review:<br/>Check logs<br/>Verify writes"]
    Critical --> Auto["Auto-recovery (if enabled):<br/>recover_from_qdrant.py"]
Section titled “Option 1: Railway Service (Recommended for Production)”

Deploy as a dedicated Railway service for continuous monitoring.

Environment Variables for health-monitor service:

AUTOMEM_URL=http://memory-service.railway.internal:8001
HEALTH_MONITOR_WEBHOOK=https://hooks.slack.com/...
HEALTH_MONITOR_AUTO_RECOVER=false
HEALTH_MONITOR_CHECK_INTERVAL=300

Pros:

  • Continuous monitoring 24/7
  • Isolated from main service
  • Railway restart policies apply
  • Separate logging and metrics

Cons:

  • Additional Railway service cost (~$2/month)
  • Requires Pro plan resources
Section titled “Option 2: GitHub Actions (Recommended for Free Tier)”

Use GitHub Actions for scheduled health checks without consuming Railway resources.

Pros:

  • Free (2000 minutes/month on free tier)
  • No Railway resource consumption
  • GitHub Actions alerting built-in
  • Simple to set up

Cons:

  • No drift detection (only endpoint checks)
  • No auto-recovery
  • 5-minute minimum interval

Run periodic checks via Railway CLI or local cron.

Pros:

  • No additional deployment complexity
  • Direct access to recovery scripts
  • Easy to test and debug

Cons:

  • Requires active terminal/process
  • No automatic restart on failure
  • Not suitable for production

Health monitor sends JSON payloads to configured webhooks:

{
"alert_type": "drift_warning",
"drift_percent": 12.3,
"falkordb_count": 884,
"qdrant_count": 778,
"timestamp": "2025-10-20T14:30:00Z",
"service_url": "https://your-project.up.railway.app"
}
HEALTH_MONITOR_WEBHOOK=https://hooks.slack.com/services/T.../B.../...

Discord webhooks use the same format as Slack:

HEALTH_MONITOR_WEBHOOK=https://discord.com/api/webhooks/...

Implement your own webhook receiver to handle health alerts by accepting POST requests with the JSON payload format above.

When critical drift is detected (> 50% by default), AutoMem can automatically rebuild the FalkorDB graph from Qdrant vectors.

Enable via command line:

Terminal window
python scripts/health_monitor.py --auto-recover

Enable via environment variable:

HEALTH_MONITOR_AUTO_RECOVER=true

The recovery script performs these steps:

  1. Read all vectors from Qdrant — Retrieves payloads containing memory data
  2. Clear FalkorDB graph — Removes corrupted/incomplete data
  3. Rebuild memory nodes — Creates Memory nodes with all properties
  4. Restore relationships — Rebuilds graph relationships from metadata
  5. Verify counts — Confirms successful recovery

For detailed recovery procedures, see Backup & Recovery.

VariableDefaultDescription
HEALTH_MONITOR_DRIFT_THRESHOLD5Warning threshold (percentage)
HEALTH_MONITOR_CRITICAL_THRESHOLD50Critical threshold (percentage)
HEALTH_MONITOR_WEBHOOKNoneSlack/Discord webhook URL
HEALTH_MONITOR_AUTO_RECOVERfalseEnable automatic recovery
HEALTH_MONITOR_CHECK_INTERVAL300Seconds between health checks
HEALTH_MONITOR_LOG_LEVELINFOLogging level (DEBUG, INFO, WARNING, ERROR)

Personal Use:

  • Health checks: Every 5 minutes (alert-only)
  • Drift monitoring: Every 10 minutes
  • Auto-recovery: Disabled (manual trigger)

Team Use:

  • Health checks: Every 2 minutes
  • Drift monitoring: Every 5 minutes
  • Auto-recovery: Enabled (>50% drift)

Production Use:

  • Health checks: Every 30 seconds
  • Drift monitoring: Every 1 minute
  • Auto-recovery: Enabled (>50% drift) with immediate alerting

Railway + GitHub Actions (Free Tier):

  • GitHub Actions for scheduled /health endpoint checks
  • UptimeRobot for external HTTP monitoring (free tier, 5-minute checks)

Railway Pro (Production):

  • Dedicated health-monitor Railway service
  • Slack/Discord webhook alerts
  • Optional Grafana Cloud integration
ConfigurationMonthly CostServices
Basic~$15Memory service + FalkorDB (no monitoring)
Standard~$18+ Health monitor service (alert-only)
Production~$20+ Health monitor (auto-recovery) + Backup service
ConfigurationCostTrade-offs
Railway + GitHub Actions~$15/monthFree health checks, but no drift detection
Railway + UptimeRobot~$15/monthFree HTTP monitoring, but no database checks
Railway Pro + Grafana~$15/monthAdvanced metrics, but requires configuration

Symptoms:

⚠️ Warning: 12% drift detected
FalkorDB: 884 memories
Qdrant: 778 vectors

Solutions:

  1. < 5% drift: Normal — in-flight writes during check
  2. 5-10% drift: Possible failed writes — check logs for errors
  3. > 50% drift: Run recovery script manually:
    Terminal window
    python scripts/recover_from_qdrant.py

Causes:

  • FalkorDB connection failed
  • Qdrant connection failed (if configured)
  • Service still starting up

Solutions:

  • Check FalkorDB container is running: docker ps | grep falkordb
  • Verify FalkorDB environment variables: FALKORDB_HOST, FALKORDB_PORT, FALKORDB_PASSWORD
  • Wait 30-60 seconds if service was just deployed

Symptoms:

  • Critical drift detected
  • Alert sent to webhook
  • But recovery script not running

Solutions:

  • Verify HEALTH_MONITOR_AUTO_RECOVER=true is set
  • Check that scripts/recover_from_qdrant.py is accessible
  • Review health monitor logs for permission errors

All three monitoring endpoints emit structured logs for observability:

  • /health — logs connection check results and memory counts
  • /analyze — logs query execution times for each analytics query
  • /startup-recall — logs number of memories returned by each phase

The structured log format uses Python’s extra={} parameter to include machine-parseable fields alongside log messages, enabling automated performance analysis and metrics dashboard integration. See Performance Tuning for details on the logging fields and how to parse them.