Skip to content

Backup & Recovery

This page describes the backup strategies and disaster recovery procedures available for AutoMem deployments. It covers the three-layer backup architecture, automated backup methods, configuration options, backup formats, and four recovery paths for different failure scenarios. For monitoring backup health and detecting data drift, see Health Monitoring.

AutoMem implements a defense-in-depth backup strategy with three independent layers:

graph TB
    subgraph "Layer 1: Persistent Volumes"
        RailwayVol["Railway Volume Snapshots<br/>Automatic every 24h<br/>Recovery: 5 minutes"]
        VolumeFeatures["- One-click restore<br/>- Railway Dashboard access<br/>- Included with Railway Pro<br/>- FalkorDB only"]
    end

    subgraph "Layer 2: Dual Storage Redundancy"
        FalkorDB["FalkorDB<br/>Canonical record<br/>Graph + metadata"]
        Qdrant["Qdrant<br/>Backup record<br/>Vectors + payloads"]
        RecoverQdrant["recover_from_qdrant.py<br/>Recovery: 10 minutes<br/>Success rate: 99.7%"]

        FalkorDB <-->|"Redundancy"| Qdrant
        Qdrant --> RecoverQdrant
        RecoverQdrant --> FalkorDB
    end

    subgraph "Layer 3: Automated Backups"
        GitHubWorkflow[".github/workflows/backup.yml<br/>Every 6 hours<br/>Free tier: 2000 min/month"]
        BackupScript["backup_automem.py<br/>Compressed JSON exports"]
        LocalBackups["./backups/<br/>Last 7-14 backups"]
        S3Backups["S3 Storage<br/>Cross-region replication<br/>Recovery: 30 minutes"]

        GitHubWorkflow --> BackupScript
        BackupScript --> LocalBackups
        BackupScript --> S3Backups
    end

    Failure["Data Loss Event"] --> Layer1Check{"Layer 1<br/>Available?"}
    Layer1Check -->|Yes| RailwayVol
    Layer1Check -->|No| Layer2Check{"Layer 2<br/>Available?"}
    Layer2Check -->|Yes| RecoverQdrant
    Layer2Check -->|No| Layer3Check{"Layer 3<br/>Available?"}
    Layer3Check -->|Yes| S3Backups
    Layer3Check -->|No| DataLoss["Complete data loss<br/>Start fresh"]
LayerMechanismRecovery SpeedScopePlatform Lock
InfrastructureRailway volume snapshotsInstantFalkorDB onlyYes (Railway)
Dual StorageReal-time dual writesImmediateBoth databasesNo
ApplicationScript exportsMinutesBoth databasesNo
AutomatedScheduled executionN/A (prevention)Both databasesNo

Railway provides automatic volume backups for the FalkorDB persistent volume configured in the deployment template.

The FalkorDB service uses a persistent volume mounted at /var/lib/falkordb/data. The Redis persistence settings ensure data durability with the following configuration:

  • RDB snapshots every 60 seconds if at least 1 write
  • AOF (Append-Only File) for write-ahead logging
  • fsync every second for durability

Accessing and restoring snapshots:

  1. Railway Dashboard → FalkorDB service
  2. “Backups” tab shows snapshot history
  3. One-click restore from any snapshot

Limitations:

  • Only covers FalkorDB (Qdrant not included)
  • Cannot export or download backups
  • Platform-locked to Railway
  • Best for quick recovery from recent failures

The core backup script at scripts/backup_automem.py exports data from both FalkorDB and Qdrant to compressed JSON files.

FalkorDB Export Structure:

The FalkorDB export captures the entire Redis keyspace including:

  • Memory nodes with all properties
  • Relationship edges
  • Metadata and indices
  • Graph structure information

Qdrant Export Structure:

The Qdrant export includes:

  • Vector embeddings (768-dimensional or 1024-dimensional depending on provider)
  • Payload data (memory content, metadata, tags)
  • Point IDs mapped to memory IDs
  • Collection configuration

Both exports are compressed using gzip with .json.gz extension, typically achieving 70-80% compression ratio.

Terminal window
# Basic backup - creates timestamped backups in ./backups/falkordb/ and ./backups/qdrant/
python scripts/backup_automem.py
# With retention policy - deletes backups older than 7 days after creating new ones
python scripts/backup_automem.py --cleanup --keep 7
# Custom directory
python scripts/backup_automem.py --output /path/to/backups
# With S3 upload - requires boto3 and AWS credentials set via environment variables
python scripts/backup_automem.py --s3-bucket automem-backups

Backups are written to timestamped subdirectories:

backups/
├── falkordb/
│ ├── falkordb_20251020_143000.json.gz
│ ├── falkordb_20251020_203000.json.gz
│ └── ...
└── qdrant/
├── qdrant_20251020_143000.json.gz
├── qdrant_20251020_203000.json.gz
└── ...

The --cleanup --keep N flag removes backups older than N days based on filename timestamp parsing.

s3://automem-backups/
├── falkordb/
│ ├── falkordb_20251020_143000.json.gz
│ └── ...
└── qdrant/
├── qdrant_20251020_143000.json.gz
└── ...

S3 Cost Estimation:

ComponentFormulaExample
Storage$0.023/GB/month100MB backup = $0.0023/month
PUT requests$0.005/1000 requests4 backups/day = $0.60/year
GET requests (restore)$0.0004/1000 requestsNegligible
Total (100MB, every 6h)-~$0.30/month

The recommended automation method uses GitHub Actions to run backups on a schedule without consuming Railway resources.

sequenceDiagram
    participant GHA as "GitHub Actions<br/>backup.yml"
    participant TCP as "Railway TCP Proxy<br/>monorail.proxy.rlwy.net"
    participant FDB as "FalkorDB<br/>:6379"
    participant QDR as "Qdrant Cloud<br/>HTTPS API"
    participant Script as "backup_automem.py"
    participant S3 as "S3 Bucket<br/>automem-backups"

    Note over GHA: Triggered every 6 hours<br/>or manually via workflow_dispatch

    GHA->>GHA: Checkout code
    GHA->>GHA: Install dependencies<br/>requirements.txt + boto3

    Note over GHA,TCP: Connectivity check (critical)
    GHA->>GHA: Validate FALKORDB_HOST<br/>Must NOT be *.railway.internal
    GHA->>TCP: Test TCP connection<br/>timeout 10s
    TCP->>FDB: Forward connection
    FDB-->>TCP: Connection OK
    TCP-->>GHA: ✅ Connectivity verified

    Note over GHA,S3: Backup execution
    GHA->>Script: Execute with env vars<br/>FALKORDB_*, QDRANT_*, AWS_*
    Script->>TCP: Connect via TCP Proxy
    TCP->>FDB: Redis protocol commands
    FDB-->>TCP: Export graph data
    TCP-->>Script: Graph JSON

    Script->>QDR: HTTPS GET /collections/memories/points
    QDR-->>Script: Vector data with payloads

    Script->>Script: Compress to .json.gz<br/>backups/falkordb/<br/>backups/qdrant/

    alt S3 Upload Enabled
        Script->>S3: Upload via boto3<br/>s3://automem-backups/
        S3-->>Script: Upload complete
    end

    Script-->>GHA: Exit 0 (success)
    GHA->>GHA: Log backup summary<br/>File sizes and timestamps

The workflow is defined in .github/workflows/backup.yml and triggers every 6 hours or manually via workflow_dispatch.

SecretPurposeExampleUsed By
FALKORDB_HOSTRailway TCP proxy domainmonorail.proxy.rlwy.netredis.Redis() connection
FALKORDB_PORTRailway TCP proxy port12345redis.Redis() connection
FALKORDB_PASSWORDFalkorDB authenticationGenerated by Railwayredis.Redis(password=)
QDRANT_URLQdrant endpointhttps://xyz.qdrant.ioQdrantClient(url=)
QDRANT_API_KEYQdrant authenticationAPI key from Qdrant CloudQdrantClient(api_key=)
AWS_ACCESS_KEY_IDS3 upload (optional)AWS credentialsboto3.client('s3')
AWS_SECRET_ACCESS_KEYS3 upload (optional)AWS credentialsboto3.client('s3')
AWS_DEFAULT_REGIONS3 region (optional)us-east-1boto3.client('s3', region_name=)

The TCP proxy endpoint is found in Railway Dashboard → FalkorDB service → Settings → Networking → TCP Proxy.

For users who prefer Railway-hosted backups, scripts/Dockerfile.backup provides a containerized backup service that runs continuously.

The Dockerfile defines a Python 3.11 Alpine container that installs dependencies, copies the backup script, creates the output directory, and runs an infinite loop with backup and sleep cycles.

Railway deployment configuration:

  • Builder: Dockerfile
  • Dockerfile Path: scripts/Dockerfile.backup
  • Root Directory: / (project root)
  • Environment Variables: Same as memory-service: FALKORDB_HOST, FALKORDB_PORT, FALKORDB_PASSWORD, QDRANT_URL, QDRANT_API_KEY, plus optional AWS credentials

Resource usage: Approximately $1-2/month on Railway Pro (minimal CPU/memory during sleep cycles).


VariableRequiredDefaultDescription
FALKORDB_HOSTYes-FalkorDB hostname or IP
FALKORDB_PORTYes6379FalkorDB Redis port
FALKORDB_PASSWORDYes-FalkorDB authentication password
FALKORDB_GRAPHNomemoriesGraph database name
QDRANT_URLYes*-Qdrant endpoint URL
QDRANT_API_KEYYes*-Qdrant API authentication
QDRANT_COLLECTIONNomemoriesQdrant collection name
AWS_ACCESS_KEY_IDNo-For S3 upload
AWS_SECRET_ACCESS_KEYNo-For S3 upload
AWS_DEFAULT_REGIONNous-east-1S3 region

*Qdrant variables optional if system is running without vector storage.

Use CaseBackup FrequencyRetention PeriodStorage LocationEstimated Cost
Personal/DevelopmentEvery 24 hours7 daysLocal only$0
Team/Small ProductionEvery 6 hours14 daysLocal + S3~$0.50/month
ProductionEvery 1-6 hours30 daysS3 with versioning~$2-5/month
EnterpriseEvery 1 hour90 days + archiveS3 + cross-region~$10-20/month

MethodScopeSpeedAutomationPlatform LockCostBest For
Railway VolumesFalkorDB onlyInstantAutomaticYesIncludedQuick recovery
GitHub ActionsBoth databases5-10 minScheduledNoFreeMost users
Railway ServiceBoth databases5-10 minContinuousPartial$1-2/moRailway-centric
Manual ScriptBoth databases5-10 minManualNoFreeDevelopment

Failure ScenarioData AvailableRecovery MethodPrimary ToolRTO
FalkorDB data lossQdrant intactQdrant-based rebuildrecover_from_qdrant.py2-5 min
FalkorDB persistence disabledQdrant intactQdrant-based rebuildrecover_from_qdrant.py2-5 min
Qdrant data lossFalkorDB intactBackground re-embeddingEnrichment queue30-60 min
Both databases corruptedBackup files (S3/local)File restorationrestore_from_backup.py + recovery10-20 min
Railway volume failureRailway snapshotsVolume restoreRailway dashboard5-10 min
Drift detected (5-50%)Both databases availableSelective synchealth_monitor.py --auto-recover1-2 min
flowchart TD
    Start["Data Loss Detected"] --> Assess["Assess Damage"]

    Assess --> FalkorLost{"FalkorDB<br/>lost?"}
    Assess --> QdrantLost{"Qdrant<br/>lost?"}

    FalkorLost -->|Yes| QdrantIntact{"Qdrant<br/>intact?"}
    FalkorLost -->|No| NoAction1["No action needed"]

    QdrantIntact -->|Yes| RecoverQdrant["Use recover_from_qdrant.py<br/>⚡ 10 min / 99.7% success"]
    QdrantIntact -->|No| BothLost["Both databases lost"]

    QdrantLost -->|Yes| FalkorIntact{"FalkorDB<br/>intact?"}
    QdrantLost -->|No| NoAction2["No action needed"]

    FalkorIntact -->|Yes| RebuildQdrant["Rebuild Qdrant from FalkorDB<br/>POST /admin/reembed<br/>⚡ 15-20 min"]
    FalkorIntact -->|No| BothLost

    BothLost --> RailwayBackup{"Railway volume<br/>backups?"}
    RailwayBackup -->|Yes| RailwayRestore["Railway Dashboard restore<br/>⚡ 5 min / FalkorDB only<br/>Then rebuild Qdrant"]
    RailwayBackup -->|No| S3Backup{"S3 or local<br/>backups?"}

    S3Backup -->|Yes| S3Restore["1. Restore Qdrant from S3<br/>2. recover_from_qdrant.py<br/>⏱️ 30 min total"]
    S3Backup -->|No| CompleteFailure["Complete data loss<br/>Start fresh"]

    RecoverQdrant --> Verify["Verify Integrity<br/>Compare counts<br/>Check sample memories"]
    RebuildQdrant --> Verify
    RailwayRestore --> Verify
    S3Restore --> Verify

The fastest and most reliable recovery method. Uses Qdrant’s vector payloads to rebuild the entire FalkorDB graph structure.

When to use:

  • FalkorDB lost all data (container restart, persistence misconfiguration)
  • FalkorDB corrupted but Qdrant intact
  • Health monitor detects >50% drift with Qdrant having more data

Recovery process details:

FunctionPurposeCode Location
qdrant_client.scroll()Fetch all vectors with payloadsQdrant SDK call
_filter_reserved_fields()Remove type, confidence from metadatascripts/recover_from_qdrant.py
CREATE MERGE (m:Memory)Rebuild memory nodesCypher query in recovery loop
Relationship extractionParse metadata.relationships arrayRecovery loop logic

Execution steps:

Terminal window
# Step 1: Verify Qdrant availability
curl https://your-qdrant-url/collections/memories
# Step 2: Run recovery script with database environment variables
FALKORDB_HOST=falkordb.railway.internal \
FALKORDB_PORT=6379 \
FALKORDB_PASSWORD=your-password \
QDRANT_URL=https://your-qdrant-url \
QDRANT_API_KEY=your-key \
python scripts/recover_from_qdrant.py
# Step 3: Monitor recovery progress in stdout

Known limitations:

  • 99.7% recovery rate: In testing, 2/780 memories failed due to malformed data
  • Relationship loss: If relationships were stored only in FalkorDB (not in Qdrant payload), they won’t be recovered
  • Recent writes: Memories written in the last 2 seconds (before embedding completes) may be missing from Qdrant

Used when both databases are lost or corrupted. Restores from compressed JSON backups stored locally or in S3.

When to use:

  • Both FalkorDB and Qdrant lost
  • Recovery from specific point in time needed
  • Testing disaster recovery procedures

Backup file structure:

backups/
├── falkordb/
│ ├── falkordb_20251020_143000.json.gz
│ └── falkordb_20251020_083000.json.gz
└── qdrant/
├── qdrant_20251020_143000.json.gz
└── qdrant_20251020_083000.json.gz

Each Qdrant backup contains an array of point objects with id, vector (768-dimensional float array), and payload (content, memory_id, tags, importance, type, created_at).

sequenceDiagram
    participant Operator
    participant S3 as S3 Bucket / Local
    participant Restore as restore_from_backup.py
    participant Qdrant
    participant Recovery as recover_from_qdrant.py
    participant FalkorDB

    Operator->>S3: Download backup<br/>aws s3 cp or local copy
    Operator->>Restore: Execute with backup path

    Restore->>Restore: Decompress JSON.gz<br/>gzip.open()
    Restore->>Restore: Parse JSON array<br/>json.load()

    loop For each point
        Restore->>Qdrant: Upsert point<br/>qdrant_client.upsert()
    end

    Restore-->>Operator: Qdrant restoration complete<br/>780 points restored

    Operator->>Recovery: python recover_from_qdrant.py
    Recovery->>Qdrant: Scroll all points
    Recovery->>FalkorDB: Rebuild graph
    Recovery-->>Operator: Full recovery complete

Execution steps:

Terminal window
# Step 1: Download backup files from S3
aws s3 cp s3://automem-backups/qdrant/qdrant_20251020_143000.json.gz ./
# Step 2: Decompress backup
gunzip qdrant_20251020_143000.json.gz
# Step 3: Restore to Qdrant
python scripts/restore_from_backup.py --file qdrant_20251020_143000.json
# Step 4: Rebuild FalkorDB from Qdrant
python scripts/recover_from_qdrant.py
Database SizeDecompressQdrant UploadFalkorDB RebuildTotal
100 memories<1 sec5 sec10 sec~15 sec
1,000 memories2 sec30 sec60 sec~2 min
10,000 memories10 sec5 min10 min~15 min
100,000 memories30 sec30 min60 min~90 min

Uses Railway’s built-in volume snapshots for instant recovery of FalkorDB data.

When to use:

  • FalkorDB volume corruption
  • Need to rollback to specific snapshot
  • Quick recovery without external dependencies

Restoration steps:

  1. Log in to Railway dashboard, navigate to FalkorDB service, click “Volumes” tab
  2. View available snapshots (sorted by date), note timestamp and size
  3. Click “Restore” next to chosen snapshot and confirm (irreversible action)
  4. Railway stops the FalkorDB service, replaces volume with snapshot data, restarts the service
  5. Verify service health with curl https://your-automem-url/health

Limitations:

  • Railway-locked: Cannot export snapshots outside Railway platform
  • FalkorDB only: Does not restore Qdrant data
  • Snapshot frequency: Default 24-hour intervals (may lose up to 24 hours of data)
  • Retention policy: Depends on Railway plan (Pro plan: 30 days)

Uses the health monitor to detect and automatically fix inconsistencies between FalkorDB and Qdrant without full recovery.

When to use:

  • Health monitor detects 5-50% drift
  • Both databases online but inconsistent
  • Partial write failures suspected
  • Preventive maintenance
Terminal window
# Enable auto-recovery on health monitor service
HEALTH_MONITOR_AUTO_RECOVER=true
# Or trigger manual recovery
python scripts/health_monitor.py --auto-recover

After any recovery procedure, verify system integrity before returning to production.

graph TB
    Start[Recovery Complete]

    Start --> V1{1. Database<br/>Connectivity}
    V1 -->|Pass| V2{2. Memory<br/>Count Match}
    V1 -->|Fail| Fix1[Check connection strings<br/>Verify credentials]

    V2 -->|Pass| V3{3. Relationship<br/>Integrity}
    V2 -->|Fail| Fix2[Re-run recovery<br/>Check for errors]

    V3 -->|Pass| V4{4. Embedding<br/>Coverage}
    V3 -->|Fail| Fix3[Investigate missing edges<br/>Check entity extraction]

    V4 -->|Pass| V5{5. Search<br/>Functionality}
    V4 -->|Fail| Fix4[Re-run embedding worker<br/>Verify Qdrant indexes]

    V5 -->|Pass| Complete[✓ Validation Complete<br/>Resume Operations]
    V5 -->|Fail| Fix5[Test vector search<br/>Rebuild Qdrant collection]

    Fix1 --> V1
    Fix2 --> V2
    Fix3 --> V3
    Fix4 --> V4
    Fix5 --> V5

1. Database Connectivity:

Terminal window
curl https://your-automem-url/health

2. Memory Count Verification:

Terminal window
# Check FalkorDB count matches Qdrant count
curl https://your-automem-url/health | jq '.statistics'

3. Relationship Integrity:

Terminal window
# Use analyze endpoint to check relationship distribution
curl https://your-automem-url/analyze | jq '.relationship_types'

4. Embedding Coverage:

Terminal window
# Verify all memories have embeddings
curl https://your-automem-url/analyze | jq '.embedding_coverage'

5. Search Functionality:

Terminal window
# Test recall with a known query
curl -X GET "https://your-automem-url/recall?query=test" \
-H "Authorization: Bearer your-api-token"

Monitor the system for 24 hours after recovery:

MetricCheck FrequencyThresholdAction if Exceeded
Drift percentageEvery 5 min>5%Investigate write failures
Enrichment queue depthEvery 15 min>100Check worker health
Embedding queue depthEvery 15 min>500Verify OpenAI API key
API error rateEvery 5 min>1%Check logs for errors
Response time (p95)Every 15 min>2sInvestigate slow queries

ScenarioMethodDetectionExecutionValidationTotal RTO
FalkorDB lost, Qdrant intactQdrant recoveryImmediate2-3 min1 min3-4 min
FalkorDB persistence disabledQdrant recoveryImmediate2-3 min1 min3-4 min
Both databases corruptedBackup restoreVaries5-10 min2 min7-12 min
Railway volume failureVolume restoreImmediate3-5 min1 min4-6 min
5-20% drift detectedSelective syncAuto (5 min)1-2 min1 min7-8 min
Qdrant lost, FalkorDB intactRe-embeddingImmediate30-60 min5 min35-65 min

Recovery Point Objective (RPO):

  • Best case: 0 seconds (Qdrant recovery, dual storage)
  • Typical case: 2 seconds (embedding queue latency)
  • Worst case: 6 hours (last automated backup)

Recovery Script Fails with Connection Error

Section titled “Recovery Script Fails with Connection Error”

Problem: recover_from_qdrant.py exits with “Failed to connect to FalkorDB”

Solution: Verify FALKORDB_HOST, FALKORDB_PORT, and FALKORDB_PASSWORD environment variables match the FalkorDB service configuration. For Railway deployments, use the internal hostname falkordb.railway.internal (or TCP proxy endpoint for external access).

Problem: Script reports “Recovered 0/780 memories”

Solutions:

  • If collection doesn’t exist: Use backup restoration method (Method 2)
  • If payloads are missing: Qdrant was used for vectors only; recovery not possible without payloads

Problem: /analyze shows invalid memory types like “str”, “int”, “boolean”

Cause: Using old version of recover_from_qdrant.py without RESERVED_FIELDS filtering

Solution: Update to v0.5.0+ of AutoMem. If already on old version, run scripts/cleanup_memory_types.py to fix corrupted types.

Problem: Health monitor still reports >5% drift after running recovery

Solutions:

  • If FalkorDB has more: Recent Qdrant writes may have failed; check Qdrant API key
  • If Qdrant has more: FalkorDB may be read-only; check REDIS_ARGS includes --save and --appendonly
  • If inconsistent: May need bidirectional sync; open a GitHub issue

Regular recovery testing ensures procedures work when needed.

EnvironmentTest FrequencyTest TypeData Source
DevelopmentWeeklyFull recoveryTest data
StagingMonthlyFull recoveryProduction replica
ProductionQuarterlyValidation onlyVerify backup integrity
Terminal window
# Check backup files exist and have reasonable sizes
ls -lh backups/falkordb/ backups/qdrant/
# Validate memory counts from backup
zcat backups/qdrant/qdrant_latest.json.gz | jq 'length'
# Test decompression
gunzip -t backups/falkordb/falkordb_latest.json.gz && echo "OK"