Files

kelin 12df3a1ae2 Round 6: Fix deployment script reliability and credential handling

- Add pre-flight validation checks (internet, disk space, Docker availability)
- Fix Authelia password hash extraction (handle 'Digest:' prefix format)
- Improve credential flow between setup and deploy scripts
- Save plain password for user reference in ADMIN_PASSWORD.txt
- Add cleanup for directory/file conflicts on re-runs
- Add automatic Authelia database cleanup for encryption key mismatches
- Add error recovery guidance with cleanup trap
- Display credentials prominently after deployment
- Update step numbering (now 10 steps with pre-flight)
- Update documentation to Round 6

Tested on fresh Debian 12 installation - both scripts now complete successfully.

2026-01-13 19:57:45 -05:00

14 KiB

Raw Blame History

Round 6 Testing - Agent Instructions for Script Optimization

Mission Context

Test and optimize the AI-Homelab deployment scripts (setup-homelab.sh and deploy-homelab.sh) on a test server environment. Focus on improving repository code quality and reliability, not test server infrastructure.

Current Status - Round 6

Test Environment: Fresh Debian installation
Testing Complete: ✅ Both scripts now working successfully
Issues Fixed:
- Password hash extraction from Authelia output (added proper parsing)
- Credential flow between setup and deploy scripts (saved plain password)
- Directory vs file conflicts on re-runs (added cleanup logic)
- Authelia database encryption key mismatches (added auto-cleanup)
- Password display for user reference (now shows and saves password)
Previous Rounds: 1-5 focused on permission handling, deployment order, and stability

Primary Objectives

1. Script Reliability Analysis

Identify why setup-homelab.sh failed during Authelia password hash generation
Review error handling and fallback mechanisms
Check Docker availability before container-based operations
Validate all external dependencies (curl, openssl, docker)

2. Repository Improvements (Priority Focus)

Enhance error messages with actionable debugging steps
Add pre-flight validation checks before critical operations
Improve script idempotency (safe to run multiple times)
Add progress indicators for long-running operations
Document assumptions and prerequisites more clearly

3. Testing Methodology

Create test checklist for each script function
Document expected vs. actual behavior at failure point
Test edge cases (partial runs, re-runs, missing dependencies)
Verify scripts work on truly fresh Debian installation

Critical Operating Principles

DO Focus On (Repository Optimization)

✅ Script robustness: Better error handling, validation, recovery ✅ User experience: Clear messages, progress indicators, helpful errors ✅ Documentation: Update docs to match actual script behavior ✅ Idempotency: Scripts can be safely re-run without breaking state ✅ Dependency checks: Validate requirements before attempting operations ✅ Rollback capability: Provide recovery steps when operations fail

DON'T Focus On (Test Server Infrastructure)

❌ Test server setup: Don't modify the test server's base configuration ❌ System packages: Test scripts should work with standard Debian ❌ Hardware setup: Focus on software reliability, not hardware optimization ❌ Performance tuning: Prioritize reliability over speed

Specific Issues to Address (Round 6)

Issue 1: Authelia Password Hash Generation Failure

Problem: Script failed while generating password hash using Docker container

Investigation Steps:

Check if Docker daemon is running before attempting container operations
Verify Docker image pull capability (network connectivity)
Add timeout handling for Docker run operations
Provide fallback mechanism if Docker-based hash generation fails

Proposed Solutions:

# Option 1: Pre-validate Docker availability
if ! docker info &> /dev/null 2>&1; then
    log_error "Docker is not available for password hash generation"
    log_info "This requires Docker to be running. Please ensure:"
    log_info "  1. Docker daemon is started: sudo systemctl start docker"
    log_info "  2. User can access Docker: docker ps"
    exit 1
fi

# Option 2: Add timeout and error handling
log_info "Generating password hash (this may take a moment)..."
if ! PASSWORD_HASH=$(timeout 30 docker run --rm authelia/authelia:4.37 authelia crypto hash generate argon2 --password "$ADMIN_PASSWORD" 2>&1 | grep '^\$argon2'); then
    log_error "Failed to generate password hash"
    log_info "Error output:"
    echo "$PASSWORD_HASH"
    log_info ""
    log_info "Troubleshooting steps:"
    log_info "  1. Check Docker is running: docker ps"
    log_info "  2. Test image pull: docker pull authelia/authelia:4.37"
    log_info "  3. Check network connectivity"
    exit 1
fi

# Option 3: Provide manual hash generation instructions
if [ -z "$PASSWORD_HASH" ]; then
    log_error "Automated hash generation failed"
    log_info ""
    log_info "You can generate the hash manually after setup:"
    log_info "  docker run --rm authelia/authelia:4.37 authelia crypto hash generate argon2"
    log_info "  Then edit: /opt/stacks/core/authelia/users_database.yml"
    log_info ""
    read -p "Continue without setting admin password now? (y/N): " -n 1 -r
    echo ""
    if [[ ! $REPLY =~ ^[Yy]$ ]]; then
        exit 1
    fi
    # Set placeholder that requires manual configuration
    PASSWORD_HASH="\$argon2id\$v=19\$m=65536,t=3,p=4\$CHANGEME"
fi

Issue 2: Script Exit Codes and Error Recovery

Problem: Script exits on first error without cleanup or recovery guidance

Improvements:

# Add at script beginning
set -e  # Exit on error
set -u  # Exit on undefined variable
set -o pipefail  # Exit on pipe failures

# Add trap for cleanup on error
cleanup_on_error() {
    local exit_code=$?
    log_error "Script failed with exit code: $exit_code"
    log_info ""
    log_info "Partial setup may have occurred. To resume:"
    log_info "  1. Review error messages above"
    log_info "  2. Fix the issue if possible"
    log_info "  3. Re-run: sudo ./setup-homelab.sh"
    log_info ""
    log_info "The script is designed to be idempotent (safe to re-run)"
}

trap cleanup_on_error ERR

Issue 3: Pre-flight Validation

Add comprehensive checks before starting:

# New Step 0: Pre-flight validation
log_info "Step 0/9: Running pre-flight checks..."

# Check internet connectivity
if ! ping -c 1 8.8.8.8 &> /dev/null && ! ping -c 1 1.1.1.1 &> /dev/null; then
    log_error "No internet connectivity detected"
    log_info "Internet access is required for:"
    log_info "  - Installing packages"
    log_info "  - Downloading Docker images"
    log_info "  - Accessing Docker Hub"
    exit 1
fi

# Check DNS resolution
if ! nslookup google.com &> /dev/null; then
    log_warning "DNS resolution may be impaired"
fi

# Check disk space
AVAILABLE_SPACE=$(df / | tail -1 | awk '{print $4}')
REQUIRED_SPACE=5000000  # 5GB in KB
if [ "$AVAILABLE_SPACE" -lt "$REQUIRED_SPACE" ]; then
    log_error "Insufficient disk space"
    log_info "Available: $(($AVAILABLE_SPACE / 1024))MB"
    log_info "Required: $(($REQUIRED_SPACE / 1024))MB"
    exit 1
fi

log_success "Pre-flight checks passed"
echo ""

Testing Checklist for Round 6

setup-homelab.sh Testing

Clean run: Fresh Debian 12 system with no prior setup
Re-run safety: Run script twice, verify idempotency
Interrupted run: Stop script mid-execution, verify recovery
Network failure: Simulate network issues during package installation
Docker unavailable: Test behavior when Docker daemon isn't running
User input validation: Test invalid inputs (weak passwords, invalid emails)
NVIDIA skip: Test skipping GPU driver installation
NVIDIA install: Test GPU driver installation path (if hardware available)

deploy-homelab.sh Testing

Missing .env: Test behavior when .env file doesn't exist
Invalid .env: Test with incomplete or malformed .env values
Port conflicts: Test when ports 80/443 are already in use
Network issues: Simulate Docker network creation failures
Core stack failure: Test recovery if core services fail to start
Certificate generation: Monitor Let's Encrypt certificate acquisition
Re-deployment: Test redeploying over existing infrastructure
Partial deployment: Stop mid-deployment, verify recovery

Documentation Updates Required

1. Update docs/getting-started.md

Document actual script behavior observed in testing
Add troubleshooting section for common failure scenarios
Include recovery procedures for partial deployments
Update Round 4 → Round 6 references

2. Update AGENT_INSTRUCTIONS.md

Change "Round 4" references to current round
Add learnings from Round 5 and 6 testing
Document new error handling patterns
Update success criteria

3. Create Testing Guide

Document test environment setup
Provide test case templates
Include expected vs actual result tracking
Create test automation scripts if beneficial

Agent Action Workflow

Phase 1: Analysis (30 minutes)

Review last terminal output to identify exact failure point
Read relevant script sections around the failure
Check for similar issues in previous rounds (git history)
Document root cause and contributing factors

Phase 2: Solution Design (20 minutes)

Propose specific code changes to address issues
Design error handling and recovery mechanisms
Create validation checks to prevent recurrence
Draft user-facing error messages

Phase 3: Implementation (40 minutes)

Apply changes to scripts using multi_replace_string_in_file
Update related documentation
Add comments explaining complex error handling
Update .env.example if new variables needed

Phase 4: Validation (30 minutes)

Review script syntax: bash -n scripts/*.sh
Check for common shell script issues (shellcheck if available)
Verify all error paths have recovery guidance
Test modified sections in isolation if possible

Phase 5: Documentation (20 minutes)

Update CHANGELOG or version notes
Document breaking changes if any
Update README if testing procedure changed
Create/update this round's testing summary

Success Criteria for Round 6

Must Have

✅ setup-homelab.sh completes successfully on fresh Debian 12
✅ Script provides clear error messages when failures occur
✅ All error conditions have recovery/troubleshooting guidance
✅ Scripts can be safely re-run without breaking existing setup
✅ Docker availability validated before Docker operations

Should Have

✅ Pre-flight validation catches issues before they cause failures
✅ Progress indicators show user what's happening
✅ Timeout handling prevents infinite hangs
✅ Network failures provide helpful next steps
✅ Documentation updated to match current behavior

Nice to Have

⭐ Automated testing framework for scripts
⭐ Rollback capability for partial deployments
⭐ Log collection for debugging
⭐ Summary report at end of execution
⭐ Colored output supports colorless terminals

Key Files to Modify

Primary Targets

scripts/setup-homelab.sh - Lines 1-491
- Focus areas: Steps 7 (Authelia secrets), error handling
scripts/deploy-homelab.sh - Lines 1-340
- Focus areas: Pre-flight checks, error recovery
AGENT_INSTRUCTIONS.md - Round references
docs/getting-started.md - Testing documentation

Supporting Files

.env.example - Validate all variables documented
README.md - Update if quick start changed
docs/quick-reference.md - Add troubleshooting entries

Constraints and Guidelines

Follow Repository Standards

Maintain existing code style and conventions
Use established logging functions (log_info, log_error, etc.)
Keep user-facing messages clear and actionable
Preserve idempotency where it exists

Avoid Scope Creep

Don't rewrite entire scripts - make targeted improvements
Don't add features - focus on fixing reliability issues
Don't optimize performance - focus on correctness
Don't modify test server - fix repository code

Safety First

All changes must maintain backward compatibility
Don't remove existing functionality
Add validation, don't skip steps
Provide rollback/recovery for all operations

Communication with User

Regular Updates

Report progress through testing phases
Highlight blockers or unexpected findings
Ask for clarification when needed
Summarize changes made

Final Report Format

# Round 6 Testing Summary

## Issues Identified
1. [Issue name] - [Brief description]
   - Root cause: ...
   - Impact: ...

## Changes Implemented
1. [File]: [Change description]
   - Why: ...
   - Testing: ...

## Testing Results
- ✅ [What works]
- ⚠️ [What needs attention]
- ❌ [What still fails]

## Next Steps for Round 7
1. [Remaining issue to address]
2. [Enhancement to consider]

Immediate Actions for Agent

Investigate current failure:
- Review terminal output showing exit code 1
- Identify line in setup-homelab.sh where failure occurred
- Determine if Docker was available for password hash generation
Propose specific fixes:
- Add Docker availability check before Authelia password hash step
- Add timeout handling for docker run command
- Provide fallback if automated hash generation fails
Implement improvements:
- Update setup-homelab.sh with error handling
- Add pre-flight validation checks
- Enhance error messages with recovery steps
Update documentation:
- Document Round 6 findings
- Update troubleshooting guide
- Revise Round 4 references to current round
Prepare for validation:
- Create test checklist
- Document expected behavior
- Note any manual testing steps needed

Generated for Round 6 testing of AI-Homelab deployment scripts Focus: Repository optimization, not test server modification

14 KiB Raw Blame History