Round 7 Summary: ✅ Safe reset script validated - no system crashes ✅ Full deployment workflow successful ✅ Password hash corruption fixed (heredoc issue) ✅ All containers running healthily ✅ System stability confirmed Round 8 Preparation: - Focus on production readiness - Comprehensive integration testing - Edge case validation - SSL certificate generation testing - Documentation accuracy verification - User experience polish Testing scenarios defined for: - Fresh installation (happy path) - Re-run detection (idempotency) - Password reset workflows - Clean slate between tests - Service access validation - SSL certificate validation - Edge cases (invalid config, network issues, resource constraints) Success criteria: Scripts ready for public release (v1.0.0-rc1)
7.9 KiB
7.9 KiB
Round 7 Testing - Preparation and Safety Guidelines
Mission Context
Test AI-Homelab deployment scripts with focus on safe cleanup and recovery procedures.
Round 7 Status: ✅ COMPLETE - All objectives met, no system crashes occurred
Issues Found and Fixed:
- Password hash corruption - Heredoc variable expansion was mangling
$characters in argon2 hashes- Fixed: Use quoted heredoc ('EOF') with placeholders, then sed replacement
- Committed:
ee8a359
Critical Safety Requirements - NEW for Round 7
⚠️ SYSTEM CRASH PREVENTION
Issue from Round 6: Aggressive cleanup operations caused system crashes requiring power cycles and BIOS recovery attempts.
Root Causes Identified:
- Removing directories while Docker containers were actively using them
- Aggressive
rm -rfoperations on Docker volumes while containers running - No graceful shutdown sequence before cleanup
- Docker volume operations causing filesystem corruption
Safe Testing Procedure
Before Each Test Run
- Always use the new reset script instead of manual cleanup
- Never run cleanup commands while containers are running
- Always stop containers gracefully before removing files
- Monitor system resources during operations
Using the Safe Reset Script
cd ~/AI-Homelab/scripts
sudo ./reset-test-environment.sh
This script:
- ✅ Stops all containers gracefully (proper shutdown)
- ✅ Waits for containers to fully stop
- ✅ Removes Docker volumes safely
- ✅ Cleans directories only after containers stopped
- ✅ Preserves system packages and settings
- ✅ Does NOT touch Docker installation
- ✅ Does NOT modify system files
What NOT to Do (Dangerous Operations)
# ❌ NEVER do these while containers are running:
rm -rf /opt/stacks/core/traefik # Can corrupt active containers
rm -rf /var/lib/docker/volumes/* # Filesystem corruption risk
docker volume rm $(docker volume ls -q) # Removes volumes containers need
find /var/lib/docker -exec rm -rf {} + # EXTREMELY DANGEROUS
# ❌ NEVER force remove running containers:
docker rm -f $(docker ps -aq) # Can cause state corruption
# ❌ NEVER use pkill on Docker processes:
pkill -9 docker # Can corrupt Docker daemon state
Safe Cleanup Sequence (Manual)
If you need to clean up manually:
# 1. Stop services gracefully
cd /opt/stacks/core
docker compose down # Waits for clean shutdown
# 2. Wait for full stop
sleep 5
# 3. Then and only then remove files
rm -rf /opt/stacks/core/traefik
rm -rf /opt/stacks/core/authelia
# 4. Remove volumes after containers stopped
docker volume rm core_authelia-data
Round 7 Objectives
Primary Goals
- ✅ Verify safe reset script works without system crashes
- ✅ Test full deployment after reset (round-trip testing)
- ✅ Validate no file system corruption occurs
- ✅ Ensure containers start cleanly after reset
- ✅ Document any remaining edge cases
Testing Checklist
Pre-Testing Setup
- System is stable and responsive
- All previous containers stopped cleanly
- Disk space sufficient (5GB+ free)
- No filesystem errors:
dmesg | grep -i error - Docker daemon healthy:
systemctl status docker
Round 7 Test Sequence
-
Clean slate using reset script
sudo ./scripts/reset-test-environment.sh- Script completes without errors
- System remains responsive
- No kernel panics or crashes
- All volumes removed cleanly
-
Fresh deployment with improved scripts
sudo ./scripts/setup-homelab.sh- Completes successfully
- No permission errors
- Password hash generated correctly
- Credentials saved properly
-
Deploy infrastructure
sudo ./scripts/deploy-homelab.sh- Containers start cleanly
- No file conflicts
- Authelia initializes properly
- Credentials work immediately
-
Verify services
- Traefik accessible and routing
- Authelia login works
- Dockge UI accessible
- SSL certificates generating
-
Test reset again (idempotency)
sudo ./scripts/reset-test-environment.sh- Stops everything gracefully
- No orphaned containers
- No volume leaks
- System stable after reset
Changes Made for Round 7
New Files
scripts/reset-test-environment.sh- Safe cleanup script with proper shutdown sequence
Modified Files
scripts/deploy-homelab.sh:- Added graceful container stop before config file operations
- Removed automatic database cleanup (now use reset script instead)
- Added safety checks before rm operations
- Better warnings about existing databases
Removed Dangerous Operations
# REMOVED from deploy-homelab.sh:
docker compose up -d authelia 2>&1 | grep -q "encryption key" && {
docker compose down authelia
sudo rm -rf /var/lib/docker/volumes/core_authelia-data/_data/*
}
# This was causing crashes - containers couldn't handle abrupt volume removal
# REMOVED blind directory removal:
rm -rf /opt/stacks/core/traefik /opt/stacks/core/authelia
# Now checks if containers are running first
System Health Monitoring
Before Each Test Run
# Check system health
free -h # Memory available
df -h # Disk space
dmesg | tail -20 # Recent kernel messages
systemctl status docker # Docker daemon health
docker ps # Running containers
During Testing
- Monitor system logs:
journalctl -f - Watch Docker logs:
docker compose logs -f - Check resource usage:
htoportop
After Issues
# If system becomes unresponsive:
# 1. DO NOT hard power off immediately
# 2. Try to SSH in from another machine
# 3. Gracefully stop Docker: systemctl stop docker
# 4. Wait 30 seconds for disk writes to complete
# 5. Then reboot: systemctl reboot
# Check for filesystem corruption after boot:
sudo dmesg | grep -i error
sudo journalctl -xb | grep -i error
Recovery Procedures
If Deploy Script Hangs
# In another terminal:
cd /opt/stacks/core
docker compose ps # See what's running
docker compose logs authelia # Check for errors
docker compose down # Stop gracefully
# Then re-run deploy
If Authelia Won't Start (Encryption Key Error)
# Use the reset script:
sudo ./scripts/reset-test-environment.sh
# Then start fresh deployment
If System Crashed During Testing
# After reboot:
# 1. Check Docker state
systemctl status docker
docker ps -a # Look for crashed containers
# 2. Clean up properly
cd /opt/stacks/core
docker compose down --remove-orphans
# 3. Remove corrupted volumes
docker volume prune -f
# 4. Start fresh with reset script
sudo ./scripts/reset-test-environment.sh
Success Criteria for Round 7
Must Have
- ✅ Reset script completes without system crashes
- ✅ Can deploy and reset multiple times safely
- ✅ No filesystem corruption after any operation
- ✅ System remains responsive throughout testing
- ✅ All containers stop gracefully
Should Have
- ✅ Clear warnings before destructive operations
- ✅ Confirmation prompts for cleanup
- ✅ Progress indicators during long operations
- ✅ Health checks before and after operations
Nice to Have
- ⭐ Automatic backup before reset
- ⭐ Rollback capability
- ⭐ System health validation
- ⭐ Detailed logging of all operations
Emergency Contacts / References
- Docker best practices: https://docs.docker.com/config/daemon/
- Linux filesystem safety:
man sync,man fsync - systemd service management:
man systemctl
Post-Round 7 Review
Document These
- Any new crash scenarios discovered
- System resource usage patterns
- Time required for clean operations
- Any remaining unsafe operations
- User experience improvements needed
Remember: System stability is more important than testing speed. Always wait for clean shutdowns and never force operations.