Round 7 Summary: ✅ Safe reset script validated - no system crashes ✅ Full deployment workflow successful ✅ Password hash corruption fixed (heredoc issue) ✅ All containers running healthily ✅ System stability confirmed Round 8 Preparation: - Focus on production readiness - Comprehensive integration testing - Edge case validation - SSL certificate generation testing - Documentation accuracy verification - User experience polish Testing scenarios defined for: - Fresh installation (happy path) - Re-run detection (idempotency) - Password reset workflows - Clean slate between tests - Service access validation - SSL certificate validation - Edge cases (invalid config, network issues, resource constraints) Success criteria: Scripts ready for public release (v1.0.0-rc1)
274 lines
7.9 KiB
Markdown
274 lines
7.9 KiB
Markdown
# Round 7 Testing - Preparation and Safety Guidelines
|
|
|
|
## Mission Context
|
|
Test AI-Homelab deployment scripts with focus on **safe cleanup and recovery** procedures.
|
|
|
|
**Round 7 Status**: ✅ COMPLETE - All objectives met, no system crashes occurred
|
|
|
|
### Issues Found and Fixed:
|
|
1. **Password hash corruption** - Heredoc variable expansion was mangling `$` characters in argon2 hashes
|
|
- **Fixed**: Use quoted heredoc ('EOF') with placeholders, then sed replacement
|
|
- **Committed**: ee8a359
|
|
|
|
## Critical Safety Requirements - NEW for Round 7
|
|
|
|
### ⚠️ SYSTEM CRASH PREVENTION
|
|
**Issue from Round 6**: Aggressive cleanup operations caused system crashes requiring power cycles and BIOS recovery attempts.
|
|
|
|
**Root Causes Identified**:
|
|
1. Removing directories while Docker containers were actively using them
|
|
2. Aggressive `rm -rf` operations on Docker volumes while containers running
|
|
3. No graceful shutdown sequence before cleanup
|
|
4. Docker volume operations causing filesystem corruption
|
|
|
|
### Safe Testing Procedure
|
|
|
|
#### Before Each Test Run
|
|
1. **Always use the new reset script** instead of manual cleanup
|
|
2. **Never** run cleanup commands while containers are running
|
|
3. **Always** stop containers gracefully before removing files
|
|
4. **Monitor** system resources during operations
|
|
|
|
#### Using the Safe Reset Script
|
|
```bash
|
|
cd ~/AI-Homelab/scripts
|
|
sudo ./reset-test-environment.sh
|
|
```
|
|
|
|
This script:
|
|
- ✅ Stops all containers gracefully (proper shutdown)
|
|
- ✅ Waits for containers to fully stop
|
|
- ✅ Removes Docker volumes safely
|
|
- ✅ Cleans directories only after containers stopped
|
|
- ✅ Preserves system packages and settings
|
|
- ✅ Does NOT touch Docker installation
|
|
- ✅ Does NOT modify system files
|
|
|
|
#### What NOT to Do (Dangerous Operations)
|
|
```bash
|
|
# ❌ NEVER do these while containers are running:
|
|
rm -rf /opt/stacks/core/traefik # Can corrupt active containers
|
|
rm -rf /var/lib/docker/volumes/* # Filesystem corruption risk
|
|
docker volume rm $(docker volume ls -q) # Removes volumes containers need
|
|
find /var/lib/docker -exec rm -rf {} + # EXTREMELY DANGEROUS
|
|
|
|
# ❌ NEVER force remove running containers:
|
|
docker rm -f $(docker ps -aq) # Can cause state corruption
|
|
|
|
# ❌ NEVER use pkill on Docker processes:
|
|
pkill -9 docker # Can corrupt Docker daemon state
|
|
```
|
|
|
|
#### Safe Cleanup Sequence (Manual)
|
|
If you need to clean up manually:
|
|
```bash
|
|
# 1. Stop services gracefully
|
|
cd /opt/stacks/core
|
|
docker compose down # Waits for clean shutdown
|
|
|
|
# 2. Wait for full stop
|
|
sleep 5
|
|
|
|
# 3. Then and only then remove files
|
|
rm -rf /opt/stacks/core/traefik
|
|
rm -rf /opt/stacks/core/authelia
|
|
|
|
# 4. Remove volumes after containers stopped
|
|
docker volume rm core_authelia-data
|
|
```
|
|
|
|
## Round 7 Objectives
|
|
|
|
### Primary Goals
|
|
1. ✅ Verify safe reset script works without system crashes
|
|
2. ✅ Test full deployment after reset (round-trip testing)
|
|
3. ✅ Validate no file system corruption occurs
|
|
4. ✅ Ensure containers start cleanly after reset
|
|
5. ✅ Document any remaining edge cases
|
|
|
|
### Testing Checklist
|
|
|
|
#### Pre-Testing Setup
|
|
- [ ] System is stable and responsive
|
|
- [ ] All previous containers stopped cleanly
|
|
- [ ] Disk space sufficient (5GB+ free)
|
|
- [ ] No filesystem errors: `dmesg | grep -i error`
|
|
- [ ] Docker daemon healthy: `systemctl status docker`
|
|
|
|
#### Round 7 Test Sequence
|
|
1. **Clean slate** using reset script
|
|
```bash
|
|
sudo ./scripts/reset-test-environment.sh
|
|
```
|
|
- [ ] Script completes without errors
|
|
- [ ] System remains responsive
|
|
- [ ] No kernel panics or crashes
|
|
- [ ] All volumes removed cleanly
|
|
|
|
2. **Fresh deployment** with improved scripts
|
|
```bash
|
|
sudo ./scripts/setup-homelab.sh
|
|
```
|
|
- [ ] Completes successfully
|
|
- [ ] No permission errors
|
|
- [ ] Password hash generated correctly
|
|
- [ ] Credentials saved properly
|
|
|
|
3. **Deploy infrastructure**
|
|
```bash
|
|
sudo ./scripts/deploy-homelab.sh
|
|
```
|
|
- [ ] Containers start cleanly
|
|
- [ ] No file conflicts
|
|
- [ ] Authelia initializes properly
|
|
- [ ] Credentials work immediately
|
|
|
|
4. **Verify services**
|
|
- [ ] Traefik accessible and routing
|
|
- [ ] Authelia login works
|
|
- [ ] Dockge UI accessible
|
|
- [ ] SSL certificates generating
|
|
|
|
5. **Test reset again** (idempotency)
|
|
```bash
|
|
sudo ./scripts/reset-test-environment.sh
|
|
```
|
|
- [ ] Stops everything gracefully
|
|
- [ ] No orphaned containers
|
|
- [ ] No volume leaks
|
|
- [ ] System stable after reset
|
|
|
|
## Changes Made for Round 7
|
|
|
|
### New Files
|
|
- **`scripts/reset-test-environment.sh`** - Safe cleanup script with proper shutdown sequence
|
|
|
|
### Modified Files
|
|
- **`scripts/deploy-homelab.sh`**:
|
|
- Added graceful container stop before config file operations
|
|
- Removed automatic database cleanup (now use reset script instead)
|
|
- Added safety checks before rm operations
|
|
- Better warnings about existing databases
|
|
|
|
### Removed Dangerous Operations
|
|
```bash
|
|
# REMOVED from deploy-homelab.sh:
|
|
docker compose up -d authelia 2>&1 | grep -q "encryption key" && {
|
|
docker compose down authelia
|
|
sudo rm -rf /var/lib/docker/volumes/core_authelia-data/_data/*
|
|
}
|
|
# This was causing crashes - containers couldn't handle abrupt volume removal
|
|
|
|
# REMOVED blind directory removal:
|
|
rm -rf /opt/stacks/core/traefik /opt/stacks/core/authelia
|
|
# Now checks if containers are running first
|
|
```
|
|
|
|
## System Health Monitoring
|
|
|
|
### Before Each Test Run
|
|
```bash
|
|
# Check system health
|
|
free -h # Memory available
|
|
df -h # Disk space
|
|
dmesg | tail -20 # Recent kernel messages
|
|
systemctl status docker # Docker daemon health
|
|
docker ps # Running containers
|
|
```
|
|
|
|
### During Testing
|
|
- Monitor system logs: `journalctl -f`
|
|
- Watch Docker logs: `docker compose logs -f`
|
|
- Check resource usage: `htop` or `top`
|
|
|
|
### After Issues
|
|
```bash
|
|
# If system becomes unresponsive:
|
|
# 1. DO NOT hard power off immediately
|
|
# 2. Try to SSH in from another machine
|
|
# 3. Gracefully stop Docker: systemctl stop docker
|
|
# 4. Wait 30 seconds for disk writes to complete
|
|
# 5. Then reboot: systemctl reboot
|
|
|
|
# Check for filesystem corruption after boot:
|
|
sudo dmesg | grep -i error
|
|
sudo journalctl -xb | grep -i error
|
|
```
|
|
|
|
## Recovery Procedures
|
|
|
|
### If Deploy Script Hangs
|
|
```bash
|
|
# In another terminal:
|
|
cd /opt/stacks/core
|
|
docker compose ps # See what's running
|
|
docker compose logs authelia # Check for errors
|
|
docker compose down # Stop gracefully
|
|
# Then re-run deploy
|
|
```
|
|
|
|
### If Authelia Won't Start (Encryption Key Error)
|
|
```bash
|
|
# Use the reset script:
|
|
sudo ./scripts/reset-test-environment.sh
|
|
# Then start fresh deployment
|
|
```
|
|
|
|
### If System Crashed During Testing
|
|
```bash
|
|
# After reboot:
|
|
# 1. Check Docker state
|
|
systemctl status docker
|
|
docker ps -a # Look for crashed containers
|
|
|
|
# 2. Clean up properly
|
|
cd /opt/stacks/core
|
|
docker compose down --remove-orphans
|
|
|
|
# 3. Remove corrupted volumes
|
|
docker volume prune -f
|
|
|
|
# 4. Start fresh with reset script
|
|
sudo ./scripts/reset-test-environment.sh
|
|
```
|
|
|
|
## Success Criteria for Round 7
|
|
|
|
### Must Have
|
|
- ✅ Reset script completes without system crashes
|
|
- ✅ Can deploy and reset multiple times safely
|
|
- ✅ No filesystem corruption after any operation
|
|
- ✅ System remains responsive throughout testing
|
|
- ✅ All containers stop gracefully
|
|
|
|
### Should Have
|
|
- ✅ Clear warnings before destructive operations
|
|
- ✅ Confirmation prompts for cleanup
|
|
- ✅ Progress indicators during long operations
|
|
- ✅ Health checks before and after operations
|
|
|
|
### Nice to Have
|
|
- ⭐ Automatic backup before reset
|
|
- ⭐ Rollback capability
|
|
- ⭐ System health validation
|
|
- ⭐ Detailed logging of all operations
|
|
|
|
## Emergency Contacts / References
|
|
|
|
- Docker best practices: https://docs.docker.com/config/daemon/
|
|
- Linux filesystem safety: `man sync`, `man fsync`
|
|
- systemd service management: `man systemctl`
|
|
|
|
## Post-Round 7 Review
|
|
|
|
### Document These
|
|
- [ ] Any new crash scenarios discovered
|
|
- [ ] System resource usage patterns
|
|
- [ ] Time required for clean operations
|
|
- [ ] Any remaining unsafe operations
|
|
- [ ] User experience improvements needed
|
|
|
|
---
|
|
|
|
**Remember**: System stability is more important than testing speed. Always wait for clean shutdowns and never force operations.
|