Files
EZ-Homelab/AGENT_INSTRUCTIONS_DEV.md
kelin adb894d35e Round 10: Add Traefik routing to monitoring services
- Added Traefik labels and routing to prometheus, grafana, loki, cadvisor
- Fixed Grafana ROOT_URL to use domain-based URL (https://grafana.${DOMAIN})
- Added uptime-kuma bypass rule in Authelia (needs initial setup)
- Updated all services to use traefik-network
- Synced domain from kelin-hass to kelin-casa across all configs
- Fixed missing tls=true label on uptime-kuma
- Note: Loki is API-only service (no web UI, accessed via Grafana)
2026-01-14 23:08:37 -05:00

623 lines
23 KiB
Markdown

# AI Agent Instructions - Repository Development Focus
## Mission Statement
You are an AI agent specialized in **developing and testing** the AI-Homelab repository. Your primary focus is on improving the codebase, scripts, documentation, and configuration templates - **not managing a production homelab**. You are working with a test environment to validate repository functionality.
## Context: Development Phase
- **Current Phase**: Testing and development
- **Repository**: `/home/kelin/AI-Homelab/`
- **Purpose**: Validate automated deployment, improve scripts, enhance documentation
- **Test System**: Local Debian 12 environment for validation
- **User**: `kelin` (PUID=1000, PGID=1000)
- **Key Insight**: You're building the **tool** (repository), not using it in production
## Primary Objectives
### 1. Repository Quality
- **Scripts**: Ensure robust error handling, idempotency, and clear user feedback
- **Documentation**: Maintain accurate, comprehensive, beginner-friendly docs
- **Templates**: Provide production-ready Docker Compose configurations
- **Consistency**: Maintain uniform patterns across all files
### 2. Testing Validation
- **Fresh Install**: Verify complete workflow on clean systems
- **Edge Cases**: Test error conditions, network failures, invalid inputs
- **Idempotency**: Ensure scripts handle re-runs gracefully
- **User Experience**: Clear messages, helpful error guidance, smooth flow
### 3. Code Maintainability
- **Comments**: Document non-obvious logic and design decisions
- **Modular Design**: Keep functions focused and reusable
- **Version Control**: Make atomic, well-described commits
- **Standards**: Follow bash best practices and YAML conventions
## Repository Structure
```
~/AI-Homelab/
├── .github/
│ └── copilot-instructions.md # GitHub Copilot guidelines for homelab management
├── docker-compose/ # Service stack templates
│ ├── core/ # DuckDNS, Traefik, Authelia, Gluetun (deploy first)
│ ├── infrastructure/ # Dockge, Portainer, Pi-hole, monitoring
│ ├── dashboards/ # Homepage, Homarr
│ ├── media/ # Plex, Jellyfin, *arr services
│ ├── monitoring/ # Prometheus, Grafana, Loki
│ ├── productivity/ # Nextcloud, Paperless-ngx, etc.
│ └── *.yml # Individual service stacks
├── config-templates/ # Service configuration files
│ ├── authelia/ # SSO configuration
│ ├── traefik/ # Reverse proxy config
│ ├── homepage/ # Dashboard config
│ └── [other-services]/
├── docs/ # Comprehensive documentation
│ ├── getting-started.md # Installation guide
│ ├── services-overview.md # Service descriptions
│ ├── docker-guidelines.md # Docker best practices
│ ├── proxying-external-hosts.md # External host integration
│ ├── quick-reference.md # Command reference
│ ├── troubleshooting/ # Problem-solving guides
│ └── service-docs/ # Per-service documentation
├── scripts/ # Automation scripts
│ ├── setup-homelab.sh # First-run system setup
│ ├── deploy-homelab.sh # Deploy core + infrastructure + dashboards
│ └── reset-test-environment.sh # Clean slate for testing
├── .env.example # Environment template with documentation
├── .gitignore # Git exclusions
├── README.md # Project overview
├── AGENT_INSTRUCTIONS.md # Original homelab management instructions
└── AGENT_INSTRUCTIONS_DEV.md # This file - development focus
```
## Core Development Principles
### 1. Test-Driven Approach
- **Write tests first**: Consider edge cases before implementing
- **Validate thoroughly**: Test fresh installs, re-runs, failures, edge cases
- **Document testing**: Record test results and findings
- **Clean between tests**: Use reset script for reproducible testing
### 2. User Experience First
- **Clear messages**: Every script output should be helpful and actionable
- **Error guidance**: Don't just say "failed" - explain why and what to do
- **Progress indicators**: Show users what's happening (Step X/Y format)
- **Safety checks**: Validate prerequisites before making changes
### 3. Maintainable Code
- **Comments**: Explain WHY, not just WHAT
- **Functions**: Small, focused, single-responsibility
- **Variables**: Descriptive names, clear purpose
- **Constants**: Define at top of scripts
- **Error handling**: set -e, trap handlers, validation
### 4. Documentation Standards
- **Beginner-friendly**: Assume user is new to Docker/Linux
- **Step-by-step**: Clear numbered instructions
- **Examples**: Show actual commands and expected output
- **Troubleshooting**: Pre-emptively address common issues
- **Up-to-date**: Validate docs match current script behavior
## Script Development Guidelines
### setup-homelab.sh - First-Run Setup
**Purpose**: Prepare system and configure Authelia on fresh installations
**Key Responsibilities:**
- Install Docker Engine + Compose V2
- Configure user groups (docker, sudo)
- Set up firewall (UFW) with ports 80, 443, 22
- Generate Authelia secrets (JWT, session, encryption key)
- Create admin user with secure password hash
- Create directory structure (/opt/stacks/, /opt/dockge/)
- Set up Docker networks
- Detect and offer NVIDIA GPU driver installation
**Development Focus:**
- **Idempotency**: Detect existing installations, skip completed steps
- **Error handling**: Validate each step, provide clear failure messages
- **User interaction**: Prompt for admin username, password, email
- **Security**: Generate strong secrets, validate password complexity
- **Documentation**: Display credentials clearly at end
**Testing Checklist:**
- [ ] Fresh system: All steps complete successfully
- [ ] Re-run: Detects existing setup, skips appropriately
- [ ] Invalid input: Handles empty passwords, invalid emails
- [ ] Network failure: Clear error messages, retry guidance
- [ ] Low disk space: Pre-flight check catches issue
### deploy-homelab.sh - Stack Deployment
**Purpose**: Deploy core infrastructure, infrastructure, and dashboards
**Key Responsibilities:**
- Validate prerequisites (.env file, Docker running)
- Create Docker networks (homelab, traefik, dockerproxy, media)
- Copy .env to stack directories
- Configure Traefik with domain and email
- Deploy core stack (DuckDNS, Traefik, Authelia, Gluetun)
- Deploy infrastructure stack (Dockge, Pi-hole, monitoring)
- Deploy dashboards stack (Homepage, Homarr)
- Wait for services to become healthy
- Display access URLs and login information
**Development Focus:**
- **Sequential deployment**: Core first, then infrastructure, then dashboards
- **Health checks**: Verify services are running before proceeding
- **Certificate generation**: Wait for Let's Encrypt wildcard cert (2-5 min)
- **Error recovery**: Clear guidance if deployment fails
- **User feedback**: Show progress, success messages, next steps
**Testing Checklist:**
- [ ] Fresh deployment: All containers start and stay healthy
- [ ] Re-deployment: Handles existing containers gracefully
- [ ] Missing .env: Clear error with instructions
- [ ] Docker not running: Helpful troubleshooting steps
- [ ] Port conflicts: Detect and report clearly
### reset-test-environment.sh - Clean Slate
**Purpose**: Safely remove test deployment for fresh testing
**Key Responsibilities:**
- Stop and remove all homelab containers
- Remove Docker networks (homelab, traefik, dockerproxy, media)
- Remove deployment directories (/opt/stacks/, /opt/dockge/)
- Preserve system packages and Docker installation
- Preserve user credentials and repository
**Development Focus:**
- **Safety**: Only remove homelab resources, not system files
- **Completeness**: Remove all traces for clean re-deployment
- **Confirmation**: Prompt before destructive operations
- **Documentation**: Explain what will and won't be removed
**Testing Checklist:**
- [ ] Removes all containers and networks
- [ ] Preserves Docker engine and packages
- [ ] Doesn't affect user home directory
- [ ] Allows immediate re-deployment
- [ ] Clear confirmation messages
## Docker Compose Template Standards
### Service Definition Best Practices
```yaml
services:
service-name:
image: namespace/image:tag # Pin versions (no :latest)
container_name: service-name # Explicit container name
restart: unless-stopped # Standard restart policy
networks:
- homelab-network # Use shared networks
ports: # Only if not using Traefik
- "8080:8080"
volumes:
- ./service-name/config:/config # Relative paths for configs
- service-data:/data # Named volumes for data
# Large data on separate drives:
# - /mnt/media:/media
# - /mnt/downloads:/downloads
environment:
- PUID=1000 # User ID for file permissions
- PGID=1000 # Group ID for file permissions
- TZ=America/New_York # Consistent timezone
- UMASK=022 # File creation mask
labels:
# Traefik routing
- "traefik.enable=true"
- "traefik.http.routers.service-name.rule=Host(`service.${DOMAIN}`)"
- "traefik.http.routers.service-name.entrypoints=websecure"
- "traefik.http.routers.service-name.tls.certresolver=letsencrypt"
# SSO protection (ENABLED BY DEFAULT - security first)
- "traefik.http.routers.service-name.middlewares=authelia@docker"
# Only Plex and Jellyfin bypass SSO for app compatibility
# Organization
- "homelab.category=category-name"
- "homelab.description=Service description"
volumes:
service-data:
driver: local
networks:
homelab-network:
external: true
```
### Volume Path Conventions
- **Config files**: Relative paths (`./service/config:/config`)
- **Large data**: Absolute paths (`/mnt/media:/media`, `/mnt/downloads:/downloads`)
- **Named volumes**: For application data (`service-data:/data`)
- **Rationale**: Relative paths work correctly in Dockge's `/opt/stacks/` structure
### Security-First Defaults
- **SSO enabled by default**: All services start with Authelia middleware
- **Exceptions**: Only Plex and Jellyfin bypass SSO (for app/device access)
- **Comment pattern**: `# - "traefik.http.routers.service.middlewares=authelia@docker"`
- **Philosophy**: Users should explicitly disable SSO when ready, not add it later
## Configuration File Standards
### Traefik Configuration
**Static Config** (`traefik.yml`):
- Entry points (web, websecure)
- Certificate resolvers (Let's Encrypt DNS challenge)
- Providers (Docker, File)
- Dashboard configuration
**Dynamic Config** (`dynamic/routes.yml`):
- Custom route definitions
- External host proxying
- Middleware definitions (beyond Docker labels)
### Authelia Configuration
**Main Config** (`configuration.yml`):
- JWT secret, session secret, encryption key
- Session settings (domain, expiration)
- Access control rules (bypass for specific services)
- Storage backend (local file)
- Notifier settings (file-based for local testing)
**Users Database** (`users_database.yml`):
- Admin user credentials
- Password hash (argon2id)
- Email address for notifications
### Homepage Dashboard Configuration
**services.yaml**:
- Service listings organized by category
- Use `${DOMAIN}` variable for domain replacement
- Icons and descriptions for each service
- Links to service web UIs
**Template Pattern**:
```yaml
- Infrastructure:
- Dockge:
icon: docker.svg
href: https://dockge.${DOMAIN}
description: Docker Compose stack manager
```
## Documentation Standards
### Getting Started Guide
**Target Audience**: Complete beginners to Docker and homelabs
**Structure**:
1. Prerequisites (system requirements, accounts needed)
2. Quick setup (simple step-by-step)
3. Detailed explanation (what each step does)
4. Troubleshooting (common issues and solutions)
5. Next steps (using the homelab)
**Writing Style**:
- Clear, simple language
- Numbered steps
- Code blocks with syntax highlighting
- Expected output examples
- Warning/info callouts for important notes
### Service Documentation
**Per-Service Pattern**:
1. **Overview**: What the service does
2. **Access**: URL pattern (`https://service.${DOMAIN}`)
3. **Default Credentials**: Username/password if applicable
4. **Configuration**: Key settings to configure
5. **Integration**: How it connects with other services
6. **Troubleshooting**: Common issues
### Quick Reference
**Content**:
- Common commands (Docker, docker-compose)
- File locations (configs, logs, data)
- Port mappings (service to host)
- Network architecture diagram
- Troubleshooting quick checks
## Testing Methodology
### Test Rounds
Follow the structured testing approach documented in `ROUND_*_PREP.md` files:
1. **Fresh Installation**: Clean Debian 12 system
2. **Re-run Detection**: Idempotency validation
3. **Edge Cases**: Invalid inputs, network failures, resource constraints
4. **Service Validation**: All services accessible and functional
5. **SSL Validation**: Certificate generation and renewal
6. **SSO Validation**: Authentication working correctly
7. **Documentation Validation**: Instructions match reality
### Test Environment Management
```bash
# Reset to clean slate
sudo ./scripts/reset-test-environment.sh
# Fresh deployment
sudo ./scripts/setup-homelab.sh
sudo ./scripts/deploy-homelab.sh
# Validate deployment
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
docker network ls | grep homelab
```
### Test Documentation
Record findings in `ROUND_*_PREP.md` files:
- **Objectives**: What you're testing
- **Procedure**: Exact commands and steps
- **Results**: Success/failure, unexpected behavior
- **Fixes**: Changes made to resolve issues
- **Validation**: How you confirmed the fix
## Common Development Tasks
### Adding a New Service Stack
1. **Create compose file**: `docker-compose/service-name.yml`
2. **Define service**: Follow template standards
3. **Add configuration**: `config-templates/service-name/`
4. **Document service**: `docs/service-docs/service-name.md`
5. **Update overview**: Add to `docs/services-overview.md`
6. **Test deployment**: Validate on test system
7. **Update README**: If adding major category
### Improving Script Reliability
1. **Identify issue**: Document current failure mode
2. **Add validation**: Pre-flight checks for prerequisites
3. **Improve errors**: Clear messages with actionable guidance
4. **Add recovery**: Handle partial failures gracefully
5. **Test edge cases**: Invalid inputs, network issues, conflicts
6. **Document behavior**: Update comments and docs
### Updating Documentation
1. **Identify drift**: Find docs that don't match reality
2. **Test procedure**: Follow docs exactly, note discrepancies
3. **Update content**: Fix inaccuracies, add missing steps
4. **Validate changes**: Have someone else follow new docs
5. **Cross-reference**: Update related docs for consistency
### Refactoring Code
1. **Identify smell**: Duplicated code, complex functions, unclear logic
2. **Plan refactor**: Design cleaner structure
3. **Extract functions**: Create small, focused functions
4. **Improve names**: Use descriptive variable/function names
5. **Add comments**: Document design decisions
6. **Test thoroughly**: Ensure behavior unchanged
7. **Update docs**: Reflect any user-facing changes
## File Permission Safety (CRITICAL)
### The Permission Problem
Round 4 testing revealed that careless sudo usage causes permission issues:
- Scripts create files as root
- User can't edit files in their own home directory
- Requires manual chown to fix
### Safe Practices
**DO:**
- Check ownership before editing: `ls -la /home/kelin/AI-Homelab/`
- Keep files owned by `kelin:kelin` in user directories
- Use sudo only for Docker operations and system directories (/opt/)
- Let scripts handle file creation without sudo when possible
**DON'T:**
- Use sudo for file operations in `/home/kelin/`
- Blindly escalate privileges on "permission denied"
- Assume root ownership is needed
- Ignore ownership in `ls -la` output
### Diagnosis Before Escalation
```bash
# Check file ownership
ls -la /home/kelin/AI-Homelab/
# Expected: kelin:kelin ownership
# If root:root, something went wrong
# Fix if needed (user runs this, not scripts)
sudo chown -R kelin:kelin /home/kelin/AI-Homelab/
```
## AI Agent Workflow
### When Asked to Add a Service
1. **Research service**: Purpose, requirements, dependencies
2. **Check existing patterns**: Review similar services in repo
3. **Create compose file**: Follow template standards
4. **Add configuration**: Create config templates if needed
5. **Write documentation**: Service-specific guide
6. **Update references**: Add to services overview
7. **Test deployment**: Validate on test system
### When Asked to Improve Scripts
1. **Understand current behavior**: Read script, test execution
2. **Identify issues**: Document problems and edge cases
3. **Design solution**: Plan improvements
4. **Implement changes**: Follow bash best practices
5. **Add error handling**: Validate inputs, check prerequisites
6. **Improve messages**: Clear, actionable feedback
7. **Test thoroughly**: Fresh install, re-run, edge cases
8. **Document changes**: Update comments and docs
### When Asked to Update Documentation
1. **Locate affected docs**: Find all related files
2. **Test current instructions**: Follow docs exactly
3. **Note discrepancies**: Where docs don't match reality
4. **Update content**: Fix errors, add missing info
5. **Validate changes**: Test updated instructions
6. **Check cross-references**: Update related docs
7. **Review consistency**: Ensure uniform terminology
### When Asked to Debug an Issue
1. **Reproduce problem**: Follow exact steps to trigger issue
2. **Gather context**: Logs, file contents, system state
3. **Identify root cause**: Trace back to source of failure
4. **Design fix**: Consider edge cases and side effects
5. **Implement solution**: Make minimal, targeted changes
6. **Test fix**: Validate issue is resolved
7. **Prevent recurrence**: Add checks or documentation
8. **Document finding**: Update troubleshooting docs
## Quality Checklist
### Before Committing Changes
- [ ] Code follows repository conventions
- [ ] Scripts have error handling and validation
- [ ] New files have appropriate permissions
- [ ] Documentation is updated
- [ ] Changes are tested on clean system
- [ ] Comments explain non-obvious decisions
- [ ] Commit message describes why, not just what
### Before Marking Task Complete
- [ ] Primary objective achieved
- [ ] Edge cases handled
- [ ] Documentation updated
- [ ] Tests pass on fresh system
- [ ] No regressions in existing functionality
- [ ] Code reviewed for quality
- [ ] User experience improved
## Key Repository Files
### .env.example
**Purpose**: Template for user configuration with documentation
**Required Variables**:
- `DOMAIN` - DuckDNS domain (yourdomain.duckdns.org)
- `DUCKDNS_TOKEN` - Token from duckdns.org
- `ACME_EMAIL` - Email for Let's Encrypt
- `PUID=1000` - User ID for file permissions
- `PGID=1000` - Group ID for file permissions
- `TZ=America/New_York` - Timezone
**Auto-Generated** (by setup script):
- `AUTHELIA_JWT_SECRET`
- `AUTHELIA_SESSION_SECRET`
- `AUTHELIA_STORAGE_ENCRYPTION_KEY`
**Optional** (for VPN features):
- `SURFSHARK_USERNAME`
- `SURFSHARK_PASSWORD`
- `WIREGUARD_PRIVATE_KEY`
- `WIREGUARD_ADDRESSES`
### docker-compose/core/docker-compose.yml
**Purpose**: Core infrastructure that must deploy first
**Services**:
1. **DuckDNS**: Dynamic DNS updater for Let's Encrypt
2. **Traefik**: Reverse proxy with automatic SSL
3. **Authelia**: SSO authentication for all services
4. **Gluetun**: VPN client (Surfshark WireGuard)
**Why Combined**:
- These services depend on each other
- Simplifies initial deployment (one command)
- Easier to manage core infrastructure together
- All core services in `/opt/stacks/core/` directory
### config-templates/traefik/traefik.yml
**Purpose**: Traefik static configuration
**Key Sections**:
- **Entry Points**: HTTP (80) and HTTPS (443)
- **Certificate Resolvers**: Let's Encrypt with DNS challenge
- **Providers**: Docker (automatic service discovery), File (custom routes)
- **Dashboard**: Traefik monitoring UI
### config-templates/authelia/configuration.yml
**Purpose**: Authelia SSO configuration
**Key Sections**:
- **Secrets**: JWT, session, encryption key (from .env)
- **Session**: Domain, expiration, inactivity timeout
- **Access Control**: Rules for bypass (Plex, Jellyfin) vs protected services
- **Storage**: Local file backend
- **Notifier**: File-based for local testing
## Remember: Development Focus
You are **building the repository**, not managing a production homelab:
1. **Test Thoroughly**: Fresh installs, re-runs, edge cases
2. **Document Everything**: Assume user is a beginner
3. **Handle Errors Gracefully**: Clear messages, actionable guidance
4. **Follow Conventions**: Maintain consistency across all files
5. **Validate Changes**: Test on clean system before committing
6. **Think About Users**: Make their experience smooth and simple
7. **Preserve Context**: Comment WHY, not just WHAT
8. **Stay Focused**: You're improving the tool, not using it
## Quick Reference Commands
### Testing Workflow
```bash
# Reset test environment
sudo ./scripts/reset-test-environment.sh
# Fresh setup
sudo ./scripts/setup-homelab.sh
# Deploy infrastructure
sudo ./scripts/deploy-homelab.sh
# Check deployment
docker ps --format "table {{.Names}}\t{{.Status}}"
docker network ls | grep homelab
docker logs <container-name>
# Access Dockge
# https://dockge.${DOMAIN}
```
### Repository Management
```bash
# Check file ownership
ls -la ~/AI-Homelab/
# Fix permissions if needed
sudo chown -R kelin:kelin ~/AI-Homelab/
# Validate YAML syntax
docker-compose -f docker-compose/core/docker-compose.yml config
# Test environment variable substitution
docker-compose -f docker-compose/core/docker-compose.yml config | grep DOMAIN
```
### Docker Operations
```bash
# View all containers
docker ps -a
# View logs
docker logs <container> --tail 50 -f
# Restart service
docker restart <container>
# Remove container
docker rm -f <container>
# View networks
docker network ls
# Inspect network
docker network inspect <network>
```
## Success Criteria
A successful repository provides:
1. **Reliable Scripts**: Work on fresh systems, handle edge cases
2. **Clear Documentation**: Beginners can follow successfully
3. **Production-Ready Templates**: Services work out of the box
4. **Excellent UX**: Clear messages, helpful errors, smooth flow
5. **Maintainability**: Code is clean, commented, consistent
6. **Testability**: Easy to validate changes on test system
7. **Completeness**: All necessary services and configs included
Your mission: Make AI-Homelab the best automated homelab deployment tool possible.