Wiki major update

updated with recent documentation
This commit is contained in:
kelinfoxy
2026-01-21 19:18:39 -05:00
parent 30e0481685
commit ef55974b50
98 changed files with 32244 additions and 10 deletions

View File

@@ -0,0 +1,293 @@
# Common Issues and Solutions
## Installation Issues
### Docker Group Permissions
**Symptom:** `permission denied while trying to connect to the Docker daemon socket`
**Solution:**
```bash
# After running setup script, you must log out and back in
exit # or logout
# Or without logging out:
newgrp docker
```
### Password Hash Generation Timeout
**Symptom:** Password hash generation takes longer than 60 seconds
**Causes:**
- High CPU usage from other processes
- Slow system (argon2 is computationally intensive)
**Solutions:**
```bash
# Check system resources
top
# or
htop
# If system is slow, reduce argon2 iterations (less secure but faster)
# This is handled automatically by Authelia - just wait
# On very slow systems, it may take up to 2 minutes
```
### Port Conflicts
**Symptom:** `bind: address already in use`
**Solution:**
```bash
# Check what's using the port
sudo lsof -i :80
sudo lsof -i :443
# Common culprits:
# - Apache: sudo systemctl stop apache2
# - Nginx: sudo systemctl stop nginx
# - Another container: docker ps (find and stop it)
```
## Deployment Issues
### Authelia Restart Loop
**Symptom:** Authelia container keeps restarting
**Common causes:**
1. **Password hash corruption** - Fixed in current version
2. **Encryption key mismatch** - Changed .env after initial deployment
**Solution:**
```bash
# Check logs
sudo docker logs authelia
# If encryption key error, reset Authelia database:
sudo ./scripts/reset-test-environment.sh
# Then run setup and deploy again
```
### Watchtower Issues
**Status:** Temporarily disabled due to Docker API compatibility
**Issue:** Docker 29.x requires API v1.44, but Watchtower versions have compatibility issues
**Current state:** Commented out in infrastructure.yml with documentation
**Manual updates instead:**
```bash
# Update all images in a stack
cd /opt/stacks/stack-name/
docker compose pull
docker compose up -d
```
### Homepage Not Showing Correct URLs
**Symptom:** Homepage shows `{{HOMEPAGE_VAR_DOMAIN}}` instead of actual domain
**Cause:** Old deployment script version
**Solution:**
```bash
# Re-run deployment script (safe - won't affect running services)
sudo ./scripts/deploy-homelab.sh
# Or manually fix:
cd /opt/stacks/dashboards/homepage
sudo find . -name "*.yaml" -exec sed -i "s/{{HOMEPAGE_VAR_DOMAIN}}/yourdomain.duckdns.org/g" {} \;
```
### Services Not Accessible via HTTPS
**Symptom:** Can't access services at https://service.yourdomain.duckdns.org
**Solutions:**
1. **Check Traefik is running:**
```bash
sudo docker ps | grep traefik
sudo docker logs traefik
```
2. **Verify DuckDNS is updating:**
```bash
sudo docker logs duckdns
# Should show "Your IP has been updated"
```
3. **Check ports are open:**
```bash
sudo ufw status
# Should show 80/tcp and 443/tcp ALLOW
```
4. **Verify domain resolves:**
```bash
nslookup yourdomain.duckdns.org
# Should return your public IP
```
## Service-Specific Issues
### Gluetun VPN Not Connecting
**Symptom:** Gluetun shows connection errors
**Solutions:**
```bash
# Check credentials in .env
cat ~/AI-Homelab/.env | grep SURFSHARK
# Check Gluetun logs
sudo docker logs gluetun
# Common fixes:
# 1. Wrong server region
# 2. Invalid credentials
# 3. WireGuard not supported by provider
```
### Pi-hole DNS Not Working
**Symptom:** Devices can't resolve DNS through Pi-hole
**Solutions:**
```bash
# Check Pi-hole is running
sudo docker ps | grep pihole
# Verify port 53 is available
sudo lsof -i :53
# If systemd-resolved is conflicting:
sudo systemctl disable systemd-resolved
sudo systemctl stop systemd-resolved
```
### Dockge Shows Empty
**Symptom:** No stacks visible in Dockge
**Cause:** Stacks not copied to /opt/stacks/
**Solution:**
```bash
# Check what exists
ls -la /opt/stacks/
# Re-run deployment to copy stacks
sudo ./scripts/deploy-homelab.sh
```
## Performance Issues
### Slow Container Start Times
**Causes:**
- First-time image pulls
- Slow disk (not using SSD/NVMe)
- Insufficient RAM
**Solutions:**
```bash
# Pre-pull images
cd /opt/stacks/stack-name/
docker compose pull
# Check disk performance
sudo hdparm -Tt /dev/sda # Replace with your disk
# Check RAM usage
free -h
# Move /opt/stacks to faster disk if needed
```
### High CPU Usage from Authelia
**Normal:** Argon2 password hashing is intentionally CPU-intensive for security
**If persistent:**
```bash
# Check what's causing load
sudo docker stats
# If Authelia constantly high:
sudo docker logs authelia
# Look for repeated authentication attempts (possible attack)
```
## Reset and Recovery
### Complete Reset (Testing Only)
**Warning:** This is destructive!
```bash
# Use the safe reset script
sudo ./scripts/reset-test-environment.sh
# Then re-run setup and deploy
sudo ./scripts/setup-homelab.sh
sudo ./scripts/deploy-homelab.sh
```
### Partial Reset (Single Stack)
```bash
# Stop and remove specific stack
cd /opt/stacks/stack-name/
docker compose down -v # -v removes volumes (data loss!)
# Redeploy
docker compose up -d
```
### Backup Before Reset
```bash
# Backup important data
sudo tar czf ~/homelab-backup-$(date +%Y%m%d).tar.gz /opt/stacks/
# Backup specific volumes
docker run --rm \
-v stack_volume:/data \
-v $(pwd):/backup \
busybox tar czf /backup/volume-backup.tar.gz /data
```
## Getting Help
1. **Check container logs:**
```bash
sudo docker logs container-name
sudo docker logs -f container-name # Follow logs
```
2. **Use Dozzle for real-time logs:**
Access at https://dozzle.yourdomain.duckdns.org
3. **Check the AI assistant:**
Ask Copilot in VS Code for specific issues
4. **Verify configuration:**
```bash
# Check .env file
cat ~/AI-Homelab/.env
# Check compose file
cat /opt/stacks/stack-name/docker-compose.yml
```
5. **Docker system info:**
```bash
docker info
docker version
docker system df # Disk usage
```

View File

@@ -0,0 +1,223 @@
# SSL Certificate Issues with DuckDNS DNS Challenge
## Issue Summary
Wildcard SSL certificate acquisition via DuckDNS DNS-01 challenge consistently fails due to network connectivity issues with DuckDNS authoritative nameservers.
## Root Cause Analysis
### Why Both Domain and Wildcard are Required
Let's Encrypt requires validation of BOTH domains when using SAN (Subject Alternative Name) certificates:
- `kelin-hass.duckdns.org` (apex domain)
- `*.kelin-hass.duckdns.org` (wildcard)
This is a Let's Encrypt policy - you cannot obtain just the wildcard certificate. Both must be validated simultaneously.
### Technical Root Cause: Unreachable Authoritative Nameservers
**Problem**: DuckDNS authoritative nameservers (ns1-ns9.duckdns.org) are **unreachable** from the test system's network.
**Evidence**:
```bash
# Direct ping to DuckDNS nameservers - 100% packet loss
ping -c 2 ns1.duckdns.org # FAIL: 100% packet loss
ping -c 2 99.79.143.35 # FAIL: 100% packet loss (direct IP)
# DNS queries to authoritative servers - timeout
dig @99.79.143.35 kelin-hass.duckdns.org # FAIL: timeout
dig @35.182.183.211 kelin-hass.duckdns.org # FAIL: timeout
dig @3.97.58.28 kelin-hass.duckdns.org # FAIL: timeout
# Queries to recursive resolvers - SUCCESS
dig @8.8.8.8 kelin-hass.duckdns.org # SUCCESS
dig @1.1.1.1 kelin-hass.duckdns.org # SUCCESS
# Traceroute analysis
traceroute 99.79.143.35
# Shows traffic reaching hop 5 (74.41.143.193) then black hole
# DuckDNS nameservers are hosted on Amazon AWS
# Suggests AWS security groups or ISP blocking
```
**Why This Matters**:
Traefik's ACME client (lego library) requires verification against authoritative nameservers after setting TXT records. Even though:
- DuckDNS API successfully sets TXT records ✅
- TXT records propagate to public DNS (8.8.8.8, 1.1.1.1) ✅
- Recursive DNS queries work ✅
The lego library **must** also query the authoritative nameservers directly to verify propagation, and this step fails due to network unreachability.
## Attempted Solutions
### Configuration Optimizations Tried
1. **Increased propagation delay** - `delayBeforeCheck: 300` (5 minutes)
- Result: Delay worked, but authoritative NS check still failed
2. **Extended timeout** - `DUCKDNS_PROPAGATION_TIMEOUT=600` (10 minutes)
- Result: Longer timeout observed, but same NS unreachability issue
3. **LEGO environment variables**:
```yaml
- LEGO_DISABLE_CNAME_SUPPORT=true
- LEGO_EXPERIMENTAL_DNS_TCP_SUPPORT=true
- LEGO_DNS_TIMEOUT=60
- LEGO_DNS_RESOLVERS=1.1.1.1:53,8.8.8.8:53
- LEGO_DISABLE_CP=true
```
- Result: Forced use of recursive resolvers for some queries, but SOA lookups still failed
4. **Explicit Docker DNS configuration**:
```yaml
dns:
- 1.1.1.1
- 8.8.8.8
```
- Result: Container used correct resolvers, but lego still attempted authoritative NS queries
5. **VPN routing test** (through Gluetun container)
- Result: DuckDNS nameservers also unreachable through VPN
### Error Messages Observed
**Phase 1: Direct authoritative nameserver timeout**
```
propagation: time limit exceeded: last error: authoritative nameservers:
DNS call error: read udp 172.19.0.2:53666->3.97.58.28:53: i/o timeout
[ns=ns6.duckdns.org.:53, question='_acme-challenge.kelin-hass.duckdns.org. IN TXT']
```
**Phase 2: SOA record query failure**
```
propagation: time limit exceeded: last error: could not find zone:
[fqdn=_acme-challenge.kelin-hass.duckdns.org.]
unexpected response for 'kelin-hass.duckdns.org.'
[question='kelin-hass.duckdns.org. IN SOA', code=SERVFAIL]
```
## Working Configuration (Self-Signed Certificates)
Current deployment is **fully functional** with self-signed certificates:
- All services accessible via HTTPS ✅
- Can proceed through browser certificate warnings ✅
- Traefik routing works correctly ✅
- Authelia SSO functional ✅
- All stacks deployed successfully ✅
## Recommended Solutions for Next Test Run
### Option 1: Switch to Cloudflare DNS (RECOMMENDED)
**Pros**:
- Cloudflare nameservers are highly reliable and globally accessible
- Supports wildcard certificates via DNS-01 challenge
- Better performance and propagation times
- Well-tested with Traefik
**Steps**:
1. Move domain to Cloudflare (free tier sufficient)
2. Obtain Cloudflare API token (Zone:DNS:Edit permission)
3. Update `traefik.yml`:
```yaml
dnsChallenge:
provider: cloudflare
delayBeforeCheck: 30 # Cloudflare propagates quickly
resolvers:
- "1.1.1.1:53"
- "1.0.0.1:53"
```
4. Update `docker-compose.yml`:
```yaml
environment:
- CF_DNS_API_TOKEN=${CF_DNS_API_TOKEN}
```
### Option 2: Investigate Network Blocking
**Diagnostic Steps**:
1. Test from different network (mobile hotspot, different ISP)
2. Contact ISP to check if AWS IP ranges are blocked
3. Check router/firewall for DNS filtering or AWS blocking
4. Test with different VPN provider
**If network is the issue**:
- May need to use VPN or proxy for Traefik container
- Consider hosting Traefik on different network segment
### Option 3: HTTP-01 Challenge (Non-Wildcard)
**Pros**:
- More reliable (no DNS dependencies)
- Works with current DuckDNS setup
- No external nameserver queries required
**Cons**:
- ❌ No wildcard certificate (must specify each subdomain)
- Requires port 80 accessible from internet
- Separate certificate for each subdomain
**Steps**:
1. Update `traefik.yml`:
```yaml
httpChallenge:
entryPoint: web
```
2. Remove wildcard domain label from Traefik service:
```yaml
# Remove this line:
- "traefik.http.routers.traefik.tls.domains[0].sans=*.${DOMAIN}"
```
3. Add explicit TLS configuration to each service's labels
### Option 4: Use Alternative DNS Provider with DuckDNS
Keep DuckDNS for dynamic IP updates, but use different DNS for certificates:
1. Use Cloudflare for DNS records
2. Keep DuckDNS container for IP updates
3. Create CNAME in Cloudflare pointing to DuckDNS
4. Use Cloudflare for certificate challenge
## Files to Update in Repository
### ~/AI-Homelab/stacks/core/traefik/traefik.yml
Document both HTTP and DNS challenge configurations with clear comments.
### ~/AI-Homelab/stacks/core/docker-compose.yml
Ensure wildcard domain configuration is correct (it is currently):
```yaml
- "traefik.http.routers.traefik.tls.domains[0].main=${DOMAIN}"
- "traefik.http.routers.traefik.tls.domains[0].sans=*.${DOMAIN}"
```
**This is correct** - keep both apex and wildcard.
### ~/AI-Homelab/docs/service-docs/traefik.md
Add troubleshooting section for DuckDNS DNS challenge issues.
## Success Criteria for Next Test
### Must Have:
- [ ] Valid wildcard SSL certificate obtained
- [ ] Certificate automatically renews
- [ ] No browser certificate warnings
- [ ] Documented working configuration
### Should Have:
- [ ] Certificate acquisition completes in < 5 minutes
- [ ] Reliable across multiple test runs
- [ ] Clear error messages if failure occurs
## Timeline Analysis
**First Test Run**: Certificates reportedly worked
**Current Test Run**: Consistent failures
**Possible Explanations**:
1. DuckDNS infrastructure changes (AWS security policies)
2. ISP routing changes
3. Increased AWS security after abuse/attacks
4. Different network environment during first test
## Conclusion
**Current Status**: System is production-ready except for SSL certificate warnings.
**Blocking Issue**: DuckDNS authoritative nameservers unreachable from current network environment.
**Recommendation**: **Switch to Cloudflare DNS** for next test run. This is the most reliable solution and is the industry standard for automated certificate management with Traefik.
**Alternative**: If staying with DuckDNS is required, investigate network connectivity issues with ISP and consider using HTTP-01 challenge (losing wildcard capability).