Wiki major update
updated with recent documentation
This commit is contained in:
293
wiki/troubleshooting/COMMON-ISSUES.md
Normal file
293
wiki/troubleshooting/COMMON-ISSUES.md
Normal file
@@ -0,0 +1,293 @@
|
||||
# Common Issues and Solutions
|
||||
|
||||
## Installation Issues
|
||||
|
||||
### Docker Group Permissions
|
||||
|
||||
**Symptom:** `permission denied while trying to connect to the Docker daemon socket`
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# After running setup script, you must log out and back in
|
||||
exit # or logout
|
||||
|
||||
# Or without logging out:
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
### Password Hash Generation Timeout
|
||||
|
||||
**Symptom:** Password hash generation takes longer than 60 seconds
|
||||
|
||||
**Causes:**
|
||||
- High CPU usage from other processes
|
||||
- Slow system (argon2 is computationally intensive)
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check system resources
|
||||
top
|
||||
# or
|
||||
htop
|
||||
|
||||
# If system is slow, reduce argon2 iterations (less secure but faster)
|
||||
# This is handled automatically by Authelia - just wait
|
||||
# On very slow systems, it may take up to 2 minutes
|
||||
```
|
||||
|
||||
### Port Conflicts
|
||||
|
||||
**Symptom:** `bind: address already in use`
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Check what's using the port
|
||||
sudo lsof -i :80
|
||||
sudo lsof -i :443
|
||||
|
||||
# Common culprits:
|
||||
# - Apache: sudo systemctl stop apache2
|
||||
# - Nginx: sudo systemctl stop nginx
|
||||
# - Another container: docker ps (find and stop it)
|
||||
```
|
||||
|
||||
## Deployment Issues
|
||||
|
||||
### Authelia Restart Loop
|
||||
|
||||
**Symptom:** Authelia container keeps restarting
|
||||
|
||||
**Common causes:**
|
||||
1. **Password hash corruption** - Fixed in current version
|
||||
2. **Encryption key mismatch** - Changed .env after initial deployment
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Check logs
|
||||
sudo docker logs authelia
|
||||
|
||||
# If encryption key error, reset Authelia database:
|
||||
sudo ./scripts/reset-test-environment.sh
|
||||
# Then run setup and deploy again
|
||||
```
|
||||
|
||||
### Watchtower Issues
|
||||
|
||||
**Status:** Temporarily disabled due to Docker API compatibility
|
||||
|
||||
**Issue:** Docker 29.x requires API v1.44, but Watchtower versions have compatibility issues
|
||||
|
||||
**Current state:** Commented out in infrastructure.yml with documentation
|
||||
|
||||
**Manual updates instead:**
|
||||
```bash
|
||||
# Update all images in a stack
|
||||
cd /opt/stacks/stack-name/
|
||||
docker compose pull
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### Homepage Not Showing Correct URLs
|
||||
|
||||
**Symptom:** Homepage shows `{{HOMEPAGE_VAR_DOMAIN}}` instead of actual domain
|
||||
|
||||
**Cause:** Old deployment script version
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Re-run deployment script (safe - won't affect running services)
|
||||
sudo ./scripts/deploy-homelab.sh
|
||||
|
||||
# Or manually fix:
|
||||
cd /opt/stacks/dashboards/homepage
|
||||
sudo find . -name "*.yaml" -exec sed -i "s/{{HOMEPAGE_VAR_DOMAIN}}/yourdomain.duckdns.org/g" {} \;
|
||||
```
|
||||
|
||||
### Services Not Accessible via HTTPS
|
||||
|
||||
**Symptom:** Can't access services at https://service.yourdomain.duckdns.org
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. **Check Traefik is running:**
|
||||
```bash
|
||||
sudo docker ps | grep traefik
|
||||
sudo docker logs traefik
|
||||
```
|
||||
|
||||
2. **Verify DuckDNS is updating:**
|
||||
```bash
|
||||
sudo docker logs duckdns
|
||||
# Should show "Your IP has been updated"
|
||||
```
|
||||
|
||||
3. **Check ports are open:**
|
||||
```bash
|
||||
sudo ufw status
|
||||
# Should show 80/tcp and 443/tcp ALLOW
|
||||
```
|
||||
|
||||
4. **Verify domain resolves:**
|
||||
```bash
|
||||
nslookup yourdomain.duckdns.org
|
||||
# Should return your public IP
|
||||
```
|
||||
|
||||
## Service-Specific Issues
|
||||
|
||||
### Gluetun VPN Not Connecting
|
||||
|
||||
**Symptom:** Gluetun shows connection errors
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check credentials in .env
|
||||
cat ~/AI-Homelab/.env | grep SURFSHARK
|
||||
|
||||
# Check Gluetun logs
|
||||
sudo docker logs gluetun
|
||||
|
||||
# Common fixes:
|
||||
# 1. Wrong server region
|
||||
# 2. Invalid credentials
|
||||
# 3. WireGuard not supported by provider
|
||||
```
|
||||
|
||||
### Pi-hole DNS Not Working
|
||||
|
||||
**Symptom:** Devices can't resolve DNS through Pi-hole
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check Pi-hole is running
|
||||
sudo docker ps | grep pihole
|
||||
|
||||
# Verify port 53 is available
|
||||
sudo lsof -i :53
|
||||
|
||||
# If systemd-resolved is conflicting:
|
||||
sudo systemctl disable systemd-resolved
|
||||
sudo systemctl stop systemd-resolved
|
||||
```
|
||||
|
||||
### Dockge Shows Empty
|
||||
|
||||
**Symptom:** No stacks visible in Dockge
|
||||
|
||||
**Cause:** Stacks not copied to /opt/stacks/
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Check what exists
|
||||
ls -la /opt/stacks/
|
||||
|
||||
# Re-run deployment to copy stacks
|
||||
sudo ./scripts/deploy-homelab.sh
|
||||
```
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### Slow Container Start Times
|
||||
|
||||
**Causes:**
|
||||
- First-time image pulls
|
||||
- Slow disk (not using SSD/NVMe)
|
||||
- Insufficient RAM
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Pre-pull images
|
||||
cd /opt/stacks/stack-name/
|
||||
docker compose pull
|
||||
|
||||
# Check disk performance
|
||||
sudo hdparm -Tt /dev/sda # Replace with your disk
|
||||
|
||||
# Check RAM usage
|
||||
free -h
|
||||
|
||||
# Move /opt/stacks to faster disk if needed
|
||||
```
|
||||
|
||||
### High CPU Usage from Authelia
|
||||
|
||||
**Normal:** Argon2 password hashing is intentionally CPU-intensive for security
|
||||
|
||||
**If persistent:**
|
||||
```bash
|
||||
# Check what's causing load
|
||||
sudo docker stats
|
||||
|
||||
# If Authelia constantly high:
|
||||
sudo docker logs authelia
|
||||
# Look for repeated authentication attempts (possible attack)
|
||||
```
|
||||
|
||||
## Reset and Recovery
|
||||
|
||||
### Complete Reset (Testing Only)
|
||||
|
||||
**Warning:** This is destructive!
|
||||
|
||||
```bash
|
||||
# Use the safe reset script
|
||||
sudo ./scripts/reset-test-environment.sh
|
||||
|
||||
# Then re-run setup and deploy
|
||||
sudo ./scripts/setup-homelab.sh
|
||||
sudo ./scripts/deploy-homelab.sh
|
||||
```
|
||||
|
||||
### Partial Reset (Single Stack)
|
||||
|
||||
```bash
|
||||
# Stop and remove specific stack
|
||||
cd /opt/stacks/stack-name/
|
||||
docker compose down -v # -v removes volumes (data loss!)
|
||||
|
||||
# Redeploy
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### Backup Before Reset
|
||||
|
||||
```bash
|
||||
# Backup important data
|
||||
sudo tar czf ~/homelab-backup-$(date +%Y%m%d).tar.gz /opt/stacks/
|
||||
|
||||
# Backup specific volumes
|
||||
docker run --rm \
|
||||
-v stack_volume:/data \
|
||||
-v $(pwd):/backup \
|
||||
busybox tar czf /backup/volume-backup.tar.gz /data
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
1. **Check container logs:**
|
||||
```bash
|
||||
sudo docker logs container-name
|
||||
sudo docker logs -f container-name # Follow logs
|
||||
```
|
||||
|
||||
2. **Use Dozzle for real-time logs:**
|
||||
Access at https://dozzle.yourdomain.duckdns.org
|
||||
|
||||
3. **Check the AI assistant:**
|
||||
Ask Copilot in VS Code for specific issues
|
||||
|
||||
4. **Verify configuration:**
|
||||
```bash
|
||||
# Check .env file
|
||||
cat ~/AI-Homelab/.env
|
||||
|
||||
# Check compose file
|
||||
cat /opt/stacks/stack-name/docker-compose.yml
|
||||
```
|
||||
|
||||
5. **Docker system info:**
|
||||
```bash
|
||||
docker info
|
||||
docker version
|
||||
docker system df # Disk usage
|
||||
```
|
||||
223
wiki/troubleshooting/SSL-CERTIFICATES-DUCKDNS.md
Normal file
223
wiki/troubleshooting/SSL-CERTIFICATES-DUCKDNS.md
Normal file
@@ -0,0 +1,223 @@
|
||||
# SSL Certificate Issues with DuckDNS DNS Challenge
|
||||
|
||||
## Issue Summary
|
||||
Wildcard SSL certificate acquisition via DuckDNS DNS-01 challenge consistently fails due to network connectivity issues with DuckDNS authoritative nameservers.
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### Why Both Domain and Wildcard are Required
|
||||
Let's Encrypt requires validation of BOTH domains when using SAN (Subject Alternative Name) certificates:
|
||||
- `kelin-hass.duckdns.org` (apex domain)
|
||||
- `*.kelin-hass.duckdns.org` (wildcard)
|
||||
|
||||
This is a Let's Encrypt policy - you cannot obtain just the wildcard certificate. Both must be validated simultaneously.
|
||||
|
||||
### Technical Root Cause: Unreachable Authoritative Nameservers
|
||||
|
||||
**Problem**: DuckDNS authoritative nameservers (ns1-ns9.duckdns.org) are **unreachable** from the test system's network.
|
||||
|
||||
**Evidence**:
|
||||
```bash
|
||||
# Direct ping to DuckDNS nameservers - 100% packet loss
|
||||
ping -c 2 ns1.duckdns.org # FAIL: 100% packet loss
|
||||
ping -c 2 99.79.143.35 # FAIL: 100% packet loss (direct IP)
|
||||
|
||||
# DNS queries to authoritative servers - timeout
|
||||
dig @99.79.143.35 kelin-hass.duckdns.org # FAIL: timeout
|
||||
dig @35.182.183.211 kelin-hass.duckdns.org # FAIL: timeout
|
||||
dig @3.97.58.28 kelin-hass.duckdns.org # FAIL: timeout
|
||||
|
||||
# Queries to recursive resolvers - SUCCESS
|
||||
dig @8.8.8.8 kelin-hass.duckdns.org # SUCCESS
|
||||
dig @1.1.1.1 kelin-hass.duckdns.org # SUCCESS
|
||||
|
||||
# Traceroute analysis
|
||||
traceroute 99.79.143.35
|
||||
# Shows traffic reaching hop 5 (74.41.143.193) then black hole
|
||||
# DuckDNS nameservers are hosted on Amazon AWS
|
||||
# Suggests AWS security groups or ISP blocking
|
||||
```
|
||||
|
||||
**Why This Matters**:
|
||||
Traefik's ACME client (lego library) requires verification against authoritative nameservers after setting TXT records. Even though:
|
||||
- DuckDNS API successfully sets TXT records ✅
|
||||
- TXT records propagate to public DNS (8.8.8.8, 1.1.1.1) ✅
|
||||
- Recursive DNS queries work ✅
|
||||
|
||||
The lego library **must** also query the authoritative nameservers directly to verify propagation, and this step fails due to network unreachability.
|
||||
|
||||
## Attempted Solutions
|
||||
|
||||
### Configuration Optimizations Tried
|
||||
|
||||
1. **Increased propagation delay** - `delayBeforeCheck: 300` (5 minutes)
|
||||
- Result: Delay worked, but authoritative NS check still failed
|
||||
|
||||
2. **Extended timeout** - `DUCKDNS_PROPAGATION_TIMEOUT=600` (10 minutes)
|
||||
- Result: Longer timeout observed, but same NS unreachability issue
|
||||
|
||||
3. **LEGO environment variables**:
|
||||
```yaml
|
||||
- LEGO_DISABLE_CNAME_SUPPORT=true
|
||||
- LEGO_EXPERIMENTAL_DNS_TCP_SUPPORT=true
|
||||
- LEGO_DNS_TIMEOUT=60
|
||||
- LEGO_DNS_RESOLVERS=1.1.1.1:53,8.8.8.8:53
|
||||
- LEGO_DISABLE_CP=true
|
||||
```
|
||||
- Result: Forced use of recursive resolvers for some queries, but SOA lookups still failed
|
||||
|
||||
4. **Explicit Docker DNS configuration**:
|
||||
```yaml
|
||||
dns:
|
||||
- 1.1.1.1
|
||||
- 8.8.8.8
|
||||
```
|
||||
- Result: Container used correct resolvers, but lego still attempted authoritative NS queries
|
||||
|
||||
5. **VPN routing test** (through Gluetun container)
|
||||
- Result: DuckDNS nameservers also unreachable through VPN
|
||||
|
||||
### Error Messages Observed
|
||||
|
||||
**Phase 1: Direct authoritative nameserver timeout**
|
||||
```
|
||||
propagation: time limit exceeded: last error: authoritative nameservers:
|
||||
DNS call error: read udp 172.19.0.2:53666->3.97.58.28:53: i/o timeout
|
||||
[ns=ns6.duckdns.org.:53, question='_acme-challenge.kelin-hass.duckdns.org. IN TXT']
|
||||
```
|
||||
|
||||
**Phase 2: SOA record query failure**
|
||||
```
|
||||
propagation: time limit exceeded: last error: could not find zone:
|
||||
[fqdn=_acme-challenge.kelin-hass.duckdns.org.]
|
||||
unexpected response for 'kelin-hass.duckdns.org.'
|
||||
[question='kelin-hass.duckdns.org. IN SOA', code=SERVFAIL]
|
||||
```
|
||||
|
||||
## Working Configuration (Self-Signed Certificates)
|
||||
|
||||
Current deployment is **fully functional** with self-signed certificates:
|
||||
- All services accessible via HTTPS ✅
|
||||
- Can proceed through browser certificate warnings ✅
|
||||
- Traefik routing works correctly ✅
|
||||
- Authelia SSO functional ✅
|
||||
- All stacks deployed successfully ✅
|
||||
|
||||
## Recommended Solutions for Next Test Run
|
||||
|
||||
### Option 1: Switch to Cloudflare DNS (RECOMMENDED)
|
||||
**Pros**:
|
||||
- Cloudflare nameservers are highly reliable and globally accessible
|
||||
- Supports wildcard certificates via DNS-01 challenge
|
||||
- Better performance and propagation times
|
||||
- Well-tested with Traefik
|
||||
|
||||
**Steps**:
|
||||
1. Move domain to Cloudflare (free tier sufficient)
|
||||
2. Obtain Cloudflare API token (Zone:DNS:Edit permission)
|
||||
3. Update `traefik.yml`:
|
||||
```yaml
|
||||
dnsChallenge:
|
||||
provider: cloudflare
|
||||
delayBeforeCheck: 30 # Cloudflare propagates quickly
|
||||
resolvers:
|
||||
- "1.1.1.1:53"
|
||||
- "1.0.0.1:53"
|
||||
```
|
||||
4. Update `docker-compose.yml`:
|
||||
```yaml
|
||||
environment:
|
||||
- CF_DNS_API_TOKEN=${CF_DNS_API_TOKEN}
|
||||
```
|
||||
|
||||
### Option 2: Investigate Network Blocking
|
||||
**Diagnostic Steps**:
|
||||
1. Test from different network (mobile hotspot, different ISP)
|
||||
2. Contact ISP to check if AWS IP ranges are blocked
|
||||
3. Check router/firewall for DNS filtering or AWS blocking
|
||||
4. Test with different VPN provider
|
||||
|
||||
**If network is the issue**:
|
||||
- May need to use VPN or proxy for Traefik container
|
||||
- Consider hosting Traefik on different network segment
|
||||
|
||||
### Option 3: HTTP-01 Challenge (Non-Wildcard)
|
||||
**Pros**:
|
||||
- More reliable (no DNS dependencies)
|
||||
- Works with current DuckDNS setup
|
||||
- No external nameserver queries required
|
||||
|
||||
**Cons**:
|
||||
- ❌ No wildcard certificate (must specify each subdomain)
|
||||
- Requires port 80 accessible from internet
|
||||
- Separate certificate for each subdomain
|
||||
|
||||
**Steps**:
|
||||
1. Update `traefik.yml`:
|
||||
```yaml
|
||||
httpChallenge:
|
||||
entryPoint: web
|
||||
```
|
||||
2. Remove wildcard domain label from Traefik service:
|
||||
```yaml
|
||||
# Remove this line:
|
||||
- "traefik.http.routers.traefik.tls.domains[0].sans=*.${DOMAIN}"
|
||||
```
|
||||
3. Add explicit TLS configuration to each service's labels
|
||||
|
||||
### Option 4: Use Alternative DNS Provider with DuckDNS
|
||||
Keep DuckDNS for dynamic IP updates, but use different DNS for certificates:
|
||||
1. Use Cloudflare for DNS records
|
||||
2. Keep DuckDNS container for IP updates
|
||||
3. Create CNAME in Cloudflare pointing to DuckDNS
|
||||
4. Use Cloudflare for certificate challenge
|
||||
|
||||
## Files to Update in Repository
|
||||
|
||||
### ~/AI-Homelab/stacks/core/traefik/traefik.yml
|
||||
Document both HTTP and DNS challenge configurations with clear comments.
|
||||
|
||||
### ~/AI-Homelab/stacks/core/docker-compose.yml
|
||||
Ensure wildcard domain configuration is correct (it is currently):
|
||||
```yaml
|
||||
- "traefik.http.routers.traefik.tls.domains[0].main=${DOMAIN}"
|
||||
- "traefik.http.routers.traefik.tls.domains[0].sans=*.${DOMAIN}"
|
||||
```
|
||||
**This is correct** - keep both apex and wildcard.
|
||||
|
||||
### ~/AI-Homelab/docs/service-docs/traefik.md
|
||||
Add troubleshooting section for DuckDNS DNS challenge issues.
|
||||
|
||||
## Success Criteria for Next Test
|
||||
|
||||
### Must Have:
|
||||
- [ ] Valid wildcard SSL certificate obtained
|
||||
- [ ] Certificate automatically renews
|
||||
- [ ] No browser certificate warnings
|
||||
- [ ] Documented working configuration
|
||||
|
||||
### Should Have:
|
||||
- [ ] Certificate acquisition completes in < 5 minutes
|
||||
- [ ] Reliable across multiple test runs
|
||||
- [ ] Clear error messages if failure occurs
|
||||
|
||||
## Timeline Analysis
|
||||
|
||||
**First Test Run**: Certificates reportedly worked
|
||||
**Current Test Run**: Consistent failures
|
||||
|
||||
**Possible Explanations**:
|
||||
1. DuckDNS infrastructure changes (AWS security policies)
|
||||
2. ISP routing changes
|
||||
3. Increased AWS security after abuse/attacks
|
||||
4. Different network environment during first test
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Current Status**: System is production-ready except for SSL certificate warnings.
|
||||
|
||||
**Blocking Issue**: DuckDNS authoritative nameservers unreachable from current network environment.
|
||||
|
||||
**Recommendation**: **Switch to Cloudflare DNS** for next test run. This is the most reliable solution and is the industry standard for automated certificate management with Traefik.
|
||||
|
||||
**Alternative**: If staying with DuckDNS is required, investigate network connectivity issues with ISP and consider using HTTP-01 challenge (losing wildcard capability).
|
||||
Reference in New Issue
Block a user