Add docker-compose configurations and SSL troubleshooting docs
- Added compose files for core, infrastructure, and dashboards stacks - Added Traefik, Authelia, and DuckDNS configuration files - Added dockge.managed and dockge.url labels to all services - Updated Watchtower to latest version with DOCKER_API_VERSION=1.44 - Created comprehensive SSL certificate troubleshooting guide for DuckDNS issues
This commit is contained in:
223
docs/troubleshooting/SSL-CERTIFICATES-DUCKDNS.md
Normal file
223
docs/troubleshooting/SSL-CERTIFICATES-DUCKDNS.md
Normal file
@@ -0,0 +1,223 @@
|
||||
# SSL Certificate Issues with DuckDNS DNS Challenge
|
||||
|
||||
## Issue Summary
|
||||
Wildcard SSL certificate acquisition via DuckDNS DNS-01 challenge consistently fails due to network connectivity issues with DuckDNS authoritative nameservers.
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### Why Both Domain and Wildcard are Required
|
||||
Let's Encrypt requires validation of BOTH domains when using SAN (Subject Alternative Name) certificates:
|
||||
- `kelin-hass.duckdns.org` (apex domain)
|
||||
- `*.kelin-hass.duckdns.org` (wildcard)
|
||||
|
||||
This is a Let's Encrypt policy - you cannot obtain just the wildcard certificate. Both must be validated simultaneously.
|
||||
|
||||
### Technical Root Cause: Unreachable Authoritative Nameservers
|
||||
|
||||
**Problem**: DuckDNS authoritative nameservers (ns1-ns9.duckdns.org) are **unreachable** from the test system's network.
|
||||
|
||||
**Evidence**:
|
||||
```bash
|
||||
# Direct ping to DuckDNS nameservers - 100% packet loss
|
||||
ping -c 2 ns1.duckdns.org # FAIL: 100% packet loss
|
||||
ping -c 2 99.79.143.35 # FAIL: 100% packet loss (direct IP)
|
||||
|
||||
# DNS queries to authoritative servers - timeout
|
||||
dig @99.79.143.35 kelin-hass.duckdns.org # FAIL: timeout
|
||||
dig @35.182.183.211 kelin-hass.duckdns.org # FAIL: timeout
|
||||
dig @3.97.58.28 kelin-hass.duckdns.org # FAIL: timeout
|
||||
|
||||
# Queries to recursive resolvers - SUCCESS
|
||||
dig @8.8.8.8 kelin-hass.duckdns.org # SUCCESS
|
||||
dig @1.1.1.1 kelin-hass.duckdns.org # SUCCESS
|
||||
|
||||
# Traceroute analysis
|
||||
traceroute 99.79.143.35
|
||||
# Shows traffic reaching hop 5 (74.41.143.193) then black hole
|
||||
# DuckDNS nameservers are hosted on Amazon AWS
|
||||
# Suggests AWS security groups or ISP blocking
|
||||
```
|
||||
|
||||
**Why This Matters**:
|
||||
Traefik's ACME client (lego library) requires verification against authoritative nameservers after setting TXT records. Even though:
|
||||
- DuckDNS API successfully sets TXT records ✅
|
||||
- TXT records propagate to public DNS (8.8.8.8, 1.1.1.1) ✅
|
||||
- Recursive DNS queries work ✅
|
||||
|
||||
The lego library **must** also query the authoritative nameservers directly to verify propagation, and this step fails due to network unreachability.
|
||||
|
||||
## Attempted Solutions
|
||||
|
||||
### Configuration Optimizations Tried
|
||||
|
||||
1. **Increased propagation delay** - `delayBeforeCheck: 300` (5 minutes)
|
||||
- Result: Delay worked, but authoritative NS check still failed
|
||||
|
||||
2. **Extended timeout** - `DUCKDNS_PROPAGATION_TIMEOUT=600` (10 minutes)
|
||||
- Result: Longer timeout observed, but same NS unreachability issue
|
||||
|
||||
3. **LEGO environment variables**:
|
||||
```yaml
|
||||
- LEGO_DISABLE_CNAME_SUPPORT=true
|
||||
- LEGO_EXPERIMENTAL_DNS_TCP_SUPPORT=true
|
||||
- LEGO_DNS_TIMEOUT=60
|
||||
- LEGO_DNS_RESOLVERS=1.1.1.1:53,8.8.8.8:53
|
||||
- LEGO_DISABLE_CP=true
|
||||
```
|
||||
- Result: Forced use of recursive resolvers for some queries, but SOA lookups still failed
|
||||
|
||||
4. **Explicit Docker DNS configuration**:
|
||||
```yaml
|
||||
dns:
|
||||
- 1.1.1.1
|
||||
- 8.8.8.8
|
||||
```
|
||||
- Result: Container used correct resolvers, but lego still attempted authoritative NS queries
|
||||
|
||||
5. **VPN routing test** (through Gluetun container)
|
||||
- Result: DuckDNS nameservers also unreachable through VPN
|
||||
|
||||
### Error Messages Observed
|
||||
|
||||
**Phase 1: Direct authoritative nameserver timeout**
|
||||
```
|
||||
propagation: time limit exceeded: last error: authoritative nameservers:
|
||||
DNS call error: read udp 172.19.0.2:53666->3.97.58.28:53: i/o timeout
|
||||
[ns=ns6.duckdns.org.:53, question='_acme-challenge.kelin-hass.duckdns.org. IN TXT']
|
||||
```
|
||||
|
||||
**Phase 2: SOA record query failure**
|
||||
```
|
||||
propagation: time limit exceeded: last error: could not find zone:
|
||||
[fqdn=_acme-challenge.kelin-hass.duckdns.org.]
|
||||
unexpected response for 'kelin-hass.duckdns.org.'
|
||||
[question='kelin-hass.duckdns.org. IN SOA', code=SERVFAIL]
|
||||
```
|
||||
|
||||
## Working Configuration (Self-Signed Certificates)
|
||||
|
||||
Current deployment is **fully functional** with self-signed certificates:
|
||||
- All services accessible via HTTPS ✅
|
||||
- Can proceed through browser certificate warnings ✅
|
||||
- Traefik routing works correctly ✅
|
||||
- Authelia SSO functional ✅
|
||||
- All stacks deployed successfully ✅
|
||||
|
||||
## Recommended Solutions for Next Test Run
|
||||
|
||||
### Option 1: Switch to Cloudflare DNS (RECOMMENDED)
|
||||
**Pros**:
|
||||
- Cloudflare nameservers are highly reliable and globally accessible
|
||||
- Supports wildcard certificates via DNS-01 challenge
|
||||
- Better performance and propagation times
|
||||
- Well-tested with Traefik
|
||||
|
||||
**Steps**:
|
||||
1. Move domain to Cloudflare (free tier sufficient)
|
||||
2. Obtain Cloudflare API token (Zone:DNS:Edit permission)
|
||||
3. Update `traefik.yml`:
|
||||
```yaml
|
||||
dnsChallenge:
|
||||
provider: cloudflare
|
||||
delayBeforeCheck: 30 # Cloudflare propagates quickly
|
||||
resolvers:
|
||||
- "1.1.1.1:53"
|
||||
- "1.0.0.1:53"
|
||||
```
|
||||
4. Update `docker-compose.yml`:
|
||||
```yaml
|
||||
environment:
|
||||
- CF_DNS_API_TOKEN=${CF_DNS_API_TOKEN}
|
||||
```
|
||||
|
||||
### Option 2: Investigate Network Blocking
|
||||
**Diagnostic Steps**:
|
||||
1. Test from different network (mobile hotspot, different ISP)
|
||||
2. Contact ISP to check if AWS IP ranges are blocked
|
||||
3. Check router/firewall for DNS filtering or AWS blocking
|
||||
4. Test with different VPN provider
|
||||
|
||||
**If network is the issue**:
|
||||
- May need to use VPN or proxy for Traefik container
|
||||
- Consider hosting Traefik on different network segment
|
||||
|
||||
### Option 3: HTTP-01 Challenge (Non-Wildcard)
|
||||
**Pros**:
|
||||
- More reliable (no DNS dependencies)
|
||||
- Works with current DuckDNS setup
|
||||
- No external nameserver queries required
|
||||
|
||||
**Cons**:
|
||||
- ❌ No wildcard certificate (must specify each subdomain)
|
||||
- Requires port 80 accessible from internet
|
||||
- Separate certificate for each subdomain
|
||||
|
||||
**Steps**:
|
||||
1. Update `traefik.yml`:
|
||||
```yaml
|
||||
httpChallenge:
|
||||
entryPoint: web
|
||||
```
|
||||
2. Remove wildcard domain label from Traefik service:
|
||||
```yaml
|
||||
# Remove this line:
|
||||
- "traefik.http.routers.traefik.tls.domains[0].sans=*.${DOMAIN}"
|
||||
```
|
||||
3. Add explicit TLS configuration to each service's labels
|
||||
|
||||
### Option 4: Use Alternative DNS Provider with DuckDNS
|
||||
Keep DuckDNS for dynamic IP updates, but use different DNS for certificates:
|
||||
1. Use Cloudflare for DNS records
|
||||
2. Keep DuckDNS container for IP updates
|
||||
3. Create CNAME in Cloudflare pointing to DuckDNS
|
||||
4. Use Cloudflare for certificate challenge
|
||||
|
||||
## Files to Update in Repository
|
||||
|
||||
### ~/AI-Homelab/stacks/core/traefik/traefik.yml
|
||||
Document both HTTP and DNS challenge configurations with clear comments.
|
||||
|
||||
### ~/AI-Homelab/stacks/core/docker-compose.yml
|
||||
Ensure wildcard domain configuration is correct (it is currently):
|
||||
```yaml
|
||||
- "traefik.http.routers.traefik.tls.domains[0].main=${DOMAIN}"
|
||||
- "traefik.http.routers.traefik.tls.domains[0].sans=*.${DOMAIN}"
|
||||
```
|
||||
**This is correct** - keep both apex and wildcard.
|
||||
|
||||
### ~/AI-Homelab/docs/service-docs/traefik.md
|
||||
Add troubleshooting section for DuckDNS DNS challenge issues.
|
||||
|
||||
## Success Criteria for Next Test
|
||||
|
||||
### Must Have:
|
||||
- [ ] Valid wildcard SSL certificate obtained
|
||||
- [ ] Certificate automatically renews
|
||||
- [ ] No browser certificate warnings
|
||||
- [ ] Documented working configuration
|
||||
|
||||
### Should Have:
|
||||
- [ ] Certificate acquisition completes in < 5 minutes
|
||||
- [ ] Reliable across multiple test runs
|
||||
- [ ] Clear error messages if failure occurs
|
||||
|
||||
## Timeline Analysis
|
||||
|
||||
**First Test Run**: Certificates reportedly worked
|
||||
**Current Test Run**: Consistent failures
|
||||
|
||||
**Possible Explanations**:
|
||||
1. DuckDNS infrastructure changes (AWS security policies)
|
||||
2. ISP routing changes
|
||||
3. Increased AWS security after abuse/attacks
|
||||
4. Different network environment during first test
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Current Status**: System is production-ready except for SSL certificate warnings.
|
||||
|
||||
**Blocking Issue**: DuckDNS authoritative nameservers unreachable from current network environment.
|
||||
|
||||
**Recommendation**: **Switch to Cloudflare DNS** for next test run. This is the most reliable solution and is the industry standard for automated certificate management with Traefik.
|
||||
|
||||
**Alternative**: If staying with DuckDNS is required, investigate network connectivity issues with ISP and consider using HTTP-01 challenge (losing wildcard capability).
|
||||
Reference in New Issue
Block a user