Konubinix' opinionated web of thoughts

Beware the Resolver of Nginx

Fleeting

Context: I was debugging an issue with nginx behind in front of a ECS load balancer. The load balancer was recreated. Its dns record updated, but nginx failed to fetch the new ip address.

There are plenty of stack overflow discussion about using the resolver instruction and even the documentation is quite explicit about it. I was surprised to realize how it worked.

creating the env to reproduce the issue

the network

I will take advantage of the internal dns of docker to simulate the moving dns entry.

As per the documentation of docker, the docker bridge network cannot resolve dns names.

Let’s create our own.

docker network create test
3e8c801c300fb53fe2f9962a53f4263357a37231557e9bceee01b8d595291ebc

the containers

Then, let’s run a simple http service in that network.

docker run --rm -ti --detach --network test --name service traefik/whoami
8c6766257ec39b4268e91c4af60dc9053b6fe1fffe4a4a61952fbc6691931648

Then, let’s ping it to see its ip address

docker run --rm --network test alpine ping -c 1 service.test
PING service.test (172.18.0.2): 56 data bytes
64 bytes from 172.18.0.2: seq=0 ttl=64 time=0.099 ms

--- service.test ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.099/0.099/0.099 ms

Now, let’s kill it and add a dummy sleep container to have another ip address.

docker run --rm -ti --detach --name sleep --network test alpine sleep 3600

echo "Using ip != 2"
docker kill service
docker run --rm -ti --detach --name sleep --network test alpine sleep 3600
docker run --rm -ti --detach --network test --name service traefik/whoami
docker run --rm --network test alpine ping -c 1 service.test
Using ip != 2
service
e2cba3c3478549287915bc2cfbeb79c828ed89e52f2083d5967469bf4ff3c4ff
PING service.test (172.18.0.3): 56 data bytes
64 bytes from 172.18.0.3: seq=0 ttl=64 time=0.235 ms

--- service.test ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.235/0.235/0.235 ms

To go back to the ip 2, it suffices to kill both and simply restart the service.

echo "Going back to ip 2"
docker kill service sleep
docker run --rm -ti --detach --network test --name service traefik/whoami
docker run --rm --network test alpine ping -c 1 service.test
Going back to ip 2
service
sleep
4d4eaab4b33d9bdd7ccc975a867d391afe4189ea74ede7b8f4986d46232b0e97
PING service.test (172.18.0.2): 56 data bytes
64 bytes from 172.18.0.2: seq=0 ttl=64 time=0.168 ms

--- service.test ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.168/0.168/0.168 ms

nginx

Now, let’s start a nginx container to proxy connections to the service.

cat <<EOF > /tmp/proxy_pass.conf
server {
    listen 80;
    server_name _;

    location / {
        proxy_pass http://service.test;
    }
}
EOF

docker run --rm -ti --detach --name nginx --network test --publish 8080:80 \
       --volume /tmp/proxy_pass.conf:/etc/nginx/conf.d/default.conf \
       nginx
f19b2d312a93b69fffc0fb62129806617fafb085ebacf88c75f2b04c159d8c42

And try it

curl localhost:8080/
Hostname: 4d4eaab4b33d
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.2
RemoteAddr: 172.18.0.3:37966
GET / HTTP/1.1
Host: service.test
User-Agent: curl/8.10.1
Accept: */*
Connection: close

Great. We have access to the service, on the ip 2.

Let’s move the service container to another ip address.

Using ip != 2
service
9b7a6eb3c83fe853ad4c1837573017e92314b48f225b1cf09aeda7a469dd2739
e22c6c1d25fbcc9e81a883726ede70cc93d0c807c18a302e5137a06ca17d3de1
PING service.test (172.18.0.4): 56 data bytes
64 bytes from 172.18.0.4: seq=0 ttl=64 time=0.264 ms

--- service.test ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.264/0.264/0.264 ms

And try again to curl nginx.

<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.27.4</center>
</body>
</html>

This is the issue we are studying.

using resolver with a validity period

According to the documentation, we could add a resolver entry to nginx location.

We can see the ip of the dns server in /etc/resolv.conf.

docker run --rm --network test alpine cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 127.0.0.11
options ndots:0

# Based on host file: '/etc/resolv.conf' (internal resolver)
# ExtServers: [192.168.1.1]
# Overrides: []
# Option ndots from: internal

Let’s make use of it in our nginx instance.

docker kill nginx
cat <<EOF > /tmp/proxy_pass.conf
server {
    listen 80;
    server_name _;

    location / {
        resolver 127.0.0.11 valid=2s;
        proxy_pass http://service.test;
    }
}
EOF

docker run --rm -ti --detach --name nginx --network test --publish 8080:80 \
       --volume /tmp/proxy_pass.conf:/etc/nginx/conf.d/default.conf \
       nginx
nginx
1ccae4b4cf250f92d5695aaf7a284afad97f5fae8d51b4746acf2c7f3216e07c

Now, let’s see it first work ok.

Hostname: e22c6c1d25fb
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.4
RemoteAddr: 172.18.0.3:46086
GET / HTTP/1.1
Host: service.test
User-Agent: curl/8.10.1
Accept: */*
Connection: close

Then, move it back to the ip address 2.

Going back to ip 2
service
sleep
66d053baac2412eeb01f984df300381c4244bfa85ee6bfb122509b6e0e97f786
PING service.test (172.18.0.2): 56 data bytes
64 bytes from 172.18.0.2: seq=0 ttl=64 time=0.094 ms

--- service.test ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.094/0.094/0.094 ms

Now, trying again

<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.27.4</center>
</body>
</html>

This is not good.

Actually, the resolver is not taken into account, as we can see by restarting nginx with a dumb resolver.

docker kill nginx
cat <<EOF > /tmp/proxy_pass.conf
server {
    listen 80;
    server_name _;

    location / {
        resolver 1.2.3.4 valid=2s;
        proxy_pass http://service.test;
    }
}
EOF

docker run --rm -ti --detach --name nginx --network test --publish 8080:80 \
       --volume /tmp/proxy_pass.conf:/etc/nginx/conf.d/default.conf \
       nginx
nginx
d99097278e186c560a9c453ba68adb19e9b18a235e56ba24806b2958aa1ee6a9

Then, try it.

Hostname: 66d053baac24
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.2
RemoteAddr: 172.18.0.3:42612
GET / HTTP/1.1
Host: service.test
User-Agent: curl/8.10.1
Accept: */*
Connection: close

While it should have failed to use the ip 1.2.3.4.

with the backend as a variable

I read somewhere that using a variable helped.

docker kill nginx
cat <<"EOF" > /tmp/proxy_pass.conf
server {
    listen 80;
    server_name _;

    location / {
        resolver 127.0.0.11 valid=2s;
        set $backend "http://service.test";
        proxy_pass $backend;
    }
}
EOF

docker run --rm -ti --detach --name nginx --network test --publish 8080:80 \
       --volume /tmp/proxy_pass.conf:/etc/nginx/conf.d/default.conf \
       nginx
nginx
5dc5c585dfb04a357c16ae6772587608230aeb82b592d44b4cb49f57de57b03a
Hostname: 66d053baac24
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.2
RemoteAddr: 172.18.0.3:46830
GET / HTTP/1.1
Host: service.test
User-Agent: curl/8.10.1
Accept: */*
Connection: close

Now, going to another ip address.

Using ip != 2
service
02b42d79c381e676eb74a6fb0cbe859d7b3ba898ac1c3dc020334d1d89ab967f
a1163e6d4131a3d3a855ced05e687883333d3d237b8f219a779298484682b999
PING service.test (172.18.0.4): 56 data bytes
64 bytes from 172.18.0.4: seq=0 ttl=64 time=0.108 ms

--- service.test ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.108/0.108/0.108 ms

And trying again

Hostname: a1163e6d4131
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.4
RemoteAddr: 172.18.0.3:38646
GET / HTTP/1.1
Host: service.test
User-Agent: curl/8.10.1
Accept: */*
Connection: close

We indeed realize that this made the nginx program react to the change in the dns record.

Also, trying to use a dumb resolver, it fails, showing that it actually uses it.

docker kill nginx
cat <<"EOF" > /tmp/proxy_pass.conf
server {
    listen 80;
    server_name _;

    location / {
        resolver 1.2.3.4 valid=2s;
        set $backend "http://service.test";
        proxy_pass $backend;
    }
}
EOF

docker run --rm -ti --detach --name nginx --network test --publish 8080:80 \
       --volume /tmp/proxy_pass.conf:/etc/nginx/conf.d/default.conf \
       nginx
nginx
c4d79e495af0f585a0f3be672dfeef28b17ad62ec559d28310ddaaed3740d738
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.27.4</center>
</body>
</html>

But unfortunately, it ignores the value of the valid field.

docker kill nginx
cat <<"EOF" > /tmp/proxy_pass.conf
server {
    listen 80;
    server_name _;

    location / {
        resolver 127.0.0.11 valid=3600s;
        set $backend "http://service.test";
        proxy_pass $backend;
    }
}
EOF

docker run --rm -ti --detach --name nginx --network test --publish 8080:80 \
       --volume /tmp/proxy_pass.conf:/etc/nginx/conf.d/default.conf \
       nginx
nginx
c7e069440db400a7200c2df2e23070f63522affac7177f03e73ba8b861903121
Hostname: a1163e6d4131
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.4
RemoteAddr: 172.18.0.3:53524
GET / HTTP/1.1
Host: service.test
User-Agent: curl/8.10.1
Accept: */*
Connection: close
Going back to ip 2
service
sleep
541f00945d87e5430ccd90328a705e16ec737fb436bf778863ab5ec3bbd3b33e
PING service.test (172.18.0.2): 56 data bytes
64 bytes from 172.18.0.2: seq=0 ttl=64 time=0.276 ms

--- service.test ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.276/0.276/0.276 ms
Hostname: 541f00945d87
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.2
RemoteAddr: 172.18.0.3:60376
GET / HTTP/1.1
Host: service.test
User-Agent: curl/8.10.1
Accept: */*
Connection: close

I would have expected it to fail because it took me less than 3600s to run those and the ip address should have remained the same.

In real life, with ECS, it looked like it followed the TTL of the record. In this simulation, it appears it is immediate. I don’t know why.

using upstream and resolve

docker kill nginx
cat <<EOF > /tmp/proxy_pass.conf
resolver 127.0.0.11 valid=3600s;

server {
    listen 80;
    server_name _;

    location / {
        proxy_pass http://backend;
    }
}

upstream backend {
   # I don't know what this zone means, but it complains if it's not there
   zone backend 64k;
   server service.test   resolve;
}
EOF

docker run --rm -ti --detach --name nginx --network test --publish 8080:80 \
       --volume /tmp/proxy_pass.conf:/etc/nginx/conf.d/default.conf \
       nginx
nginx
6e35209940ce6d1a46735d61d62e4049ab669613e1df87d779cb7feefb965e25

I can still resolve to the service.

Hostname: 541f00945d87
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.2
RemoteAddr: 172.18.0.3:33520
GET / HTTP/1.1
Host: backend
User-Agent: curl/8.10.1
Accept: */*
Connection: close

And using another ip address before the validity expiration indeed gets a 502 error.

Using ip != 2
service
05e64556aa84144076a0a3c489841d848f629a87ab5ebd453955f6236fa17646
ff1554068d09ce633d0c3ba7813836a8774072eed6c7ebddc1c3c95158934cff
PING service.test (172.18.0.4): 56 data bytes
64 bytes from 172.18.0.4: seq=0 ttl=64 time=0.545 ms

--- service.test ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.545/0.545/0.545 ms
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.27.4</center>
</body>
</html>

Let’s now try it with a shorter validity period.

docker kill nginx
cat <<EOF > /tmp/proxy_pass.conf
resolver 127.0.0.11 valid=2s;

server {
    listen 80;
    server_name _;

    location / {
        proxy_pass http://backend;
    }
}

upstream backend {
   # I don't know what this zone means, but it complains if it's not there
   zone backend 64k;
   server service.test   resolve;
}
EOF

docker run --rm -ti --detach --name nginx --network test --publish 8080:80 \
       --volume /tmp/proxy_pass.conf:/etc/nginx/conf.d/default.conf \
       nginx
nginx
66918d3258b7b9eb6c0f589e5f96d7280c6b2e49c50754209c54b90c240e81e7
Hostname: ff1554068d09
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.4
RemoteAddr: 172.18.0.3:58124
GET / HTTP/1.1
Host: backend
User-Agent: curl/8.10.1
Accept: */*
Connection: close
Going back to ip 2
service
sleep
7ffe6737eab0917d4da9225d77e85295356574b936635f62547117525b9d69e4
PING service.test (172.18.0.2): 56 data bytes
64 bytes from 172.18.0.2: seq=0 ttl=64 time=0.089 ms

--- service.test ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.089/0.089/0.089 ms
Hostname: 7ffe6737eab0
IP: 127.0.0.1
IP: ::1
IP: 172.18.0.2
RemoteAddr: 172.18.0.3:42180
GET / HTTP/1.1
Host: backend
User-Agent: curl/8.10.1
Accept: */*
Connection: close

This behaves exactly as intended. That’s a relief.

in kubernetes

In a standard cluster the IP in resolv.conf belongs to kube-dns.kube-system.svc.cluster.local. So it is better to use that instead.

https://serverfault.com/questions/876308/kubernetes-dns-resolver-in-nginx ([2025-03-21 Fri])