Topic: dnsmasq: DNS resolution broken after reboot with "dnsseccheckunsigned"

The content of this topic has been archived on 14 Apr 2018. There are no obvious gaps in this topic, but there may still be some posts missing at the end.

Page 1 of 1

Post #1

silentcreek

7 Jan 2016, 01:47

Hi,

I encountered a problem with reboots and dnsmasq. I build my own snapshots from the stable branch (15.05 Chaos Calmer) from time to time. The configuration (or diffconfig to be precise) and packages stay the same and I always keep my settings during a sysupgrade. I observed the issue several times now which I first coundn't isolate, but eventually I did.

After a reboot dnsmasq cannot resolve any domain name anymore if I have

option dnsseccheckunsigned '1'

in my /etc/config/dhcp.

Everything works fine before the reboot (of course, my images have the package dnsmasq-full, so they support dnssec and the external dns server I use does, too). But right after a reboot dns resolution won't work anymore.

However, if I remove that option after a reboot the router, dns resolution works again. And then I can re-add the option, reload dnsmasq and dnsmasq will work again with dnsseccheckunsigned. It is only after a reboot that it breaks.

This does not seem to be caused by a regression in the sources. Even if I reflash the same image that was working before while keeping the exact same configuration, dns resolution in dnsmasq will break with dnsseccheckunsigned. If I remove the option dnsseccheckunsigned and re-add it afterwards, it works again. So it's something during reboot that causes the issue.

Now that I found the problem, it's easy to handle. But the more interesting question would be why that even happens. Does anybody have a clue?

Thanks,

Timo

Edit: Changed the title and replaced "sysupgrade" with "reboot" in this post after it became clear that sysupgrade is not the problem but rebooting in general. See the second post for further details.

(Last edited by silentcreek on 27 Jan 2016, 10:06)

Post #2

silentcreek

15 Jan 2016, 11:03

Update:

I think I know why that happens: It's not particularily due to the sysupgrade but rather a reboot in general. When the router reboots, it loses the correct time. And with an incorrect time the dnsmasq signature check will fail. Inevitably, without dns resolution, time cannot be updated via ntp either.

So it seems the option dnscheckcheckunsigned should be used with care and come with a big warning unless you have a system with a hardware real time clock.

I also found a discussion that is related to this issue, but so far this hasn't been resolved:
https://patchwork.ozlabs.org/patch/521344/

I will see if I can come up with a small script to bypass the signature or time-check until the system time has been synchronized successfully or if I have to dismiss the option dnsseccheckunsigned completly.

(Last edited by silentcreek on 15 Jan 2016, 11:31)

Post #3

stangri

16 Jan 2016, 10:15

Thanks for the investigation man, a while ago I faced the same problem and decided to just keep it set to 0.

Would setting it to 0, restarting dnsmasq sleeping for a few minutes and then setting it to 1 and restarting dnsmasq again (in /etc/rc.local) work as a temporary work-around?

Post #4

silentcreek

20 Jan 2016, 16:42

stangri wrote:

Thanks for the investigation man, a while ago I faced the same problem and decided to just keep it set to 0.

In absence of a better solution, I have done the same at the moment - until I have more time to work on a solution.

stangri wrote:

Would setting it to 0, restarting dnsmasq sleeping for a few minutes and then setting it to 1 and restarting dnsmasq again (in /etc/rc.local) work as a temporary work-around?

It should work. It's not the cleanest solution, but it should work.

I'm still trying to make up my mind on how to approach this in a proper way. There are several options I had thought about so far:
1) Have a script run at start up that disables dnscheckcheckunsigned, waits until time is in sync and enables it again
This should work, but then again dnsmasq has the option --dnssec-no-timecheck which should be a bit safer since you still to dnssec validations, but just without timechecks.

2) Start dnssec with --dnssec-no-timecheck until time is in sync, then enable timechecks
This would probably be a cleaner solution. But unfortunately --dnssec-no-timecheck is not available via UCI which means you probably have to hack the init scripts to make use of it. I haven't looked into the init scripts much yet, so I'll have to find time to look into it further to actually implement that.

3) Use a local ntp server to get the time that can be accessed by IP instead of a domain name
This would proably require the least effort but it's not feasible for everybody. Since I have one server that runs 24/7, I might do that. But I don't know how if there are requirements on the precision of an ntp server with regards to dnssec validaton, so I'm not sure here either. Plus, it's also not bulletproof in case your server is offline temporarily and your router happens to reboot exactly during that downtime.

4) Use hardcoded ip addresses for ntp servers
Well, external IPs may change, so this has a catch, too.

5) Have the ntp daemon query a specific dns server (like Google Public DNS) directly and thus bypass dnsmasq
That would be quite nice, although I have no idea whether this would be possible at all.

When I have more time, I will try to figure something out to address this issue. But at the moment it's not at the top of my list...

Post #5

stangri

21 Jan 2016, 07:07

Much appreciated, please keep us posted on your progress if you tackle the problem.

I wonder how it's done in other distributions and if we could ask dev team to adopt the same logic/code.

Post #6

stangri

24 Jan 2016, 05:48

Uhm, this seems to work:

uci add_list dhcp.@dnsmasq[0].server='/0.openwrt.pool.ntp.org/8.8.8.8'
uci add_list dhcp.@dnsmasq[0].server='/1.openwrt.pool.ntp.org/8.8.8.8'
uci add_list dhcp.@dnsmasq[0].server='/2.openwrt.pool.ntp.org/8.8.8.8'
uci add_list dhcp.@dnsmasq[0].server='/3.openwrt.pool.ntp.org/8.8.8.8'
uci set dhcp.@dnsmasq[0].dnsseccheckunsigned=1
uci commit dhcp

I rebooted right after and got a working dnsmasq.

PS. If you could edit the thread title not to vindicate sysupgrade but rather dnssec and specifically dnsseccheckunsigned, would be great.

(Last edited by stangri on 24 Jan 2016, 05:50)

Post #7

silentcreek

27 Jan 2016, 10:10

stangri wrote:

Uhm, this seems to work:

uci add_list dhcp.@dnsmasq[0].server='/0.openwrt.pool.ntp.org/8.8.8.8'

This is very interesting. I will do some experiments maybe this weekend to verify that. It seems inconsistant that using specific dns servers for certain hosts would lead to the dnssec options not being honored, but if that works, it might be an elegant solution to the problem.

stangri wrote:

PS. If you could edit the thread title not to vindicate sysupgrade but rather dnssec and specifically dnsseccheckunsigned, would be great.

Done.

Post #8

silentcreek

30 Jan 2016, 03:04

Ok, I tested what you suggested. It did not work, unfortunately, though. If I add specific dns servers for selected domains, I can still not lookup those names if the system time is incorrect. (I just set my system time to some date in the seventies to emulate that.) I have a suspicion, though, why it worked for you. If dnsmasq realizes that the system time is bogus (which it determins by checking the last modification date of a special file), it will not enforce timestamp checks. When this is the case, it says so in the system log. In this case looking up domain names still works. But if dnsmasq assumes the time is correct, but it is in fact wrong, dns resolution will break. One more thing which makes it hard to spot is that the dns cache doesn't seem to be reset when you restart dnsmasq (unless you do a reboot). So looking up domains that you looked up before using an outdated systemtime will work for these as well.

Anyway, it would have been nice but doesn't work, unfortunately.

In the meantime, I also did some more research. I learned that out of the approaches I listed before, 2) and 5) are not feasible. Starting dnssec with --dnssec-no-timecheck until the system time is valid, is easily done in OpenWrt because when dnsmasq is started with that option, it will wait for a SIGHUP signal as a sign that the time is now correct and then restart with timechecking enforced. The problem here is that on OpenWrt this signal is used for other purposes as well. So any of these triggers could cause dnsmasq to restart with timestamp checking enforced, even if the time is still incorrect. One would have to patch the dnsmasq sources to use a different trigger.
5) is not impossible in absolute terms, but you would have to insert a manipulated shared library into a program to do that, which is not easily done and may have other sideeffects.

So, I'll have to investigate the other options further.

Post #9

silentcreek

4 Feb 2016, 12:37

Hey again,

so, I found a nice and reliable solution

Basically, the trick is to use nslookup to lookup the ip addresses of the NTP servers' domain names. With nslookup you can bypass dnsmasq and dnssec validation Try for example:

nslookup www.openwrt.org 208.67.222.222

It will lookup www.openwrt.org using the DNS server 208.67.222.222. It also works when dnsmasq would not return any IP because of an incorrect system time.

With this neat workaround at hand, I wrote an init script that will keep the system time in sync when dnsseccheckunsigned is enabled. What it basically does is:

1) Ping the specified DNS servers to see if the internet connection is up and DNS servers available
2) Lookup the IP addresses of the specified NTP servers using the specified DNS servers
3) Use the retrieved IP addresses with ntpd to sync the time
4) After a successfull sync, restart dnsmasq to ensure it runs with DNSSEC validation enforced

In addition, the script does a lot of error handling:
- If any of the steps 1-3 fails, it will pause for a bit and retry until the retry limit is reached
- If it ultimately fails and reaches the retry limit (e.g. if the internet connection is down), we can enter a "enter a "fallback" mode, if it is enabled (option FALLBACK=1). Fallback mode means, the script will disable dnsseccheckunsigned and restart dnsmasq. It will also create a temporary file to keep track of the fallback state. If the system is shut down and the temporary file is found, dnsseccheckunsigned will be reenabled for the next boot.
- I also implemented a command "enforce" to be used in conjunction with the fallback mode. The enforce command will check if the fallback mode was entered (checks for the temporary file that's created when fallback mode is entered), try to sync time and if successful, reenable dnsseccheckunsigned and restart dnsmasq (and remove the temporary file). A good idea would be to add something like this to your crontab:

0 * * * * [ -e /tmp/dnssectime-fallback ] && /etc/init.d/dnssectime enforce

This way, dnsseccheckunsigned will be reenabled as soon as possible (checked every hour) and not only when the system is shut down.
- Last but not least, there is the possibility to automatically reenable dnsseccheckunsigned during startup if it's found disabled (option FORCE_DNSSEC=1). This is useful in case the fallback mode was active just before a power loss or hard reset of the router.

All that should help to make this a really robust solution and to cover even unlikely errors and to keep dnsmasq running reliably.
Plus, this init script logs all actions or errors to make it easier to debug (or just to verify everythings running well). You can use

logread | grep dnssectime

to view the log messages.
The script has a few configurable options in its upper section to make it portable and customizable.

The script can be found here: http://pastebin.com/SuYmKPB7
Please give it a try. I'd appreciate any feedback.

Limitations:
- Obviously, the code is not the most elegant or efficient. But it should just work. I might clean it up/streamline it later.
- At this point, the code only uses IPv4 addresses. Simply because almost everybody should have a working IPv4 connection, but not everybody might have a working IPv6 connection. Plus, I didn't even check how nslookup and ntpd handle IPv6 addresses.
- Another idea I have is to use uci to query the DNS and NTP servers in the router configuration, instead of having the user hardcode them in this init script. But that's not a priority, especially since not everybody defines DNS servers in their configuration (but receives them via DHCP instead).
- The script expects the option dnsmasqcheckunsigned to be found as dhcp.@dnsmasq[0].dnsseccheckunsigned in the uci system. I don't know if a different setup where the option would appear as e.g. dhcp.@dnsmasq[2].dnsseccheckunsigned is even possible. But if it is, the relevant uci calls would have to be adjusted. (Btw. if no occurrence of dnsseccheckunsigned is found in uci at all - i.e. it is neither 0 or 1, but undefined - the script will skip execution and just print a notice to the system log.)

Regards,

Timo

(Last edited by silentcreek on 4 Feb 2016, 15:14)

Post #10

stangri

5 Feb 2016, 01:01

Hey Timo, thank you very much for investigation and coming up with the solution. I'll try to test it on my router over the weekend. One question tho -- have you considered writing a hotplug iface file instead?

Post #11

stangri

5 Feb 2016, 03:32

So borrowing heavily on your original idea, the following seems to work (created a new image and reflashed to test), I'm including the below in the uci-defaults and once router has booted up I'm online.

cat << 'EOF' > /etc/hotplug.d/iface/90-dnssec
#!/bin/sh
[ "$ACTION" = "ifup" ] && [ "$INTERFACE" = "wan" ] || exit 0
OUT=0
for server in $(uci get network.wan.dns); do
    ping -c1 "$server" > /dev/null
    OUT=$((OUT+$?))
done

[ $OUT = 0 ] || exit 0

args=""
for peer in $(uci get system.ntp.server); do
    ip=$(nslookup "$peer" "$server" | grep -v "$server" | grep -m1 -E "^Address [0-9]{1,2}: ([0-9]{1,3}\.){3}[0-9]{1,3} " | cut -d' ' -f3)
    args="$args -p $ip"
done
ntpd -qn $args > /dev/null
/etc/init.d/dnsmasq restart
exit 0
EOF

(Last edited by stangri on 18 Feb 2016, 05:34)

Post #12

silentcreek

5 Feb 2016, 23:02

Hey stangri,

nope, I haven't thought about a hotplug script yet. What are you hoping to gain from that? I don't know the hotplug system very well, but atm I only see one scenario in which it might be useful or better to use a hotplug script. This would be: If you lose the internet connection for a prolongued time and reboot the router during that time, a hotplug script might have the edge over an init script simply because it wouldn't try to sync time (or ping dns servers) before the internet connection is up. But then again, with the init script you could disable dnsseccheckunsigned in thhat scenario, and reeenable it again once you find the connection is up again.

As for hotplugd, I have one question: What exactly does ifup check? Does it only check if the link is up or does it really check if you have a working internet connection? For example, I have my OpenWrt router behind a cable gateway. So even if the cable gateway has no internet connection, the wan interface of the OpenWrt machine would be "up", but that doesn't mean global addresses would be reachable. Anyway, are there any other scenarios where a hotplug script would have advantages?

Post #13

stangri

6 Feb 2016, 02:16

I thought that the proper hotplug would be neater as it requires less code.

If I understand the https://wiki.openwrt.org/doc/techref/hotplug correctly, you're right -- in your scenario the wan will be up, even if the cable modem has no internet connection, unless the cable modem is smart enough to reset the port when it gets connected.

No doubt your script is more encompassing/universal, maybe the proper hotplug script can still be created to combat the problem. The little one above works for me tho.

Post #14

silentcreek

8 Feb 2016, 03:01

Well, for now, I stick with an init script, simply because that can be called again later (via cron) to re-enable dnsseccheckunsigned in case it was disabled during boot.
The point of all these "safety nets" in my code is that I need this to be reliable under all possible circumstances. I'd like to avoid any situation where e.g. my wife would find herself with a broken internet connection (or dns resolution) and I would have to explain to her on the phone how to ssh into the router, etc

Nevertheless, I reworked the code so it's much cleaner and efficient now (compared to my first attempt - your's is still smaller, of course). I moved some recurring stuff into functions and added for loops now, inspired by your code
Here's the new version: http://pastebin.com/5rnsbk55

One noteworthy functional change is: Dnsmasq is only restarted now if the option dnsseccheckunsigned was changed. If not, it should not require a restart.

And one other sidenote that might be interesting for your hotplug script: You might wanna check the exit code of the ntpd command. When I did my experiments first, I noticed that ntpd might fail to sync time even if the server is reachable. I don't recall the exact error message, but it was something like the peer's time was considered not precise enough and therefore it refuses to sync time. I'd assume, though, the more peers you use, the rarer this scenario becomes and I haven't observed it since.

(Last edited by silentcreek on 8 Feb 2016, 08:23)

Post #15

stangri

8 Feb 2016, 21:43

Outstanding job on a workaround and thank you for the suggestion -- that hotplug script is not really error-prone.

Post #16

silentcreek

16 Feb 2016, 15:49

Thanks.

I'm posting my last and final take on this script: http://pastebin.com/0ghjTLSW

The functionality is still the same, but it's again a bit cleaner now. I added some more comments to make it easier to understand what's going on. I also added a prefix to all variables to make sure there's no mixup with any other variables that might be used on the system. But there's only one real functional change: In the previous version, the stop command (which is executed on shutdown) would restart Dnsmasq if dnsseccheckunsiged was re-enabled. This unneccessary restart is now avoided.

Post #17

stangri

16 Feb 2016, 20:53

Great job, script looks very neat!

I wonder how this issue is solved in big-name Linux distros tho.

Post #18

silentcreek

17 Feb 2016, 00:17

Well, I suppose most full-flegded Linux distros are run on devices with hardware clocks. Even some embedded devices do have hardware clocks (e.g. my LeMaker BananaPi has an RTC but no battery for it - as long as you don't unplug the board, the time will be kept during reboots). And then there is the question if they even enable DNSSec by default. So the cases in which they'd need to worry, are probably rare. My router is the first device I had such problems with, even though it's not the first device without an hardware clock I use.

Post #19

stangri

18 Feb 2016, 05:36

Ah, good call, didn't realize that.

BTW, ran into problems with my hotplug script, the nslookup part wasn't working to get an IP so I couldn't get the time set. I ended up changing it to:

cat << 'EOF' > /etc/hotplug.d/iface/90-dnssec
#!/bin/sh
[ "$ACTION" = "ifup" ] && [ "$INTERFACE" = "wan" ] || exit 0
uci set dhcp.@dnsmasq[0].dnsseccheckunsigned=0
uci commit dhcp
/etc/init.d/dnsmasq restart
args=""
for peer in $(uci get system.ntp.server); do args="$args -p $peer"; done
ntpd -qn $args > /dev/null
uci set dhcp.@dnsmasq[0].dnsseccheckunsigned=1
uci commit dhcp
/etc/init.d/dnsmasq restart
exit 0
EOF

Post #20

silentcreek

18 Feb 2016, 08:57

Interesting. I never experienced such problems. Maybe your internet connection is not fully up at that point? I remember a thread here a few days ago where someone complained about his hotplug script being called too early (link is going up but the internet connection not fully established). Adding a pause with the sleep command solved this. You could give it a shot.

Post #21

stangri

18 Feb 2016, 09:42

I've tried to run the script commands in the console and there still issues with nslookup for some reason. Are you running your script on trunk or CC?

Post #22

silentcreek

18 Feb 2016, 10:14

I'm running it on Chaos Calmer. You could add some pipes or logs to catch the actual errors and see what's going on.

The discussion might have continued from here.

Page 1 of 1