OpenWrt Forum Archive

Topic: Multi-WAN Load Balancing

The content of this topic has been archived between 29 Mar 2018 and 3 May 2018. Unfortunately there are posts – most likely complete pages – missing.

daniel_bergamini wrote:

I'm going to try and lower all the 300 and +300 stuff to 200 and +200 which should still get it out of the way and maybe work... we'll see.

Yes indeed, a quick :%s/300/200/g and life is better, at least it's building out the table correctly.

root@OpenWrt:~# ip rule show
0:    from all lookup local 
200:    from all fwmark 0x1 lookup 200 
210:    from 1.2.3.4 lookup 210 
211:    from all fwmark 0x10 lookup 210 
220:    from 192.168.100.100 lookup 220 
221:    from all fwmark 0x20 lookup 220 
32766:    from all lookup main 
32767:    from all lookup default

I'm going to play around a little with regard to seeing how it behaves after a failover but I've been spot checking the connections and there is a lot less traffic from the wrong IP but still some. I have noticed that the ordering of the /tmp/resolv.conf.auto could be optimized a bit based on which connection is preferred.

For example, it wrote out wan's DNS servers first and then wan2's DNS servers, which means that all DNS traffic was going over wan, even though my default route was wan2. Fortunately it was all using the correct interface's IP, but still less than ideal. I think I could beat it by switching cables around and making wan the default route. Just something to think about...

Anyway, why is traffic still going out the wrong interface? I can only assume 256 is some magical boundary on the ip tables, which sort of makes sense.

I think the problem may be kernel specific, as I seem to see no issue with tables above 256.
In any case, I've updated it to 1.0.14, which now puts the tables in the 170-190 range.

Let me know if you have any issues after that.

SouthPawn wrote:
Priyantha Bleeker wrote:

Well I don't need a specific traffic route, but I would like a sort of inteligent routering, so that it will choose the most logical way and the most fast way to the Internet.
If it is for a specific address via WAN1, and for another specific address via WAN2 it is the most ideal situation I think.

In your situation, set the weight for the 8mbit wan to 10, and 5 for the 4mbit, as for specific addresses that's exactly what the outbound rules (mwanfw) is for.

[cut]

Then simply restart the service /etc/init.d/multiwan restart.

Hi I did precise what you said.
But there wasn't changed anything.

I've got here my config: http://www.priyantha.nl/multiwan
In there I have also some preffered IP's over one specific WAN port, but that isn't working really.
For example in the config I have configures that the IP "212.67.182.102" should go over WAN and not over WAN2, but it is going over WAN2 acording to this traceroute:

traceroute to 212.67.182.102 (212.67.182.102), 30 hops max, 38 byte packets
 1  192.168.1.254 (192.168.1.254)  40.215 ms  6.326 ms  100.003 ms
 2  s55906c01.adsl.wanadoo.nl (85.144.108.1)  222.204 ms  63.414 ms  43.953 ms
 3  V512.dr3-asd5.nl.euro.net (194.134.152.37)  42.056 ms  184.549 ms  214.280 ms
 4  PC18.cr1-asd8.nl.euro.net (194.134.161.21)  227.640 ms  267.367 ms  58.792 ms
 5  PC11.er1-asd8.nl.euro.net (194.134.161.11)  89.888 ms  43.395 ms  111.789 ms
 6  asd-nik-pr01.xb.nl (195.69.144.90)  143.488 ms  171.624 ms  133.986 ms
 7  atm4-0-0-102.29999r9pe.ams.iparix.net (212.67.161.149)  123.030 ms  43.593 ms  44.833 ms
 8  212.67.182.162 (212.67.182.162)  128.244 ms  305.025 ms  385.464 ms
 9  ams-host.nl (212.67.182.102)  351.739 ms  247.630 ms  284.073 ms

If I do the following before I am doing a traceroute: "ip route add 212.67.182.102 via 213.148.226.1"
Then I get this:

traceroute to 212.67.182.102 (212.67.182.102), 30 hops max, 38 byte packets
 1  ict.18.254.concepts.nl (213.197.18.254)  27.887 ms  7.651 ms  7.785 ms
 2  ict.18.25.concepts.nl (213.197.18.25)  23.080 ms  8.158 ms  8.305 ms
 3  asd-nik-pr01.xb.nl (195.69.144.90)  8.556 ms  8.782 ms  9.245 ms
 4  atm4-0-0-102.29999r9pe.ams.iparix.net (212.67.161.149)  8.882 ms  8.767 ms  9.177 ms
 5  212.67.182.162 (212.67.182.162)  9.100 ms  8.965 ms  9.000 ms
 6  ams-host.nl (212.67.182.102)  9.057 ms  9.165 ms  9.035 ms

But then there is no any failover anymore which I want to have always if possible.
This is the output of "ip route show" this is AFTER adding the static route:

root@OpenWrt:~# ip route show
212.67.182.102 via 213.148.226.1 dev eth0.2 
192.168.3.0/24 dev br-lan  proto kernel  scope link  src 192.168.3.44 
192.168.1.0/24 dev eth0.3  proto kernel  scope link  src 192.168.1.67 
213.148.226.0/24 dev eth0.2  proto kernel  scope link  src 213.148.226.77 
default via 192.168.1.254 dev eth0.3 
default via 213.148.226.1 dev eth0.2

Maybe you or somebody else can help me with this ?
Regards,

Priyantha

SouthPawn wrote:

I think the problem may be kernel specific, as I seem to see no issue with tables above 256.
In any case, I've updated it to 1.0.14, which now puts the tables in the 170-190 range.

Let me know if you have any issues after that.

I wonder if it's a 2.4 vs 2.6 kernel kind of thing? Interesting you say 170-190  because I'm seeing it create rules 9-21. Not that I care as long as it's under 256.

root@OpenWrt:~# uname -a
Linux OpenWrt 2.4.35.4 #12 Tue Dec 29 15:30:20 UTC 2009 mips unknown
root@OpenWrt:~# ip rule show
0:    from all lookup local 
9:    from all fwmark 0x1 lookup LoadBalancer 
10:    from 1.2.3.4 lookup MWAN1 
11:    from all fwmark 0x10 lookup MWAN1 
20:    from 192.168.100.100 lookup MWAN2 
21:    from all fwmark 0x20 lookup MWAN2 
32766:    from all lookup main 
32767:    from all lookup default

Is this not right? Interestingly, with both connections in a working state, I am seeing DNS calls going out wan2 now, even though the /tmp/resolv.conf.auto has wan's DNS servers at the top of the file. I wonder if it was a timing thing last time.

I am still seeing wrong interface traffic but at a much reduced rate while both network connections are running properly. Once wan2 fails however, dnsmasq reloads with only wan's nameserver entries but wan2's IP is the only one making dns calls out wan's interface. It's really quite strange. If I do an 'ifdown wan2' it corrects itself.

Jun 26 12:03:25 OpenWrt user.notice root: [Multi-WAN Notice]: wan2 has failed and is currently offline.
Jun 26 12:03:25 OpenWrt user.info sysinit: ## Refreshing DNS Resolution and Tables ##
Jun 26 12:03:27 OpenWrt user.info sysinit: ## Refreshing Load Balancer ##
Jun 26 12:03:27 OpenWrt daemon.info dnsmasq[1019]: reading /tmp/resolv.conf.auto
Jun 26 12:03:27 OpenWrt daemon.info dnsmasq[1019]: using nameserver 4.2.2.3#53
Jun 26 12:03:27 OpenWrt daemon.info dnsmasq[1019]: using nameserver 4.2.2.1#53
Jun 26 12:03:27 OpenWrt daemon.info dnsmasq[1019]: using local addresses only for domain lan
root@OpenWrt:~# tcpdump -i eth0.1 -n net 4.2.2 and not icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0.1, link-type EN10MB (Ethernet), capture size 96 bytes
*** complete silence until wan2 fails and then; ***
12:03:41.685712 IP 192.168.100.100.12783 > 4.2.2.1.53: 19756+ A? mail.google.com. (33)
12:03:41.686740 IP 192.168.100.100.12783 > 4.2.2.3.53: 19756+ A? mail.google.com. (33)
12:03:46.690549 IP 192.168.100.100.12783 > 4.2.2.1.53: 19756+ A? mail.google.com. (33)
12:03:46.691108 IP 192.168.100.100.12783 > 4.2.2.3.53: 19756+ A? mail.google.com. (33)
12:03:51.698407 IP 192.168.100.100.37853 > 4.2.2.1.53: 27751+ A? mail.google.com. (33)
12:03:51.699196 IP 192.168.100.100.37853 > 4.2.2.3.53: 27751+ A? mail.google.com. (33)
12:03:56.700608 IP 192.168.100.100.37853 > 4.2.2.1.53: 27751+ A? mail.google.com. (33)
12:03:56.701336 IP 192.168.100.100.37853 > 4.2.2.3.53: 27751+ A? mail.google.com. (33)
daniel_bergamini wrote:
SouthPawn wrote:

I think the problem may be kernel specific, as I seem to see no issue with tables above 256.
In any case, I've updated it to 1.0.14, which now puts the tables in the 170-190 range.

Let me know if you have any issues after that.

I wonder if it's a 2.4 vs 2.6 kernel kind of thing? Interesting you say 170-190  because I'm seeing it create rules 9-21. Not that I care as long as it's under 256.

root@OpenWrt:~# uname -a
Linux OpenWrt 2.4.35.4 #12 Tue Dec 29 15:30:20 UTC 2009 mips unknown
root@OpenWrt:~# ip rule show
0:    from all lookup local 
9:    from all fwmark 0x1 lookup LoadBalancer 
10:    from 1.2.3.4 lookup MWAN1 
11:    from all fwmark 0x10 lookup MWAN1 
20:    from 192.168.100.100 lookup MWAN2 
21:    from all fwmark 0x20 lookup MWAN2 
32766:    from all lookup main 
32767:    from all lookup default

Is this not right? Interestingly, with both connections in a working state, I am seeing DNS calls going out wan2 now, even though the /tmp/resolv.conf.auto has wan's DNS servers at the top of the file. I wonder if it was a timing thing last time.

I am still seeing wrong interface traffic but at a much reduced rate while both network connections are running properly. Once wan2 fails however, dnsmasq reloads with only wan's nameserver entries but wan2's IP is the only one making dns calls out wan's interface. It's really quite strange. If I do an 'ifdown wan2' it corrects itself.

Jun 26 12:03:25 OpenWrt user.notice root: [Multi-WAN Notice]: wan2 has failed and is currently offline.
Jun 26 12:03:25 OpenWrt user.info sysinit: ## Refreshing DNS Resolution and Tables ##
Jun 26 12:03:27 OpenWrt user.info sysinit: ## Refreshing Load Balancer ##
Jun 26 12:03:27 OpenWrt daemon.info dnsmasq[1019]: reading /tmp/resolv.conf.auto
Jun 26 12:03:27 OpenWrt daemon.info dnsmasq[1019]: using nameserver 4.2.2.3#53
Jun 26 12:03:27 OpenWrt daemon.info dnsmasq[1019]: using nameserver 4.2.2.1#53
Jun 26 12:03:27 OpenWrt daemon.info dnsmasq[1019]: using local addresses only for domain lan
root@OpenWrt:~# tcpdump -i eth0.1 -n net 4.2.2 and not icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0.1, link-type EN10MB (Ethernet), capture size 96 bytes
*** complete silence until wan2 fails and then; ***
12:03:41.685712 IP 192.168.100.100.12783 > 4.2.2.1.53: 19756+ A? mail.google.com. (33)
12:03:41.686740 IP 192.168.100.100.12783 > 4.2.2.3.53: 19756+ A? mail.google.com. (33)
12:03:46.690549 IP 192.168.100.100.12783 > 4.2.2.1.53: 19756+ A? mail.google.com. (33)
12:03:46.691108 IP 192.168.100.100.12783 > 4.2.2.3.53: 19756+ A? mail.google.com. (33)
12:03:51.698407 IP 192.168.100.100.37853 > 4.2.2.1.53: 27751+ A? mail.google.com. (33)
12:03:51.699196 IP 192.168.100.100.37853 > 4.2.2.3.53: 27751+ A? mail.google.com. (33)
12:03:56.700608 IP 192.168.100.100.37853 > 4.2.2.1.53: 27751+ A? mail.google.com. (33)
12:03:56.701336 IP 192.168.100.100.37853 > 4.2.2.3.53: 27751+ A? mail.google.com. (33)

My guess here is that it's just requests that were being made as the routing tables got updated for the failover.
I'm thinking of changing this so that the default routing on the default routing table is actually the loopback which might help curb some of these.

I assume they drop off at some point, probably within minutes after failover is created.

The ip rules are listed with priority number, 9-X, the routing tables however are listed 170-190.

Edit: Upon closer inspection, I believe you and Priyantha have stumbled upon an issue I overlooked, with packets being generated on the router itself.

Namely the output & postrouting tables, I will look in to this.

(Last edited by SouthPawn on 27 Jun 2010, 06:38)

SouthPawn wrote:

Edit: Upon closer inspection, I believe you and Priyantha have stumbled upon an issue I overlooked, with packets being generated on the router itself.

Namely the output & postrouting tables, I will look in to this.

This makes perfect sense, traffic going *through* the router appears to be working and failing over properly. If I set all my clients to go out to the net directly for DNS I bet everything would work flawlessly. Since OpenWRT is using DNSMasq, the router is initiating the requests and thus not failing over properly. Very good catch.

Would it be terribly hard to weight traffic types with a mod like this? Is this compatible with any of the existing qos scripts? Thanks again and I look forward to hearing what you decided to do with regard to the DNS issue. Please let me know if I can assist?

daniel_bergamini wrote:
SouthPawn wrote:

Edit: Upon closer inspection, I believe you and Priyantha have stumbled upon an issue I overlooked, with packets being generated on the router itself.

Namely the output & postrouting tables, I will look in to this.

This makes perfect sense, traffic going *through* the router appears to be working and failing over properly. If I set all my clients to go out to the net directly for DNS I bet everything would work flawlessly. Since OpenWRT is using DNSMasq, the router is initiating the requests and thus not failing over properly. Very good catch.

Would it be terribly hard to weight traffic types with a mod like this? Is this compatible with any of the existing qos scripts? Thanks again and I look forward to hearing what you decided to do with regard to the DNS issue. Please let me know if I can assist?

It's entirely compatible with qos-scripts, smile
I will let you know as soon as I figure it out, hopefully it won't take too long.

Thanks Daniel,
-Craig

Just thought you'd like to know I can add a route 256 now that I've upgraded to the 2.6 kernel based 10.03 (openwrt-brcm47xx-squashfs.trx). I just had to know if that would fix it. I assume .13 would work for me now, but I think it's better to have it universal for people still on 2.4.

daniel_bergamini wrote:

Just thought you'd like to know I can add a route 256 now that I've upgraded to the 2.6 kernel based 10.03 (openwrt-brcm47xx-squashfs.trx). I just had to know if that would fix it. I assume .13 would work for me now, but I think it's better to have it universal for people still on 2.4.

Just updated, 1.0.15.
I think I'll be keeping it at 170-190 for greater compatibility, initially moving it to 300 was to remove the cap on WANs possible, but I hardly think anyone will have 20 wans anyhow. smile
Anyhow, I've updated it so that the default route now points back to a lan interface instead of adding both default routes.

Thanks Again,
-Craig

(Last edited by SouthPawn on 28 Jun 2010, 04:34)

SouthPawn wrote:

Anyhow, I've updated it so that the default route now points back to a lan interface instead of adding both default routes.

Thanks Again,
-Craig

No, thank you Craig, this has been a huge improvement for me and I really appreciate all the attention you've given me throughout the troubleshooting process.

Everything appears to be working normally, especially given that I have just recently reinstalled OpenWRT on this router. To the best of my knowledge I have not seen any wrong interface traffic since switching to .15. What I find interesting is I am now seeing simultaneous DNS calls across both interfaces (to the appropriately configured DNS server for that interface). I'm pretty sure DNSmasq will just return the first result it receives, this is a pretty nice feature I think.


Do you know within qos-scripts if the config behaves as you would expect when defining multiple interfaces? Eg, would I setup a:

config 'interface' 'wan'
...
config 'interface' 'wan2'

Configure their associated up/down speeds and it would respect that? With the current laci-app-qos it doesn't let you name the interface, so I didn't know if that was just a friendly name for the config file? Any other insight you can provide on using multi-wan with qos?

Thanks again Craig, I really appreciate everything!

daniel_bergamini wrote:

Do you know within qos-scripts if the config behaves as you would expect when defining multiple interfaces? Eg, would I setup a:

config 'interface' 'wan'
...
config 'interface' 'wan2'

Configure their associated up/down speeds and it would respect that? With the current laci-app-qos it doesn't let you name the interface, so I didn't know if that was just a friendly name for the config file? Any other insight you can provide on using multi-wan with qos?

Thanks again Craig, I really appreciate everything!

Glad to hear it's working as expected! smile

Here's an example from my /etc/config/qos file:

config 'interface' 'wan'
        option 'classgroup' 'Default'
        option 'enabled' '1'
        option 'overhead' '1'
        option 'download' '10000'
        option 'upload' '1536'

config 'interface' 'wan2'
        option 'classgroup' 'Default'
        option 'enabled' '1'
        option 'overhead' '1'
        option 'download' '768'
        option 'upload' '389'

They will share the same qos rules, but for each interface.
The multiwan script copies the tables and tc filters qos-scripts creates and edits them to work within our multiwan enviroment.

(Last edited by SouthPawn on 28 Jun 2010, 06:43)

hi all, i managed to get this amazing script up & running without any networking issues, but now my problem is another one,

For example, when i log in into some websites/foruns (phpbb3 for example) sometimes my session gets dropped because it goes from another ip, any workaround for this?

Best Regards

kadettgte wrote:

hi all, i managed to get this amazing script up & running without any networking issues, but now my problem is another one,

For example, when i log in into some websites/foruns (phpbb3 for example) sometimes my session gets dropped because it goes from another ip, any workaround for this?

Best Regards

Generally speaking, if there is a compatibility issue, either create a outbound rule for that site to use either balancer, in luci is stated as Load Balancer(Compatibility) or designate that site to use a specific interface such as wan or wan2.

I still need to update the wiki... =P

(Last edited by SouthPawn on 28 Jun 2010, 18:14)

Hi Again , and thanks for your fast reply hehe,

this is my configuration:

config 'multiwan' 'config'
        option 'default_route' 'balancer'
        option 'lan_if' 'lan'


config 'interface' 'wan'
        option 'weight' '10'
        option 'health_interval' '10'
        option 'icmp_hosts' 'dns'
        option 'timeout' '3'
        option 'health_fail_retries' '3'
        option 'health_recovery_retries' '5'
        option 'dns' 'auto'
        option 'failover_to' 'wan2'

config 'interface' 'wan2'
        option 'weight' '10'
        option 'health_interval' '10'
        option 'icmp_hosts' 'dns'
        option 'timeout' '3'
        option 'health_fail_retries' '3'
        option 'health_recovery_retries' '5'
        option 'dns' 'auto'
        option 'failover_to' 'balancer'

config 'mwanfw'
        option 'dst' '195.245.173.151'
        option 'wanrule' 'wan'

config 'mwanfw'
        option 'dst' '195.245.173.150'
        option 'wanrule' 'wan2'


the problem is that it's not "one-site" specific, for example, at least 2 phpbb3 based sites will have this issue, as well as some others that i don't recall right now, is there any way to maintain a tcp session for a while?

if there isn't any way i'll just have to disable balancer and just use failover until i find a way (that if i find, i'll post it here)

Best Regards

(Last edited by kadettgte on 29 Jun 2010, 11:22)

See if you have any better results with 1.0.15-2. smile

damn you are fast big_smile,
i'm going to try it in about 1hour or so, and then i'll post my results back.

Thanks a lot.

Best Regards

sorry for the late reply

well, your patch worked perfectly for most of the sites, i only have a problem with one, that i cannot find the problem

if i set the balancer to failover, it works perfectly, else it will fail even it i force it to go out from a specific uplink (which works for other sites/services/etc)

I think there isn't really a build-system

hi craig,

can you post version 1.0.14 in your ftp site again? that one works for me. i'm using 2.4 kernel and when i upgraded to 1.0.15-2 i can't make outgoing connections from the router. lan to internet works but i can't wget, icmp telnet, nslookup, etc. from the router. i need that for the scripts i have that runs when the router boot up. i was not able to do a thorough diagnosis because i have work at home in 3 hours and i panicked and reverted to the last mw version i had saved - 1.0j. (what a loser huh?)
anyway, the last things i can remember before i pushed the red button were:
1. route does not display multiple gateway.
2. ip rule is ok.
3. multiwan interface routing distribution works from lan. i go through the right interface for the sites i defined.
4. when i removed 1.0.15-2 everything worked again.

that's it. if 1.0.15-2 works for the 2.6 kernel users maybe post 2 versions of the app in your site?
thanks in advance. sorry i could not be more help here.

andyballon wrote:

hi craig,

can you post version 1.0.14 in your ftp site again? that one works for me. i'm using 2.4 kernel and when i upgraded to 1.0.15-2 i can't make outgoing connections from the router. lan to internet works but i can't wget, icmp telnet, nslookup, etc. from the router. i need that for the scripts i have that runs when the router boot up. i was not able to do a thorough diagnosis because i have work at home in 3 hours and i panicked and reverted to the last mw version i had saved - 1.0j. (what a loser huh?)
anyway, the last things i can remember before i pushed the red button were:
1. route does not display multiple gateway.
2. ip rule is ok.
3. multiwan interface routing distribution works from lan. i go through the right interface for the sites i defined.
4. when i removed 1.0.15-2 everything worked again.

that's it. if 1.0.15-2 works for the 2.6 kernel users maybe post 2 versions of the app in your site?
thanks in advance. sorry i could not be more help here.

I've put it back up, but let me ask you something, what's the name of your lan?
There's a new configuration option in .15 that uses the lan interface as the default gateway for the main table.
option 'lan_if' 'lan'

If that's not actually set to the actual name of your lan it could create a problem, if you could let me know it'd be very beneficial for me. smile

oh. i think it was 'lan'. what should i put there? eth0.0? br-lan?
i still have time. i'll install .14 and tell you what it says.
THANKS!

It should be set to whichever the lan zone is.

hmmm... it is set to 'lan' which is my lan zone.
and for some reason .14 stopped working for me as well.
can you upload .12 multiwan and its luci-app ipk?
i'll check if that works and try an upgrade to .15 and see if that works.
i'll report back after 6 hrs.

hi craig,
i tried version .14 again and it looks like the routers routing is the one that's getting affected.
i'm getting a "no route to host" error when i ping a known host.
but forwarding is working since i can get to the internet from lan.
not really sure but outgoing to the internet from the router does not work.

woah! i think you'll be interested in this:
root@culiat-wg:~# ip route list
192.168.2.0/24 dev eth0.2  proto kernel  scope link  src 192.168.2.214
192.168.1.0/24 dev br-lan  proto kernel  scope link  src 192.168.1.1
114.108.200.0/23 dev eth0.1  proto kernel  scope link  src 114.108.200.78
default via 192.168.1.1 dev br-lan

and here is what i have in config/multiwan:
root@culiat-wg:~# cat /etc/config/multiwan

config 'multiwan' 'config'
        option 'lan_if' 'lan'
        option 'default_route' 'fastbalancer'

doing some more investigation...

Sorry, posts 176 to 175 are missing from our archive.