-
-
Save spali/2da4f23e488219504b2ada12ac59a7dc to your computer and use it in GitHub Desktop.
#!/usr/local/bin/php | |
<?php | |
require_once("config.inc"); | |
require_once("interfaces.inc"); | |
require_once("util.inc"); | |
$subsystem = !empty($argv[1]) ? $argv[1] : ''; | |
$type = !empty($argv[2]) ? $argv[2] : ''; | |
if ($type != 'MASTER' && $type != 'BACKUP') { | |
log_error("Carp '$type' event unknown from source '{$subsystem}'"); | |
exit(1); | |
} | |
if (!strstr($subsystem, '@')) { | |
log_error("Carp '$type' event triggered from wrong source '{$subsystem}'"); | |
exit(1); | |
} | |
$ifkey = 'wan'; | |
if ($type === "MASTER") { | |
log_error("enable interface '$ifkey' due CARP event '$type'"); | |
$config['interfaces'][$ifkey]['enable'] = '1'; | |
write_config("enable interface '$ifkey' due CARP event '$type'", false); | |
interface_configure(false, $ifkey, false, false); | |
} else { | |
log_error("disable interface '$ifkey' due CARP event '$type'"); | |
unset($config['interfaces'][$ifkey]['enable']); | |
write_config("disable interface '$ifkey' due CARP event '$type'", false); | |
interface_configure(false, $ifkey, false, false); | |
} |
I'm running OPNsense 24.7.10_2-amd64 and incorporated the bits and pieces of code here and there. The solution I found for the undefined function for system_routing_configure() was by including the system.inc to the script and then I can use interface_configure without it crashing. Although, I have CARP event issues unrelated to this.
require_once("config.inc");
require_once("interfaces.inc");
require_once("util.inc");
// Ensure system_routing_configure is included
require_once("system.inc");
.
.
.
So is this script considered stable on OPNsense 24.7.10_2 (with the possible need to require system.inc as mentioned directly above)?
So is this script considered stable on OPNsense 24.7.10_2 (with the possible need to require system.inc as mentioned directly above)?
Not sure. I barely got the whole script installed and troubleshot my installation. I figured I would share what I did to make it work with the crash. I have it running on 1 physical baremetal and 1 proxmox vm with 11 internal VIP VLANs. Stable? Not sure.
I upgraded today to 24.7.11_2. Adding:
require_once("system.inc");
does prevent the crashing issue. Nice find, huetruong.
I'm still having an issue with entering persistent maintenance mode not causing a failover: opnsense/core#7877
I've also not had enough time to find the most optimal way to shut/noshut the WAN interface - to ensure active/passive device reboot behavior produces a consistent and desired state for the interface based on the CARP status. (I don't want my backup/passive device to have it's WAN interface enabled upon boot, and requesting a DHCP lease while the active device is already handling traffic)
Creative suggestions, MEntOMANdo. You could do that and probably achieve a workable situation, but I see potential problems with that approach, and for some users and ISPs.
In your VM example, though the interface will be "down" by default, I believe the interface will still be brought up by configuration during boot - if it's stored in the opnsense configuration for the interface to be up, it will be brought up during boot.
In your CRON example, you may also run into a race condition, and still have your WAN interface come up, and do things like request a DHCP Lease, and possibly also not be shut down by the cron job if the device is 'backup' - depending on when the boot process that cron entry actually executes.
Towards the end of 'boot', the interface configuration is read, and then applied. So, with either approach, you have both the risk of the interface coming up in the first place, or not being shut down after the opnsense scripts read the configuration and bring up the interface.
This is one reason why I mention my workaround of using shell_exec to manually set the interfaces up or down is not very clean, or ideal - both because I'm calling shell_exec in the first place (bad practice, a security no-no!), and because the state of the interface will not persist across reboots).
IMO, it's better for the syshook.d CARP script to set the interface's configuration to be down, and save this in the configuration - so that only when CARP's state changes to "master", will the WAN interface be brought up at all. This way, you don't have to change default interface behavior, the script handles this for you.
Thoughts?
I upgraded today to 24.7.11_2. Adding:
require_once("system.inc");
does prevent the crashing issue. Nice find, huetruong.I'm still having an issue with entering persistent maintenance mode not causing a failover: opnsense/core#7877 I've also not had enough time to find the most optimal way to shut/noshut the WAN interface - to ensure active/passive device reboot behavior produces a consistent and desired state for the interface based on the CARP status. (I don't want my backup/passive device to have it's WAN interface enabled upon boot, and requesting a DHCP lease while the active device is already handling traffic)
I reread your comments. I have to disable the WAN interface of the instance that is in backup state when I update and reboot so it doesn’t switch over.
This script works fine as an automatic failover if something goes wrong with the master.
Long story short, after finding out I couldn't unbridge my ONT -- I went about testing the WAN failover between my opnsense VMs again.
Either I haven't tested it in a long time or I was mistaken the last time I tested it. I had most of the issues that everyone mentioned .. most noticeably, the wan interface not disabling or enabling properly on the master/backup node respectively.
Also, on the backup/master node -- I noticed that it kept repeating master/backup node messages (as per the logging from 10-wancarp).
_2024-12-27T01:38:05-05:00 Error opnsense /usr/local/etc/rc.syshook.d/carp/10-wancarp: enable interface 'wan' due CARP event 'MASTER'
2024-12-27T01:38:05-05:00 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member " (172.30.67.254) (40@vlan009)" has resumed the state "BACKUP" for vhid 40
2024-12-27T01:38:05-05:00 Error opnsense /usr/local/etc/rc.syshook.d/carp/10-wancarp: disable interface 'wan' due CARP event 'BACKUP'
2024-12-27T01:38:04-05:00 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member " (172.30.67.254) (40@vlan009)" has resumed the state "INIT" for vhid 40
2024-12-27T01:38:04-05:00 Error opnsense /usr/local/etc/rc.syshook.d/carp/10-wancarp: disable interface 'wan' due CARP event 'INIT'
2024-12-27T01:42:05-05:00 Notice configd.py [c8268658-528e-4180-9efb-b4465da3c196] Carp event on subsystem 200@vtnet1 for type MASTER
2024-12-27T01:40:05-05:00 Notice configd.py [75707303-20a1-468e-add3-97c31659f7cf] Carp event on subsystem 215@vlan09 for type MASTER
__
What I believe fixed the inconsistent master/backup status messages in the 10-wancarp -- was seeing that the IF type check in 20-openvpn in /usr/local/etc/rc.syshook.d/carp was different. Thanks for everyone that posted their fixes.
https://gist.github.com/vc1cv1/f59273ce98fda57cf8000cca65193b6b
#last updated for opnsense 24.7.11_2
#!/usr/local/bin/php
<?php
require_once("config.inc");
require_once("interfaces.inc");
require_once("util.inc");
require_once("system.inc");
$subsystem = !empty($argv[1]) ? $argv[1] : '';
$type = !empty($argv[2]) ? $argv[2] : '';
if (!in_array($type, ['MASTER', 'BACKUP', 'INIT'])) {
log_msg("Carp '$type' event unknown from source '{$subsystem}'");
exit(1);
}
if (!strstr($subsystem, '@')) {
log_error("Carp '$type' event triggered from wrong source '{$subsystem}'");
exit(1);
}
$ifkey = 'wan';
$real_if = get_real_interface($ifkey);
# since all my CARP ips fail over together, I just wanted it to only run when it matched the CARP status change for my LAN interface. You can find it in your debug log searching for 'carp' and/or totally comment out the IF statement.
if ($subsystem === "200@vtnet1") {
if ($type === "MASTER") {
log_error("enable interface '$ifkey' due CARP event '$type' on '$subsystem'");
$config['interfaces'][$ifkey]['enable'] = '1';
write_config("enable interface '$ifkey' due CARP event '$type'", false);
interface_configure(false, $ifkey, false, false);
sleep(2);
shell_exec("/sbin/ifconfig {$real_if} up");
log_msg("Issuing dhclient command on '$real_if' to request a DHCP lease");
sleep(1);
shell_exec("dhclient {$real_if}");
} else {
log_error("disable interface '$ifkey' due CARP event '$type' on '$subsystem'");
unset($config['interfaces'][$ifkey]['enable']);
write_config("disable interface '$ifkey' due CARP event '$type'", false);
interface_configure(false, $ifkey, false, false);
shell_exec("/sbin/ifconfig {$real_if} down");
}
} #if subsystem
Creative suggestions, MEntOMANdo. You could do that and probably achieve a workable situation, but I see potential problems with that approach, and for some users and ISPs. In your VM example, though the interface will be "down" by default, I believe the interface will still be brought up by configuration during boot - if it's stored in the opnsense configuration for the interface to be up, it will be brought up during boot. In your CRON example, you may also run into a race condition, and still have your WAN interface come up, and do things like request a DHCP Lease, and possibly also not be shut down by the cron job if the device is 'backup' - depending on when the boot process that cron entry actually executes.
Towards the end of 'boot', the interface configuration is read, and then applied. So, with either approach, you have both the risk of the interface coming up in the first place, or not being shut down after the opnsense scripts read the configuration and bring up the interface.
This is one reason why I mention my workaround of using shell_exec to manually set the interfaces up or down is not very clean, or ideal - both because I'm calling shell_exec in the first place (bad practice, a security no-no!), and because the state of the interface will not persist across reboots).
IMO, it's better for the syshook.d CARP script to set the interface's configuration to be down, and save this in the configuration - so that only when CARP's state changes to "master", will the WAN interface be brought up at all. This way, you don't have to change default interface behavior, the script handles this for you.
Thoughts?
agreed, it's better for the status of the interface to be saved. after testing my failovers, i saw nothing in my backup node on reboot that mentioned the disabled 'wan' interface being tried to be brought online and/or it being disabled by carp status
Thank you for your efforts on this. I've got it set up and working when failing over. However, when the other device comes back online, I'm experiencing an issue. At that point, both firewalls are active and - since I duplicated the MAC address - competing for the IP address from the ISP. Has anyone else experienced this issue? How have you worked around it?
Thank you for your efforts on this. I've got it set up and working when failing over. However, when the other device comes back online, I'm experiencing an issue. At that point, both firewalls are active and - since I duplicated the MAC address - competing for the IP address from the ISP. Has anyone else experienced this issue? How have you worked around it?
which revision of the code are you using? Normally, the backup's interface should remained disabled unless the CARP status changes.
also, under HA -> settings -> "disable preempt" -- do you have that checked or unchecked? Mine is unchecked -- maybe you have this checked.
"When this device is configured as CARP master it will try to switch to master when powering up, this option will keep this one slave if there already is a master on the network. A reboot is required to take effect."
I'm using the one from above, I think you posted it "last week". I did update it to handle my second ISP (I have two ISPs, but neither provide a second IP). Preempt is disabled.
I THINK even though it will come up as a backup, it still tries to grab an IP address at bootup because CARP has not yet been initialized. I see an increase in loss (on the master WAN links) right as the (other, backup) system boots and when it gets to parts (during the boot) where it says something about configuring the WAN interfaces. This makes sense, since the backup does not yet have an awareness of CARP on those interfaces (since they're not configured for CARP) and should logically try to get an IP (with a duplicated MAC) and it is attempting to bring those interfaces up. I may try to spend some time in the other RC directories to see if there is a logical place to down the WAN interfaces until CARP is up and the system's role can be determined. I wasn't sure if others had seen the same issue and - if they had - what may have been done to work around it.
Has anyone tried this on 25.x yet? Either I'm being very dumb or there's a bug where additional scripts in /usr/local/etc/rc.syshook.d/carp/ are not executed. If I move the code to 20-openvpn it works. If I copy all the code from 20-openvpn into 10-wancarp it does not execute. Permissions should be correct
Am I missing something obvious?
Been on 25.x for a couple of weeks.. took the plunge after taking a snapshot of both firewalls. Zero issues on this end.. scripts working as intended.
Has anyone tried this on 25.x yet? Either I'm being very dumb or there's a bug where additional scripts in /usr/local/etc/rc.syshook.d/carp/ are not executed. If I move the code to 20-openvpn it works. If I copy all the code from 20-openvpn into 10-wancarp it does not execute. Permissions should be correct
Am I missing something obvious?
I'm also seeing the same issue on 25.1.8_1, did you ever find a solution?
Has anyone tried this on 25.x yet? Either I'm being very dumb or there's a bug where additional scripts in /usr/local/etc/rc.syshook.d/carp/ are not executed. If I move the code to 20-openvpn it works. If I copy all the code from 20-openvpn into 10-wancarp it does not execute. Permissions should be correct
Am I missing something obvious?I'm also seeing the same issue on 25.1.8_1, did you ever find a solution?
Got this fixed. The #! has to be the first line in the script and I had a comment above it
Has anyone tried this on 25.x yet? Either I'm being very dumb or there's a bug where additional scripts in /usr/local/etc/rc.syshook.d/carp/ are not executed. If I move the code to 20-openvpn it works. If I copy all the code from 20-openvpn into 10-wancarp it does not execute. Permissions should be correct
Am I missing something obvious?I'm also seeing the same issue on 25.1.8_1, did you ever find a solution?
Got this fixed. The #! has to be the first line in the script and I had a comment above it
Glad you were able to fix it. My problem was some encoding issue uploading through scp. Once I created and edited the files directly on the router things worked as expected.
Heya lavacano,
Found this git through searching and it's exactly the solution I've been looking for. I was implementing and am stuck at this step;
"On both nodes, you must have a gateway defined for your LAN failover path. This gateway's IP address should be the LAN CARP VIP (e.g., 10.0.1.1). This gateway must have a higher priority (i.e., a lower numerical value) than the WAN gateway. For example, set the LAN VIP gateway priority to 250 and leave the WAN gateway at its default of 254. This ensures that when the script disables the WAN IP on the backup node, the system's routing engine will automatically select the LAN VIP gateway as the new default route. "
Can you please elaborate on how to set this up? Like others, I have a single WAN IP via DHCP connected to a 4x 1Gb and 2x 10Gb hub upstream. The 10Gb are connected to each OPNsense firewall as WAN. I have both WAN nodes MAC spoofed to match. Under Gateways Configuration I have one WAN_GW defined on each. Then using CARP for all LAN-VLANs/DMZ/ECT.
Thank you for the updated script and help with this. Can't wait to get this implemented and working reliably. Thank you.
-PiXEL8

it stays on.
the script now (v2.3) does the following or should: disable wan ipv4/6 gw's, disable wan & tunnelbroker interfaces/teardown states on finding out its in BACKUP carp
then when it becomes MASTER it enables all those things, since there can only be 1 master and 1 backup using PF there is no issue with flapping.
### The WAN MAC address for both routers can be cloned or the same this way so as to not flap your ISP also since we take everything down this way.
some assumptions: you are using tunnelbroker gif ipv6 tunnel, the names of the wan and tunnelbroker gateways are not keyed but hardcoded currently.
the reason we keep a non upstream lan interface gateway is so the backup can have internet access that is at least the extent of the utility of that afaik
the routing takes care of the lan non upstream gateway automatically such that when the wan upstream gateway gets enabled traffic meant for wan stops flowing through the lan non upstream gateway when the wan upstream gateway is enabled
--
heres what gemma said about my explanation
The "LAN Failover Gateway": A Path for the Backup Node
The LAN_FAILOVER_GW
is a clever routing trick. It is not a real, physical gateway.
Its only job is to give the backup firewall a path to the internet by routing its own management traffic (for updates, NTP, etc.) through the active master firewall. This keeps the backup node online and ready to take over at a moment's notice.
How to Configure It
Here is the step-by-step guide to create the LAN Failover Gateway, based on the setup in your screenshot.
- Navigate: Go to System > Gateways > Single.
- Add: Click the "+" button to add a new gateway.
- Configure the Fields:
- Disabled: Leave this unchecked.
- Interface: Select your primary internal network, which is LAN.
- Address Family: IPv4.
- Name: Give it a descriptive name, like
LAN_FAILOVER_GW
. - Gateway: Enter the LAN CARP VIP address. In your case, this is
10.10.10.1
. - Priority: Set this to a value that is better (a lower number) than your main WAN gateway. Your
WAN_STATIC
gateway has a priority of 254, so setting this to 250 is perfect. - Disable Gateway Monitoring: Check this box. This is critical. You don't want OPNsense trying to ping this gateway, as it's just a local VIP address. A failed ping would incorrectly mark it as down.
- Upstream Gateway: Leave this unchecked. This tells OPNsense that it is an internal, local gateway, not one that leads directly to the internet.
- Save and Apply Changes.
How It Works in Practice
The magic of this setup lies in how OPNsense's routing engine uses a combination of gateway priority and the "Upstream" flag.
On the MASTER Node:
- The
WAN_STATIC
gateway is enabled by the script and active. - Even though
LAN_FAILOVER_GW
has a better priority number (250 vs 254), the system will always prefer theWAN_STATIC
gateway for internet traffic because it is marked as an upstream gateway. - Result: All internet-bound traffic correctly goes out the WAN interface. The
LAN_FAILOVER_GW
is ignored.
On the BACKUP Node:
- The failover script runs and disables the
WAN_STATIC
gateway. - The routing engine sees that the only available, enabled gateway is now
LAN_FAILOVER_GW
. - The system is forced to create a new default route for itself pointing to the IP of the
LAN_FAILOVER_GW
(10.10.10.1
). - Result: When the backup firewall tries to access the internet, it sends its traffic to
10.10.10.1
. Since the master node currently holds that CARP VIP, the traffic is routed through the master's LAN, NAT'd, and sent out to the internet via the master's working WAN connection.
Thank you for the quick reply. I have the LAN_FAILOVER_GW setup now on both. Since I'm using DHCP on WAN would I change the configuration options like this or leave $wan_ip_vp == ''; instead of 'DHCP' like below? Also, I'm not using IPv6 should that cfg option be empty as well? Also not sure about tbroker gateway setting since not using IPv6. Thank you.
// #################### CONFIGURATION ####################
$ifkey = 'wan';
$wan_ip_v4 = 'DHCP';
$wan_subnet_v4 = 30;
// Names of the gateways to manage, as they appear in System > Gateways > Single
$wan_gw_name = 'WAN_GW';
$tbroker_gw_name = '';
// The CARP VIP on your LAN for gateway redirection on the backup node.
$lan_vip_v4 = '10.10.99.1';
$lan_vip_v6 = '2600:1337::1';
This is working perfectly, THANK YOU!!
Please add Unbound DNS restart after master failover. Ty
Please add Unbound DNS restart after master failover. Ty
after testing and about 20 iterations of the script after 2.9 my conclusion is it is a much much better setup to block these ports on the non vip router ip addresses. for dns and dhcpd since they are not carp aware (what a joke)
With v4.7.3-final-fixed should I undo;
net.inet.carp.init_delay = 60
and
mkdir -p /usr/local/etc/rc.syshook.d/config
ln -s /usr/local/etc/rc.syshook.d/carp/10-wancarp /usr/local/etc/rc.syshook.d/config/20-service-check
With 3.x code I was having issues with traffic passing after failover, so I'm currently using only one firewall with the other disconnected to have a stable network.
Also, if possible could you add an option to include additional interfaces with WAN to be enabled/disabled at failover? I have a server with dual NICs (team with active-backup) connected to each firewall. With both firewall interfaces enabled it eventually floods the switch stack even though it's an active-backup configuration. Thank you.
-PiXEL8
Give this a go; the top one
I am running 24.7.9_1, and I see the same error mentioned by bitcoredotorg.
I also tried the recent development branch as of this writing, and it is the same.
Implementing @bitcoredotorg 's fix seemed to work well enough, though I had to edit it slightly. The script with his workaround looks like this for me:
error stack: