A while ago I wrote a simple but useful script which I’m sharing here to detect upstream provider HSRP failover events via traceroute. It can be used for all kinds of virtual IP routing failover like VRRP, Check Point Cluster XL, actual routing protocols like BGP/OSPF or similar technologies where IP packets can be routed across multiple hops.
The script executes traceroutes to a given destination and checks whether the path is being routed over a certain hop, with the ability to send mail notifications if this is not the case.
You can get the most recent version of this script on my Github here. If you have any suggestions or improvements (which I’m sure there is plenty of room for), feel free to drop a comment or an issue or a pull-request on Github.
A while ago we were hit by an amplification/reflection DDoS attack against our public-facing network. I was familiar with NTP and DNS based reflection DDoS attacks, but this one employed the Simple Service Discovery Protocol (SSDP) to flood our tubes, a name name I’ve heard before and saw in packet traces randomly, but hardly knew anything about to be honest.
SSDP is a UDP-based protocol for service discovery and UPnP functionality with an HTTP-like syntax. It’s deployed by modern operating systems and embedded systems like home routers, where it is sometimes enabled even on their external interfaces, which makes this kind of attack possible.
The Shadowserver Foundation has a nice website with lots of information and statistics of public SSDP-enabled devices: While the number of open or vulnerable DNS and NTP is going down steadily, there are currently around 14 million IPs around the world that respond to SSDP requests, and the number is only declining very slowly:
Due to this we can expect that SSDP will be abused for DDoS attacks more often in the future.
Among many other new features, the vSphere 5.1 distributed vSwitch brought us the Network Health Check feature. The main purpose of this feature is to ensure that all ESXi hosts attached to a particular distributed vSwitch can access all VLANs of the configured port groups and with the same MTU. This is really useful in situations where you’re dealing with many VLANs and the roles of virtualization and network admin are strongly separated.
Unfortunately, like pretty much all newer networking features, the health check is only included in the distributed vSwitch which requires vSphere Enterprise+ licenses.
There are a couple other (cumbersome) options you have though:
If you’re have ESXi 5.5 you can use the pktcap-uw utility on the ESXi shell to check if your host receives frames for a specific VLAN on an uplink port:
The following example command will capture receive-side frames tagged with VLAN 100 on uplink vmnic3:
# pktcap-uw –uplink vmnic3 –dir 0 –capture UplinkRcv –vlan 100
If systems are active in this VLAN you should see a few broadcasts or multicasts already and meaning the host is able to receive frames on this NIC for this VLAN. Repeat that for every physical vmnic uplink and VLAN.
Another way to check connectivity is to create a vmkernel interface for testing on this VLAN and using vmkping. Since manually configuring a vmkernel interface for every VLAN seems like a huge PITA, I came up with a short ESXi shell script to automate that task.
Check out the script below. It uses a CSV-style list to configure vmkernel interfaces with certain VLAN and IP settings, pinging a specified IP on that network with the given payload size to account for MTU configuration. This should at least take care of initial network-side configuration errors when building a new infrastructure or adding new hosts.
This script was tested successfully on ESXi 5.5 and should work on ESXi 5.1 as well. I’m not entirely sure about 5.0 but that should be ok too. (Please leave a comment in case you can confirm/refute that).
Introducing the ghetto-vSwitchHealthCheck.sh:
Update: I have moved my scripts to GitHub and updated some of them a bit. You can find the current version of this particular script here.
A few of our mail gateway servers running with postfix/policyd-weight/amavis/spamassaisin generate a lot of DNS queries to our DNS servers at times.
I’m not particularly concerned about that myself but there were some discussions about whether we should or how we could decrease the volume of DNS queries.
One suggestion for an easy and convenient solution was to just install the Name Service Cache Daemon (nscd) to cache responses locally on the mail server. They implemented this quickly on one of the servers but it didn’t really seem to work, it still generated loads of queries and the nscd statistics didn’t indicate that caching was working. Also, the statistics output of nscd -g always displayed 0% cache hit ratio and 0 cache hits on positive entries.
So they just as quickly abandoned the idea without digging into it deeper and more or less forgot about the whole plan in general, as it wasn’t like we had any real issues in the first place.
Other options discussed were setting up dedicated caching-only resolvers (onto the hosts themselves) which wouldn’t have been difficult either.
Fast forward a few months and the so called “issue” of too many DNS queries came up again recently and I decided to check nscd myself and why it supposedly wouldn’t work.
Last week, Check Point released the first major update to the Gaia-introducing R75.40 version. The release of R75.45 mainly brings enhancements and fixes for the new Gaia platform:
R75.45 Release Notes
R75.45 Resolved Issues
R75.45 Known Limitations
R75.45 includes this R75.40 hotfix.
Among the new features provided, support for 6in4 tunneling, policy-based routing appear to be the most intriguing (at least to me).
The long list of fixes addresses important issues such as long policy install time, kernel memory leaks or IPS not exiting bypass mode again after returning to normal CPU utilization.
Only direct upgrades from R75.40 are supported, no earlier versions.
I haven’t tried this new release yet, but given the list of fixes and potential first prematureness of the still young Gaia platform, I’ll check it out rather sooner than later.
We have a couple of 802.11q VLAN interfaces on our Check Point Firewalls for a number of small public DMZ networks, providing full layer 2 isolation for usually only one or two related servers in those networks. Using a /29 subnet for each VLAN means minus broadcast and network address, we can assign 6 (public) IP addresses in those networks. We also need an IP for the virtual ClusterXL interface to be used as a gateway by the servers. Up until recently, we also used 2 more IPs of each subnet for the members VLAN interfaces, so they can communicate via Check Points Cluster Control Protocol or CCP. This left us with merely 3 public IPv4 addresses per /29 subnet for actual servers inside those VLANs.
Configuring Cluster Addresses on Different Subnets
Wasting IPs like this on interfaces transparent for your actual services is not really practical in times of scarce IPv4 address space though. So what we did was getting rid of using IP addresses of the public subnet for the VLAN interfaces of each cluster member during our upgrade to R75.40 and Gaia on new systems. This configuration is fully supported (actually since long ago) as per ClusterXL Admin Guide:
Only one routable IP address is required in a ClusterXL cluster, for the virtual cluster interface that faces the Internet. All cluster member physical IP addresses can be non-routable.Configuring different subnets for the cluster IP addresses and the member addresses is useful in order to:
– Enable a multi-machine cluster to replace a single-machine gateway in a pre-configured network, without the need to allocate new addresses to the cluster members.
– Allow organizations to use only one routable address for the ClusterXL Gateway Cluster. This saves routable addresses.