Open Source Windows service for reporting server load back to HAProxy (load balancer feedback agent).

In general when you are load balancing a cluster you can evenly spread the connections through the cluster and you get pretty consistent and even load balancing. However with some applications such as RDS (Microsoft Terminal Servers), you can get very high load from just a  few users doing heavy work. The solution to this is to use some kind of server load feedback agent. We’ve had one of these for a while in our product but now with a lot of help from Simon Horman we’ve managed to integrate the functionality into the main branch (well soon anyway) of HAproxy. We thought it would be a good idea to open source the previous work on Ldirectord/LVS, make it compatible with HAProxy, and release our Windows service code as GPL.

Until the work is merged and tested with an official release of HAProxy we’ve compiled a patched version of HAProxy dev19 ish here…. (http://downloads.loadbalancer.org/agent/haproxy-agent-check-20130813.tar.gz) Or you can get the patches from the mailing list archive…

UPDATE: The Loadbalancer.org feedback agent code is now supported in HAProxy 1.5-dev21+

Download the latest Windows Feedback Agent Service Here:  http://downloads.loadbalancer.org/agent/loadbalanceragent.msi (v4.5.1)

v4.5.1 : Multiple performance improvements + a couple of feature enhancements – backwards compatible with all Loadbalancer.org appliances layer 4 & 7.

Example HAProxy configuration

Simply compile as usual and then modify your RDS cluster:

listen RDSTest
	bind 192.168.69.22:3389
	mode tcp
	balance leastconn
	persist rdp-cookie
	server backup 127.0.0.1:9081 backup  non-stick
	tcp-request inspect-delay 5s
	tcp-request content accept if RDP_COOKIE
	timeout client 12h
	timeout server 12h
	option tcpka
	option redispatch
	option abortonclose
	maxconn 40000
	server Win2008R2 192.168.64.50:3389 weight 100 check agent-check agent-port 3333 inter 2000  rise 2  fall 3 minconn 0  maxconn 0  on-marked-down shutdown-sessions

The important bit agent-check agent-port 3333 tells HAProxy to constantly monitor each backend server in the cluster by doing a telnet to port 3333 and grabbing the response which will usually be a percentage idle value i.e.

80% – I am not very busy please increase my weight and send me more traffic
10% – I’m busy please decrease my weight and stop sending me so much traffic
drain – Set the weight to 0 and gradually drain the traffic from this server for maintenance
stop – Stop all traffic immediately, kill this backend server
up ready 20% – Force HAProxy to bring the server up and set weight to 20% (irrespective of how it was taken down)

 

If you have a Linux backend you could create a simple service calling the following script:

#!/bin/bash
LOAD=`/usr/bin/vmstat 1 2| /usr/bin/tail -1| /usr/bin/awk '{print $15;}' | /usr/bin/tee`
echo "$LOAD%"
#This outputs a 1 second average CPU idle

Call the script  /usr/bin/lb-feedback.sh
make sure that you make it executable:

chmod +x /usr/bin/lb-feedback.sh


Insert this line into /etc/services

lb-feedback 3333/tcp # loadbalancer.org feedback daemon

Now create the following file called /etc/xinetd.d/lb-feedback

# default: on
# description: lb-feedback socket server
service lb-feedback
{
 port = 3333
 socket_type = stream
 flags = REUSE
 wait = no
 user = nobody
 server = /usr/bin/lb-feedback.sh
 log_on_success += USERID
 log_on_failure += USERID
 disable = no
}

Then change permissions and restart xinetd:

chmod 644 /etc/xinetd.d/lb-feedback
/etc/init.d/xinetd restart

You can now test this service by using telnet:

telnet 127.0.0.1 3333
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
95%
Connection closed by foreign host.

Now if you have a Windows server as your backend you can use our open source monitor service. You can download the Loadbalancer.org windows feedback agent here (http://downloads.loadbalancer.org/agent/loadbalanceragent.msi)

Source code is here together with the binary: CpuMonitor_4.5.1.zip

Source code is now available on GitHub: Windows Feedback Agent

Once you have installed Loadbalancer.org feedback service you should find the monitor.exe file in Program Files/LoadBalancer.org

Feedback

Simply hit the ‘start’ button and the agent should start responding to telnet on port 3333 (you may need to make an exception for that port in your Windows firewall).

You can change the ‘mode’ setting to drain then ‘apply settings and restart’ and HAProxy will then set the weight to 0 and status to drain (blue) i.e.:

drain

Or you can set the ‘mode’ to halt then ‘apply settings and restart’ and HAProxy will then immediately set the status to DOWN (yellow) i.e.:

down

When the agent is running in normal mode it will report back the percentage idle of the system based on the settings in the feedback agent XML file:
<xml>
<Cpu>
<ImportanceFactor value="1" />
<ThresholdValue value="100" />
</Cpu>
<Ram>
<ImportanceFactor value="1" />
<ThresholdValue value="100" />
</Ram>
<TCPService>
<Name value="HTTP" />
<IPAddress value="*" />
<Port value="80" />
<MaxConnections value="0" />
<ImportanceFactor value="0" />
</TCPService>
<ReadAgentStatusFromConfig value="True" />
<ReadAgentStatusFromConfigInterval value="5" />
<AgentStatus value="Normal" />
<Interval value="10" />
</xml>


Notice that you can control both the importance of CPU & RAM utilization and also a threshold, so the following logic is used:

If CPU importance = 0 then ignore, 0.5 means give 50% importance, 1 means 100% importance
If RAM importance = 0 then ignore etc.

If ThresholdValue is reached on any monitor then immediately go into DRAIN mode (a value of 0 means no threshold is set).

This can be very useful if you have a small number of RDP sessions using a lot of RAM, simply set a ThresholdValue of 85, then as soon as memory crosses that threshold no new users will be sent to that server.

Otherwise to calculate the percentage idle reported by the agent would be to divide the utilization by the number of factors involved i.e.

If you are using two services then:

utilization = utilization + cpuLoad * cpuImportance%;
utilization = utilization + ramOccupied * ramImportance%;
utilization = utilization / 2

So if importance was 1 for both cpu and ram you would only get 0% reported if both CPU and RAM were 100%. (actually it gets a bit weird if one is already reporting 0%, but lets not go….

And if the importance is zero then ignore completely i.e.

utilization = utilization + cpuLoad * cpuImportance%;
//utilization = utilization + ramOccupied * 0 (importance is zero so ignore)
utilization = utilization (one service only so don’t divide)

Also the final section TCPService effectively lets you load balance on number of established connections to your server, so you could balance based on the number of RDP connections to port 3389.

For this setting MaxConnections is important to specify as otherwise the agent will have no idea how to calculate the load i.e.

utilization = MaxConnections / 100 * number of current connections * importance%

In the following screen shot from a Loadbalancer.org appliance you can see that the Win2008R2 server is healthy and 99% idle, whereas the Linux server was busy at 43% idle before the Linux agent was put into maintenance mode and the server taken out of the group.

 

sysoverview

Does that make sense? Have a play with the config file and let us know what you think….

 

 

 

 

 

 

 

 

53 thoughts on “Open Source Windows service for reporting server load back to HAProxy (load balancer feedback agent).

  1. Felix,
    Full support for the new agent will be incorporated in the Loadbalancer.org ENTERPRISE VA v7.6. The new version will also include external health check scripts, ssl re-encryption etc. I’ll update this post as soon as we have a release date (which should be very soon).
    NB. Current version is v7.6.2….

  2. Hi,

    We currently testing with this custom version of HAProxy. The backend servers are multiple Windows 2008R2 server with the loadbalancer.org agent.

    The problem is that the weight value is not decreased during heavy load. The agent is reporting the correct value. Please could someone help me to fix this problem? My config file:

    global
    daemon
    stats socket /var/run/haproxy.stat mode 600 level admin
    pidfile /var/run/haproxy.pid
    maxconn 40000
    ulimit-n 81000
    tune.maxrewrite 1024
    defaults
    mode http
    balance roundrobin
    timeout connect 4000
    timeout client 42000
    timeout server 43000
    listen RDP_Test
    bind 172.17.20.8:3389
    mode tcp
    balance leastconn
    option tcpka
    tcp-request inspect-delay 5s
    tcp-request content accept if RDP_COOKIE
    option tcpka
    timeout client 12h
    timeout server 12h
    option redispatch
    option abortonclose
    maxconn 40000
    server SRV-TS01 172.17.20.5:3389 weight 100 check agent-port 3333 inter 2000 rise 2 fall 3 minconn 0 maxconn 0 on-marked-down shutdown-sessions
    server SRV-TS02 172.17.20.6:3389 weight 100 check agent-port 3333 inter 2000 rise 2 fall 3 minconn 0 maxconn 0 on-marked-down shutdown-sessions
    server SRV-TS03 172.17.20.7:3389 weight 100 check agent-port 3333 inter 2000 rise 2 fall 3 minconn 0 maxconn 0 on-marked-down shutdown-sessions
    listen stats :7777
    stats enable
    stats uri /
    option httpclose
    stats auth loadbalancer:loadbalancer

    Thanks!

  3. Hi there,

    I was looking for something about HAProxy and found that post, and it’s funny because we already worked on this topic. We’ve developed such a feature in the past year for HAProxy, relying the http check “disbale-on-404″ feature. In fact, we have a light service (.NET service with listening socket) that will return 200 if load is OK, 404 if not. The load trigger is fully configurable, could be base on CPU load, RAM load or both. The period on wich the load is monitored is also configurable and all settings are stored in the regsitry to permit configuration deployement through GPO. When the load is ok, the returned page also contains the actual load of the server.

  4. Willem,

    Could you provide some more detail?
    Do the weights in HAProxy change correctly under normal load?
    Is it when the real server is under high load only that the problem occurs?

  5. I’m using Feedback Agent 4.3.0 for Windows with HAProxy 1.5.3, there is an issue.

    When I set agent status from NORMAL to DOWN/DRAIN, HAProxy stats then marked this server as DOWN/DRAIN forever, even though later I set agent back to NORMAL and restart agent service, but the server still marked as DOWN/DRAIN in HAProxy stats. Unless I do a HAProxy reload, then the server back to NORMAL in HAProxy stats.

  6. Ken,

    The behaviour of the various drain modes in HAProxy has been improved recently, unfortunately our Windows agent needs a change to be compatible.

    It needs to report something like:
    “up ready 100%” after it comes out of drain mode (for 60 seconds or so…)

    Our dev team just started looking into this very issue last week.They will make a decision on the best way to change the windows agent to be compatible with the new HAProxy implementation of the agent in the next couple of days.

    (taken from the email to the haproxy
    mailing list ‘health check hell’ – ”
    “up” : declare the server up, don’t change the configured weight
    “up 50%” : declare the server up, set weight to 50%
    “50%” : don’t touch the server state, just set the weight to 50%
    “drain” : don’t touch the state, nor weight, just switch to drain mode.
    “maint” : force maintenance mode.
    “drain 20%” : drain mode, adjust weight to 20% (not used in this mode but
    will avoid complex logics in agent scripts)
    “ready 30%” : leave maint/drain modes, start at 30% weight.
    “up ready 40%” : the agent does the 3 things at once and says the
    service is OK.
    “stopped drain 10%” : the agent does the 3 things at once and
    indicates that the server is now down after drain mode.”

      • Ken,
        We have a beta version here if you would like to test the new drain/maint functionality:
        http://downloads.loadbalancer.org/agent/beta/LBCPUMonInstallation.msi
        The important value in the XML config is:
        Interval value="10"
        This tells the system to respond ‘ready up’ for 10 seconds after every ‘up’ state change.
        We are still looking into the other xen performance issue before releasing this version.
        Update 19th Sept – We massively increased the performance/response of the Windows agent in 4.5 Beta2… should resolve most issues (it was a silly bug).

        • Hi Malcolm,

          Tried the beta version, UP/DOWN switching works fine! I have another issue, well not really issue, just a question to ask.

          I found in my environment run with agent is almost no different from without it, distribution of connections is smooth but distribution of loads is not, CPU usage rates are far from equal compares among servers.

          Could it be the option blow I use in HAProxy cause this issue?

          balance source

          I read from HAProxy doc this ‘source’ algorithm make changing a server’s weight on the fly take no effect.

  7. Hi,

    does it support virtual host on Xenserver? I have noticed that the weights in HAProxy change correctly on physical server but are static on virtual host. I have done some digging and the agent does not send any data on a virtual machine?

    Many thanks
    Ys

  8. Hi,

    Me2 but it just doesn’t work. The firewall is disabled and the only difference between the two servers is physical vs virtual. I am sure there is something wrong because on physical server I can see plenty connections to HAproxy and on virtual just one that disappears after 2 seconds. Also, when I use process explorer to see what is actually going on. On Physical, I can see connection, then loadbalancer agent reading the data, stooping and sending where on virtual it gets to reading the data and then it does not send just quits.. Very bizarre. I recon it may be something with virtual adapter but was wondering if you can have more insight on this.

    Thanks

  9. I have installed and configured the loadbalancer agent on two additional servers. Same as before, one physical and one virtual. Outcome is the same. Now, two physical are working perfectly fine where virtual are not. Any ideas would be greatly appreciated.

    Thanks

  10. Hi Malcolm, details about the version below.

    BUILD_NUMBER=’59235p’
    PRODUCT_BRAND=’XenServer’
    PRODUCT_VERSION_TEXT_SHORT=’6.1′
    COMPANY_NAME_SHORT=’Citrix’
    PLATFORM_NAME=’XCP’
    PLATFORM_VERSION=’1.6.10′
    KERNEL_VERSION=’2.6.32.43-0.4.1.xs1.6.10.734.170748xen’
    BRAND_CONSOLE=’XenCenter’
    COMPANY_NAME=’Citrix Systems, Inc.’
    XEN_VERSION=’4.1.3′
    PRODUCT_VERSION_TEXT=’6.1′
    MANAGEMENT_ADDRESS_TYPE=’IPv4′
    PRODUCT_NAME=’xenenterprise’
    PRODUCT_VERSION=’6.1.0′
    INSTALLATION_DATE=’2013-03-22 11:56:01.977608′

    The only agents installed is XenAgent and PHD Virtual Backup Agent. Windows version 2008 R2 with all the updates, role Terminal Server. Firewall (build-in) is disabled, antivirus disabled (sophos).

    Regards,
    Ys

    • Ys,

      We had another user complaining about the agent failing on a virtual host, we are just doing some testing relating to how fast the agent can get a response about CPU speed in a virtual host and we shall let you know asap.

    • Ys,

      We think we have found a potential problem where HAProxy is giving up on waiting for the response (the default timeout for agent check is 2 seconds).
      Could you try the following global setting:
      timeout check 5s
      and potentially the backend setting
      agent check 5s
      And restart the agent + haproxy?

      UPDATE: New agent v4.5.1 is MUCH faster at responding under heavy load, but if you are still experiencing issues you might want to set the CPU ThresholdValue @ 85 which will definitely prevent it getting unresponsive at very high loads…..

  11. Hi,
    We are using our own feedback agent, and have utilisation feeding back ok.
    But I can’t get the load balancer to recognise the other modes (drain/halt).
    Currently for those modes we have tried returning:
    “halt”
    “down”
    “drain”
    “maint”
    and also tried adding a trailing “\n”.
    But the lbadmin system overview page does not show the service in these other states (it just shows a weighting of ‘5’ for any of the above cases).
    regards
    Charly

  12. hello, recently we started using the loadbalancer agent (v4.5.1) for windows. It is working as described but I have a problem that if the server(in our case a windows 2008R2) is restarted the agent doesn’t automatically set the mode to normal. The servers are restarted on a schedule. thanks in advance

    • Alban, this is fixed in the latest version of HAProxy (1.6Dev-xxx), this will be included in v7.6.4 of our appliance which is due for release soon. Currently, Haproxy marks a server as ‘agent down’ when you turn it off, it should be just ‘down’. As a work around, you can go to the agents XML config file, and increase the Interval Value from it’s default of 10s to ensure the agent responds with ‘up ready’ for longer and HAProxy can then bring the server back online. You could try doubling it to begin with and see how that works, then adjust from there.

  13. Hello.

    I am using Agent 4.5.1 with HAProxy 1.5.2

    And faced with problem: when I load the server(R1) CPU up to 100 percent, the agent still connect to it.
    In this time telnet to port 3333 shows value 0 – 15%

    howto fix it?

    configuration file:

    global
    #uid 99
    #gid 99
    daemon
    stats socket /var/run/haproxy.stat mode 600 level admin
    log 127.0.0.1 local4
    maxconn 40000
    ulimit-n 81001
    pidfile /var/run/haproxy.pid
    defaults
    log global
    mode http
    timeout connect 10000
    timeout client 1h
    timeout server 1h
    balance leastconn

    listen stats :7777
    stats enable
    stats uri /
    stats hide-version
    option httpclose
    frontend F1
    bind *:3389
    maxconn 40000
    default_backend B1
    mode tcp
    option tcplog
    backend B1
    mode tcp
    option tcpka
    balance leastconn
    tcp-request inspect-delay 5s
    tcp-request content accept if RDP_COOKIE
    persist rdp-cookie
    stick-table type string size 204800 expire 120m
    stick on rdp_cookie(mstshash)
    server R1 10.10.10.209:3389 weight 100 check agent-check agent-port 3333 inter 2000 rise 2 fall 3 minconn 0 maxconn 0 on-marked-down shutdown-sessions
    server R2 10.10.10.210:3389 weight 100 check agent-check agent-port 3333 inter 2000 rise 2 fall 3 minconn 0 maxconn 0 on-marked-down shutdown-sessions
    option redispatch
    option abortonclose

  14. Hi Slaine,

    I’m not sure if I understand your issue correctly.. Please correct me if I don’t.

    The load balancer will continue to connect to the agent on the real server whatever the current load.

    The load threshold can be adjusted within the agent by clicking the configuration button and changing the threshold settings and restarting the agent. Lowering the threshold should put the server into drain mode at a lower CPU loading and therefore prevent connections to the server. Also, when in drain mode, existing connections will still be able to connect but new connections will be rejected.

  15. A couple things.
    1. Thanks for making this open source! I noticed the source code may not be the 4.5.1 version. It says it in the text, but the link goes to 4.5.0.
    2. When I have a lower CPU threshold, say 25, once in a while for just a second or two, the feedback agent will report 0%, even though there is no load on the server. I wanted the source so I could track the bug down and fix it myself, plus add a few things for our specific environment. If I find the bug, I’d gladly provide the fix.

    Thanks!

    • Hi Aaron,

      I’ve updated the Blog to include the link to our GitHub account which should have all the latest source code available.

      Personally I’m not aware of any problems with it incorrectly reporting 0% but any help in finding the cause of this issue is very much appreciated.

      If it’s easily reproduceable can you give us the steps to do so and a copy of your config so I can get one of our guys to take a look.

      • After running some tests, it wasn’t an code problem. It’s an operator problem. 😛 It seems to work fine.

  16. Hi!

    Testing HAProxy 1.5-2 with FeedbackAgent, all works fine and there is no issues.
    I have a question about rdp_cookie: how to make possible to always send the same user to the same server?
    There is my config:
    global
    daemon
    stats socket /var/run/haproxy.stat mode 600 level admin
    pidfile /var/run/haproxy.pid

    defaults
    mode http
    timeout connect 4000
    timeout client 42000
    timeout server 43000

    listen RDP_Test
    bind *:3389
    mode tcp
    balance leastconn
    option tcpka
    tcp-request inspect-delay 5s
    tcp-request content accept if RDP_COOKIE
    timeout client 12h
    timeout server 12h
    option tcpka
    option redispatch
    option abortonclose
    maxconn 40000
    server TS1 10.64.0.209:3389 weight 100 check agent-check agent-port 3333 inter 2000 rise 2 fall 3 minconn 0 maxconn 0 on-marked-down shutdown-sessions
    server TS2 10.64.0.210:3389 weight 100 check agent-check agent-port 3333 inter 2000 rise 2 fall 3 minconn 0 maxconn 0 on-marked-down shutdown-sessions

    thanks.

    • Hi David,

      I think you are missing 2 lines, one for the stick table and one to tell it what to store in that table :

      stick-table type string size 10240k expire 30m
      stick on rdp_cookie(mstshash) upper

  17. HAProxy 1.6-dev1, CentOS6

    Getting a segfault when trying connect to port 3389.

    segfault at 0 ip (null) sp 00007fff18a41268 error 14 in haproxy[400000+a4000]

    Compiling with next options:
    make TARGET=linux26
    make install

  18. HAProxy 1.5-dev21 Feedback Agent 4.5.1.
    Have a server that is running Windows 2008R2. When the CPU usage of that server reaches 100 percent and is under a very heavy load,
    value of the agent is ignored and users can connect to the problematic server.
    All hosts running on Qemu/KVM virtual environment.

    • When the Windows server is heavily loaded the agent will respond slowly. So you either need to increase your timeouts in haproxy:

      timeout check 5s
      agent check 5s

      Or change the threshold in the agent so that it reports overloaded at a lower threshold i.e. 75%


      ThresholdValue value=75

      • Where put the agent check directive?
        I get an error at starting haproxy:
        parsing [/etc/haproxy/haproxy.cfg:47] : unknown keyword ‘agent’ in ‘backend’ section

        increase timeout check and changes threshold in agent not helps.

        global
        #uid 99
        #gid 99
        daemon
        stats socket /var/run/haproxy.stat mode 600 level admin
        log 127.0.0.1 local4
        maxconn 40000
        ulimit-n 81001
        pidfile /var/run/haproxy.pid

        defaults
        log global
        #mode http
        timeout connect 10000
        timeout client 1h
        timeout server 1h
        balance leastconn
        timeout check 5s

        listen stats :1936
        mode http
        stats enable
        stats hide-version
        stats realm Haproxy\ Statistics
        stats uri /
        stats auth test:test

        frontend F1
        bind *:3389
        maxconn 40000
        default_backend RDSTest
        mode tcp
        option tcplog

        backend RDSTest
        balance leastconn
        stick-table type string size 10240k expire 3m
        stick on rdp_cookie(mstshash) #upper
        persist rdp-cookie
        tcp-request inspect-delay 5s
        tcp-request content accept if RDP_COOKIE
        #balance rdp-cookie
        timeout client 1h
        timeout server 1h
        option tcpka
        option abortonclose
        agent check 5s
        server TS1 10.64.0.209:3389 weight 100 check agent-check agent-port 3333 inter 2000 rise 2 fall 3 minconn 0 maxconn 0 on-marked-down shutdown-sessions
        server TS2 10.64.0.210:3389 weight 100 check agent-check agent-port 3333 inter 2000 rise 2 fall 3 minconn 0 maxconn 0 on-marked-down shutdown-sessions

  19. Pingback: loadbalancer.org – Linux feedback agent « David Ramsden – Network engineer, general geek, petrol + drum and bass head

  20. Hi all.
    There is a possibility, when client log off from the server, mstshash cookie will be deleted immediately?

  21. Hi David,

    The RDP client should send the cookie during the Connection Initiation phase of the RDP Connection Sequence, so it should be present each time a user connects. The load balancer will continue to track an rdp (mstshash) cookie in a stick table for the remaining persistence time after the user disconnects. Please note that there have been a few reliability issues with RDP cookies as noted in this blog: http://blog.loadbalancer.org/microsoft-drops-support-for-mstshash-cookies/

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Powered by sweet Captcha