Heavy / Light optimization part II. Self monitoring and Back To Light

The current article will shine a light on “self monitoring” configuration aspects as well as continues and extends article over Heavy / Light Optimization Logic.

Few posts ago we talked about Heavy / Light optimization logic and figured out three reasons why AVIcode may revert back the heavy / light flag for entrypoints. These reasons were:

    • Restarting the application
    • Changing the alerting threshold
    • And mysterious “self monitoring” feature

Today we’ll shine a light on that “self monitoring” feature at all and under which conditions it may decide to go “Back to Light” particularly.

If you take a look into the “PMonitor.config” file, you may notice the following section:

<ss:selfMonitoring>

<globalTimonlyPerSecondThreashold enable=”true”

value=”20000″ mode=”light”/>

<ss:timeonlyNoisePerEntrypoint enable=”true” value=”2″/>

<ss:backToLightRecycle enable=”true” value=”1440″ mode=”light”/>

<ss:backToLightThreshold enable=”true” value=”50″ mode=”light”/>

<ss:eventGroupsSizeThreshold enable=”true” value=”100″/>

</ss:selfMonitoring>


Note: ss:eventGroupsSizeThreshold may be absent in your config file since it’s not there out of the box, but you
can find it in corresponding “Monitor.xsd” and if you need to update the value you are to add this parameter to ss:selfMonitoring section explicitly.

Remember we also discussed that Heavy / Light optimization can backfire marking certain entrypoints “heavy” under high system load? When system load restores these entrypoints will generate unnecessary overhead unless, of course, we figure out how to revert them back to the “light” mode automatically.

Perhaps the simplest way to do it is to revert the entrypoints, say, once per day. This is exactly what ss:backToLightRecycle parameter is doing. The 1440 is the number of minutes, which (leave that calculator alone) is exactly 24 hours. After that period Agent clears Heavy / Light states for all monitored Event Groups.

OK, but what if I have an entrypoint is slow once in a while but is OK 99% of the time? This is typical, for example, when a mid-tier web services warm up and cause web front responses to be slow. This is when ss:backToLightThreshold comes into play. If the ratio of the [number of events overcome alerting threshold] / [number of events below alerting threshold] exceeds that parameter AVIcode will revert that entrypoint to “light” mode.

Alright, but what if I have a HUGE number of the functions under the monitoring (which can easily happen if you enable, for example, all namespaces monitoring)? Light mode or not, we at least need to measure the duration of these functions. And if we have a lot of them, and I mean A LOT, then just working the stopwatch can slow the system down significantly.

The next optimization technique is more like a safety switch. It monitors the time spent for duration measurements and either reduces the number of monitored functions or turns the time-only monitoring off completely. This is done by ss:globalTimonlyPerSecondThreashold and ss:timeonlyNoisePerEntrypoint. You know that these safety switches are in effect if there is at least one of the following events is in the Intercept event log:

Event log message 1: Due to an unusually high system load, the threshold for functions processed per second has been exceeded. This may have resulted from time-only monitoring of a namespace with a large number of the functions. Please change your configuration settings to prevent this situation. Execution time monitoring for namespaces will be disabled.

Event log message 2: Due to an unusually high system load, the threshold for functions processed per second has been exceeded. This may have resulted from time-only monitoring of a namespace with a large number of the functions. Please change your configuration settings to prevent this situation. All monitoring will be disabled.

Event log message 3: Intercept Studio detected problem with your configuration settings. Time-only function monitoring of a resource has exceeded the allowable threshold. Some time-only functions will not be reported in the event for that resource until the resource is called again.

The 3rd message is a rarety. If you see it in the event log, it means that one of the entrypoints generated too much noise. The noise in terms of the AVIcode is the monitoring overhead, i.e. time spent measuring the function duration. In order to measure function duration, we have to call QueryPerformanceCounter method twice (once when we enter, once when we leave). Therefore under the normal conditions the “noise” for the time-only function will be exatly the duration of the QueryPerformanceCounter function call * 2. Under some exotic conditions (like very fast  entrypoints or something fishy with performance counters) that noise may theoretically exceed the duration of the function itself (i.e. operating the stopwatch took more than the event we were measuring). These kind of measurements don’t make sense, thus AVIcode stops the measurement. If, however, that happens for a reason, ss:timeonlyNoisePerEntrypoint can be adjusted to allow for a more “noise”. It represents the “normal noise” measured in the duration of the QueryPerformanceCounter call.  You normally tweak the threshold by incrementing (+1) it iteratively.

The 1st and the 2nd messages indicate the same problem caused by two different configuration settings. It means we are taking too many time-only functions processed per second. AVIcode usually does it for a reason, however, if you know what you are doing, you may tweak ss:globalTimonlyPerSecondThreshold. It is important to make sure the setting is not changed rapidly because by modifying it you may affect monitored application performance.
You can calculate per second threshold based on either server characteristics or process characteristics.

Per second threshold: Tuning based on server characteristics
In Intercept Event Log you should find the following message.

Pay attention to counter frequency in green square, it will help us to estimate per second threshold based on server characteristics.

In the screenshot above we have 2603916/21/10. This message means that QueryPerformanceCounter ticks every 1/2603916 of a second, and that it takes 21 ticks to accumulate 10 counts.
That gives us (2603916 / 21) * 10 = 1239960 calls per second.
AVIcode suggests 0.025 multiplier to calculate “timeonly calls per entrypoint” threshold, so that gives us 30999 ( = 1239960 * 0.025) as a threshold in this scenario.
0.025 value for this multiplier is selected based on a couple of assumptions: there are 2 QueryPerformanceCounter calls per method, yet AVIcode wants performance overhead to stay within 5%.
Once threshold value is calculated, use it to update “PMonitor.config” (ss:globalTimOnlyPerSecondThreshold).

Per second threshold: Tuning based on process characteristics
Open “PMonitor.config”” and turn on enableAgentDiagnostic:
<ss:enableAgentDiagnostic value=”true”/>
Application restart is not required after.

In Windows Performance Monitor check the values of “Intercept Agent/Timeonly calls per entrypoint” and “Intercept Agent/Entrypoint calls / sec” counters for specific process.
<

Multiple these values to calculate average number of time-only functions being processed.
Set calculated value to “PMonitor.config” (ss:globalTimOnlyPerSecondThreshold).

Per second threshold values not really scientific, and they are more for your information, since they are calculated based on the assumption that QueryPerformanceCounter is the slowest call. That’s not always true.
You may ask what threshold – calculated based on server or process characteristics – you should set as “globalTimOnlyPerSecondThreshold”? Well, you can start from using the least of these values, but it is going to take a few iterations before you find a value that works for your environment.

Having changed the parameters in config file you should save it and close and restart monitored applications.

That’s it. Hope I was able to explain this not so evident and easy to get aspect of performance optimization.

And finally few words about ss:eventGroupsSizeThreshold. Agent uses hash table to provide caching of exception event groups. When new exception group appears, it’s being added to hashtable and thus increasing its size. The hashtable has size limitation, and if it’s exceeded, table will be cleared. So only new exception groups appearing will be hashed, while existing hashing is lost. Default value for this threshold is 50 Mb. Of course you may specify your own value above, but keeping in mind that this will increase the monitored process memory usage.

It seems now you have a solid comprehensive view on those AVIcode optimization options, – thus you are able to tweak the configuration in accordance with your load expectations since correct optimization settings are the straightforward milestone to make monitoring more adjustable to your specific environment.

Good luck!

VIAcode provides services for migration, optimization and management for Azure.