Helix monitoring metrics are exposed as the MBeans attributes. The MBeans are registered based on instance role.

The easiest way to see the available metrics is using jconsole and point it at a running Helix instance. This will allow browsing all metrics with JMX.

Note that if not mentioned in the attribute name, all attributes are gauge by default.

Metrics on Both Controller and Participant

MBean ZkClientMonitor

ObjectName: “HelixZkClient:type=[client-type],key=[specified-client-key],PATH=[zk-client-listening-path]”

Attributes Description
ReadCounter Zk Read counter. Which could be used to identify unusually high/low ZK traffic
WriteCounter Same as above
ReadBytesCounter Same as above
WriteBytesCounter Same as above
StateChangeEventCounter Zk connection state change counter. Which could be used to identify ZkClient unstable connection
DataChangeEventCounter Zk node data change counter. which could be used to identify unusual high/low ZK events occurrence or slow event processing
PendingCallbackGauge Number of the pending Zk callbacks.
TotalCallbackCounter Number of total received Zk callbacks.
TotalCallbackHandledCounter Number of total handled Zk callbacks.
ReadTotalLatencyCounter Total read latency in ms.
WriteTotalLatencyCounter Total write latency in ms.
WriteFailureCounter Total write failures.
ReadFailureCounter Total read failures.
ReadLatencyGauge Histogram (with all statistic data) of read latency.
WriteLatencyGauge Histogram (with all statistic data) of write latency.
ReadBytesGauge Histogram (with all statistic data) of read bytes of single Zk access.
WriteBytesGauge Histogram (with all statistic data) of write bytes of single Zk access.

MBean HelixCallbackMonitor

ObjectName: “HelixCallback:Type=[callback-type],Key=[cluster-name].[instance-name],Change=[callback-change-type]”

Attributes Description
Counter Zk Callback counter for each Helix callback type.
UnbatchedCounter Unbatched Zk Callback counter for each helix callback type.
LatencyCounter Callback handler latency counter in ms.
LatencyGauge Histogram (with all statistic data) of Callback handler latency.

MBean MessageQueueMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],messageQueue=[instance-name]”

Attributes Description
MessageQueueBacklog Get the message queue size

Metrics on Controller only

MBean ClusterStatusMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name]”

Attributes Description
DisabledInstancesGauge Current number of disabled instances
DisabledPartitionsGauge Current number of disabled partitions number
DownInstanceGauge Current down instances number
InstanceMessageQueueBacklog The sum of all message queue sizes for instances in this cluster
InstancesGauge Current live instances number
MaxMessageQueueSizeGauge The maximum message queue size across all instances including controller
RebalanceFailureGauge None 0 if previous rebalance failed unexpectedly. The Gauge will be set every time rebalance is done.
RebalanceFailureCounter The number of failures during rebalance pipeline.
Enabled 1 if cluster is enabled, otherwise 0
Maintenance 1 if cluster is in maintenance mode, otherwise 0
Paused 1 if cluster is paused, otherwise 0

MBean ClusterEventMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],eventName=ClusterEvent,phaseName=[event-handling-phase]”

Attributes Description
TotalDurationCounter Total event process duration for each stage.
MaxSingleDurationGauge Max event process duration for each stage within the recent hour.
EventCounter The count of processed event in each stage.
DurationGauge Histogram (with all statistic data) of event process duration for each stage.

MBean InstanceMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],instanceName=[instance-name]”

Attributes Description
Online This instance is Online (1) or Offline (0)
Enabled This instance is Enabled (1) or Disabled (0)
TotalMessageReceived Number of messages sent to this instance by controller
DisabledPartitions Get the total disabled partitions number for this instance

MBean ResourceMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],resourceName=[resource-name]”

Attributes Description
PartitionGauge Get number of partitions of the resource in best possible ideal state for this resource
ErrorPartitionGauge Get the number of current partitions in ERORR state for this resource
DifferenceWithIdealStateGauge Get the number of how many replicas' current state are different from ideal state for this resource
MissingTopStatePartitionGauge Get the number of partitions do not have top state for this resource
ExternalViewPartitionGauge Get number of partitions in ExternalView for this resource
TotalMessageReceived Get number of messages sent to this resource by controller
LoadRebalanceThrottledPartitionGauge Get number of partitions that need load rebalance but were throttled.
RecoveryRebalanceThrottledPartitionGauge Get number of partitions that need recovery rebalance but were throttled.
PendingLoadRebalancePartitionGauge Get number of partitions that have pending load rebalance requests.
PendingRecoveryRebalancePartitionGauge Get number of partitions that have pending recovery rebalance requests.
MissingReplicaPartitionGauge Get number of partitions that have replica number smaller than expected.
MissingMinActiveReplicaPartitionGauge Get number of partitions that have replica number smaller than the minimum requirement.
MaxSinglePartitionTopStateHandoffDurationGauge Get the max duration recorded when the top state is missing in any single partition.
FailedTopStateHandoffCounter Get the number of total top state transition failure.
SucceededTopStateHandoffCounter Get the number of total top state transition done successfully.
SuccessfulTopStateHandoffDurationCounter Get the total duration of all top state transitions.
PartitionTopStateHandoffDurationGauge Histogram (with all statistic data) of top state transition duration.

MBean PerInstanceResourceMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],instanceName=[instance-name],resourceName=[resource-name]”

Attributes Description
PartitionGauge Get number of partitions of the resource in best possible ideal state for this resource on specific instance

MBean JobMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],jobType=[job-type]”

Attributes Description
SuccessfulJobCount Get number of the succeeded jobs
FailedJobCount Get number of failed jobs
AbortedJobCount Get number of the aborted jobs
ExistingJobGauge Get number of existing jobs registered
QueuedJobGauge Get numbers of queued jobs, which are not running jobs
RunningJobGauge Get numbers of running jobs
MaximumJobLatencyGauge Get maximum latency of jobs running time. It will be cleared every hour
JobLatencyCount Get total job latency counter.

MBean WorkflowMonitor

ObjectName: “ClusterStatus:cluster=[cluster-name],workflowType=[workflow-type]”

Attributes Description
SuccessfulWorkflowCount Get number of succeeded workflows
FailedWorkflowCount Get number of failed workflows
FailedWorkflowGauge Get number of current failed workflows
ExistingWorkflowGauge Get number of current existing workflows
QueuedWorkflowGauge Get number of queued but not started workflows
RunningWorkflowGauge Get number of running workflows
WorkflowLatencyCount Get workflow latency count
MaximumWorkflowLatencyGauge Get maximum workflow latency gauge. It will be reset in 1 hour.

Metrics on Participant only

MBean StateTransitionStatMonitor

ObjectName: “CLMParticipantReport:Cluster=[cluster-name],Resource=[resource-name],Transition=[transaction-id]”

Attributes Description
TotalStateTransitionGauge Get the number of total state transitions
TotalFailedTransitionGauge Get the number of total failed state transitions
TotalSuccessTransitionGauge Get the number of total succeeded state transitions
MeanTransitionLatency Get the average state transition latency (from message read to finish)
MaxTransitionLatency Get the maximum state transition latency
MinTransitionLatency Get the minimum state transition latency
PercentileTransitionLatency Get the percentile of state transitions latency
MeanTransitionExecuteLatency Get the average execution latency of state transition (from task started to finish)
MaxTransitionExecuteLatency Get the maximum execution latency of state transition
MinTransitionExecuteLatency Get the minimum execution latency of state transition
PercentileTransitionExecuteLatency Get the percentile of execution latency of state transitions

MBean ThreadPoolExecutorMonitor

ObjectName: “HelixThreadPoolExecutor:Type=[threadpool-type]” (threadpool-type in Message.MessageType, BatchMessageExecutor, Task)

Attributes Description
ThreadPoolCoreSizeGauge Thread pool size is as configured. Aggregate total thread pool size for the whole cluster.
ThreadPoolMaxSizeGauge Same as above
NumOfActiveThreadsGauge Number of running threads.
QueueSizeGauge Queue size. Could be used to identify if too many HelixTask blocked in participant.

MBean MessageLatencyMonitor

ObjectName: “CLMParticipantReport:ParticipantName=[instance-name],MonitorType=MessageLatencyMonitor”

Attributes Description
TotalMessageCount Total message count
TotalMessageLatency Total message latency in ms
MessagelatencyGauge Histogram (with all statistic data) of message processing latency.

MBean ParticipantMessageMonitor

ObjectName: “CLMParticipantReport:ParticipantName=[instance-name]”

Attributes Description
ReceivedMessages Number of received messages
DiscardedMessages Number of discarded messages
CompletedMessages Number of completed messages
FailedMessages Number of failed messages
PendingMessages Number of pending messages to be processed

Back to top

Reflow Maven skin maintained by Olivier Lamy.

Apache Helix, Apache, the Apache feather logo, and the Apache Helix project logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.
Privacy Policy