How to measure resource usage on partitionlevel in Service Fabric?

后端未结

关注

 1  516

故里飘歌 2021-02-06 19:45

With Service Fabric we get the tools to create custom metrics and capacities. This way we can all make our own resource models that the resource balancer uses to execute on runt

1条回答

攒了一身酷 (楼主)

2021-02-06 20:13
Not only are service instances and replicas hosted in the same process, but they also share a thread pool by default in .NET! Every time you create a new service instance, the platform actually just creates an instance of your service class (the one that derives from StatefulService or StatelessService) inside the host process. This is great because it's fast, cheap, and you can pack a ton of services into a single host process and on to each VM or machine in your cluster.

But it also means resources are shared, so how do you know how much each replica of each partition is using?

The answer is that you report load on virtual resources rather than physical resources. The idea is that you, the service author, can keep track of some measurement about your service, and you formulate metrics from that information. Here is a simple example of a virtual resource that's based on physical resources:

Suppose you have a web service. You run a load test on your web service and you determine the maximum requests per second it can handle on various hardware profiles (using Azure VM sizes and completely made-up numbers as an example):
- A2: 500 RPS
- D2: 1000 RPS
- D4: 1500 RPS
Now when you create your cluster, you set your capacities accordingly based on the hardware profiles you're using. So if you have a cluster of D2s, each node would define a capacity of 1000 RPS.

Then each instance (or replica if stateful) of your web service reports an average RPS value. This is a virtual resource that you can easily calculate per instance/replica. It corresponds to some hardware profile, even though you're not reporting CPU, network, memory, etc. directly. You can apply this to anything that you can measure about your services, e.g., queue length, concurrent user count, etc.

If you don't want to define a capacity as specific as requests per second, you can take a more general approach by defining physical-ish capacities for common resources, like memory or disk usage. But what you're really doing here is defining usable memory and disk for your services rather than total available. In your services you can keep track of how much of each capacity each instance/replica uses. But it's not a total value, it's just the stuff you know about. So for example if you're keeping track of data stored in memory, it wouldn't necessarily include runtime overhead, temporary heap allocations, etc.

I have an example of this approach in a Reliable Collection wrapper I wrote that reports load metrics strictly on the amount of data you store by counting bytes: https://github.com/vturecek/metric-reliable-collections. It doesn't report total memory usage, so you have to come up with a reasonable estimate of how much overhead you need and define your capacities accordingly, but at the same time by not reporting temporary heap allocations and other transient memory usage, the metrics that are reported should be much smoother and more representative of the actual data you're storing (you don't necessarily want to re-balance the cluster simply because the .NET GC hasn't run yet, for example).
0 讨论(0)
发布评论:

提交评论
- 加载中...