A passing grade for robots

What to score and what to be weary off in robot risk sign-off.

Jan 30, 2025

This article is best enjoyed after having read my piece on metrics. If you have not done so, please go back and give it a read before diving in here; this all will make a lot more sense.

What test are we trying to pass?

In the hypothetical case that you have beaten the odds and gotten your safety requirements translated into metrics, you have a mature incident review process in place, and you have a rigorous data collection and management system, you are left with one more hurdle: What constitutes a passing grade for these metrics? To be specific: what are you expecting to see in the metrics to allow you to continue operating or to increase the complexity or risk associated with your operations?

Assuming a well argued and well documented safety metrics dashboard is in place, various pieces are still missing to help give you confidence to (further) deploy safely. From a system level perspective, where lies the threshold for safe versus unsafe? Does that threshold move and under what conditions? Do you measure unknowns in your system? Do non-safety related metrics impact the threshold? How to metrics relate to one another? Can one well performing metric balance out a bad performing metric?

For lower level safety metrics, decisions must be made as well. How many instances of x is too many and why? Can one metric offset another, arguing something about exposure for example? What is the process of changing the passing criteria? And how do all metrics feed into the overall system score? Is some data weighted more heavily and why? Is uncertainty or noise in the data taken into account too?

I can propose only one approach to resolving most of these questions, and it is flawed.

How do you score the test?

My approach to this problem is to set passing criteria based on thoroughly documented intentions and definitions of safety for the company or project itself. If you are deploying robots intended to help around the house, what are the self-imposed and regulation-required levels of acceptable risk? From there, you can argue about the robustness and reliability of the metrics to predict breaking those regulations or limits. Performing a Fault Tree Analysis or similar on your top level safety requirements may help highlight how seemingly unrelated lower level aspects of your system can still result in a serious issue and it can further bolster the quality of your safety argument. It is important to take into account the accuracy and uncertainties of the metrics when setting acceptance thresholds. Build uncertainty buffers in your passing criteria where you can, and employ processes to further enhance the validity of your passing criteria. Software and hardware updates as well as additional features added should drive a passing criteria review.

A downside to this approach is that it requires a relatively simple operational design domain to document all risks or regulations associated with a deployment. When the environment becomes increasingly complex such as when a robot operates on public roads, the well defined list of safety definitions becomes a confusing tangle quickly. To solve this problem, consider defining safety limits and thresholds at a lower level, covering all parts of your operation. You can set a threshold on managing three way stop intersections for your robot, for example, in addition to setting thresholds on the measured behavior around vulnerable road users.

An approach to safety thresholds that I have seen used but is not recommended involves setting an operating example as safety bar. Basically, the performance of the robot on a good day could lead to measured values of that day being used to benchmark other days against. This is problematic because the reason behind the metrics showing one thing versus another can be attributed to many different internal and external aspects, and it is unlikely that full root cause analyses are done to determine whether a deviation in the metrics is safety critical or not.

No matter what approach you decide on for setting safety thresholds, I recommend you document your work on this topic really, really well.

Did you pass a test or did you graduate from school?

All of this takes an incredible amount of work, but what have you truly accomplished? Depending on the coverage and quality of your metrics and surrounding processes and the way your safety case is argued and how metrics feed into it, you may still not have all the information to safety deploy. It is critical to be honest and clear about where you are still uncertain about safety, and what aspects are undermining confidence in the data you do have.

This work is important and relevant, as metrics are being used to deploy systems on public roads today. Autonomous vehicle companies have been using data to publicly argue their safety performance, sometimes comparing themselves to average human drivers. This is a complex topic to cover, as there are many different opinions on the matter and few companies or regulators agree on anything in this field. If you are interested in reading more about this, I can recommend Phil Koopman’s recent article on statistical safety.

If you would like to learn more about how to build safer systems and get my articles in your inbox every week, please subscribe.

Better Systems Labs

Discussion about this post