The Ethics of AI-Enabled Weapon Systems: Testing and Evaluating

Jovana Davidovic | Philosophy Department, University of Iowa Stockdale Center, USNA BABL AI

The Ethics of AI-Enabled Weapon Systems: Testing and Evaluating

Abstract: Testing, evaluating, validating, and verifying (TEVV) AI-enabled weapons systems is a crucial step for assuring justice in war. Doing so, at a minimum, minimizes chances of jus in bello violations. However, TEVV of AI-enabled weapons systems is complicated by the nature of machine-learning algorithms that enable these weapons. This paper addresses the complexity of TEVV for AI-enabled weapons and provides potential solutions to mitigate the ethical risks that emerge out of using and developing AI-enabled weapon systems. The integration of AI into weapon systems – whether for decision-support or targeting – presents distinctive risks that must be addressed by any military force that includes these weapons in its arsenals. To minimize such risks, this paper suggests we should engage in rigorous testing, evaluation, validation, and verification (TEVV) of AI-enabled weapons. Such testing should be a) cradle to grave, b) modular and principled, and should be followed by c) gradual fielding in d) clearly defined operational envelopes, with e) appropriate explainability; this should take place in parallel with the legal review of these weapons.

A crucial step in assuring just military preparedness is assuring safety and reliability of weapons systems. Just development and deployment of weapon systems requires that a state engage in the so-called TEVV (testing, evaluation, verification, and validation) process before deploying such weapons systems.[1] Premature deployment of systems that have not been adequately tested for safety and reliability increase the risk of harm to innocent persons and increase the risk of jus in bello violations.[2]

Ordinarily, the TEVV process aims at assuring that a piece of machinery or technology works as anticipated; it aims at assuring predictable performance, which in turn serves as basis for trusting that a system will operate as expected when it is deployed. But weapons systems are no ordinary technology. Weapons systems, by their nature, have the potential to cause significant harm. A weapons systems TEVV is thus not only meant to assure reliable performance, but also to provide assurances regarding the weapons’ compliance with jus in bello rules. After all, the operational effectiveness of a weapon system depends on its ability to comply with jus in bello rules, like distinction.[3] A weapon system TEVV is thus aimed at building the right kind of calibrated trust in commanders who decide to deploy the weapon system and operators who use it.[4] Such trust is calibrated when the warfighters operational reliance aligns with the system performance for the context”.[5] Such trust is of the right kind when it is both grounded in predictability of performance and the underlying values embedded in the weapon and the TEVV process. In other words, weapons system TEVV must not only provide grounds for instrumental predictability-based trust, but also for values-based trust.[6] Heather Roff and David Danks distinguish between these two senses of ‘trust’.[7] They define ‘predictability-based trust’ as the kind of trust that comes from predictable and reliable performance, whereas they define ‘values-based trust’ as the kind of trust that comes from one’s expectations of others’ values, beliefs and commitments. Because weapons can cause such significant harm, weapons TEVV cannot simply be aimed at instrumental means-ends predictability-trust, but must also aim at values-based trust, building trust among operators and commanders that the weapon is constrained in certain ways compatible with some key values (like jus in bello rules) and that the weapon can be used in ways compatible with those values. Relatedly, it is worthwhile distinguishing between trustworthiness as an inward value of a system, and trust cultivation as an outward goal for a system. Understanding these distinctions is important because trustworthiness and predictability-dimension of trust are best met by TEVV that has a high level of predictability about performance. But the value-dimension of trust and trust cultivation are best promoted by specific process design and transparency about the TEVV process. A TEVV process for a weapons system striving to serve jus ante bellum should try to simultaneously be trustworthy and trust cultivating, it should strive to build predictability- and value-based trust.[8]

AI-enabled weapons present a particular problem for the TEVV process because of their complexity, opaqueness, and brittleness.[9] In what follows, we examine what a TEVV process for AI-enabled weapons should look like, understanding that the aims of that process relative to jus ante bellum are assuring safety and operational effectiveness (including minimizing civilian casualties) and building the right kind of trust in commanders deploying the system and operators using it. As we will discuss later, a robust and transparent TEVV process can also serve as a confidence-building measure for potential adversaries, thus minimizing the likelihood that potential adversaries will resort to war prematurely.

Before turning to what the TEVV process ought to look like for AI-enabled weapons, a few words about what we mean by AI-enabled weapons. AI-enabled weapons, for our purposes here, are weapons with an autonomous or semi-autonomous mode, including any weapon that for its proper functioning utilizes machine learning (ML) (including deep learning (DL)) algorithms. Autonomous and semi-autonomous weapons, are in turn, weapons that use ML or DL algorithms to decide on how to accomplish some task within the constraints of what, when, and why.[10] This might be a weapon that uses AI for object recognition and for discriminating between civilians and combatants, or it might be a weapon that is mounted on a tank that uses AI for navigation. It might be a weapon that uses AI to loiter while searching for targets and/or engaging targets, or that uses an algorithm to identify incoming missiles. It could be a defensive or offensive weapon and it can use the algorithm in various aspects of its functioning. Circumscribing what makes an “AI-enabled weapon” is part of the difficulty in performing TEVV for such a weapon. But at a minimum, we know that machine learning and deep learning techniques make the TEVV process for AI-enabled weapons significantly more difficult. This is for a variety of reasons that will be discussed below. For now, we will proceed with this somewhat indeterminate and stipulative definition of AI-enabled weapons. For our purposes, it is less important that we define an AI-enabled weapon generally, and more important that we acknowledge that when a weapon for its functioning uses a machine learning or deep learning algorithm – its testing, evaluation, verification and validation is complicated and different from that same process for a weapon sans such an algorithm. We now turn to how AI-enabled weapons complicate the TEVV process aimed at jus ante bellum and just preparedness.

a.     Appropriate Unit of Analysis: First, it is harder to circumscribe the appropriate unit of analysis for TEVV process. This is not simply because it is, as addressed above, hard to define what an AI-enabled weapon is, but for several other reasons.

                        i.         First, it is because the same algorithm can be utilized across a range of weapon system applications. Consider for example an object recognition algorithm. Such an algorithm can be utilized for several different purposes and weapon systems. It might be used to help autonomous navigation in a tank, or it could be used to classify an object as a weapon/non-weapon and as such it could be used as a part of a decision-making system regarding targeting or in bello proportionality.

                      ii.         Second, algorithms often function within a system of systems. The term system of systems here simply refers to the idea that several algorithms could be providing inputs for one another. For example, and building on the example above, an object recognition algorithm could provide input to a decision-augmenting algorithm, which collates and processes a range of relevant information and presents alternative courses of action and likelihoods of success to the commander.

                     iii.         Third, the success of algorithms and therefore their ability to function properly greatly depends on so-called human-machine teaming, namely on the interaction between the operator or user and a particular algorithm or its output. This might mean that TEVV might need to test systems with differently trained operators so as to be able to provide meaningful assurances regarding the system’s performance.[11]

b.     Fitness for purpose: Fitness for purpose presents a further issue for TEVV in the context of AI-enabled weapons. If predicting performance is one key function of TEVV, then high dependence of performance on operational environment and the suitability of training data for that operational environment means that first, each new deployment might need a new evaluation, and second, that the TEVV process needs a clear way of defining what counts as a “new deployment”.

c.     Generalizing from testing: Relatedly, generalizing and extrapolating from test results is near impossible for many AI-enabled weapons systems. As Wojton et al.’s analysis of relevant literature concludes “[t]here is strong consensus that the state-space explosion resulting from the interaction of tasks’ and systems’ growing complexity will make it impossible, under any realistic assumptions, to exhaustively test all scenarios”.[12] In other words, these systems perform incredibly difficult tasks, they do so in unpredictable environments, and they provide “non-deterministic, dynamic responses to those environments”, making the range of potential scenarios to test immense, if not infinite.[13] Now, it is certainly the case that both AI-enabled and traditional weapons systems face this obstacle – one can only test a fraction of the operational space, but with traditional weapon systems one can with more confidence generalize across varied environments.[14] Consider an autonomous base defense system with a task of responding to a variety of threats under current ROEs.[15] It will likely encounter many different scenarios, some of which the TEVV process cannot foresee and thus cannot test for. Flournoy, et al. discuss similar issues under the rubric of “brittleness”, arguing that the “traditional TEVV approach is not well suited for ML/DL”, partly because, “ML/DL system performance is difficult to characterize and bind, and the brittleness of such systems means they will require regular system updates and testing”.[16] So in addition to not being able to predict all potential scenarios, and not being able to extrapolate from testing data, the need of such systems to adjust adds another obstacle to TEVV. While this does not mean that we can expect constant full-blown TEVV processes, it does mean that the TEVV process cannot be simply done up-front, but instead it should be integrated in the development and operation of a weapon. Using a better integrated TEVV process for AI-enabled weapons should also result in a more informed approach regarding appropriate operational environments for a particular AI-enabled weapon. More will be said about this below.

d.     Unpredictable failures: Relatedly, failures of an AI-enabled weapon are harder to predict and more difficult to understand, complicating the ability of TEVV to faithfully predict how AI systems will perform and in which operational environments. The opaquer the system, the harder it is to predict which common mistakes the system might present with and why. Relatedly, and echoing the above worries, it is also harder to circumscribe common operational contexts within which the system might perform better or worse and those within which it might exhibit errors. Deep learning techniques and systems of systems are particularly likely to exhibit these sorts of opaqueness issues. TEVV process might therefore require a certain level of explainability. In fact, as Flournoy, et al. argue, it might be that for high-risk systems explainability is a requirement for a successful TEVV and thus for fielding the weapon.[17] Opaqueness to some extent might also drive how TEVV influences certification schemes for AI-enabled weapons, for example requiring ML experience for some high-risk systems.[18] In addition, as Jane Pinelis, the Chief AI Assurance Officer for the U.S.’s Joint AI Center argues, we might need to move away from complete risk avoidance, and precise risk quantification and instead focus on making failures graceful.

e.     Piecemeal approach to development: A further complexity around AI-enabled weapons systems is the fact that they are often build not-for-purpose and by the public sector. Unlike traditional weapon systems, AI-enabled weapon systems (and specifically the AI part of those systems) are more likely to come piecemeal from a variety of sources. This is because first, much of the development of AI is happening in the public sector, and second, because AI is often utilized to solve for specific problems or design needs, and as such there is an increased reliance on AI solutions to concrete problems in the OODA loop. It is more likely that an AI-enabled weapon is going to come not from a single weapons developer, like traditional weapon systems did, but from a range of sources.[19] This affects how some AI-enabled weapons can be tested. For example, for ML systems, verification and validation require never-before-seen testing data that is fit for that purpose and thus not data that is publicly available. One of the biggest obstacles for building and meaningfully testing AI-enabled weapons is access to large data sets that are appropriately tagged and built and kept for such testing.[20]

f.      Dynamic and open AI systems: Finally, some machine learning solutions to warfighting problems will result in open systems – AI models that rely on iterative processes for dynamic situations – simply put, models that take in data and update the model as the model is being used. That presents an obvious problem for the TEVV process, as each new algorithm (ML model) in one sense acts as a new weapon. A robust TEVV process needs to make clear at what level of change in the model is the weapon sufficiently different from the previous one so as to trigger the TEVV process anew.

With all of these ways in which AI-enabled weapons are different from traditional weapons, there are key ways in which TEVV process ought to shift to accommodate the presence of AI-enabled weapons. We focus here on key changes that will serve the jus ante bellum aims mentioned above, namely changes that, through assuring safety, precision and accuracy, will also assure that the development and deployment of AI-enabled weapons is compatible with jus in bello conditions and doesn’t further promote conflict. With those aims in mind the TEVV process for AI-enabled weapons ought to meet the following conditions:

a.     Ongoing and integrated cradle to grave: First, as Flournoy, et al., Pinelis, and others have suggested, the TEVV process for AI-enabled weapon systems ought to be ongoing.[21] This is partly because an integrated testing and evaluations process can bolster quality of AI-enabled weapons, by anticipating problems, and assuring compliance with both jus in bello and responsible AI principles. An ongoing, integrated-into-development TEVV can also address some of the issues of transparency and explainability mentioned above.

 

b.     Principled and modular approach to TEVV: Relatedly, there should be a clear set of circumstances when an otherwise “same” AI-enabled weapon needs to undergo a ground-up (anew) TEVV process, or some aspect of the TEVV process (thus the modular approach). This is for two reasons: first, in as much as AI-enabled weapons’ performance depends heavily on the operational environment new operational environments require some new testing and evaluation. Second, updates to AI-enabled weapons might trigger a new process. A robust TEVV process needs to not only assess performance in appropriate operational environments, but also define those environments, and often in collaboration with those developing AI or integrating AI into weapon systems. Defining such operational environments for AI systems is harder than for traditional weapon systems. It might be, for example, that an object recognition model works very well in one climate or geographic region, but not in another, for unanticipated reasons. Thus, both new training data, and potentially new operational environments ought to trigger some elements of the TEVV process to be repeated.

c.     TEVV for AI-enabled weapons should serve the legal weapons review: As we acknowledged above, the TEVV process for AI-enabled weapons ought to be ongoing and it should parallel the process of development. Relatedly, some legal scholars, most notably Tobias Vestern and Altea Rossi, have argued that the weapons review ought to also be parallel and incorporated into the TEVV process. According to Vestern and Rossi, even though traditionally (and especially in the U.S.) testing and technical assessment come prior to legal review and provide evidence for it, the nature of AI systems, which often translate the legal requirements into the technical specifications, makes it so that the legal review needs to be undertaken in tandem with the TEVV process.[22] Legal weapons review needs to be, in their mind, a part of TEVV. This seems right, especially if we are focused on developing a TEVV process which is sensitive to jus ante bellum, including assuring that the way we develop and deploy weapons does not lead to violations of jus in bello conditions. A particular benefit of this approach is that things that trigger renewed testing and evaluation might also trigger, in some cases, legal weapons review.

d.     Better testing data sets and gradual fielding: Many algorithms relevant to weapons systems (like object recognition or decision-making warfighting algorithms) get trained in simulated environments. These environments are themselves often built on ML/DL algorithms and simulate controlled environments. Simulation-based testing data is not going to be enough in cases when the risks of fielding a particular weapon is great. Better and real life data sets are obviously preferable as they would increase operator’s ability to trust performance in a range of operational environments. To this end, it is crucial that the AI-enabled weapons be tested in real-life environments, and be fielded partially only, and if possible be released into the wild in a gradual way. “[A] strategy of graded autonomy (slowly stepping up the permitted risks of unsupervised tasks, as with medical residents) and limited capability fielding (only initially certifying and enabling a subset of existing capabilities for fielding) could allow the services to get at least some useful functionality into warfighters hands while continuing the T&E process for features with a higher evidentiary burden (Porter et al., 2020)”.[23] This approach to TEVV and fielding of AI-enabled weapons systems can play a particularly meaningful role in assuring allies and potential adversaries of responsible development thus disincentivizing rash deployment of similar technology on their part.

e.     Transparency and Explainability: A robust TEVV process will impose some transparency requirements on AI-enabled weapons as well as some explainability limits. It should be noted here as well that it is also the case that TEVV process itself needs to be transparent and that that can contribute to what we call confidence-building measures, as discussed in a later section. Regarding the transparency and explainability of AI-enabled weapons, the needs of the TEVV process might not completely overlap with the transparency and explainability needed for operation of an AI-enabled weapon. Simply put, to define the operating envelope (a set of conditions when we expect the system to perform in expected ways) requires a certain level of transparency and explainability from a system which is different than the transparency and explainability an operator might need to responsibly operate the system. Explainability serves not only the identification of a problem, but also the ability to predict behavior of system in varied circumstances, defining the operating envelope, and trust. One way to try to build the needed transparency and explainability is recommended by Wojton, et al., in their literature review. “If systems are recording data about their own decisions and internal processing, then stakeholders, including developers, testers, and even users, can gain more transparency into the system. From a TEV&V perspective, this instrumentation could be combined with safety middleware or disabled functionality to execute what some call “shadow testing”, where the complex system makes decisions about what it would do in the current situation without being allowed to implement or execute those actions (Templeton, 2019)”.[24] Related to the above need for better data from the wild, the suggestion of shadow testing could also provide meaningful and large data from the field.

f.      TEVV should define the operating envelope, and red lines for some technologies: TEVV is traditionally primarily about safety and accuracy. But as we have seen, when it comes to AI-enabled weapons TEVV must include assessing operational environments and defining the operating envelope. This will allow operator to know whether and when to trust the weapon system as well as some ways in which it might fail. While this is the case with respect to TEVV for traditional weapons, namely that TEVV is meant to provide insights into variable performance in a range of operational environments (the so-called operating envelope), this issue is significantly more complicated for AI-enabled systems and as such TEVV has a more significant and explicit role to play in defining appropriate operational environments for the use of an AI-enables weapon. TEVV can also provide some insight into whether a particular algorithm should be used instead of a human or non-AI alternative, in the first place. To clarify, there might be times when an AI-enabled system that works relatively well, might not be better than a human in some contexts. That in turn means that TEVV ought not to assess the safety and precision of a weapon in vacuum, but with an eye to available alternatives for similar functions in varied operational environments.

g.     TEVV should drive certification schemes: TEVV has to take place with capable operators, and in that way the iterative process of testing and evaluation can help guide appropriate training, skills and certifications of operators. For example, the TEVV process that the U.S.’s Joint AI Center proposes includes four types of testing including algorithmic testing, human machine testing, systems integration testing and operational testing with real users in real scenarios.[25] The human machine testing and the operational testing provide evidence not just for the evaluation of the weapon systems, but also they provide evidence for appropriate human machine teaming. While TEVV has always played a role (at least in the U.S. approach to weapons testing) in certification schemes for operators, the training content that TEVV can provide in cases of certification schemes for AI-enabled weapons seems significantly greater.

h.     TEVV should be undertaken in various configurations of systems and people: As mentioned above, AI-enabled weapons are often systems of systems – chains of algorithms with one algorithm’s output acting as input for another. In such cases it might not be possible or desirable to test just one algorithm at a time.[26] This in turn means that ML algorithms need to be tested in various configurations – i.e. operating with/alongside/in a chain with a number of other ML models/algorithms. Relatedly, scholars have argued that to assure good results testing ought to proceed in various configurations of systems and people.[27]

Ultimately, a TEVV process for AI-enabled weapons is meant to not only test for safety, but also provide meaningful insights regarding appropriate human-machine interaction as well as compliance with jus in bello. In fact, a robust TEVV process is a necessary step for a legal weapons review. We have argued that a TEVV process sensitive to the unique challenges of AI-enabled weapons can meet the conditions of just preparedness by embracing the above recommendations.

[1] Note about the term ‘TEVV’: Various sources have the ‘VV’ as validation and verification and others as verification and validation. We follow Flournoy, et al. in referring to it as validation and verification. Michèle A. Flournoy, Avril Haines, & Gabrielle Chefitz, Adapting DOD’s Test & Evaluation, Validation & Verification (TEVV) Enterprise for Machine Learning Systems, including Deep Learning Systems (October 2020), https://cset.georgetown.edu/wp-content/uploads/Building-Trust-Through-Testing.pdf

[2] And potentially increase the risk of violations of jus ad bellum.

[3] Furthermore, TEVV can provide a basis for a legal weapons review, as in does in the U.S., and concomitantly TEVV can play a significant role in providing assurances to potential adversaries. Both of these issues are discussed below.

[4] Flournoy, et al. 3 “The ultimate goal of any TEVV system should be to build trust – with a commander who is responsible for deploying a system and an operator who will decide whether to delegate a task to such system – by providing relevant, easily understandable data to inform decision-making”.

[5] Pinelis Y., Presentation on Progress in Testing and Evaluation of AI-enabled Weapons Systems. https://www.youtube.com/watch?v=1eSKngsJvvo

[6] Roff, H.M., and Danks, D. (2018). Trust but verify: The difficulty of trusting autonomous weapons systems. J. Milit. Ethics 17, 2-20. doi: 10.1080/15027570.2018.1481907

[7] Ibid.

[8] Wojton, H., Porter, D., and Dennis, J. (2021). Test and Evaluation of AI-Enabled and Autonomous Systems: A Literature Review. Alexandria, VA: Institute for Defense Analysis. Available online at: https://testscience.org/wp-content/uploads/formidable/20/Autonomy-Lit-Review.pdf

[9] UD Department of Defense (2021). DoD Instruction 5000.89: Test and Evaluation. Washington, DC: US Department of Defense. US Department of Defense (2020). AI Principles: Recommendations on the Ethical Use of Artificial Intelligence by the Department of Defense. Press Release. Washington, DC: US Department of Defense. “ENCLOSURE 2 V&V AND T&E OF AUTONOMOUS AND SEMI-AUTONOMOUS WEAPON SYSTEMS To ensure autonomous and semi-autonomous weapon systems function as anticipated in realistic operational environments against adaptive adversaries and are sufficiently robust to minimize failures that could lead to unintended engagements or to loss of control of the system, in accordance with subparagraph 4.a.(1) above the signature of this Directive: a. Systems will go through rigorous hardware and software V&V and realistic system developmental and operational T&E, including analysis of unanticipated emergent behavior resulting from the effects of complex operational environments on autonomous or semiautonomous systems. b. After initial operational test and evaluation (IOT&E), any further changes to the system will undergo V&V and T&E in order to ensure that critical safety features have not been degraded. (1) A regression test of the software shall be applied to validate critical safety features have not been degraded. Automated regression testing tools will be used whenever feasible. The regression testing shall identify any new operating states and changes in the state transition matrix of the autonomous or semi-autonomous weapon system. (2) Each new or revised operating state shall undergo integrated T&E to characterize the system behavior in that new operating state. Changes to the state transition matrix may require whole system follow-on operational T&E, as directed by the Director of Operational Test and Evaluation (DOT&E)”.

[10] Wojton et al., 3: “1. Specify the goals or objectives of the specific task and/or the overall mission, and possibly the larger reasons for those goals, like the commander’s intent (i.e., what to do and why), 2. Specify the constraints associated with the task, such as Rules of Engagement (ROEs, i.e., what not to do). 3. Not specify the methods to use or give explicit contingencies for every situation, like reacting to the adversary’s response (i.e., how to do the task). [But] Whether a system is empowered to make these ‘how’ decisions for a task is how this paper will differentiate autonomous from non-autonomous systems”.

[11] Flournoy et al.

[12] Wojton, et al., 5.

[13] Wojton, et al., 4.

[14] Pinelis.

[15] Wojton, et al., 1.

[16] Flournoy et al., 7.

[17] Flournoy et al.

[18] This might be especially the case when teaching “common mistakes” is less useful and understanding the way that the system works is necessary for anticipating/recognizing when the system is failing to operate as expected.

[19] Interview with Chief Ethics of AI Officer for the Air Force.

[20] Interview with TEVV Director for CDAO.

[21] Pinelis; Flournoy, et al.

[22] It should be noted however that not everyone agrees that the TEVV process is necessarily prior to legal review and practitioners of legal review described in interview with us that even for non-AI-enabled weapons the legal review process have a cradle to grave feature.

[23] Wojton, et al. 20.

[24] Wojton, et al., 20.

[25] Pinelis.

[26] Flournoy et al.

[27] Hand, D.J., and Khan, S. (2020). Validating and verifying AI systems. Patterns 1, 37. doi: 10.1016/j.patter.2020.100037