How we review and rate AI products

Technical excellence is not enough

At Common Sense, we know that successful artificial intelligence is built with responsibility, ethics, and inclusion by design. This means that technical excellence alone is not enough for AI systems—AI is sociotechnical, which means that the technology cannot be separated from the humans and the human-created processes that inform, shape, and develop its use.

That's why our AI product reviews are contextual, taking into account the societal landscape in which the products will be used and actively seeking what information might be missing or invisible to an AI system. Our ratings and reviews act as "nutrition" labels for AI. They describe a product's opportunities, limitations, and risks in a clear and consistent way, putting the information you need at your fingertips.

How Common Sense's AI reviews work

Our process stretches across two distinct stages: 1) participatory disclosures and 2) sociotechnical assessments. At its core, the process is built on the concept of model cards and datasheets, which can be thought of as "nutrition" labels for AI. This documentation organizes the essential facts about an AI model or system in a structured, repeatable way, surfacing critical information such as best uses, known limitations, ideal inputs, and performance metrics.

Our first step is to gather information about the product we're reviewing. This includes anything the organization has shared with us as a part of our participatory disclosures process, as well as any publicly available transparency reports, and a literature review.

Once we have gathered as much information as we can, we bring our review team together for a discussion. This is what we call our sociotechnical assessment. During this conversation we assess multiple dimensions of the product, including:

Common Sense AI Principles. Each product is assessed for potential opportunities, limitations, and harms according to the standards of each AI principle.
A Continuum of Risk. Different applications of advanced technologies may be riskier in some contexts. Higher-risk applications require increased scrutiny, auditing, and disclosure.
Ethical Risk Sweeping. There are significant differences between reviewing a multi-use tool like ChatGPT and a built-for-purpose application. This framework accounts for these distinctions while also educating users on the product's technologies, best uses, unique opportunities, and risk profiles.

Once we've completed our discussion, we finalize the product's ratings under each AI Principle. We then incorporate information from the rest of our sociotechnical assessment to determine the final product rating. We share the key things we discussed and the most important things for you to know, though we do not present any proprietary information that the company shared with us as part of the participatory disclosure process.. There's a lot of information in each of them, and that's by design! Our goal is to give you a clear picture of the details that are important to think about as you decide whether to use these products in your homes and schools—or how to regulate them.

Participatory Disclosures

The first stage in the Common Sense AI review process is the collection of data, model, and system information. This is a private disclosure process between the organization whose product we're reviewing and Common Sense. None of these disclosures are required. In other words, if the creators of any given AI product wish to keep all information private, this is up to them. The more information we have, the more robust our assessment—and ultimately our rating and review—can be. If there isn't enough information to inform our sociotechnical assessment, and publicly available transparency reports or other information does not exist, this factors into the final rating and review for any given product.

Who fills out the participatory disclosures? Common Sense recommends that each organization follow the roles outlined in this annotated model card template, which includes three distinct functions: 1) the developer, who writes the code and runs training; 2) the sociotechnic, who is skilled at analyzing the interaction of technology and society in the long term (this includes lawyers, ethicists, sociologists, or rights advocates); and 3) the project organizer, who understands the overall scope and reach of the model, is responsible for ensuring that each part of the card is filled out, and serves as a contact person.

What do we ask in our participatory disclosures?

Common Sense has adapted and expanded on frameworks such as model cards, datasheets, system cards, and other forms of AI transparency reporting. We cover questions in the following areas:

System Information: These questions are intended to gather general information about the product, including information about how it was developed.
Intended Uses: The answers to these questions should provide sufficient information for us to quickly grasp what the product should and should not be used for, and why it was created.
Performance Factors: This section is intended to provide a summary of product performance across a variety of relevant factors.
Metrics: This section captures information about how the organization measures the real-world impact of the product.
Evaluation Data: We ask which data sets and benchmarks an organization used to test its product. It is important that evaluation data sets include data sets that are publicly available for third-party use. As such, for any data sets referenced in this section, we ask for links or information that would provide visibility into the evaluation data's source and composition.
Training Data: Ideally, this section should capture as much information about the training data as the evaluation data. We recognize, however, this might not always be feasible (e.g., if the data is proprietary, etc). In these cases, we ask for organizations to share basic details about the distributions over groups in the data, as well as any other details that could help inform our readers about what the system has encoded.
Responsible AI: This section identifies ethical considerations that went into system development. We recognize that ethical analysis does not always lead to precise solutions, and in these scenarios, we aim to gather information on any ethical contemplation process(es) and outcome(s). We ask about a series of general responsible AI considerations, as well as a set of questions that are specific to children, teens, families, and education.
Data-Specific Questions: These sections focus more deeply on both evaluation and training data. The included questions are inspired by and modified from the standards identified in Datasheets for Datasets. As noted by the paper's authors, the "World Economic Forum suggests that all entities should document the provenance, creation, and use of machine learning data sets in order to avoid discriminatory outcomes." This type of transparency reporting will additionally be required for many use cases in the upcoming EU AI Act. In this section, we ask questions about
- Data set creation and motivation
- Data composition
- Data collection process
- Data & system maintenance

The complete list of questions in our participatory disclosures are available here.

Our sociotechnical assessment

The information provided through the participatory disclosure process and our own research provide the foundation for our responsible AI assessment. This evaluation interrogates and documents known opportunities and ethical risks, which then inform the review write-ups, the ratings for the product under each AI Principle, and the final rating for each product. Much of the foundation for this assessment is built on the seminal work of Santa Clara University's Markkula Center for Applied Ethics, and specifically the Ethics in Technology Practice.

Importantly, these reviews are not conducted individually, but by a team. This matters because no single person, type of background, or specific expertise can single-handedly recognize the ethical blind spots, technical limitations, or opportunities within a given product. Teamwork also matters because it helps to ensure that personal beliefs do not drive review outcomes; team members actively hold each other accountable to the process itself, not any specific outcome.

This also speaks to the importance of a repeatable process for any AI governance work. Any responsible AI evaluation deals with highly complex technologies that can have significant impact on people. Evaluating those impacts means that the topics that need to be discussed, while crucial, can be difficult. Having a repeated, dependable process with a team of people who all feel psychologically safe to discuss these topics allows us to repeatedly surface the right information to share in our final reviews in a way you can trust.

Our sociotechnical assessment breaks down into two main categories: Overview & ethical risk sweeping and AI Principles assessments.

Overview & Ethical Risk Sweeping

This part of our review lays the groundwork for the more specific AI Principles assessments. This section covers the following:

Product Overview, Benefits & Opportunities

This section covers what the product does, its intended uses, foreseeable users, and whether or not children and teens are using the product. We also assess how the product creates value for its users and the broader society, and how it might do more to positively impact children, teens, and education in the future.

Societal landscape assessment

This section aims to contextualize the product and look at how evenly its benefits are or aren't distributed. Questions in this section include:

Whose interests, desires, skills, experiences, and values might this product have simply assumed, rather than actually consulted? Why do we think this, and with what justification?
Who are the various stakeholders that will be directly affected by this product? Do their interests appear to have been protected? How do we know what their interests really are—have they been asked?
Which groups and individuals will be indirectly affected in significant ways? How have their interests been protected? How do we know what their interests really are—have they been asked?
Who might use this product that wasn't expected to use it, or for purposes that weren't expected? How does this expand or change the stakeholder picture?

Ethical risk sweeping

Leveraging the Markkula Center's definition, ethical risks are "choices that may cause significant harm to persons, or other entities/systems carrying a morally significant status (ecosystems, democratic institutions, water supplies, animal or plant populations, etc.), or are likely to spark acute moral controversy for other reasons." It is important to note that ethical risks are not always easy to identify, and we are always pushing ourselves to recognize common blind spots, including when:

We do not share the moral perspective of other stakeholders.
We fail to anticipate the likely causal interactions that will lead to harm.
We consider only material or economic causes or consequences of harm.
We fail to draw the distinction between conventional and moral norms.
The ethical risks are subtle, complex, or significant only in aggregate.
We misclassify ethical risks as legal, economic, cultural, or PR risks.

We conduct this risk sweeping by incorporating concepts of both post- and pre-mortem ethical risk assessments. We also include a series of questions designed to assess the worst possible scenarios, called the "think about the terrible people" assessment. Questions in this section include:

Have there been any known ethical failures or unaddressed concerns with this product? If so, what are they?
Do we know what combination or cascade of causes led to the ethical failure(s)? If not, what do we think might be the cause(s)?
Did the organization do anything differently as a result?
For all the risks identified, which are trivial? Which are urgent? Which are too remote to consider? Which are potentially remote but too serious to ignore?
In what situations might the product creator be tempted to hide information about risks of harm to children or education?
What shortcuts might the product creator take to optimize for profit and/or adoption that could cause harm to children or education?
In what ways could this product be abused, misinterpreted, or weaponized by its users?

AI Principles Assessments

For each of the eight Common Sense AI Principles, we ask a series of questions to help us assess how well a product aligns with each principle. At a high level, we are seeking to answer the following for each principle:

What we aim to address: Does the product respect human rights and children's rights, as well as identity, integrity, and human dignity? Does it support human agency with human-in-the-loop and adults (parents, caregivers, educators)-in-the-loop frameworks?