How we review and rate AI products
Technical excellence is not enough
At Common Sense, we know that successful artificial intelligence is built with responsibility, ethics, and inclusion by design. This means that technical excellence alone is not enough for AI systems—AI is sociotechnical, which means that the technology cannot be separated from the humans and the human-created processes that inform, shape, and develop its use.
That's why our AI product reviews are contextual, taking into account the societal landscape in which the products will be used and actively seeking what information might be missing or invisible to an AI system. Our ratings and reviews act as "nutrition" labels for AI. They describe a product's opportunities, limitations, and risks in a clear and consistent way, putting the information you need at your fingertips.
How Common Sense's AI reviews work
Our process stretches across two distinct stages: 1) participatory disclosures and 2) sociotechnical assessments. At its core, the process is built on the concept of model cards and datasheets, which can be thought of as "nutrition" labels for AI. This documentation organizes the essential facts about an AI model or system in a structured, repeatable way, surfacing critical information such as best uses, known limitations, ideal inputs, and performance metrics.
Our first step is to gather information about the product we're reviewing. This includes anything the organization has shared with us as a part of our participatory disclosures process, as well as any publicly available transparency reports, and a literature review.
Once we have gathered as much information as we can, we bring our review team together for a discussion. This is what we call our sociotechnical assessment. During this conversation we assess multiple dimensions of the product, including:
- Common Sense AI Principles. Each product is assessed for potential opportunities, limitations, and harms according to the standards of each AI principle.
- A Continuum of Risk. Different applications of advanced technologies may be riskier in some contexts. Higher-risk applications require increased scrutiny, auditing, and disclosure.
- Ethical Risk Sweeping. There are significant differences between reviewing a multi-use tool like ChatGPT and a built-for-purpose application. This framework accounts for these distinctions while also educating users on the product's technologies, best uses, unique opportunities, and risk profiles.
Once we've completed our discussion, we finalize the product's ratings under each AI Principle. We then incorporate information from the rest of our sociotechnical assessment to determine the final product rating. We share the key things we discussed and the most important things for you to know, though we do not present any proprietary information that the company shared with us as part of the participatory disclosure process.. There's a lot of information in each of them, and that's by design! Our goal is to give you a clear picture of the details that are important to think about as you decide whether to use these products in your homes and schools—or how to regulate them.
Participatory Disclosures
The first stage in the Common Sense AI review process is the collection of data, model, and system information. This is a private disclosure process between the organization whose product we're reviewing and Common Sense. None of these disclosures are required. In other words, if the creators of any given AI product wish to keep all information private, this is up to them. The more information we have, the more robust our assessment—and ultimately our rating and review—can be. If there isn't enough information to inform our sociotechnical assessment, and publicly available transparency reports or other information does not exist, this factors into the final rating and review for any given product.
Who fills out the participatory disclosures? Common Sense recommends that each organization follow the roles outlined in this annotated model card template, which includes three distinct functions: 1) the developer, who writes the code and runs training; 2) the sociotechnic, who is skilled at analyzing the interaction of technology and society in the long term (this includes lawyers, ethicists, sociologists, or rights advocates); and 3) the project organizer, who understands the overall scope and reach of the model, is responsible for ensuring that each part of the card is filled out, and serves as a contact person.
What do we ask in our participatory disclosures?
Common Sense has adapted and expanded on frameworks such as model cards, datasheets, system cards, and other forms of AI transparency reporting. We cover questions in the following areas:
- System Information: These questions are intended to gather general information about the product, including information about how it was developed.
- Intended Uses: The answers to these questions should provide sufficient information for us to quickly grasp what the product should and should not be used for, and why it was created.
- Performance Factors: This section is intended to provide a summary of product performance across a variety of relevant factors.
- Metrics: This section captures information about how the organization measures the real-world impact of the product.
- Evaluation Data: We ask which data sets and benchmarks an organization used to test its product. It is important that evaluation data sets include data sets that are publicly available for third-party use. As such, for any data sets referenced in this section, we ask for links or information that would provide visibility into the evaluation data's source and composition.
- Training Data: Ideally, this section should capture as much information about the training data as the evaluation data. We recognize, however, this might not always be feasible (e.g., if the data is proprietary, etc). In these cases, we ask for organizations to share basic details about the distributions over groups in the data, as well as any other details that could help inform our readers about what the system has encoded.
- Responsible AI: This section identifies ethical considerations that went into system development. We recognize that ethical analysis does not always lead to precise solutions, and in these scenarios, we aim to gather information on any ethical contemplation process(es) and outcome(s). We ask about a series of general responsible AI considerations, as well as a set of questions that are specific to children, teens, families, and education.
- Data-Specific Questions: These sections focus more deeply on both evaluation and training data. The included questions are inspired by and modified from the standards identified in Datasheets for Datasets. As noted by the paper's authors, the "World Economic Forum suggests that all entities should document the provenance, creation, and use of machine learning data sets in order to avoid discriminatory outcomes." This type of transparency reporting will additionally be required for many use cases in the upcoming EU AI Act. In this section, we ask questions about
- Data set creation and motivation
- Data composition
- Data collection process
- Data & system maintenance
The complete list of questions in our participatory disclosures are available here.
Our sociotechnical assessment
The information provided through the participatory disclosure process and our own research provide the foundation for our responsible AI assessment. This evaluation interrogates and documents known opportunities and ethical risks, which then inform the review write-ups, the ratings for the product under each AI Principle, and the final rating for each product. Much of the foundation for this assessment is built on the seminal work of Santa Clara University's Markkula Center for Applied Ethics, and specifically the Ethics in Technology Practice.
Importantly, these reviews are not conducted individually, but by a team. This matters because no single person, type of background, or specific expertise can single-handedly recognize the ethical blind spots, technical limitations, or opportunities within a given product. Teamwork also matters because it helps to ensure that personal beliefs do not drive review outcomes; team members actively hold each other accountable to the process itself, not any specific outcome.
This also speaks to the importance of a repeatable process for any AI governance work. Any responsible AI evaluation deals with highly complex technologies that can have significant impact on people. Evaluating those impacts means that the topics that need to be discussed, while crucial, can be difficult. Having a repeated, dependable process with a team of people who all feel psychologically safe to discuss these topics allows us to repeatedly surface the right information to share in our final reviews in a way you can trust.
Our sociotechnical assessment breaks down into two main categories: Overview & ethical risk sweeping and AI Principles assessments.
Overview & Ethical Risk Sweeping
This part of our review lays the groundwork for the more specific AI Principles assessments. This section covers the following:
Product Overview, Benefits & Opportunities
This section covers what the product does, its intended uses, foreseeable users, and whether or not children and teens are using the product. We also assess how the product creates value for its users and the broader society, and how it might do more to positively impact children, teens, and education in the future.
Societal landscape assessment
This section aims to contextualize the product and look at how evenly its benefits are or aren't distributed. Questions in this section include:
- Whose interests, desires, skills, experiences, and values might this product have simply assumed, rather than actually consulted? Why do we think this, and with what justification?
- Who are the various stakeholders that will be directly affected by this product? Do their interests appear to have been protected? How do we know what their interests really are—have they been asked?
- Which groups and individuals will be indirectly affected in significant ways? How have their interests been protected? How do we know what their interests really are—have they been asked?
- Who might use this product that wasn't expected to use it, or for purposes that weren't expected? How does this expand or change the stakeholder picture?
Ethical risk sweeping
Leveraging the Markkula Center's definition, ethical risks are "choices that may cause significant harm to persons, or other entities/systems carrying a morally significant status (ecosystems, democratic institutions, water supplies, animal or plant populations, etc.), or are likely to spark acute moral controversy for other reasons." It is important to note that ethical risks are not always easy to identify, and we are always pushing ourselves to recognize common blind spots, including when:
- We do not share the moral perspective of other stakeholders.
- We fail to anticipate the likely causal interactions that will lead to harm.
- We consider only material or economic causes or consequences of harm.
- We fail to draw the distinction between conventional and moral norms.
- The ethical risks are subtle, complex, or significant only in aggregate.
- We misclassify ethical risks as legal, economic, cultural, or PR risks.
We conduct this risk sweeping by incorporating concepts of both post- and pre-mortem ethical risk assessments. We also include a series of questions designed to assess the worst possible scenarios, called the "think about the terrible people" assessment. Questions in this section include:
- Have there been any known ethical failures or unaddressed concerns with this product? If so, what are they?
- Do we know what combination or cascade of causes led to the ethical failure(s)? If not, what do we think might be the cause(s)?
- Did the organization do anything differently as a result?
- For all the risks identified, which are trivial? Which are urgent? Which are too remote to consider? Which are potentially remote but too serious to ignore?
- In what situations might the product creator be tempted to hide information about risks of harm to children or education?
- What shortcuts might the product creator take to optimize for profit and/or adoption that could cause harm to children or education?
- In what ways could this product be abused, misinterpreted, or weaponized by its users?
AI Principles Assessments
For each of the eight Common Sense AI Principles, we ask a series of questions to help us assess how well a product aligns with each principle. At a high level, we are seeking to answer the following for each principle:
Put People First
What we aim to address: Does the product respect human rights and children's rights, as well as identity, integrity, and human dignity? Does it support human agency with human-in-the-loop and adults (parents, caregivers, educators)-in-the-loop frameworks?
Additional questions related to this principle include:
- Do the creators of this product make any commitments (e.g., in Terms of Service, Acceptable Use Policies, product documentation, marketing materials, etc.) regarding human rights or children's rights? If so, are these commitments enforceable (including contractually)?
- Does this product provide sufficient training and information for adults (and children, if relevant) to effectively use the system? Does this training and information address safe use, risk of harms, and how to protect from violation of rights?
- Does the product have the potential to improve the quality of an educator's day-to-day work? Are teachers experiencing less burden and more ability to focus and effectively teach their students due to the product?
- If this product is reducing teaching burden(s), might those burdens be simply shifted to others, or are they reduced completely?
- Are there monitoring systems in place to prevent overconfidence in or overreliance on the AI system?
Promote Learning
What we aim to address: Is the product centered on the needs of individual students, including linguistically diverse students and students with disabilities? Is it aligned with content standards? Does it enable and augment educators? Does it foster a love of learning?
Additional questions related to this principle include:
- Was the tool developed with kids in mind?
- To what extent does the product enable adaptation to students' strengths, not just deficits?
- How does this product support the whole learner, including social dimensions of learning such as enabling students to be active participants in small groups and collaborative learning?
- Does this product enable improved support for learners with disabilities and English language learners?
Prioritize Fairness
What we aim to address: Does the product prioritize equitable sharing of the benefits of artificial intelligence, with a goal of eliminating unfair bias in the development and use of AI systems? Does it respect social and cultural diversity, actively address inequities, and avoid creating or propagating harms, restriction of lifestyle choices, and the concentration of power?
Additional questions related to this principle include:
- What do we know from the participatory disclosures about any fairness evaluations, practices, mitigations, etc?
- Does the product documentation or its training process provide insight into potential bias in the data?
- Is the system accessible by everyone in the same way, without any barriers?
- Who might this product be unintentionally excluding?
- Could this product be damaging to someone or to some group, or unevenly beneficial to people? In what ways?
- Who might be at substantial risk of harm from this product, and how? Have the developers recognized, justified, and mitigated this risk specifically, and what has been done to procure the informed and meaningful consent of those at risk?
- Have the creators put any procedures in place to detect and deal with unfair bias or perceived inequalities that may arise broadly? Are there specific systems designed for use by children, teens, and students?
- Are there mechanisms for teachers to be able to exercise their voice and decision-making to improve equity, reduce bias, and increase cultural responsiveness while using this product? If so, does that impact local use only, or the product more broadly?
Help People Connect
What we aim to address: Does the product foster meaningful human contact and interpersonal connection? Does it create addition to or dependence on the AI system? It should not incite hatred against an individual or group, dehumanize individuals or groups, or employ racial, religious, misogynist, or other slurs and stereotypes that incite or promote hatred.
Additional questions related to this principle include:
- How might this product increase or enhance social connection?
- How might this product contribute to a stronger school learning community?
- Can students engage with the product collaboratively?
- Are there any circumstances in which this product might dehumanize an individual or group, incite hatred against an individual or group, or include racial, religious, misogynist, or other slurs or stereotypes that could do so?
- Does the AI system clearly signal that its social interaction is simulated and that it has no capacities of feeling, empathy, or thought?
- Does this product attempt to "build a relationship" with a child? If so, does it promote or cause use that could lead to addiction to or dependence on continued use?
Be Trustworthy
What we aim to address: Is the product built on sound science that embraces peer review, validated multidisciplinary research, and reproducibility? Does it actively protect children from open beta testing, either through exclusion or informed consent? Does the product perpetuate misinformation or disinformation? Does it avoid contradicting well-established expert consensus and the promotion of theories that are demonstrably false or outdated?
Additional questions related to this principle include:
- Did the product creators take into account multidisciplinary research, especially social science, and other information about the societal landscape when developing it? How do we know?
- Is high-quality research or evaluations about the impacts of using the AI system available? Do we know not only whether the system works, but for whom and under what conditions?
- Does the product’s creator employ a strategy to monitor and test if the AI system is meeting the goals, purposes, and intended applications?
- Is information available to assure children, teens, parents and caregivers, and educators of the system's technical robustness and safety?
Protect Our Privacy
What we aim to address: Does the product protect data? Does it provide clear policies and procedures? Notice and consent for use of data? Does it allow children—in accordance with their age and maturity—to access, securely share, understand the use of, and control and delete their data, and for parents, caregivers, and educators to do the same when appropriate?
Additional questions related to this principle include:
- Does the product feed user input back into its training? If so, does it clearly notify users? Does it provide an opt-out mechanism?
- Are there mechanisms to ensure that sensitive data is kept anonymous? Are there procedures in place to limit access to the data only to those who need it?
- Is access to learner data protected and stored in a secure location and used only for the purposes for which the data was collected?
- Is there a mechanism to allow teachers and school leaders to flag issues related to privacy or data protection?
- Is it possible to customize the privacy and data settings?
- Does the AI system comply with regulations and laws like the General Data Protection Regulation, FERPA, and COPPA?
Keep Kids & Teens Safe
What we aim to address: Does the product protect children's safety, health, and well-being, regardless of whether the product is intended to be used by them? Are there special protections for marginalized communities and sensitive data? Does the product create risks to mental health? Does it produce or surface content that could directly facilitate harm to people or place? Explicit how-to information about harmful activities? Promote or condone violence? Disparage or belittle victims of violence or tragedy? Deny an atrocity? Lack reasonable sensitivity toward a natural disaster, pandemic, atrocity, conflict, death, or other tragic events?
Additional questions related to this principle include:
- Is this a safe tool for kids?
- Has it been tuned or trained to keep users safe?
- Can the product be used, directly or indirectly, to bully, harass, blackmail, or exploit others?
- Are users exposed to content that is safe and free from misinformation, disinformation, harmful content, etc?
- Does use of the system create any harm or fear for individuals or for society?
- Does this product specifically support teachers and school leaders to evaluate student well-being, and if so, how is this being monitored in the product?
- Even if it does not specifically address well-being, how might the product affect the social and emotional well-being of learners and teachers?
Be Transparent & Accountable
What we aim to address: Does the product provide mechanisms for feedback, moderation tools for adults, or notification tools that flag potentially harmful content? Is there any transparency reporting, and if so, is it sufficient and easy to understand? Could the product have a direct and significant impact on people or place, and if so, is it subject to meaningful human control? Is it the primary source of information for decision-making?
Additional questions related to this principle include:
- In what ways does this product enable an adults-in-the-loop framework?
- What (if any) mechanisms are there for reporting, remediation, and control?
- Are parents, teachers, and school leaders aware of the AI methods and features being utilized by the system?
- Is it clear what aspects AI can take over within the system?
- How is the effectiveness and impact of the AI system being evaluated, and how does this evaluation consider children and/or key values of education?
Additional assessments for multi-use & generative AI products
For these types of products, we conduct additional testing across five areas: performance (how well the system performs on various tasks), robustness (how well the system reacts to unexpected prompts or edge cases), information security (how difficult it is to extract training data), truthfulness (to what extent a model can distinguish between the real world and possible worlds), and risk of representational and allocational harms. We recognize that as a third party, any results will be imperfect and directional.
There are a range of known data sets and benchmarks that can be used to help evaluate against these areas. We use a set of known benchmarks, and when needed modify them into prompts. It is important to note that no benchmark can cover all of the risks associated with these systems, and while a company can certainly improve performance against a certain benchmark, that does not mean that it is free of those types of harms.
Our prompt analyses assess the following areas and types of harm:
- Discrimination, hate speech, and exclusion. This includes social stereotypes and unfair discrimination, hate speech and offensive language, exclusionary norms, and lower performance for some languages and social groups.
- Information hazards. This includes whether a product can leak sensitive information or cause material harm by disseminating accurate information about harmful practices.
- Misinformation harms. This includes whether a product can disseminate false or misleading information or cause material harm by disseminating false or poor information.
How we categorize types of AI in our reviews
There are many types of AI out there, and almost just as many ways to describe them! We’re bucketing our AI Product reviews into three categories:

Multi-Use
These products can be used in many different ways, and are also called "foundation models." This category includes products like generative AI, such as chatbots and products that create images from text inputs, translation tools, or computer vision models that can examine images and detect objects like logos, flowers, dogs, or buildings.

Applied Use
These products are built for a specific purpose, but they aren't specifically designed for kids or education. Examples of this category include automated recommendations in your favorite streaming app, or the way an app sorts the faces in a group of photos so you can find pictures of your niece at a wedding.

Designed for Kids
This category is a subset of Applied Use products, and it covers products specifically built for use by kids and teens, either at home or in school. This category also includes education products designed for teachers or administrators (such as a virtual assistant for teachers) that are ultimately intended to benefit students in some way.