Training for Toxicity | USC Sidney Harman Academy for Polymathic Study

Training for Toxicity is a three-part workshop series in which students fine-tune a pre-trained language model into a text classifier from scratch, interrogating the assumptions and decisions embedded at each stage of the pipeline. Working with real posts from far-right Discord channels, students first develop their own taxonomies for different forms of toxic masculinity, then independently label a shared set of posts and confront the disagreements that emerge. In the final workshop, students upload their labeled data into a custom interface, train a machine learning model, and examine its behavior by testing new inputs, analyzing which training examples the model considers most similar, and toggling entire categories on and off to observe how the composition of training data shapes classification.

Across the three workshops, students confront a fundamental question about how AI systems process and categorize human language. The pipeline behind every AI classification system begins with human labor: people reading, interpreting, and assigning labels to individual examples, one at a time. That interpretative work is then compressed into training data that a model learns to reproduce at scale. Each stage of the series makes visible what that compression discards. Categories require agreement on meaning and boundaries, agreements that are always partial and always contestable. Labels flatten the complexity and ambiguity of human language into fixed categories, discarding the context, tone, and interpretive nuance that informed each decision. And once those labels enter the training pipeline, the judgment, debate, and uncertainty that produced them are no longer captured. The model inherits the conclusions without the reasoning, and it can only classify within the categories it was given, forcing everything it encounters into them.

Workshop 1: Building a Taxonomy

In this workshop, students work with a curated dataset of approximately 20,000 comments about gender drawn from far-right Discord servers leaked by the investigative journalism collective Unicorn Riot. These leaked channels, which were made public following the 2017 Unite the Right rally in Charlottesville, contain extensive discourse around masculinity, gender roles, and identity. The dataset provided for this workshop has been filtered to isolate comments related to gender and is available as a CSV download below. Working in small groups, students can read through a sample of these comments and develop a categorical framework for identifying different forms of toxic masculinity, determining what types of toxicity are present, how they differ from one another, and where the boundaries between categories should fall.

Unicorn Riot / DiscordLeaks
Leaked Discord server used to organize Charlottesville rally, 2017.

Students may develop their framework entirely from their own reading of the data, or they may adopt and modify the framework provided in the same CSV below. The provided framework includes categories such as Male Victimization/Marginalization (the view that systems are being manipulated to disadvantage men), Fixed Gender Roles in both traditional and manosphere variants, Anti-Feminism, Anti-LGBTQ+, and Male Superiority, among others. Whether students build their own taxonomy or work from the provided one, the central activity is the same: confronting the interpretive decisions that a category system requires and recognizing that those decisions are always contestable.

Download CSV for workshop.

Workshop 2: Negotiating Categories

Students will independently label a shared set of 30–50 comments from the Discord corpus using the categorical framework developed or adopted in Workshop 1. Each student will apply labels on their own before any group discussion, recording their decisions in a shared spreadsheet where each student has a separate label column for the same set of comments. This independent labeling is essential as it will produce a visible record of where students agree and, more importantly, where they do not. Once labeling is complete, the group will identify the most contested comments and open them for structured discussion, with each student explaining what they noticed in the text, which categories they considered, and what ultimately determined their decision.

CSV with pre-labeled Discord comments.
CSV with example toxicity labeling framework.

Students, ideally working across disciplines, will bring different analytical frameworks to the same text. These disciplinary differences mirror a deeper challenge that the discussion should surface: labeling requires collapsing multiple valid interpretations into a single category, and any training dataset represents one set of resolutions to disputes that could have gone differently. Instructors should close by noting that in industry, this deliberation rarely happens at all. Labels are typically assigned by individual annotators working in isolation, and disagreements are resolved through majority vote rather than through the kind of substantive engagement students will have just practiced.

Download CSV for workshop.

Workshop 3: Interrogating the Model

In this workshop, students upload a CSV of labeled comments into the Training Data Classifier, a custom interface built for this workshop that connects to a DistilBERT language model running in a Google Colab notebook. Students select which columns contain their comment text and labels, train the model on their data, and then begin testing it by typing new comments and examining the results. The interface returns three forms of explanation for each classification: a confidence distribution showing the probability the model assigned to every category, the nearest training examples (the labeled comments from their CSV that the model considers most similar to what they typed) and a word importance analysis that measures how much each word in their input affected the prediction. Together, these three views make the model's reasoning legible in a way that a simple label output would not.

Training Data Classifier interface with label distribution.
Training Data Classifier interface with classification and nearest training examples.

The interface also allows students to manipulate the training data itself. Each label category is displayed with a toggle that can be switched on or off, and the model can be retrained on any subset of categories. Students can remove the "non-toxic" category and observe the model classify everything as some form of toxic masculinity. They can isolate a single category and see what the model does when it has no other options. They can retrain with their own labeled data alongside the instructor-provided dataset, or compare results across different labeling schemes produced by different groups. Each manipulation makes concrete a principle that would otherwise remain abstract: that classification is always relative to a predetermined set of possibilities and that what is excluded from training data shapes the model as much as what is included.

Go to Google Colab Notebook (toxicity_classifier.ipynb).

Go to Training Data Classifier web interface.

Download instructions for Notebook and interface.