Training for Toxicity is a three-part workshop series in which students fine-tune a pre-trained language model into a text classifier from scratch, interrogating the assumptions and decisions embedded at each stage of the pipeline. Working with real posts from far-right Discord channels, students first develop their own taxonomies for different forms of toxic masculinity, then independently label a shared set of posts and confront the disagreements that emerge. In the final workshop, students upload their labeled data into a custom interface, train a machine learning model, and examine its behavior by testing new inputs, analyzing which training examples the model considers most similar, and toggling entire categories on and off to observe how the composition of training data shapes classification.
Across the three workshops, students confront a fundamental question about how AI systems process and categorize human language. The pipeline behind every AI classification system begins with human labor: people reading, interpreting, and assigning labels to individual examples, one at a time. That interpretative work is then compressed into training data that a model learns to reproduce at scale. Each stage of the series makes visible what that compression discards. Categories require agreement on meaning and boundaries, agreements that are always partial and always contestable. Labels flatten the complexity and ambiguity of human language into fixed categories, discarding the context, tone, and interpretive nuance that informed each decision. And once those labels enter the training pipeline, the judgment, debate, and uncertainty that produced them are no longer captured. The model inherits the conclusions without the reasoning, and it can only classify within the categories it was given, forcing everything it encounters into them.