TALK: "From Filtering to Fingerprints: Constructing Pretraining Datasets for LLMs and Measuring Biases in the Data" VISITING SPEAKER: (https://n64lfj4ab.cc.rs6.net/tn.jsp?f=001J3H0PVaGqMWV08isURJ3iOT_XJ8STTrSgdX8UUsy2-vFgfli0aZQg3Pi1LL9g2PWaBzJGMsEhy1OApDf2eKLjLS8caqMEJiR1CRS9gyAgrxdGXWjW0mW2uXFlykK-5bRimHpINmMnuXwBDIDEa2PmmK9cnYrJsrF&c=oPKUxcWmtgc-Id8lnSU74HxZvh1vnZUBjSIvOambXJBoJClnrTi5gQ==&ch=InJBzzwmlc0zBxUTWJDASwbYGTZ6tUEUtBhMTOiqHonzieo8d1tkBQ==), an associate professor of machine learning at the Technical University of Munich WHEN: Friday, December 6, 2024 at 11 a.m. LOCATION: 5105 Iribe Center, University of Maryland ABSTRACT: In this talk, we first discuss how pre-trained datasets for LLMs are sourced from the web through heuristic and machine learning based filtering techniques. We then investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining text datasets derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb and others. Despite those datasets being obtained with similar filtering and deduplication steps, LLMs can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints. BIO: (https://n64lfj4ab.cc.rs6.net/tn.jsp?f=001J3H0PVaGqMWV08isURJ3iOT_XJ8STTrSgdX8UUsy2-vFgfli0aZQg3Pi1LL9g2PWaBzJGMsEhy1OApDf2eKLjLS8caqMEJiR1CRS9gyAgrxdGXWjW0mW2uXFlykK-5bRimHpINmMnuXwBDIDEa2PmmK9cnYrJsrF&c=oPKUxcWmtgc-Id8lnSU74HxZvh1vnZUBjSIvOambXJBoJClnrTi5gQ==&ch=InJBzzwmlc0zBxUTWJDASwbYGTZ6tUEUtBhMTOiqHonzieo8d1tkBQ==) is an associate professor of machine learning in the Department of Computer Engineering at the Technical University of Munich, and an adjunct faculty member at Rice University. From 2017–2019, he was an assistant professor of electrical and computer engineering at Rice University. Before that, Heckel was a postdoctoral researcher in the Berkeley Artificial Intelligence Research Lab at UC Berkeley and a researcher at IBM Research Zurich. He completed his Ph.D. in 2014 at ETH Zurich and was a visiting Ph.D. student at Stanford University's Statistics Department. Heckel's work focuses on machine learning, artificial intelligence, and information processing. He specializes in developing algorithms and foundations for deep learning, particularly for medical imaging, establishing mathematical and empirical underpinnings for machine learning, and utilizing DNA as a digital information technology. Co-sponsored by: UMD Center for Machine Learning Speaker(s): , Reinhard Heckel Room: 5105, Bldg: Brendan Iribe Center for Computer Science and Engineering, 8125 Paint Branch Dr, College Park, Maryland, United States, 20740