MIT researchers unveiled a technique that turns pretrained computer-vision systems into models that justify their outputs using human-readable concepts—aiming to bolster trust in high-stakes uses such as medical diagnosis and autonomous driving. The method extracts concepts the original model already relies on via a sparse autoencoder, then uses a multimodal large language model to describe and label those concepts in plain language. The resulting “concept bottleneck” forces predictions to depend on a small set—capped at five—of the most relevant concepts, curbing information leakage from hidden features. In tests on bird-species identification and skin-lesion classification, the approach outperformed state-of-the-art concept bottleneck models on accuracy while delivering crisper explanations, though fully opaque models still led on raw performance. The work, by Antonio De Santis and colleagues at MIT CSAIL and the Polytechnic University of Milan, will be presented at the International Conference on Learning Representations. Funding came from Italian and EU programs and industry partners.
Related articles:
Concept Bottleneck Models
A Unified Approach to Interpreting Model Predictions (SHAP)
“Why Should I Trust You?”: Explaining the Predictions of Any Classifier (LIME)
Zoom In: An Introduction to Circuits
Toy Models of Superposition in Neural Networks





























