Large Language Models Relearn Removed Concepts

robustness

Model editing via neuron pruning allows for concept removal from language models. Models exhibit resilience and fluidity in relearning pruned concepts.

Authors

Michelle Lo

Shay B. Cohen

Fazl Barez

Published

January 3, 2024

Large Language Models Relearn Removed Concepts

Major Takeaways

Neuroplasticity: Large language models (LLMs) demonstrate the ability to quickly regain performance and redistribute pruned concepts after retraining.
Concept Redistribution: Pruned concepts originally present in later layers are remapped to neurons in earlier layers, demonstrating the resilience of LLMs.
Polysemantic Capacities: Neurons show polysemantic properties, capturing a blend of old and new concepts during relearning.

Abstract

The study investigates neuroplasticity in large language models (LLMs) by exploring their capacity to reacquire pruned concepts after editing. The findings suggest that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. The paper highlights the challenges of permanent concept removal for improved model safety and the importance of monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts.

Introduction

Large language models encode semantic concepts across different languages, architectures, and modalities. The primary objective when pruning such models is to eliminate redundant neurons while preserving the most crucial ones, leading to the assumption that removing important “concept neurons” will disrupt the model’s structured internal representation of key concepts. However, the paper presents evidence of neuroplasticity in models, allowing them to regain high performance after pruning random or important neurons. This phenomenon, termed “neuroplasticity,” demonstrates a degree of adaptability in such models and has significant implications for model editing.

Problem Setting

The paper provides a formal definition of concept neurons, concept saliency, and concept similarity, and outlines the process for identifying and pruning top concept neurons in a language model to induce neuroplasticity.

Method

The researchers explore neuroplasticity within a pretrained model by fine-tuning the model for a specific task, identifying and pruning concept neurons, and tracking the redistribution of concepts over the retraining process. They explore the concept saliency and similarity to analyze the redistribution of concepts in the model after neuroplasticity.

Experimental Setup

The study focuses on pruning the specific concept of location names from different LLMs and analyzes the models across different runs. The model architectures, training, and evaluations are clearly described.

Results

The paper presents a detailed analysis of the rapid performance recovery after retraining, high-level concept redistribution, and the relocation of pruned concepts. It also delves into the polysemantic characteristics of neurons after retraining.

Conclusion

The findings contribute to a deeper understanding of how language models learn, adapt, and retain core conceptual representations. It also suggests potential research directions in model editing and transfer learning. The paper concludes by emphasizing the need for studying the implications of neuroplasticity-induced polysemanticity to aid the development of interpretable models and the enhanced transfer of learned representations.

Critique

The paper provides valuable insights into neuroplasticity and concept reshaping in LLMs. However, the precise relationship between concept similarity and saliency and the generalizability of the findings to other LLMs require further investigation. Additionally, the paper acknowledges the potential wider impacts of its findings and emphasizes the importance of ethical and responsible AI research.

Appendix

Model	gpt-3.5-turbo-1106
Date Generated	2024-02-26
Abstract	http://arxiv.org/abs/2401.01814v1
HTML	https://browse.arxiv.org/html/2401.01814v1
Truncated	False
Word Count	12729

Large Language Models Relearn Removed Concepts

Major Takeaways

Abstract

Introduction

Related Work

Problem Setting

Method

Experimental Setup

Results

Conclusion

Critique

Appendix