Mechanistic interpretability for ai safety

How

특권적 기저(privileged basis)와 비특권적 기저(non-privileged basis): 모노시맨틱 vs. 폴리시맨틱 뉴런의 대조

특징 정의 체계화 (Section 3.1):
- 특징을 신경망의 불가분적(irreducible) 표현 원자로 정의
- 입력 패턴 기반 특징뿐 아니라 자연 추상화(natural abstractions)로 기능하는 추상적 특징 포함
- 인간 해석가능성을 초월하는 외계인 표현(alien representations) 인정
표현의 성질 분석 (Section 3.2):
- 폴리시맨틱 뉴런 문제: 단일 뉴런이 다양한 의미 없는 특징 혼합
- 슈퍼포지션 가설(superposition hypothesis): 고차원 공간에 더 많은 특징이 중첩 인코딩
- 선형 표현 가설(linear representation hypothesis): 신경망의 높은 수준 표현이 선형 구조 유지
- 특권적 기저(privileged basis) vs. 비특권적 기저 비교
계산 메커니즘 추출 (Section 3.3):
- 회로(circuit) 분석: 특정 행동을 유발하는 신경원 간의 인과적 연결 규명
- 모티프(motif): 반복되는 계산 패턴 또는 부분회로
- 보편성 가설(universality hypothesis): 다양한 모델에서 유사한 회로 구조 출현
창발성 이해 (Section 3.4):
- 시뮬레이션 가설: 신경망이 세계 모델(world models) 내재
- 예측 직교성(prediction orthogonality): 표현 공간의 구조적 성질
- 내부 에이전트 출현 시뮬레이션과 정렬 불일치 위험

같이 보면 좋은 논문

기반 연구

On gradient-like explanation under a black-box setting: when black-box explanations become as good as white-box

메커니즘 해석 가능성 평가에 대한 종합 리뷰로, 블랙박스 모델 파해법 연구의 이론적 기반이 됩니다.

기반 연구

Towards uncovering how large language model works: An explainability perspective

527은 LLM의 설명가능성, 인터프리터빌리티 최신 이론과 기술적 안전성을 종합 리뷰해 836의 LLM 메커니즘 분석 토대가 됩니다.

기반 연구

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

AI 기억 메커니즘 전반을 정리한 설문으로, 신경망 내부 구조 해석을 심화하는 데 도움이 된다.

기반 연구

Language agents mirror human causal reasoning biases

신경망의 내재적 인과추론 편향과 메커니즘 분석을 논의하며, LLM의 인지 편향 연구 방법론을 제공한다.

기반 연구

The hidden dimensions of llm alignment: A multi-dimensional safety analysis

Mechanistic interpretability for ai safety–a review 논문은 LLM 정렬의 내부 표현 변화와 안전성 이슈 분석의 이론적 기반을 제공합니다.

기반 연구

Grammars of formal uncertainty: When to trust llms in automated reasoning tasks

Mechanistic interpretability for ai safety 논문은 LLM 자동화 추론 과정의 불확실성 및 신뢰 정량화에 필수적인 이론적 접근법을 제시하며 PCFG/불확실성 분석의 이론 토대를 제공합니다.

기반 연구

Mind the gap: Examining the self-improvement capabilities of large language models

AI 안전성 및 자기검증 능력 한계에 대한 심층적 해석은 생성-검증 갭(GV-gap) 논의의 이론적 기반을 제공한다.

기반 연구

Mechanistic Interpretability Tool for AI Weather Models

AI 모델의 메커니즘 수준 해석가능성에 대한 종합 리뷰는 기상 모델 해석 도구의 이론적 배경을 강화해줍니다.

기반 연구

Unsupervised protein language models learn patterns of enzyme function

527은 AI 안전성의 관점에서 메커니즘 해석 기법을 심층적으로 다루며, 3275의 PLM 임베딩과 반복 실험 설계 해석에 이론적 근거를 제공합니다.

기반 연구

AlphaInterp: Probing AlphaFold 3's Internal Representations Reveals Evolutionary Determinants of Predicted Structure and Confidence

파운데이션 모델의 메커니즘 해석, 내부 구조 분석에 관한 최신 리뷰로 AlphaFold3 내부 표현 해석과 연결됩니다.

다른 접근

A practical review of mechanistic interpretability for transformer-based language models

AI safety 측면에서 해석가능성 도구와 방법론을 종합 분석하여, 트랜스포머 해석 프레임워크의 다양한 관점을 비교합니다.

다른 접근

Can foundation models actively gather information in interactive environments to test hypotheses? arXiv preprint arXiv:2412.06438, 2024.

파운데이션 모델의 인터랙티브 환경에서의 학습 및 적응 능력을 평가하는 관련 연구이다.

다른 접근

Ecm: A unified electronic circuit model for explaining the emergence of in-context learning and chain-of-thought in large language model

527번 논문은 AI 시스템 안전 관점에서 기계 내재적 메커니즘 해석 서베이를 다루어, 1085의 회로 기반 설명 한계와 우수점을 대비해볼 수 있습니다.

다른 접근

InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification

InternAgent 논문은 클로즈드루프 과학 에이전트의 실현을 지향하며, 527번의 메커니즘 해석가능성 이슈가 실제 연구 자동화에서 어떻게 작동하는지 다른 측면을 보여준다.

다른 접근

Reverse predictivity for bidirectional comparison of neural networks and biological brains

MechInterp for AI Safety 논문은 모델 해석과 생물학적 적합성 평가 방식의 차이점을 구체적으로 보여준다.

다른 접근

From Disorder to Design: Physical Mechanisms Governing Generalization and Hallucination in Deep Learning for Imaging Through Scattering Media

machine interpretability 안전성 관점의 종합 리뷰 논문으로, 환각의 물리적/해석적 한계를 논의하는 데 관점 확장에 도움이 됩니다.

후속 연구

Causal learning for socially responsible ai

527번 논문은 AI 안전성과 인과 추론의 기계적 해석 가능성에 초점을 두어, 191번 논문의 socially responsible AI 논의를 실제 LLM 안정성/투명성 연구와 연결해준다.

후속 연구

Fimo: A challenge formal dataset for automated theorem proving

기계적 해석 가능한(Interpretable) AI 안전성 서베이로, 자동화 수학 증명 시스템의 신뢰성 평가에 직접적인 통찰을 더한다.

후속 연구

A practical review of mechanistic interpretability for transformer-based language models

A practical review of mechanistic interpretability for transformers 논문은 트랜스포머 계열에서의 기계론적 해석 가능성 방법을 상세히 논의하여 527의 이론적 리뷰를 실무에 적용할 수 있도록 보완합니다.

후속 연구

Language agents achieve superhuman synthesis of scientific knowledge

527의 AI 안전성과 해석가능성 논의는 457에서 LLM 환각 방지 및 신뢰성 확보 에이전트 개발의 핵심 이슈를 더욱 심층적으로 고찰합니다.

후속 연구

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

Mechanistic interpretability for ai safety–a review 논문은 LLM의 내부 memory 및 해석성 메커니즘을 안전성과 비교해 355 논문의 인간-기계 기억 비교를 실제 위험 사례로 확장한다.

응용 사례

Introspective growth: Automatically advancing llm expertise in technology judgment

LLM의 내재 지식과 실제 활용 지식 간 격차를 진단하는 438번 논문의 프레임워크에 메커니즘 해석 기반 인과 분석을 적용할 수 있다.

응용 사례

Language agents mirror human causal reasoning biases

Language agents mirror…는 LLM 인과추론 편향 분석을 구체적 사례로 제시하여 메커니즘 해석 가능성의 실제 적용을 보여준다.

응용 사례

Mind the gap: Examining the self-improvement capabilities of large language models

LLM의 자기개선 및 검증 능력 분석을 통해 해석가능성과 안전성 논의가 실제 LLM 활용에서 어떤 의미를 갖는지 보여준다.

응용 사례

Mechanistic Interpretability Tool for AI Weather Models

Mechanistic Interpretability Tool for AI Weather Models 논문은 해석가능성 도구를 실제 과학 AI 모델(기상)에 적용하여, 527 논문의 개념을 특정 상황에 실질적으로 적용한 예시를 제공합니다.

Mechanistic interpretability for ai safety–a review

Essence

Motivation

Achievement

How

Originality

Limitation & Further Study

Evaluation

같이 보면 좋은 논문

Mechanistic interpretability for ai safety–a review

Essence

Motivation

Achievement

How

Originality

Limitation & Further Study

Evaluation

같이 보면 좋은 논문

🎧 Audio Overview