业界普遍认为 Gemini 3 是迈向通用人工智能(Artificial General Intelligence,AGI) 和超级人工智能(ASI,Artificial Super Intelligence,ASI)的关键突破,是人类和机器合作的惊人之作。然而,正如 Ilya Sutskever 于 11 月 26 日的访谈中指出:大模型 Scaling Law 和摩尔定律一样,迟早会因为物理限制而失效。因此,如何打开大模型训练的炼丹炉,看清黑盒子背后的基本原理,回答大模型是否已逼近其能力极限就成为迫在眉睫的问题了。但是,前人对大模型的理论研究一直停留在单一维度,使得人们只能看到大模型背后原理的冰山一角,对黑盒子的理解也失之片面。
11 月 3 日,我们在 arXiv 上挂出了一篇论文 Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs [1]。该研究将统计物理、信号处理和信息论三者有机结合,系统地总结了对大模型背后数学原理的思考和理解,期望给全面揭示大模型的第一性原理带来曙光。过去一段时间,我们在以下的学术会议上分别报告了这方面的工作:
11 月 2 日:中国电子学会第 32 届信息论学术年会
11 月 15 日:中国工业与应用数学学会第 3 届信息通信数学及应用大会
11 月 17 日:The 2nd Conference-School on Tensor Methods in Mathematics and Artificial Intelligence Computing
2024 年诺贝尔物理学奖授予了 John Hopfield 和 Geoffrey Hinton,颁奖词为:For foundational discoveries and inventions that enable machine learning with artificial neural networks。许多人不太理解,甚至一些 AI 领域的人也认为诺贝尔奖开始蹭热点了。但实际上从早期的 Hopfield 网络开始,神经网络和统计物理就有非常深刻的联系。
Hopfield 本身就是一位物理学家,他于 1982 年提出了 Hopfield 网络,其联想记忆能力震惊了当时的世界 [2]。这一突破重新激发了人们对神经网络和 AI 的大范围研究。可以说,他对 AI 研究走出寒冬做出了不可磨灭的贡献。被称为 “AI 教父” 的 Hinton 则是第一位认识到统计物理方法在神经网络中有巨大价值的计算机科学家。1985 年,他与另外两位合作者提出了 Boltzmann 机,其关键就是引入了统计物理中的能量模型(Energy-based Model,EBM)[3][4]。除了两位诺奖得主外,还有一位女物理学家 Elizabeth Gardner 非常关键。1988 年,Gardner 三度出手,系统地研究了 Hopfield 网络的记忆容量问题,即到底能记住多少个随机模式 [5][6][7]。后来人们将这个容量称为 Gardner 容量。Gardner 用的方法就是统计物理中的 Spin Glass 模型和 Replica 方法。Replica 方法的提出者则是 2021 年诺贝尔物理学奖得主 Giorgio Parisi [8][9]。我们今年和他有一场访谈(视频链接:https://weixin.qq.com/sph/AlRVrYjAi),深入探讨了 AI 与统计物理的关系。
1. B. Bai, "Forget BIT, it is all about TOKEN: Towards semantic information theory for LLMs," arXiv:2511.01202, Nov. 2025.
2. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554-2558, Apr. 1982.
3. D. Ackley, G. Hinton, and T. Sejnowski, "A learning algorithm for Boltzmann machines," Cognitive Science, vol. 9, no. 1, pp. 147-169, Jan. 1985.
4. G. Hinton, "A practical guide to training restricted Boltzmann machines," in Neural Networks: Tricks of the Trade, 2nd ed., Berlin, Germany: Springer, 2012, pp. 599-619.
5. E. Gardner, "The space of interactions in neural network models," Journal of Physics A: Mathematical and General, vol. 21, no. 1, pp. 257-270, Jan. 1988.
6. E. Gardner and B. Derrida, "Optimal storage properties of neural network models," Journal of Physics A: Mathematical and General, vol. 21, no. 1, pp. 271-284, Jan. 1988.
7. E. Gardner and B. Derrida, "Three unfinished works on the optimal storage capacity of networks," Journal of Physics A: Mathematical and General, vol. 22, no. 12, pp. 1983-1994, Jun. 1989.
8. M. Mezard, G. Parisi, and M. Virasoro, Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications. Singapore: World Scientific Publishing, 1987.
9. G. Parisi, In a Flight of Starlings: The Wonders of Complex Systems. Milan, Italy: Penguin Press, 2023.
10. A. Vaswani et al., "Attention is all you need," in Proc. 31st Annual Conference on Neural Information Processing Systems ’17, Long Beach, CA, USA, Dec. 2017.
11. . Jaynes, Probability Theory: The Logic of Science. New York, NY, USA: Cambridge University Press, 2003.
12. A. Gu and T. Dao, "Mamba: Linear-time sequence modeling with selective state spaces," arXiv: 2312.00752, May 2024.
13. T. Dao and A. Gu, "Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality," arXiv: 2405.21060, May 2024.
14. DeepSeek-AI, “DeepSeek-V3.2: Pushing the frontier of open large language models,” DeepSeek, Hangzhou, China, Dec. 2025.
15. T. Cover, "Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition," IEEE Transactions on Electronic Computers, vol. EC-14, no. 3, pp. 326–334, Jun. 1965.
16. M. Talagrand, Mean Field Models for Spin Glasses - Vol. 1: Basic Examples. Berlin, Germany: Springer, 2011.
17.M. Talagrand, Mean Field Models for Spin Glasses - Vol. 2: Advanced Replica-Symmetry and Low Temperature. Berlin, Germany: Springer, 2011.
18. H. Ramsauer et al., "Hopfield networks is all you need," arXiv: 2008.02217, 28 Apr. 2021.
19. M. Geva, R. Schuster, J. Berant, and O. Levy, "Transformer feed-forward layers are key-value memories," in Proc. ACL Conference on Empirical Methods in Natural Language Processing ‘21, Punta Cana, Dominican Republic, Nov. 2021, pp. 5484–5495.
20. J. Fang et al., "AlphaEdit: Null-space constrained knowledge editing for language models," arXiv: 2410.02355, 22 Apr. 2025.
21. W. Fei et al., "NeuralDB: Scaling knowledge editing in LLMs to 100,000 facts with neural KV database," arXiv: 2507.18028, 24 July 2025.
22. X. Niu, B. Bai, L. Deng, and W. Han, "Beyond scaling laws: Understanding transformer performance with associative memory," arXiv: 2405.08707, 14 May 2024.
23. M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018.
24. C. Granger, "Testing for causality: A personal viewpoint," Journal of Economic Dynamics and Control, vol. 2, no. 1, pp. 329-352, Jan. 1980.
25. J. Pearl, Causality: Models, Reasoning, and Inference, 2nd ed. New York, NY, USA: Cambridge University Press, 2009.
26. T. Schreiber, "Measuring information transfer," Physical Review Letters, vol. 85, no. 2, pp. 461-464, Jul. 2000.
27. L. Barnett, A. B. Barrett, and A. K. Seth, "Granger causality and transfer entropy are equivalent for Gaussian variables," Physical Review Letters, vol. 103, no. 23, p. 238701, Dec. 2009.
28. J. Massey, “Causality, feedback and directed information,” in Proc. IEEE International Symposium on Information Theory ‘90, Waikiki, HI, USA, Nov. 1990.