1、2024年2月16日凌晨(美国时间2月15日),OpenAI发布了“文生视频”(text-to-video)的工具,Sora。整个世界再次被震撼了。人类用无数种语言,在全球的社交媒体上惊呼:现实,不存在了。
2、那么,Sora到底是什么?
3、这是一段咒语(Prompt):
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
翻译成中文(by ChatGPT)就是:
一位时尚的女士穿着黑色皮夹克、长红裙和黑色靴子,手拿黑色手袋,在东京一条灯光温暖、霓虹灯闪烁、带有动感城市标志的街道上自信而随意地行走。她戴着太阳镜,涂着红色口红。街道潮湿而有反光效果,色彩缤纷的灯光仿佛在地面上创造了镜面效果。许多行人在街上来往。
4、文生视频。文有了。现在,看视频。 数字化转型网(www.szhzxw.cn)
5、看完之后,什么感觉?是不是感觉:这……不可能是AI生成的吧?你看她脸上,雀斑和瑕疵那么明显,不像是假的;镜头移动时,水里的倒影也在移动,不像是假的;旁边一起走的那些人,每个人心中有自己的故事,不像是假的;更重要的是,她的墨镜里还有街景的映射,不像是假的。
6、以前不相信是真的。现在不相信是假的。
7、OpenAI知道你会这么想,所以在官网留了一句话:
所有本页面上的视频都是直接由Sora生成,未经修改。
8、Sora的能耐,还不仅仅是“文生视频”。他还能把两个视频,连在一起,实现无缝过渡。比如这个,从现实的乡村,无缝过渡到虚幻的城市。
9、这样的视频,还有很多。Sora还能做很多很多其它事。你可能也已经看到了不少。我就不发了。网上到处都是。
10、比起这些震撼视频,其实我更想知道的是,Sora的出现,对人工智能的整体发展,到底意味着什么?于是,我专门找来OpenAI官方公布的Sora的技术文档,仔细看了一遍。
11、看完之后我发现,这份技术文档,比那些不可思议的视频,更让人震撼。
12、这篇技术文档,没有泄露太多Sora的技术细节。但还是介绍了它的基本原理。
13、简单来说,Sora通过学习视频,来理解现实世界的动态变化,并用计算机视觉技术模拟这些变化,从而创造出新的视觉内容。换句话说,Sora学习的不仅仅是视频,也不仅仅是视频里的画面、像素点,还在学习视频里面那个世界的“物理规律”。
14、听上去,很抽象。我解释一下。
15、比如,你咬一口食物,食物应该出现一个咬痕。这是“物理规律”。如果咬完之后,食物还是完整的,那就不符合“物理规律”。 数字化转型网(www.szhzxw.cn)
16、大部分的视频软件,并不理解“物理规律”。他们处理的对象,只是画面。而不是画面里的食物和人。但是Sora,似乎理解。当Sora学习人咬食物的视频时,它记住的,不仅是食物和嘴在一起的“具体画面”,还有“咬就会有痕”这个“物理规律”。以后生成视频时,一旦有“咬”这个动作,Sora就会知道,下面应该出现一个咬痕了。
17、比如,下面这段。
18、用Sora生成的视频,并不总是能“咬就会有痕”。它“有时”也会出错。但这已经很厉害,很可怕了。因为“先记忆,再预测”,这种理解世界的方式,是人类理解世界的方式。这种方式有个名字,叫:世界模型。
19、什么是,世界模型?我举个例子。
20、你的“记忆”中,知道一杯咖啡的重量。所以当你想拿起一杯咖啡时,大脑准确“预测”了应该用多大的力。于是,杯子被顺利拿起来。你都没意识到。但如果,杯子里碰巧没有咖啡呢?你就会用很大的力,去拿很轻的杯子。你的手,立刻能感觉到不对。然后,你的“记忆”里会加上一条:杯子也有可能是空的。于是,下次再“预测”,就不会错了。你做的事情越多,大脑里就会形成越复杂的世界模型,用于更准确地预测这个世界的反应。这就是人类与世界交互的方式:世界模型。
21、关于世界模型,如果感兴趣,我建议你读一本书,叫《千脑智能》。
22、回到Sora。Sora的技术文档里有一句话: 数字化转型网(www.szhzxw.cn)
Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
翻译成中文就是:
我们的结果表明,扩展视频生成模型是向着构建通用物理世界模拟器迈进的有希望的路径。
23、什么意思?意思就是说,OpenAI最终想做的,其实不是一个“文生视频”的工具,而是一个通用的“物理世界模拟器”。也就是世界模型,为真实世界建模。
24、而Sora,只是验证了,这条道路可行的一个里程碑。
25、如果从“视频”中,可以开始学习物理的规律了,那么,未来可以不可以从“摄像头”里学习呢?如果也可以的话,那么,给AI装一双“眼睛”,让他满世界跑,会发生什么?如果也可以的话,那么,把全世界的公共摄像头,都开放给OpenAI,会发生什么?
26、Sora的出现,可能意味着,通用人工智能(AGI),正在加速到来。
27、这才是OpenAI,真正想做的事情。
28、所以,这时你就能理解,为什么Sam Altman要筹集7万亿美金,重塑全球AI芯片的基础设施了。7万亿,相当于全球GDP的10%,能买2.5个微软,4个英伟达,或者11.5个特斯拉。为什么?因为,通往通用人工智能的道路上,需要大量、大量、大量的算力。
29、Sora来了,通用人工智能还会远吗? 数字化转型网(www.szhzxw.cn)
30、这个世界正在发生着难以想象的变化。看似很远,但又瞬间近在眼前。
31、最后,要感谢Sam Altman,选择初六宣布此事。
32、不然,我们整个春节,都要用来见证历史了。

翻译:
Liu Run: OpenAI’s newly released Sora, what does it mean?
1. in the early morning of February 16, 2024 (February 15, U.S. time), OpenAI released a “text-to-video” tool, Sora. The world was shocked again. In countless languages, humans have taken to social media across the globe to exclaim: Reality no longer exists.
2. So, what exactly is Sora?
3. This is a Prompt:
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
Translated into Chinese (by ChatGPT) is: 数字化转型网(www.szhzxw.cn)
A stylish woman in a black leather jacket, long red dress and black boots, carrying a black handbag, walks confidently and casually down a Tokyo street with warm lights, neon lights and dynamic city signs. She was wearing sunglasses and red lipstick. The streets are damp and reflective, and the colorful lights seem to create a mirror effect on the ground. Many pedestrians are walking in the street.
4. Vincennes video. The text is here. Now, watch the video.
5. After watching, how do you feel? Isn’t it feeling: This… It couldn’t have been generated by AI, could it? Look at her face, the freckles and blemishes are too obvious to be fake; When the camera moves, the reflection in the water moves too, it doesn’t seem fake; The people walking alongside each other, each with their own story in mind, not like a fake; What’s more, there’s a street view in her sunglasses, so it doesn’t look fake.
6. I didn’t believe it was true. Now I don’t believe it’s fake.
7. OpenAI knows you will think so, so left a sentence on the website:
All videos on this page are generated directly by Sora and are unmodified.
8. Sora’s ability, and not just “Vincennes video.” He was also able to link two videos together for a seamless transition. Like this one, a seamless transition from the real country to the imaginary city.
9. There are many more such videos. Sora can do many, many other things. You’ve probably seen quite a few of them. I won’t send it. It’s all over the Internet.
10. Compared with these shocking videos, in fact, I want to know more is, the emergence of Sora, the overall development of artificial intelligence, what does it mean? Therefore, I specifically found the Sora technical document published by OpenAI and carefully looked at it.
11. After reading it, I found that this technical document was more shocking than those incredible videos.
12. This technical document does not reveal too many technical details of Sora. But I did give you the basics.
13. In simple terms, Sora learns video to understand the dynamics of the real world and uses computer vision technology to simulate those changes to create new visual content. In other words, Sora is not only learning the video, nor is it just learning the images and pixels in the video, but also learning the “physical laws” of the world in the video. 数字化转型网(www.szhzxw.cn)
14. It sounds abstract. Let me explain.
15. For example, if you take a bite of food, the food should have a bite mark. It’s a “law of physics.” If the food is still whole after biting, it does not conform to the “laws of physics.”
16. Most video software, do not understand the “laws of physics”. The objects they’re dealing with, they’re just pictures. Not the food or the people in the picture. But Sora seems to understand. When Sora learned videos of people biting food, it remembered not only the “concrete picture” of the food and the mouth together, but also the “physical law” that “bites make marks.” In the future, when the video is generated, as soon as there is a “bite” action, Sora will know that there should be a bite mark below.
17. For example, the following paragraph.
18. Videos generated with Sora don’t always “bite and bite.” It “sometimes” goes wrong. But it’s pretty impressive. It’s scary. Because “first remember, then predict,” this way of understanding the world is the way humans understand the world. There’s a name for this approach: the World Model.
19. What is the World model? Let me give you an example.
20. In your “memory”, you know the weight of a cup of coffee. So when you want to pick up a cup of coffee, your brain “predicts” exactly how much force to use. So, the cup was picked up smoothly. You don’t even realize it. But what if, by chance, there’s no coffee in the cup? You’re going to use a lot of force to get a very light glass. Your hand, it feels wrong right away. Then, your “memory” will add: the glass may also be empty. So, the next time you “predict”, you will not be wrong. The more things you do, the more complex models of the world are formed in your brain to better predict how the world will react. This is how humans interact with the world: the world model.
21. About the world model, if you are interested, I suggest you read a book called “a thousand brain intelligence”. 数字化转型网(www.szhzxw.cn)
22. Back to Sora. Sora’s technical documentation says:
Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
Translated into Chinese:
Our results show that extending the video generation model is a promising path towards building general purpose physical world simulators.
23. What do you mean? In other words, OpenAI ultimately wants to make, in fact, not a “Vincennes video” tool, but a general purpose “physical world simulator.” That’s world modeling, modeling the real world.
24. And Sora, is just a milestone to prove that this path is feasible.
25. If from the “video”, you can start to learn the laws of physics, then, the future can not learn from the “camera”? If so, then, give AI a pair of “eyes”, let him run around the world, what will happen? And if that were possible, what would happen if we opened up all the world’s public cameras to OpenAI?
26. The emergence of Sora may mean that general Artificial intelligence (AGI) is accelerating.
27. This is what OpenAI really wants to do. 数字化转型网(www.szhzxw.cn)
28. So you can understand why Sam Altman is raising $7 trillion to reshape the global AI chip infrastructure. $7 trillion, equivalent to 10% of global GDP, would buy 2.5 Microsofts, four NViDas, or 11.5 Teslas. Why? Because the road to general artificial intelligence requires lots, lots, lots of computing power.
29. Sora is coming, will general artificial intelligence be far away?
30. The world is undergoing unimaginable changes. It seems so far away, but it’s so close.
31. Finally, thank you to Sam Altman for choosing the sixth day to announce this.
32. Otherwise, the whole Spring Festival will be used to witness history.
本文由数字化转型网(www.szhzxw.cn)转载而成,来源于刘润;编辑/翻译:数字化转型网宁檬树。

免责声明: 本网站(http://www.szhzxw.cn/)内容主要来自原创、合作媒体供稿和第三方投稿,凡在本网站出现的信息,均仅供参考。本网站将尽力确保所提供信息的准确性及可靠性,但不保证有关资料的准确性及可靠性,读者在使用前请进一步核实,并对任何自主决定的行为负责。本网站对有关资料所引致的错误、不确或遗漏,概不负任何法律责任。
本网站刊载的所有内容(包括但不仅限文字、图片、LOGO、音频、视频、软件、程序等) 版权归原作者所有。任何单位或个人认为本网站中的内容可能涉嫌侵犯其知识产权或存在不实内容时,请及时通知本站,予以删除。
