“大厂垄断大模型”，会被开源终结吗？_大模型_开源

谷歌和OpenAI在AI方面的种种积累，最终真的会败给一群隐藏在民间的“草头侠”吗？

最近，正在进行AI大战的各个大厂，被谷歌泄漏的一份内部文件，翻开了窘迫的一面。

这份泄露的内部文件声称：“我们没有‘护城河’，OpenAI 也没有。当我们还在争吵时，第三个方已经悄悄地抢了我们的饭碗——开源。”

这份文件认为，现在的一些开源模型，一直在照搬谷歌、微软这些大厂的劳动成果，并且双方差距正在以惊人的速度缩小。开源模型更快、可定制性更强、更私密，而且功能性也不落下风。

比如，这些开源模型可以用 100 美元外加 13B 参数，加上几个礼拜的时间就能出炉，而谷歌这样的大厂，要想训练大模型，则需要面对千万美元的成本和 540B 参数，以及长达数月的训练周期。

那么，事实是否真的像这份文件所说的那样，谷歌和OpenAI在AI方面的种种积累，最终真的会败给一群隐藏在民间的“草头侠”？

所谓“大厂垄断大模型”的时代，真的要终结了吗？

要回答这个问题，我们就得先了解下目前开源模型的生态，看看这些如雨后春笋般涌现的开源模型，究竟是如何一步步蚕食谷歌这些“正规军”的江山的。

一、异军突起的开源模型

其实，最早的开源模型，其诞生完全是一场“偶然”。

今年2月，Meta发布了自家的大型语言模型LLaMA，参数量从70亿到650亿不等，并仅用130亿的参数，就在大多数基准测试下超越了GPT-3。

但万万没想到的是，刚发布没几天，LLaMA的模型文件就被泄露了。

至此之后，开源模型的浪潮就如决堤一般，变得一发不可收拾。

如八仙过海一般的ChatGPT开源替代品——「羊驼家族」，随即粉墨登场。

与ChatGPT这类大模型相比，此类开源模型最显著的特点，就是训练成本与时间都极其低廉。

以LlaMA的衍生模型Alpaca为例，其训练成本仅用了52k数据和600美元。

然而，如果开源光靠低成本，还不足以让谷歌这类大厂感到威胁，重要的是，在极低的训练成本下，这些开源模型还能屡次达到和GPT-3.5匹敌的性能。

这下谷歌和OpenAI就坐不住了。

斯坦福研究者对GPT-3.5（text-davinci-003）和Alpaca 7B进行了比较，发现这两个模型的性能非常相似。Alpaca在与GPT-3.5的比较中，获胜次数为90对89。

重点来了：这些开源模型，究竟是怎么做到这点的？

斯坦福团队的答案是两点：1、一个强大的预训练语言模型；2、一个高质量的指令遵循数据。

在这里，我们将强大的预训练语言模型（如LlaMA或GPT-3），比喻为一位有着丰富知识和经验的老师。

对于自然语言处理领域的任务，强大的预训练语言模型，可以利用大规模的文本数据进行训练，学习到自然语言的模式和规律，并且可以帮助指令遵循等任务的模型更好地理解和生成文本，提高模型的表达和理解能力。

这就相当于学生使用老师的知识和经验，来提高语言能力，指令遵循等任务的模型可以使用预训练语言模型的知识和经验来提高自己的表现。

除了借助这位“老师”的知识外，开源模型的另一“利刃”，就是指令微调。

指令微调，或指令调优，是指现有的大语言模型生成指令遵循数据后，对数据进行优化的过程。

具体来说，指令微调是指在生成的指令数据中，对一些不合适或错误的指令进行修正，使其更符合实际应用场景。

而指令调优是指在生成的指令数据中，对一些重要、复杂或容易出错的指令进行加重或重复，以提高指令遵循模型对这些指令的理解和表现能力。

凭借着这样的“微调”，人们可以生成更准确、更有针对性的指令遵循数据，从而提高开源模型在特定任务上的表现能力。

如此一来，即使只用很少的数据，开源社区也能训练出性能匹敌ChatGPT的新模型。

然而，又一个问题是：面对自己辛苦打下的江山，被开源社区用“四两拨千斤”的方式步步蚕食，谷歌和OpenAI为何一直没有予以反制呢？

哪怕是如法炮制，以毒攻毒，推出同样快速迭代的小模型，也不失为一种破局之策啊。

二、骑虎难下

实际上，谷歌这样的头部企业，不是没有意识到开源的优势。

在那份泄漏的文件中，谷歌就提到：几乎任何人都能按照自己的想法实现模型微调，到时候一天之内的训练周期将成为常态。以这样的速度，微调的累积效应将很快帮助小模型克服体量上的劣势。

可问题是，身为AI领域巨头的谷歌和OpenAI，既不能，也不愿完全放弃训练成本高昂的大参数模型。

从某种程度上说，这是其保证自身优势地位的必要手段。

作为AI领域的巨头，谷歌和OpenAI需要不断提升自己的技术实力和创新能力。而传统的大参数训练模型，则是提供这一探索和创新的必经之路。

因为大模型的底层技术若想取得突破，AI领域的研究者和科学家，就需要更深入地理解模型和算法的基本原理，探索AI技术的局限性和发展方向，这需要进行大量的理论研究、实验验证和数据探索，而不仅仅是微调和优化。

例如，在训练大参数模型时，AI领域的科学家，可以探索模型的泛化能力和鲁棒性，在不同的数据集和场景下评估模型的性能和效果。谷歌的BERT模型，也正是在此过程中得到了不断强化。

同时，大参数模型的训练，还可以帮助科学家探索模型的可解释性和可视化，

例如，对今天的GPT来说至关重要的Transformer模型，虽然在性能上表现出色，但其内部结构和工作原理却相对复杂，不利于理解和解释。

通过大参数模型的训练，人们可以可视化Transformer模型的内部结构和特征，从而更好地理解模型是如何对输入进行编码和处理的，并进一步提高模型的性能和应用效果。

因此，开源和微调的方式，虽然可以促进AI技术的快速发展和优化，但不足以替代对AI基础问题的深入研究和探索。

但话说到这，一个十分尖锐的矛盾又摆了出来：一方面，谷歌和OpenAI不能放弃对大参数模型的研究，并坚持对其技术进行保密。但另一方面，免费、高质量的开源替代品，又让谷歌等大厂的“烧钱”策略难以为继。

因大模型耗费的巨大算力资源和数据，仅是在 2022 年，OpenAI 总计花费就达到了 5.4 亿美元，与之形成鲜明对比的，则是其产生的收入只有 2800 万美元。

与此同时，开源社区的具有的灵活性上的优势，也让谷歌等大厂感到难以匹敌。

在那份泄漏的文件中，谷歌就认为：开源阵营真正的优势在于“个人行为”。

相较于谷歌这些大厂，开源社区的参与者可以自由地探索和研究技术，不受任何限制和压力，从而有更多机会发现新的技术方向和应用场景。

而谷歌研究和开发新技术时，则必须考虑产品的商业可行性和市场竞争力。这就对人才的研究方向产生了一定的限制和约束。

此外，由于保密协议的存在，谷歌的人才也难以像开源社区那样，与外界充分地交流和分享技术研究的成果。如果说，低价、灵活的开源模型，终将成为一种不可阻挡的趋势，那么当谷歌等大厂面对这浩瀚的战场时，又该怎样在新时代生存下去呢？

三、另辟蹊径

倘若谷歌这样的头部企业，最终在开源阵营的攻势下，选择了“打不过就加入”的策略，那如何在开源的情况下，找到一条可行的商业路径，就成了一件头等大事。

毕竟，在目前的市场认知下，开源几乎就等于“人人皆可免费使用。”

之前，Stable Diffusion背后的明星公司——Stability AI，就因为在开源后，没有找到明确的盈利途径，目前正面临严重的财政危机，以至于到了快倒闭的地步。

不过，关于如何在开源的情况下实现盈利，业界也不是完全没有先例可循。

例如，之前谷歌对Android系统的开源，就是一个经典的案例。

当年，由谷歌主导开发和推广的Android系统开源后，谷歌仍然通过各种途径，从Android操作系统的设备制造商那里获取了收益。

具体来说，这些途径可分为以下几种：1.收取授权费用：当设备制造商希望在其设备上预装Google Play商店等谷歌应用和服务时，他们需要遵守谷歌的授权协议，并支付相应的授权费用。
2.推出定制设备：谷歌通过与设备制造商合作，推出一些定制的Android设备，如Google Pixel智能手机和Google Nexus平板电脑等，并从中获得收入。这些定制设备通常具有更高的价值和更好的性能，而且会预装谷歌的应用和服务。
3.销售应用：当设备使用者在Google Play商店中购买应用、游戏或媒体内容时，谷歌会从中提取一定的佣金。

虽然这些途径的收益，也许并不像谷歌的主业——搜索和广告那样让其赚得盆满钵满，但谷歌仍然从中获得了各种“隐性收益”。

因为Android 的存在，避免了某一家企业垄断移动平台的入口，只要互联网是开放的，谷歌就能通过吸引更多人使用Android上的应用，来收集用户的行为数据，对这些数据进行加工，从而使得广告投放可以更加精准。

由此可见，开源模式并非与商业化的盈利模式完全冲突，这对于谷歌和开源社区的参与者而言，都是一种好事。

因为只有通过商业化途径，源源不断地为自身“造血”，谷歌和OpenAI等大厂，才能继续承担起训练大参数模型所需的巨额成本。

而只有大参数模型的持续研发，各大开源社区，才能继续以高性能、高质量的预训练语言模型为基础，微调出种类更多，应用场景更为丰富的开源模型。

基于这样的关系，开源模型与封闭的大模型之间，其实不仅仅只是对立与竞争，同时也是一种互助共生的生态。

翻译：

Will Google and OpenAI’s AI accumulation be defeated by a group of grass-headed men hiding in the people?

An internal document leaked by Google recently revealed a embarrassing situation for companies engaged in AI wars.

The leaked internal document claims: “We don’t have a ‘moat’ and neither does OpenAI. While we’re fighting, a third party has quietly taken our job — open source.”

The document argues that some current open source models have been copying the work of giants such as Google and Microsoft, and that the gap is closing at an alarming rate. The open source model is faster, more customizable, more private, and not less functional.

For example, these open source models can be built for $100 and 13B parameters, plus a few weeks, while a big company like Google faces tens of millions of dollars and 540B parameters and months of training cycles to train large models.

So is it true, as the document suggests, that Google and OpenAI’s AI efforts will ultimately be lost to a group of hidden grass-heads?

Is the era of “big factories monopolizing big models” really coming to an end?

To answer that question, we need to understand the current ecology of open source models, and how the mushrooming open source models are encroachment on Google’s regular army.

The emerging open source model

In fact, the original open source model was born entirely by accident.

In February, Meta released its own large language model, LLaMA, with a reference count ranging from 7 billion to 65 billion, beating GPT-3 on most benchmarks with only 13 billion parameters.

But just days after the launch, LLaMA’s model files were leaked.

Since then, the tide of open source models has burst its levees out of control.

The Alpaca Family, an open source alternative to ChatGPT, was introduced.

The most striking feature of open source models like ChatGPT is that training costs and time are extremely low compared to larger models like Chatgpt.

LlaMA’s derivative model, Alpaca, for example, was trained using only 52k of data and $600.

However, if the low cost of open source is not enough to make the likes of Google feel threatened, the important thing is that these open source models have repeatedly achieved performance comparable to GPT-3.5 at very low training costs.

Google and OpenAI won’t be able to sit tight.

The Stanford researchers compared GPT-3.5 (text-Davinc-003) with Alpaca 7B and found that the performance of the two models was very similar. Alpaca has won 90 to 89 in its comparison to GPT-3.5.

The point is: how exactly do these open source models do this?

The Stanford team’s answer is twofold: 1. A powerful pre-training language model; 2. A high quality instruction compliance data.

Here, we compare a powerful pre-trained language model, such as LlaMA or GPT-3, to a teacher with a wealth of knowledge and experience.

For tasks in the field of natural language processing, powerful pre-trained language models can be trained with large-scale text data to learn the patterns and rules of natural language, and can help the models of tasks such as instruction compliance to better understand and generate text, improving the expression and understanding ability of the models.

This is equivalent to students using the teacher’s knowledge and experience to improve their language ability, and models of tasks such as instruction following can use the knowledge and experience of pre-trained language models to improve their performance.

In addition to taking advantage of the teacher’s knowledge, another “sharp edge” of the open source model is command fine-tuning.

Instruction fine-tuning, or instruction tuning, is the process of optimizing data after the existing large language model generates instructions to follow the data.

Specifically, instruction fine-tuning refers to correcting some inappropriate or wrong instructions in the generated instruction data to make them more in line with the actual application scenario.

Instruction tuning refers to the emphasis or repetition of some important, complex or error-prone instructions in the generated instruction data, so as to improve the understanding and performance ability of the instruction compliance model for these instructions.

With such fine-tuning, more accurate and targeted instruction compliance data can be generated to improve the performance of open source models on specific tasks.

In this way, the open source community can train new models with performance comparable to ChatGPT, even with very little data.

However, there is another question: why have Google and OpenAI not fought back against the encroachment of the open source community on their hard work?

Even if we do the same thing, with the same fast iteration of the small model, it can be a game-changer.

Riding a tiger is difficult

In fact, leading companies like Google are not unaware of the advantages of open source.

In that leaked document, Google said that almost anyone could fine-tune the model however they wanted. And that one-day training cycles would become the norm. At this rate, the cumulative effect of fine-tuning will soon help small models overcome their size disadvantage.

The problem is that Google and OpenAI, the giants of the AI field, are neither able nor willing to completely abandon their costly, big-parameter models.

To some extent, this is a necessary means to ensure its dominance.

As giants in the field of AI, Google and OpenAI need to constantly improve their technical strength and innovation ability. The traditional large parameter training model is the only way to provide such exploration and innovation.

Because if the underlying technology of large models is to make breakthroughs, researchers and scientists in the field of AI need to have a deeper understanding of the basic principles of models and algorithms and explore the limitations and development direction of AI technology, which requires a lot of theoretical research, experimental verification and data exploration, rather than just fine tuning and optimization.

For example, when training large-parameter models, scientists in the field of AI can explore the model’s generalization ability and robustness. And evaluate the model’s performance and effect under different data sets and scenarios. Google’s BERT model has been continuously strengthened in this process.

At the same time, the training of large parameter models can also help scientists explore the interpretability and visualization of models,

For example, the Transformer model, which is critical to today’s GPT, is a performance powerhouse. But its internal structure and workings are relatively complex and difficult to understand and explain.

By training on large parameter models, people can visualize the internal structure and characteristics of Transformer models to better understand how the model encodes and processes inputs and further improve model performance and application effects.

Therefore, the open source and fine-tuning approach, although it can promote the rapid development and optimization of AI technology, is not enough to substitute for in-depth research and exploration of the fundamental problems of AI.

But having said that, a very sharp contradiction arises: On the one hand, Google and OpenAI cannot abandon their work on large-parameter models and insist on keeping their technology secret. On the other hand, free, high-quality open source alternatives make it difficult for companies like Google to continue their cash-burning strategy.

In 2022 alone, OpenAI will spend $540 million, compared with $28 million in revenue, due to the huge amount of computing resources and data that large models consume.

At the same time, the open source community has the advantage of flexibility, also let Google and other big companies find it hard to match.

In that leaked document, Google argues that the real advantage of the open source camp is “individual behavior”.

Compared with Google, participants in the open source community are free to explore and research technologies without any restrictions or pressure. And thus have more opportunities to discover new technical directions and application scenarios.

Google, on the other hand, must consider the commercial feasibility and market competitiveness of its products when researching and developing new technologies. This has produced certain restrictions and constraints on the research direction of talents.

In addition, due to the existence of non-disclosure agreements. Google’s talent is also difficult to communicate and share the results of technical research with the outside world as much as the open source community. If the cheap, flexible open source model will eventually become an irresistible trend. Then when Google and other big companies face this vast battlefield, how to survive in the new era?

Find a new path

If a leading company like Google finally chooses the “join if you can’t beat” strategy under the offensive of the open source camp. Then how to find a viable business path in the open source situation will become the first thing.

After all, in the current market perception, open source is almost synonymous with “free for everyone.”

Before, Stability AI, the star of Stable Diffusion, had failed to find a clear way to make money after it opened source. And was now in such a financial crisis that it was about to go out of business.

However, the industry is not entirely without precedent for how to monetize open source.

For example, Google’s open source of Android is a classic case.

Even after Google led the development and distribution of Android as an open source operating system, Google continued to make money from Android device makers through various channels.

Specifically, these pathways can be classified as follows:

Licensing fees: When device manufacturers want to pre-install Google apps and services such as the Google Play Store on their devices, they are required to comply with Google’s licensing agreements and pay the corresponding licensing fees.
Launch custom devices: Google makes money by partnering with device manufacturers to launch custom Android devices such as the Google Pixel smartphone and Google Nexus tablet. These custom devices typically offer higher value and better performance, and come preloaded with Google apps and services.
Sell Apps: When device users buy apps, games or media content in the Google Play store, Google takes a commission.

While these channels may not be as lucrative as Google’s main businesses, search and advertising, Google still generates a variety of “hidden benefits.”

Because of the existence of Android, no one enterprise monopolizes the entrance of mobile platform. As long as the Internet is open, Google can attract more people to use Android applications, collect user behavior data; And process the data, so that the advertising can be more accurate.

Thus, the open source model is not in complete conflict with the commercial profit model, which is a good thing for Google and the open source community participants.

It is only through commercial means that companies such as Google and OpenAI can continue to afford the huge costs of training large parametric models.

Only with the continuous research and development of large parameter models, can major open source communities continue to fine-tune open source models with more types and richer application scenarios based on high-performance and high-quality pre-training language models.

Based on this relationship, the open source model and the closed large model are not only antagonistic and competitive. But also a mutual symbiotic ecology.

本文由数字化转型网（www.szhzxw.cn）转载而成，来源于AI新智能；编辑/翻译：数字化转型网宁檬树。

免责声明: 本网站(http://www.szhzxw.cn/)内容主要来自原创、合作媒体供稿和第三方投稿，凡在本网站出现的信息，均仅供参考。本网站将尽力确保所提供信息的准确性及可靠性，但不保证有关资料的准确性及可靠性，读者在使用前请进一步核实，并对任何自主决定的行为负责。本网站对有关资料所引致的错误、不确或遗漏，概不负任何法律责任。

本网站刊载的所有内容(包括但不仅限文字、图片、LOGO、音频、视频、软件、程序等) 版权归原作者所有。任何单位或个人认为本网站中的内容可能涉嫌侵犯其知识产权或存在不实内容时，请及时通知本站，予以删除。

“大厂垄断大模型”，会被开源终结吗？

一、异军突起的开源模型

二、骑虎难下

三、另辟蹊径