Stanford AI Team Apologizes For Plagiarizing Chinese University's Model
AsianFin--"'Fake it before you make it' is an ignoble product of Silicon Valley," said Christopher Manning, director of the Artificial Intelligence Laboratory at Stanford University in a post on X on Tuesday, commenting on some researchers at the university who plagiarized the achievements by institutions such as China's Tsinghua University.
On May 29, a research team at Stanford University released a large model called Llama3-V, claiming it can achieve the same effects as large models such as GPT-4V with a pre-training cost of only US$500. The news was widely spread on social media and in the academic community of artificial intelligence.
However, industry insiders soon suspected that the Stanford team plagiarized the MiniCPM-Llama3-V 2.5 large model released by Tsinghua University and other Chinese institutions.
Both Llama3-V and the MiniCPM-Llama3-V 2.5 large model are based on the open-source Llama3 large model. Still, the team in Tsinghua conducted unique training, including using the "Tsinghua Bamboo Slips," a collection of Chinese texts written on strips of bamboo which date back to the Warring States Period (475-221 B.C.), to train the model to recognize ancient Chinese characters.
Tests show that the model released by the Stanford University team can also recognize the "Tsinghua Bamboo Slips."
"We are quite sure that the Stanford team has plagiarized our big model research results," said Liu Zhiyuan, a tenured associate professor of the Department of Computer Science at Tsinghua University.
"The data we scanned and annotated word by word from the 'Tsinghua Bamboo Slips' has never been made public, and Llama3-V has shown the same ability to identify the 'Tsinghua Bamboo Slips', even the error examples are the same," said Liu, who is also a member of the Tsinghua big model team.
As doubt increased, the Stanford team deleted the database and promotion articles on the Internet, Liu said, adding "from the evidence and their reactions, the nature of plagiarism has been relatively confirmed."
Following Manning's criticism, two members of the Stanford team, Aksh Garg and Siddharth Sharma, formally apologized on social media.
"We've taken all references to Llama3-V down and we apologize once again for the inconvenience we may have caused," they said.
"China's AI research has an increasing influence," Liu said, noting the plagiarism incident reflects that "our innovative achievements are attracting international attention."
Overall, there is still a significant gap between China's overall research level and that of the U.S., but in some specific segments such as AI innovation, China has rapidly grown into an important promoter, he added.
There is currently no clear definition of "copycatting" for large models. More than 80% of China's large model data comes from Meta's Llama series and training on data from domestic and foreign search engines and internet platforms. The boundaries of intellectual property rights and legal rights are relatively vague.
The primary intention behind building open-source large models and communities is to promote the sharing and exchange of technology, thereby accelerating the development of AI. Combining open-source models on this foundation is fundamentally acceptable; however, the way Stanford and some others have executed it has been less than clever.
An industry expert told TMTPost that there are two main considerations when using open-source projects. First, it's essential to credit the original author, specify which project is being used, and adhere to the open-source project's license and author statements without modification or deletion. Second, it's crucial to assess the project's suitability for commercialization. Some open-source projects explicitly permit commercial use, while others prohibit commercialization or secondary development, requiring developers to communicate and evaluate accordingly.
William Wong, Managing Director of Weizhi Capital, noted, "Many startups now connect to ChatGPT on the backend, design a UI for the frontend, and then claim to be in the AIGC business after launching on the Apple Store." He believes that these AIGC projects are merely riding the hype, without a solid business logic and technical basis.
A gap has been created between the U.S. AI industry and China’s. On Monday, during a speech by AI giant NVIDIA CEO Jensen Huang, a global layout image of NVIDIA ACE—a suite of technologies bringing digital humans to life with generative AI, reveals that China is no longer included in the future infrastructure regions of NVIDIA.
As a result, the open-source mode has emerged as the best, and possibly the only, option for aligning and collaborating on AI technology between China and the U.S. This approach requires more patience and understanding from developers, rather than criticism. Such criticism offers no positive effects for the industry. The lesson of "learning from mistakes" should apply not only to Stanford but also to Chinese large model enterprises.