The rise of open-source AI systems has been hailed as a milestone in the democratization of technology, fostering innovation, and accelerating scientific progress. Yet, a closer look reveals that many so-called open-source AI models are not as open as they claim to be. This phenomenon is known as “open-washing.” A research paper on these finding was presented in the conference proceedings of the 2024 ACM Conference on Fairness, Accountability and Transparency.
Understanding True Open Source
The term traditionally implies that the When source code of a software is freely available for anyone to use, modify, and distribute, it is known as “open source.” In the context of AI, this notion extends to the release of model architectures, training data, and the algorithms used for training. However, many AI models marketed as open-source fall short of these standards. They often only release the model weights or provide limited access under restrictive licenses, leaving out crucial elements like training data and detailed documentation.
Complexity of Modern AI Systems
Modern AI systems, especially those based on deep learning, are inherently complex. They require massive datasets, significant computational resources, and intricate tuning processes. This intricacy poses challenges to achieving full transparency. For instance, disclosing the entire dataset used for training might not be feasible due to privacy concerns or sheer size. Similarly, the compute power required to reproduce some models can be prohibitive for many users.
Partial Openness in Prominent Models
The challenges of true openness are exemplified by OpenAI’s GPT-3 and GPT-4 models. While OpenAI has released some versions of its models with open weights, the complete training data and detailed documentation of training processes remain undisclosed. This partial openness limits the ability of researchers and developers to fully understand, reproduce, and improve upon these models.
Regulatory Implications and Open-Washing
The upcoming EU AI Act introduces regulations that treat open-source AI systems differently, potentially creating incentives for companies to exploit the term “open source” to evade stricter scrutiny. According to a study by Liesenfeld and Dingemanse, many models that claim to be open-source are in fact “open-weight” at best, offering only the final trained model weights without the accompanying transparency in data and processes.
Proposing an Evidence-Based Framework
To clarify the ambiguities surrounding open-source AI, Liesenfeld and Dingemanse suggest an evidence-based framework that assesses AI models across 14 dimensions of openness. These dimensions include the availability of training data, transparency of scientific and technical documentation, licensing, and access methods. Their survey of over 45 generative AI systems found that the term “open source” is frequently misused, with many models falling short of full openness criteria. This open-washing misleads users and regulators and undermines the core values of the open-source movement.
Open-washing not only misleads users and regulators but also undermines the principles of the open-source movement. By selectively disclosing only parts of their systems, companies can gain the reputational benefits of openness without the associated responsibilities. Such practices can build mistrust within the AI community and stifle genuine innovation by hindering meaningful scrutiny and collaboration.
Fostering Trust and Innovation
Achieving genuine openness in AI requires more than just releasing model weights or source code. It involves a commitment to full transparency, including detailed documentation of data collection and processing, training methodologies, and model evaluation. This level of openness enables researchers to reproduce results, identify biases, and improve upon existing models.
Recommendations for Genuine Openness
To foster true openness in AI, the following measures are recommended:
- Comprehensive Documentation: AI developers should provide detailed documentation covering all aspects of the model development process, including data collection, preprocessing, training, and evaluation.
- Transparent Data Practices: Where possible, training datasets should be made available or, at a minimum, thoroughly described, including any preprocessing steps and data augmentation techniques used.
- Open Access to Models and Code: Models should be released with clear, permissive licenses that allow for modification and redistribution. Source code for training and inference should also be provided.
- Independent Audits: Independent third parties should be encouraged to audit AI models to verify claims of openness and transparency.
- Community Engagement: Developers should actively engage with the open-source community, seeking feedback and contributions to improve the models.
The notion of open-source AI holds great promise for advancing technology and fostering innovation. However, the current state of open-source AI is fraught with inconsistencies and misleading claims. By adopting more rigorous standards for transparency and openness, the AI community can ensure that the benefits of open-source truly extend to all stakeholders, fostering a more collaborative and innovative future.