I’m pretty sure you already heard or even use Open Source AI models. In fact, as of today, there are more than 1 million models hosted on HuggingFace. But before talking about Open Source AI, I want to talk about Open Source Software.

Open Source Software has been around for over three decades, and by now, we’re all familiar with what it means in terms of licensing, access, and freedom. Open Source Software is defined by a few key principles: transparency, collaboration, community-driven development, and shared ownership. These characteristics ensure that users can inspect, modify, and distribute the code freely, fostering innovation, reducing costs, and promoting trust in software.

It is important to remember that there are Organizations behind Open Source, these organizations ensure that Open Source Software remains ethical, secure, and sustainable, while helping users navigate the complexities of licensing, attribution, and contribution guidelines.

So, how about AI? Like I mentioned before, if you access HuggingFace you will see millions of models, but are those models Open Source? What does define an AI model to be Open Source?

An Open Source AI model is defined by principles similar to those of Open Source Software, but with additional considerations specific to AI, such as:

  • Training methodology: This includes the algorithms, hyperparameters, frameworks, and optimization techniques used during training.
  • Training data provenance: This covers the sources, licenses, curation processes, and ethical considerations (bias, fairness, and data privacy) of the training data.
  • Documentation: Comprehensive documentation should accompany the model, including access to source code, pre-trained weights, and step-by-step instructions for reproducibility.
  • Ethical and legal compliance: The model must adhere to Open Source licenses like MIT, Apache, and address ethical concerns such as bias audits, data privacy, and transparency in model behavior.
  • Community and governance: Clear governance structures, maintainer roles, and collaboration processes (contribution guidelines, version control) should be established to ensure ongoing development and accountability.

In Open Source AI, the source code must be freely accessible and distributed under permissive Open Source Licenses such as MIT, Apache, or similar. Also, these models require full disclosure of the training methodology, the origins and composition of the training data, and ethical considerations.

This is not just my own wish list. In October 2024 the Open Source Initiative published the Open Source AI Definition (OSAID), the first official attempt to say what “Open Source” actually means for an AI model. One detail worth calling out: OSAID does not demand that you release the entire training dataset. It asks for enough information about the data, its sources, and how it was processed that someone could recreate a similar model. That single choice is what makes the messy reality below possible.

Sounds great right? Well, despite these definitions and requirements, the reality of Open Source AI today is different. Many models do share access to pre-trained weights, while full source code is often not available. Training data is rarely disclosed, even when models are Open Source. This could be due to a combination of licensing constraints, legal risks (data privacy laws like GDPR), and ethical concerns like protecting sensitive or personal information in the data. Also, these models require Petabytes of data, and as you can imagine, sharing such massive datasets is not practical. One more thing to consider is that training these models isn’t simple. We are talking about tons of GPUs, which are super expensive, hard to come by, and take forever to run. For big companies, this is less of a problem, they have the budget and the infrastructure. But for smaller teams, universities, or Open Source collaborations, this could be challenging.

So, while the idea of Open Source AI sounds great, the reality is that people are stuck balancing transparency with the messy, real world stuff: legal risks, data ethics, and size of the datasets. It’s tricky, but it’s also a chance to rethink how we do AI responsibly without sacrificing progress.

Open Source AI isn’t dead, and it’s definitely not bad. In fact, it’s out there, and it’s doing some seriously cool stuff. You can download models, run them on your local machine or a dedicated server, fine-tune them, or even use them with your own private data (like with RAG). It’s all about customization, and that’s where the magic happens.

Projects like Open WebUI, Hugging Face, Ollama, and others are making this possible. They’re acting like the ‘bridge’ between the big, fancy AI models (that require mountains of GPUs and cash) and the rest of us. These tools let you pick and choose models that actually fit your needs whether you’re a student, a small team, or a nonprofit.

Check my previous post about Running your local LLM on MacOS using Docker and Ollama.

If you want to know more about Open Source and Open Source AI:

That’s all for this post, I’m planning to keep exploring posts related to AI, so stay tuned.