Open Source AI

I’m pretty sure you already heard or even use Open Source AI models. In fact, as of today, there are more than 1 Million Models hosted on HuggingFace. But before talk about OpenSource AI, I want to talk about Open Source Software.

Open Source Software has been around for over three decades, and by now, we’re all familiar with what it means in terms of licensing, access, and freedom. Open Source Software is defined by a few key principles: transparency, collaboration, community-driven development, and shared ownership. These characteristics ensure that users can inspect, modify, and distribute the code freely, fostering innovation, reducing costs, and promoting trust in software.

It is important to remember that there are Organizations behind Open Source, these organizations ensure that Open Source Software remains ethical, secure, and sustainable, while helping users navigate the complexities of licensing, attribution, and contribution guidelines.

So, how about AI? Like I mention before, if you access HuggingFace you will see, Million of models, but are those models Open Source? What does define an AI model to be Open Source?

An Open Source AI model is defined by principles similar to those of Open Source Software, but with additional considerations specific to AI, such as:

Training methodology: This includes the algorithms, hyperparameters, frameworks, and optimization techniques used during training.
Training data provenance: This covers the sources, licenses, curation processes, and ethical considerations (bias, fairness, and data privacy) of the training data.
Documentation: Comprehensive documentation should accompany the model, including access to source code, pre-trained weights, and step-by-step instructions for reproducibility.
Ethical and legal compliance: The model must adhere to Open Source licenses like MIT, Apache, and address ethical concerns such as bias audits, data privacy, and transparency in model behavior.
Community and governance: Clear governance structures, maintainer roles, and collaboration processes (contribution guidelines, version control) should be established to ensure ongoing development and accountability.

In Open Source AI, the source code must be freely accessible and distributed under permissive Open Source Licenses such as MIT, Apache, or similar. Also, these models require full disclosure of the training methodology, the origins and composition of the training data, and ethical considerations.

Sounds great right? Well, despite these definitions and requirements, the reality of Open Source AI today is different. Many models do share access to pre-trained weights, while full source code is often not available. Training data is rarely disclosed, even when models are Open Source. This could be due to a combination of licensing constraints, legal risks (data privacy laws like GDPR), and ethical concerns like protecting sensitive or personal information in the data. Also those models requires Petabytes of data, as you can imagine, sharing such massive datasets is not practical. On more thing to consider is that to training these models isn’t simple. We are talking about tons of GPUs, which are super expensive, hard to come by, and take forever to run. For big companies, this is less of a problem, they have the budget and the infrastructure. But for smaller teams, universities, or Open Source collaborations, this could be challengy.

So, while the idea of Open Source AI sounds great, the reality is that people are stuck balancing transparency with the messy, real world stuff: legal risks, data ethics, and size of the datasets. It’s tricky, but it’s also a chance to rethink how we do AI responsibly without sacrificing progress.

Open Source AI isn’t dead, and it’s definitely not bad. In fact, it’s out there, and it’s doing some seriously cool stuff. You can download models, run them on your local machine or a dedicated server, fine-tune them, or even use them with your own private data (like with RAG). It’s all about customization, and that’s where the magic happens.

Projects like OpenWebAI, Hugging Face, Ollama, and others are making this possible. They’re acting like the ‘bridge’ between the big, fancy AI models (that require mountains of GPUs and cash) and the rest of us. These tools let you pick and choose models that actually fit your needs whether you’re a student, a small team, or a nonprofit.

Check my previous post about Running your local LLM on MacOS using Docker and Ollama.

If you want to know more about Open Source and Open Source AI:

Thats all for this post, I’m planning to keep exploring posts related to AI, so stay tunned.