17 February 2022

Blog

More Than Open Source: A Hybrid Approach to Enterprise Data Science

Open source gained a lot of popularity in many fields, especially data science. Rightfully so! I’m a huge supporter of open source (demonstrated by the fact that RapidMiner was an open-source project when it was first incepted), but I’d like to take a holistic view on the topic in this post. 

As of 2020, 94 percent of developers, data scientists, and technology managers preferred open-source software (OSS) to proprietary software. For one, most data scientists are intellectually curious. Popular open-source frameworks typically have extensive tutorials, Coursera courses, and informal forums to provide them with quick answers to their qualifying questions.

Data scientists are tinkerers and experimenters by nature, so open source allows them to test, experiment, and iterate more than most proprietary software. Open-source software can also accommodate today’s fast pace of innovation, a critical element in modern data science applications.

However, the biggest advantages of open-source tools are for coders, and for enterprise data science, 100% coding is not the right approach. While open-source software offers unparalleled flexibility, it has a few key limitations as well. We’ll walk through the pros and cons of open source to show you why a hybrid approach incorporating both commercial and open-source solutions is the best way forward.

The Biggest Benefits of Open-Source Software

As we mentioned, adoption of open source is multiplying, and for good reason. Open-source software is free, widely available, and, well, open to the community. It offers data scientists unparalleled access and flexibility.

Here are a few of our favorite things about open source:

It’s software super glue

One of the best things about open-source software is that anyone can use its code for their own projects. And enterprises rely on that fact. Synopsys, a software company, reported that the number of open-source components per commercial application jumped from 84 in 2016 to 528 in 2021.

Many of us rely on open-source applications, too. WordPress, Mozilla Firefox, and VLC Media Player are just a few examples of OSS that have been integrated into our everyday lives. There are also plenty of open-source tools and projects from companies like Google, Apple, and Zapier that can be used for frictionless integrations, code review, model deployment, and more. Open source is, in many cases, the glue that holds essential software together.

It offers unprecedented customizability

Another major benefit of open source is that anyone can access the source code at any time. There are usually no surprises or hidden functions, as users can not only see the source code, they can also use it, and, if needed, adapt it more specifically to their use cases. As participation increases, so does the strength and usefulness of the project.

By fostering customizability, coders especially feel more confident in the projects they’re working on. It’s easier to evaluate projects, contribute to them, and apply them when trust is established from the start.

It’s a modern innovation driver

Open source is more than software, it’s a community. Rather than entering into a competition, data scientists in the open-source community work together and share knowledge to create new software aimed at solving common pain points. Collaborative, self-directed development, rather than siloed, controlled development, is owed to open source.

Kubernetes, for example, is an open-source system that’s significantly contributed to enterprise-level innovation. The tool, created for Linux container orchestration, allows admins to deploy containers to clusters at scale. Companies like Spotify, Airbnb, and Pinterest use Kubernetes to decrease website load time and get new services into production faster.

The Hidden Costs of Open-Source

The benefits and widespread adoption of open source are undeniable, but there are plenty of enterprise requirements that open source alone doesn’t cover. Though open source is ‘free’ on the surface, enterprises need to keep the hidden costs in mind—things like the time needed to build custom components and maintain enterprise models—and have a solution in place to mitigate them.

Here are a few negative ways that open source can impact your projects:

It requires more time building, coding, and maintaining

When an enterprise adopts an open-source project, the necessary, specific capabilities aren’t usually available out of the box. If you’re an early adopter of an OSS project and the functionality you need isn’t in place, you’ll need to write the code on your own. On the other hand, if the OSS project you’ve built substantial models on is no longer actively developed or maintained (see: Theano), you’ll waste precious time rebuilding that you wouldn’t be on the hook for with a commercial software provider.

Flawless integrations, too, are up to your team. While users can guarantee integration with open source, you may have to add your own code to get the integration to work the way you want it to. Additionally, when there are updates to these OSS tools—and they happen often with open-source solutions—you may need to review and rework your integration to ensure everything still works the way you need it.

It creates a messy mix of components

Open-source software is generally built in layers. When you use OSS, you inherit the components used as well as any transitive or indirect dependencies. This requires you to track both direct and indirect components to ensure proper performance. If a defect is found in an underlying component, you need to be able to quickly identify and mitigate the problem.

You may also find that building your foundation requires a mix of free open-source tools and paid components. If there’s a capability you really need that your open-source tool doesn’t support, not only will you have to pay for it separately, you’ll also be the one responsible for making sure everything integrates properly. Data science is complex and precise work. If one of the components you’re using breaks, it can derail your entire tech stack, creating a Frankenstein’s-monster-esque liability.

It doesn’t come with reliable support

While some open-source software has robust developer communities that help support applications, others, especially minor projects, are less active or even dormant. If you have a question with OSS, the support you need might not be readily available, and even after digging through forums, the answers you find aren’t always reliable. Many open source users hire an external consultant for on-call assistance, adding another cost for convenience.

Approaching Open Source With a Hybrid Mindset

There’s no denying that open source is great for coders. But, data science is a team sport, and for most of those team members, coding isn’t in their wheelhouse. To satisfy 100% of your team’s requirements, especially in an enterprise setting, a hybrid approach is the most robust, efficient, and cost-effective way to go.

If you’re building a data-intensive offering or application, it’s a good idea to pair your open-source framework with an enterprise-level data science platform. This allows you to benefit from enterprise robustness and productivity features from the get-go, while interfaces to OS environments ensure a fast pace of innovation where needed.

Today, RapidMiner is a commercial platform, but we got our start as an open-source tool. We’ve carried over the flexibility, ease of integration, and open nature of open source into our platform without the hidden costs, so you can focus on outpacing your competition rather than worrying about your applications’ codebase.

Want to find out more about RapidMiner and how to use data science software to accelerate your organization’s data-driven transformation? Check out the RapidMiner-commissioned Forrester study on digital transformation today. 

Related Resources