Using Notebooks in Databricks for Machine Learning
When a business is wanting to explore their data for bottlenecks and factors that contribute to customer happiness, they may try some Machine Learning models to find patterns. These models can be extremely valuable to a company and provide insights that a human would have never noticed. To generate these models, data is needed. Usually quite a bit of data.
Across the whole pipeline of building a useful model, several groups are required. Data engineers to pull in and standardize data, data scientists to create and test models, and software engineers to allow access to the model with new data. Other groups will have a stake in the process like managers and an IT department. Finding a platform to unify these groups and ease development is a hefty task. That’s where Databricks comes in.
What is Databricks?
Databricks is a platform, built on top of Apache Spark, designed for processing large amounts of data for data analysis. The founders of Databricks are the founders of Apache Spark and Databricks remains the largest contributor to the open source platform. Apache Spark is designed to efficiently process data used in iterative algorithms by making use of a cache and minimizing disk usage. Machine Learning models are a great example of the work that Apache is optimized for.
Notebooks are one of the services that Databricks offers. Think of a notebook as a word document that you can inject code into. The document is broken up into blocks of code, plain text, or graphs. Any data that is created in one block of code can be used in a later block. A simple use case of this would be pulling data in, cleaning it, and storing it in a variable or table in one block of code. Then somewhere later in the document, the data can be accessed for putting into a model. The notebook must be attached to a cluster which powers the code and data for the notebook.
By having one notebook where the whole team is working, collaboration is made much simpler. This eliminates the need for emailing files or pulling code from a repository. It’s all in one document that the team can access through a browser. The notebook is transparent with the code that is added, so all can see the progress that is being made and catch errors early in development.
Databricks offers you a choice with the language that you use in the notebook. You can use SQL, Python, Java, R, or Scala interchangeably. You’re not locked into one language for the entire document. Each person involved in the project can use the language that they are most comfortable with. This allows for each person to utilize their skills in the notebook, while not limiting others to their language.
Speed and Ability
Running a simulation of a model can take several hours to give one result. While trying to generate the right model, you’ll need to test several different ones. The time needed to run all these tests will quickly add up if you can’t run each one in a reasonable time. When you run a model through a notebook, you make use of a cluster that you have set up in Databricks. The cluster is optimized for these sorts of operations and you can see a significantly faster run time of a model in the cluster than on a local computer. This gives teams the ability to quickly test models and even try different ones that they previously wouldn’t have had time for.
The languages may have some basic Machine Learning models built in like linear regression in R. If you need a more complex model, you’ll probably need to use a package that has the functionality built in. Fortunately, notebooks allow 3rd party packages to be used. This gives you the full power of each language that you would expect outside of a notebook. It also gives you the ability to bring in your existing code and use inside the notebook.
Ease of Understanding
One of the great challenges of building a model is not exploring the data and running tests on it. It’s showing these models to management and presenting it in a meaningful way that they can digest. You can use the notebook to create a professional document for a presentation about the models you have created.
Notebooks allow the use of some basic graphs and images to highlight the data and models. However, there isn’t much customization supported natively. To overcome this, you can make use of other packages, such as ggplot2, to add sleek graphs and visuals. Including some of these in a presentation can help make the data stick with your audience and gives them something tangible that they can understand.
The notebooks offered from Databricks are a great choice to effectively make use of the platform. The only cost incurred for a notebook is the compute time it takes to run each section of code in the notebook. With fast runtimes, the time saved for running each model will greatly outweigh the cost of the platform. Businesses should consider using a notebook for their models if they want quick development time and high understanding with their models.