Databricks integrates TensorFlow into Apache Spark machine learning APIs to simplify and scale deep learning



Databricks announced Tuesday Deep Learning Pipelines, a new library to integrate and scale out deep learning in Apache Spark. This open source package adds high-level, easy-to-use deep learning APIs for technologies such as TensorFlow to Apache Spark, making it possible for enterprises to scale deep learning across multiple nodes.

The package provides users with the ability to call deep learning libraries within existing Spark ML workflows, making it immediately available to Spark developers without having to learn a separate tool; and seamlessly perform transfer learning of deep learning models via Spark MLlib Pipelines, combining the power of deep learning with Spark’s data processing and machine learning capabilities.

It also leverages Spark’s distributed computation engine with the integration of TensorFlow and Keras to train and productionize high quality models at scale; empower organizations to leverage artificial intelligence through mechanisms that turn deep learning models into SQL functions for business and data analysts; and work with complex data such as images through a set of Spark-native
utilities.

Previously, deep learning has been unapproachable for many because of the dependency on separate, low-level frameworks that require specialized skills. Furthermore, these frameworks do not scale well because they only run on a single node.

Apache Spark is fully open source, hosted at the vendor-independent Apache Software Foundation. Since its release, Apache Spark has seen rapid adoption by enterprises across a range of industries. Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations. Together with the Spark community, Databricks continues to contribute to the Apache Spark project, through both development and community evangelism.

Deep Learning Pipelines builds on Apache Spark’s ML Pipelines for training, and with Spark DataFrames and SQL for deploying models. It includes high-level APIs for common aspects of deep learning so they can be done in a few lines of code such as image loading, applying pre-trained models as transformers in a Spark ML pipeline, transfer learning, distributed hyperparameter tuning, and deploying models in DataFrames and SQL.

Deep Learning Pipelines supports running pre-trained models in a distributed manner with Spark, available in both batch and streaming data processing. It houses some of popular models, enabling users to start using deep learning without the costly step of training a model. For example, the following code creates a Spark prediction pipeline using InceptionV3, a state-of-the-art convolutional neural network (CNN) model for image classification, and predicts what objects are in the images that were just loaded.

Pre-trained models are extremely useful when they are suitable for the task at hand, but they are often not optimized for the specific dataset users are tackling. As an example, InceptionV3 is a model optimized for image classification on a broad set of 1000 categories, but our domain might be dog breed classification. A commonly used technique in deep learning is transfer learning, which adapts a model trained for a similar task to the task at hand. Compared with training a new model from ground-up, transfer learning requires substantially less data and resources. This is why transfer learning has become the go-to method in many real world use cases, such as cancer detection.

Databricks also unveiled Tuesday a new offering that simplifies the management of Apache Spark workloads in the cloud. Databricks Serverless, its fully managed computing platform for Apache Spark, allows teams to share a single pool of computing resources and automatically isolates users and manages costs. The new offering removes the complexity and cost of users managing their own Spark clusters.

Additional benefits of Databricks’ Serverless offering include auto-managed through configuration of clusters; scaling of local storage; adaption to multiple users sharing the cluster; and security.

“As enterprises scale their use of Apache Spark, hundreds of data scientists, data engineers and business users need to use the platform. Traditional cloud and on-premise platforms require teams or individuals to manage their own Spark clusters in order to enforce data security, isolate workloads, and configure resource allocation. This approach is costly and highly complex, as every team must learn to manage its own clusters. With Databricks Serverless, organizations can use a single, automatically managed pool of resources and get best-in-class performance for all users at dramatically lower costs. Databricks is excited to announce this offering and to be the only company able to provide it,” said Ali Ghodsi, cofounder and chief executive officer at Databricks.

Leave a Reply

WWPI – Covering the best in IT since 1980