Big Data Tools Apache Spark And Azure

In this day and age, the generation of line-of-business computer systems will generate over terabytes of data every single year by tracking sales and production through CRM and ERP.  This enormous flow of data will only continue to grow as you add the sensors of the industrial IoT along with the data that is needed to deliver.


Big data is usually unstructured and spread across many servers and databases.  Having data and knowing how to use it are two different things.  This is where big data tools come into play such as Apache Spark that distributes analytical tools across clusters of computers.  Creating techniques developed for the MapReduce algorithms using tools like Hadoop, big data analysis tools are going further to support more database-like behaviors, working with in-memory, data at scale, using loops to speed up searches, and offering a foundation for machine learning systems.

Although Apache Spark is very fast, Databricks is even faster.  Databricks is a cloud-optimized version of Spark and founded by the Spark team.  It takes advantage of the public cloud services to scale quickly and cloud storage to host its data.  It offers tools that make it easier to search your data using the notebook and is accessible and interesting to the public with tools such as Jupyter Notebooks.

Microsoft's support for Azure Databricks signals a new direction of its cloud services, bringing Databricks in as a partner vs an acquisition.  From the Azure Portal, you can perform a one-click setup, making Azure Databricks even easier to use.  You are able to host multiple analytical clusters, use autoscaling to decrease the resources used.  You can clone and edit clusters, you can assign them specific jobs or running different analyses on the same data.

Azure Databricks Configuration:

Microsoft's new service is a managed Databricks virtual appliance, developed by using containers that run on Azure Container Services.  You select the number of VMs for each cluster then the service will automatically handle the load once it's configured, running and loading new VMs for scaling.

Databricks' tools interact directly with Azure Resource Manager which adds a security group, a dedicated storage account, and virtual network to your Azure subscription.  You are allowed to use any of Azure VMs for your Databricks.  Use the newest GPU-based VMs if you are going to use it for your machine learning systems. If a particular VM is not right for your situation, you can easily change it out for another one.  All you do is clone a cluster and change the VM definitions.

Bringing Engineering To Data Science In Spark:

Spark has its own query language that's based on SQL and works with Spark DataFrames to take care of structured and unstructured data.  DataFrames are like relational tables built on top of collections of distributed data in different places. Relational databases or tables are a set of data values using a model of vertical columns and horizontal rows. Using identifiable named columns, you can build and manipulate DataFrames with languages such as R and Python.  Both data scientists and developers can take full advantage of them.

DataFrames are a domain-specific language for your data.  It's a language that extends the data analysis features of your chosen platform.  Using libraries with DataFrames you can build complex queries that take data from multiple sources, working across columns.

Azure Databricks is data-parallel and queries are evaluated only when called to deliver actions and the results are delivered quickly.  You can add Azure Databricks DataFrames and queries to existing data easily because Spark supports common data sources, either through extensions or native abilities.  This will reduce the need to migrate data to take advantage of its capabilities.

Azure Databricks is a very useful tool for developers and data scientists to develop and explore new models turning data science into data engineering.  You can create scratchpad view of the data with code and get a single view using Databricks Notebooks.

The notebooks are shared resources for anyone to use and explore their data and experiment with new queries.  Once a query is tested and turned into a regular job, its output is presented as an element Power BI dashboard.  This makes Azure Databricks part of an end-to-end data architecture that will allow for more complex reporting than SQL or NoSQL.

A New Platform For Azure Services – Microsoft Plus Databricks:

Microsoft has not announced the details regarding its pricing for Azure Databricks.  They have stated that it will improve your performance and reduce the costs as much as 99% compared to operating your own Spark installation on Azure's infrastructure service.  If Microsoft's claims are confirmed, you could experience significant savings.

Azure's Databricks directly connects to their storage services including Azure Data Lake with optimization for queries and caching.  You will also have the option to use it with Cosmos DB and take advantage of global data sources and various NoSQL data models, including MongoDB and Cassandra, and Cosmos DB's graph APIs.  Along with their data streaming tools, you will have the option for an assimilated real-time IoT analytics.

It makes a great deal of sense that Microsoft wants to partner with Databricks as Databricks has the experience and Microsoft has the platform. If the service turns into a success, this could set a new precedence how Azure evolves in the future.
Big Data Tools Apache Spark And Azure Big Data Tools Apache Spark And Azure Reviewed by thanhcongabc on January 04, 2018 Rating: 5

No comments:

Powered by Blogger.