Since the advent of distributed computing technology, many fundamental concepts of Distributed Computing has been successfully used in various area of real-life applications. The distributed paradigm emerged as an alternative to expensive supercomputers to handle intensive data computation. The distributed computing system are networks of large numbers of computation units connected through a fast network designed to share resources among compute nodes. The communication and coordination in action between distributed compute nodes occur by message passing. The cluster of compute nodes coordinate with each other to achieve a common goal while the user using the system perceives the collection of autonomous computes nodes as a single unit. Moreover, the existence of multiple computers processing the same data means that a malfunction in one of the computers doesn’t influence the entire computing process.
The application of database technologies has been evolving with the increase in the size of data to store. The first generation of data growth started with the evolution of ERP software. The use of global applications across different platforms using petabytes of data has introduced the next generation of data growth. The increasing complexities of data have changed the way we design the database. Because of the changing technological trend and user demands., applications, today need to be built into the database without the full understanding of data to be stored and the structure of the data. This need has failed many of the existing technologies like Relational Database to cope with the modern application which demands scalability and intensive data computation. Similarly, There have been many changes in the software technology used for building applications. The technologies which are developed to evolve with the increasing number of data are called big data technologies.
Big Data technologies leverage the fundamental concepts of distributed computing to achieve large-scale computation in a scalable and affordable way. Big data is a field large and complex data are analyzed systematically to extract insightful information that otherwise is too complex for traditional data-processing software. It is characterized by four key properties.
- Volume: The quantity of generated and stored data.
- Variety: The type and nature of data.
- Velocity: The speed at which the data is generated and processed to meet the demand.
- Veracity: The quality and value of the data.
Big data technologies have been used in many data-intensive applications like weather forecasting, natural disaster prediction, biomedical research, astronomical research, simulations, and so on.
The fundamental principals of distributed computing that includes data storage, data access, data transfer, data visualization, and predictive modeling using multiple low-cost commodity machines are the key consideration that makes the Big Data Analytics viable. In the age of information and technology, harnessing big data and executing a data-driven decision is crucial to stay competitive and meet the technological demand. To this extent, the paper is organized by explaining different distributed computing technologies used in big data analytics by categorizing it into a different area of application with an example of popular tools.
1. Distributed Database
A Distributed database is a database where data are stored in multiple interconnected computers, located in diverse physical locations while being attached to a common processor. These processors are tightly coupled constituting a single database system. The distribution of the data in the database is transparent to the user and is managed as if it is a single system. Similarly, The database integrity across multiple databases are maintained in each transaction. The complexity of a transaction on the database is transparent to the user. Since the data is stored in multiple computers, there are two approaches by which we can store the data in a distributed database system.
- Replication: In this approach, the entire database is stored redundantly in two or more sites. Replication simply maintains the copies of data. The advantage of this approach is that it increases the availability of data at different sites. Thus, If one of the sites fails then, another site will be available to provide the data. In contrast, the replica database needs to be updated every time there is a change made in the system. This can lead to the inconsistency between the database sites.
- Fragmentation: In this approach, the entire data in the system is logically divided and placed on different sites. The indexing of data storage is maintained by the master server. While this approach maintains the consistency in the data, it doesn’t create copies of data leading it to be the single source of failure.
This approach of data storage increases the scalability, reliability, and availability of this database system. Moreover, It provides horizontal scalability as an alternative to vertical scalability which improves the speed by distributing the load across many computers instead of computing in a single machine with higher specifications. Since the databases can be located in a distributed manner, data which are located nearer to the site of greater demand improves the performance of accessing data.
The scalability issue in the relational database with ACID(atomicity, consistency, isolation, durability) property in a single server database system is alleviated by the distributed database system. However, the distributed database system has its own barriers which are best described by CAP theorem. CAP theorem states that it is impossible for a distributed data store to simultaneously provide more than two properties out of three given properties.
- Consistency: This property ensures that operations have all atomic characteristics and changes to all nodes yielding the same results.
- Availability: This property ensures to respond to every request in the system even in the case of failure.
- Partition Tolerance: This property describes that the system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.
The visualization of the CAP theorem with three properties is shown in the figure below.
The distributed system allows us to achieve a high level of computing power compared to the computer systems in the past. The distributed system today runs on commodity hardware that is easily obtainable and configurable at an affordable cost. However, the distributed database system is complex followed by the trade-off of CAP theorem. Thus, It is necessary to select the right database design according to the requirement of the system to achieve a high level of advantages.
The relational database has remained a dominant model for the database management system in the past. Non-relational database, otherwise known as NoSQL database has been increasingly being adopted nowadays by many organizations mainly because of its schemaless model and high scalability. Traditionally, storing unstructured data in a relational database was challenging and difficult to manage. With the advent of NoSQL databases, this limitation of sparse data has been overcome. In a NoSQL database, we can store structured and unstructured data in contrast to the relational database which demands structured data.
Based on the structure of data we can store, NoSQL databases are categorized into four types of. The different types of NoSQL databases are shown and described below.
- Graph Databases: In this database, data are stored as nodes and edge following the structure of a graph. Typically the data stored in a node are information about people, places, things while data store in edges is relationship attribute between the nodes. This type of database is widely used in storing data for a social network, recommendation engine, search engine, and so on. Examples: Neo4j, JanusGraph, etc.
- Key-value databases: It is a simple type of database where each item contains a pair of keys and values. Because of its simpler structure, it can store a large amount of data and retrieve data with basic queries. It is mostly used in storing user preferences and cache data. Examples: Redis, Memcached, DynamoDB, etc.
- Wide Column Database: In this database, data are stored in tables, rows, and dynamic columns. It provides lots of flexibility over relational databases as each row is not required to have the same columns. These databases are widely used for storing the internet of things data and user profile data. Examples: Cassandra, Hbase, etc.
2. Distributed storage
A Distributed storage system is an infrastructure that can split data across multiple physical servers into many data centers. It stores the data in the form of a cluster of storage units which synchronize and coordinate with each other. These storage units allow processing of a large volume of data in parallel improving the performance drastically. This type of storage can store several types of data like video, images, documents, objects, and so on. Traditionally a single computer system is used to process the data stored in a local file system. Processing a large volume of data in a single system is a bottleneck as it will take several days to process terabytes of data. To solve the problem, a distributed file system has been introduced where data is distributed across multiple local hard disks each associated with a separate computing node. In this system, the computing of data can be divided in such a way that the processing of terabytes of data into a few minutes parallelly.
The main motivation of a distributed storage system is to provide scalability, redundantly and performance while making it probable to use cheaper commodity hardware to store large volumes of data at low cost. The limitation of a distributed storage system is defined by the CAP theorem. While using the distributed storage system, we need to give up at least one of the three properties of the CAP theorem. Many existing distributed storage system give up consistency while guaranteeing availability and partition tolerance. The limitation of consistency is accomplished by having eventually consistency in the system.
Distributed File System HDFS
HDFS is a Java-based distributed file system that spans across multiple computers that supports data-intensive applications, high fault-tolerant, and high-throughput. It ensures that the application store and process a large amount of data in hundreds of low-cost commodity hardware. The fundamental architecture of HDFS is shown below.
HDFS is typically a master-slave architecture and the clusters comprise metadata and application data. The metadata is stored in a dedicated server called Name Node and application data is stored in Data Nodes. Name Node is the central control of the entire file system which manages and maintains the entire file system tree. Additionally, It is also responsible for receiving and processing client requests, and managing, distributing specific storage tasks. The Data Node does not have any individual failover mechanism. Rather the file content is replicated across multiple data nodes for reliability. It also keeps the data local to the node where the computation is carried out reducing the overhead of data transfer between nodes.
Unlike the conventional file system, HDFS provides an API that exposes locations of a file block allowing the distributed programming like Map-Reduce framework to process data in a node locally where the data is located.
3. Distributed Computing Frameworks
As the distributed computation has to rely on the synchronization and locking process across multiple processors, it can be considered as overhead. This overhead is increased further when data is to be transported from a data node to the processing node (node that processes the data). To address the issue with a distributed system, distributed computing frameworks are introduced. This framework facilitates concurrent processing by splitting petabytes of data into smaller chunks and processing them in parallel on process nodes. The result of the process is aggregated from multiple nodes and returns to the application.
Map-Reduce is a programming model for processing and generating big data sets with a parallel and distributed algorithm. A typical Map-Reduce program is composed of map and reduce procedure. A map procedure performs filtering and sorting of data while the reduce method performs a summary operation to the mapped data. This model is a specialization of the split-apply-combine strategy for data analysis inspired by the map and reduce functions commonly used in functional programming. It is based on a point that data is not a single unit but a collection of multiple units. Further, To expedite the processing performance of Map-Reduce, instead of sending the data to the computing nodes, the map-reduce logic can be executed on the node where the data already resides. The fundamental steps involved in the Map-Reduce process are shown below.
A Map-Reduce programming framework works in three steps:
- Map: Each computes node executes the map function to local data and creates several chunks of data.
- Shuffle: The compute node redistributes data based on the output keys produced by map function to ensure data belonging to one key is located on the same compute node.
- Reduce: The compute node process each group of output data in parallel. After processing, it produces a new set of output which will be stored in a distributed file system.
Apache Spark is a distributed computing framework designed to improve iterative workloads and their performance with memory computations. It can perform faster parallel computing operations by using memory primitives. A job can load data in either local memory or a cluster-wide shared memory and query it iteratively with great speed as compared to the Map-Reduce framework. Because of its advantage over the Map-Reduce framework, It is widely used in modern applications like streaming, machine learning, graph processing, data transformations and so on.
Spark exploits the memory capacity to avoid repeated reading and writing on the map-reduce workflow. It has a concept called “Resilient Distributed Data (RDD)” which considers the memory across multiple computers as a single contiguous memory. The architecture of Apache Spark is shown below.
At the fundamental level, Apache Spark application consists of two main components.
- Driver: This component converts the user’s code into multiple tasks that can be distributed across worker nodes.
- Executers: This component executes the task that is assigned to the node.
4. Machine Learning Platforms
Machine learning algorithms deal with complex mathematical operations like matrix algebra, optimization on a large scale data. To build a complex model that generalizes well, we need a large amount of training data. Consequently, it demands high computational capabilities of the system. Traditionally, to build the model on machine learning, computers were scaled by increasing the number of cores, memories, and so on. In contrast, a large number of distributed computers are used to build the machine learning model in parallel nowadays.
The evolution of machine learning was further expedited by the popularity of distributed computing frameworks like Spark. Using these frameworks, models can be built on large scale data without sampling while achieving accurate prediction. These tools are optimized to achieve high performance as they use fast computation, parallel distributed training and in-memory compression to handle huge datasets. Some of the popular machine learning platforms are Mahout, H2O, Spark, Hadoop, DIANNE, MXNet, Petuum and so on.
5. Search System
The world is moving towards big data with the ever-increasing demand for real-time delivery of information. Data are generated in the range of petabytes every day and the storage is required to store increasing data. In the world today, real-time search is impossible to achieve with the use of traditional indexing search scheme.
Big data is stored as unstructured and structured textual data. The search system is operated on this huge amount of textual data. To make the textual data searchable, an inverted index is created out of textual data. In the forward index, a document is stored in the database, and with a document ID, we can retrieve the document. An inverted index is built from the keywords from the document. Each of the words in the document is set to the list of documents that contain that word. Without an indexing scheme, the search query would have taken a long time by searching documents one by one. Since the number of words in a language is limited, the number of entries in the inverted index will be limited even with a very large number of documents. Consequently, It is possible to hold the index in memory of a single node or a cluster of nodes.
There are many distributed searching software that implements the indexing scheme. Elastic Search is one of the prominent development of search software for big data and distributed systems. It is a RESTful search engine that supports HTTP and JSON based search queries. It offers a full-text search for all types of data including textual, numerical, geospatial, structured and unstructured. It can be used in plenty of different applications like application search, a website search, enterprise search, logging, monitoring, data analysis, security analysis, and business analysis.
Elastic Search uses the concept of an inverted index for searching. An inverted index is a list of all the unique word that appears in the document where the list of documents is associated with each word. When a search query is dispatched, it looks for the required data in the inverted index table. The output of the query is a list of document that includes the word or term. The inverted index searches the relevant document by mapping the term to the containing documents. It provides the ability to subdivide the index into multiple pieces called shards. An index is usually split into elements in shards distributed across multiple nodes. Elasticsearch server automatically manages the arrangement of these shards.
6. Big Data Messaging
Every component in a big data system is a cluster of a large number of computing nodes that handle distributed data and computation. It is challenging to create and manage one to one communication in this system. While the components in the system can be robust, the communication can be a barrier to the system. One of the critical problems is that the flow of data makes difficulty in estimating the infrastructure required for processing the data. There can be a mismatch in the rate at which data enters the processing unit and be processed to write in the storage. It results in the large in-memory buffer in the data parsing and extraction process. Moreover, failure in the data parsing and extraction process can lead to loss of data. To overcome this issue, big data messaging software has evolved to handle a large volume of messages by holding the message temporarily with failover and replication capability. This software ensures the prevention of data loss. There are mainly two paradigms in a distributed messaging system widely used across many applications.
- Publish/Subscribe Paradigm: This pattern corresponds to a mechanism where producers publish messages that are grouped into categories and consumers subscribe to categories which they are interested in. Example: Apache Kafka
- Messaging Queue Paradigm: In this pattern, Producers send messages to queue from where the consumer consumes the data in order. This is asynchronous in a way that producers and consumers need not simultaneously interact with the message queue. This is also called point to point communication. Examples: RabbitMQ, ActiveMQ, AeroMQ, etc.
RabbitMQ is one of the leading message-passing software that has been used across many software industries. Though in the past, it was initially built to run on a single server, it has incorporated the clustering architecture with the growing popularity of big data. it uses advanced message queuing protocol and enables seamless asynchronous message-based communication between applications. It uses the popular publish/subscribe system architecture where the group of message producers publishes messages with subjects and a group of consumer consumes the messages based on the subjects. The publish/subscribe architecture of RabbitMQ is shown below.
Apache Kafka is another popular distributed event streaming platform that is based on the abstraction of a distributed commit log. While the Kafka was initially a message queue system. It has evolved to be a full-fledged event streaming platform. However, the underlying principle is to do the message passing in a pub/sub fashion. It is simple to use and provides high throughput and robust replication feature. Its architecture comprises of producers, consumers, brokers, zookeeper, logs, partitions, records and topics. Records consist of value and timestamp. Topics are categories for the stream of records. Producers generate the stream of records that is put into topics and consumers are responsible for subscribing to topics of their interest.
7. Distributed Caching System
In computing, a cache is a high-speed data storage layer which stores a subset of data so that future request for the data is server faster than is possible by assessing the primary data location. The data in the cache is generally stored in fast access hardware such as RAM (Random-access memory) and can be used in correlation with a software component as well. The size of the cache is smaller with fast memory which stored copies of data from frequently used main memory locations. When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. If the processor finds the memory location in the cache, a cache hit has occurred. If the processor does not find the memory location in the cache, a cache miss has occurred. In contrast to the location of the cache in the computer, it can also be a separate cache server which is located to the nearest physical distance from the demand of data.
A cache is an important component of a big-data based system which helps to expedite the real-time response to the requests. Following the power law which states eighty percentage of users are served by twenty percentage of data, the cache provides high performance in a distributed system.
Memcached is one of the popular high-performance and distributed cache system based on a key-value store for chunks of data. Memcached has been grown in the last few years to serve multi-node clusters. Memcached is made up of four main components.
- Client Software: It is a software that manages Memcached servers.
- Hashing algorithm: It is used to choose a server based on the “key”
- Server Software: It stores values and key pains into an internal hash table
- LRU: It determines to throw out old data and make the resue of memory.
These components allow the client and server to work together to deliver cached data as effectively as possible.
Redis is another popular in-memory non-relational database used for key-value cache and store. It is blazing fast in speed as compared to a traditional database. The data processing in Redis typically ranges in nanoseconds or milliseconds because of it’s simpler structure and in-memory storage. It has many options for data storage like strings, lists, sets, hashes, etc with advanced functionalities like publish/subscribe, master/slave replication, disk persistence, and scripting. It has built-in replication, high availability and data partition feature. Moreover, It allows atomic operations like append, find, sort in the memory store.
The architecture of the Application that uses Redis as a cache server is shown below.
8. Data Visualization
The collection of data in the database itself doesn’t provide any insights about the system. The data need to be visualized in different statistical visual elements like charts, graphs, and map for getting insights and knowledge. In the world of big data, data visualization tools and technologies are essential to analyze a massive amount of information to make data-driven decisions. As the big data grows, data visualization will increasingly become the weapon to make sense of trillions of rows of data generated every day. It helps to tell the stories by curating data into easier to understand visualization highlighting the trends and outliers.
To communicate information clearly and efficiently, data visualization uses statistical graphs, plots, infographics, and other tools. Effective data visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable, and usable.
We have explained different distributed computing technologies that are being used to cope with the increasing trend of big data. Unlike traditional single computer system, distributed system has a complex architecture and overheads. Despite its limitation, It has shown many promises in terms of scalability and processing of a large amounts of big data in an inexpensive and reliable way. It is likely that distributed computing technologies are going to expedite the solutions to the increasing trend of big data.