1 Introduction
The Internet of Things (IoT) is a computing term that is difficult to define precisely. Various researchers and groups have defined it differently, although the main idea is the same. Each definition links the integration of the physical world with communication technology. Marshall McLuhan in his work Understanding Media states that with the help of electric media we are setting up a dynamic that will transform all previous technologies, including cities, into information systems. The Global Standards Initiative on Internet of Things (IoT-GSI) defines IoT as a global infrastructure with the aim of providing advanced services based on already available and still evolving interoperable communication and information technologies, interconnecting both physical and virtual things for the benefit of the information society. Generally, IoT is expected to provide cutting-edge connectivity of devices, systems and services
going beyond machine-to-machine (M2M) communications, covering a wide variety of protocols, domains and applications [1].
One of the biggest challenges of IoT is that the information generated by devices is transmitted, shared and received at a rate that is equal to the speed of light through optical fiber and wireless networks. Information generation and collection rates are rapidly exceeding the boundary range. The massive amount of data generated by devices, known as Big Data, has complexity and exponential arrival and growth rates, which makes it difficult to capture, manage, process, plan, visualize and analyze them by conventional techniques and within a certain time frame. Doug Laney, an expert from META (currently Gartner), an American research and advisory firm, has defined the 3Vs model to describe Big Data and the trials and prospects associated with it. The 3Vs model represents volume, velocity, and variety. Although this model was not originally intended to define Big Data, Gartner and many other enterprises, including IBM and some research departments of Microsoft, still use the 3Vs model to describe Big Data.

Figure 1 IBM's characterization of big data.
For the efficient processing of large volumes of data we need exceptional technologies so that the data may be processed within a tolerable time limit. Among the solutions to the problem of Big Data generation resulting in network congestion is to apply the latest mobile network generation, with much larger bandwidth and system capacity. The primary concern with the current networks is that they are incapable of handling these tremendous amounts of data, as the 3V's of the data are growing exponentially [2]. In the current fourth-generation (4G) telecommunication networks, periods of high traffic with data from sensors and cellular phones may cause problems like dropped packets, queuing delays and retransmissions. Other difficulties include handovers issues, congestion spots, ping pong zones, etc., which are determined either through drive tests, reports from operation and maintenance centers or by customer
complaints [3]. There is a need for a new network technology that can handle trillions of devices reliably, with improved quality and higher transmission rates [4]. Fifth generation mobile networks or wireless systems, abbreviated as 5G, are the proposed next telecommunications standard after the current 4G/IMT-Advanced Standards [5].
The first generation of mobile telecommunications technology, known as 1G, was designed to connect people to people while 5G aims to connect anything to anything. It is expected to provide an increase in bandwidth equal to one thousand times per unit area, one millisecond end-to-end round trip delay and around ten to hundred times the number of connected devices.
The aim of Big Data, IoT and 5G is to connect anything to anything in the most efficient and reliable way while fulfilling the QoS and data transmission rate demands of users. It is predicted that by 2020 there will be 20.8 billion connected things with an average of 6.7 devices per person. 5G aims to provide the required bandwidth, device-to-device (D2D) and M2M communications, less delay and better reliability. However, with 5G we need disruptive technologies, which may lead to a revolution in future wireless mobile communication systems [6]. The problem with 4G is that it uses a reactive approach, i.e. it tackles problems after they have occurred [7]. We need insight from Big Data to optimize and automate the network. The throughput of 5G systems can be increased by intelligently optimizing the data from sensors and mobile phones.
In this paper we link IoT infrastructure to data analysis, with the aim to create an end-to-end intelligent self-coordinating network for 5G. The advantage of this approach is efficient resource allocation, better spectrum utilization and hence improved capacity, latency and quality of service (QoS).
This paper is organized as follows. Challenges faced in optimizing 5G networks using big data are described in Section 2. Our optimized network model is presented in Section 3. In Section 4, a case study is presented to support our work. Finally, in Section 5 we conclude the paper.
2 Challenges in Optimization
2.1 Multiple Radio Access Technology (MRAT)
Wireless networks today are highly heterogeneous (i.e. they consist of devices with different radio access technology (RAT)) and have a diverse range and QoS [8] along with a different setup time and energy consumption [9]. Working with different RATs, each with distinctive characteristics and requirements in terms of latency and reliability, is a major challenge for 5G systems.
2.2 Options of Enabling Technologies for 5G Radio
Currently many 5G test networks are being developed, each with a different focus. For example, the 5G Berlin test network mainly focuses on areas like 5G access technologies, core network technologies, optics and virtualization [10]. The 5G Dresden test network [11] covers the area of tactile Internet, i.e. the near-real-time interaction of people with objects (physical and virtual) [12]. New air interface technologies are the area of concern of the Surrey 5G Innovation Centre [13]. Merging all these viable options to produce one optimized network is still in process. Some options of enabling technologies are the following.
2.2.1 Cloud Radio Access Network (CRAN)
Among all the viable options CRAN is one of the most prominent enabling network architectures for 5G systems. It offers better spectrum efficiency, capacity, energy efficiency and operational flexibility. It tries to resolve the problems associated with MRATs.
2.2.2 Network Function Virtualization (NVF)
Another option for improving system performance is NFV. NFV is a novel concept in networking research. Its aim is to virtualize the components of networks such as NAT servers, firewalls, load balancers, etc. So rather than having network functions on dedicated nodes, the hardware is virtualized.
2.2.3 Software Defined Network (SDN)
SDN and NFV are both being considered for future integration in the 5G architecture [14]. SDN is used for decoupling the data plane from the control plane. The SDN controller is responsible for forwarding the physical data to the SDN nodes and for remotely managing all the nodes.
In the Condense architecture for 5G IoT, the focus is that the network should participate actively in data analysis [15]. NVF and SDR are selected for this purpose in the 3GPP MTC architecture. An overview of CRAN, a key enabler for 5G is presented by Rui Wang [16]. CRAN is aimed at meeting the everincreasing capacity demands. The application of CRAN reduces the capital and operating expenditure burden that is constantly faced by operators.
Selecting viable technologies for 5G is still in the research phase. An optimized design needs to incorporate techniques that fulfill the goal of 5G, i.e. reduced latency and higher data rates.
2.3 Integrated Storage Model
Big Data cannot be analyzed by traditional tools and techniques, rather it requires more advanced techniques that can make data retrieval, management and storage much faster. Among many other options, the most prominent storage models for capturing, managing, analyzing, optimizing and securing big data are SQL, NoSQL and NewSQL.
SQL databases have well-defined semantics. They offer interoperability and rich client-side transaction. Another plus is their ad-hoc query and reporting. However, on the downside they do not provide a scale-out architecture. SQL databases are usually built for general purposes with one-size-fits-all approach.
NoSQL is subjective to the kind of data that is being provided to it. In case of Big Data, data constantly arrive in different forms, from various sources so the database has to be quickly scalable. NoSQL is ideal for this case.
NewSQL is a category of SQL databases that resolves the problems posed by traditional online transaction processing database management systems. It solves the performance and scalability issue, which is severe in SQL databases. However, one problem with NewSQL is that is not as general-purpose as traditional SQL systems are.
Because of the large volume, value, variety and velocity in 5G architectures, selecting an optimal, scalable and a secure storage system is one of the most important concerns of researchers.
2.4 Security and Privacy
Real-time monitoring of Big Data and providing an end-to-end secure architecture has always been a challenge. Even the NoSQL databases were developed for handling the challenges posed by data analytics, however, data security was not part of the model. Sensitive data are routinely stored in the cloud without encryption. The problem with encrypting large data sets is that sharing and easy retrieval of data becomes difficult for users.
2.5 Key Performance Indicators (KPIs)
A unified set of KPIs needs to be identified by all the equipment vendors and network operators. The performance metrics for an efficient and optimized 5G
network needs to be quantified properly. A major drawback in the cellular communication sector is the lack of a standardized set of KPIs that can be used independent from hardware/software vendors, RATs and operators [17].
3 System Model
Figure 2 illustrates our proposed model for a self-optimized 5G network. The main idea of this design is to make the network intelligent by using NFV at CRAN level so that functions can be implemented to separate raw data from useful data of sensors and devices at the baseband unit. Because raw data occupy too much bandwidth, NVF extracts features from sensors and social networks. The extracted features are then encrypted and compressed to minimize the data bandwidth and prevent data leakage. Centralized CRAN supports resource pooling, coordination of radio resources, more cost-efficient processor sharing, scalability, more flexible hardware and simplified network management. Handover failure and interference problems are reduced and memory trunking gains are achieved.
The purpose of SDN controller in the core network is to remotely manage all network nodes. It smoothly and flexibly controls traffic flows at various flow granularities and defines/redefines forwarding rules for data flows passing through the SDN network nodes. It allows better network management, policy enforcement and control. With the help of SDN, the virtual network topology can be quickly adapted by various network services by re-routing their data flows. Applications and services in the cloud and the user's mobile terminal can communicate with the help of SDN. Therefore, the network can be dynamically managed according to real-time needs, benefiting from resource virtualization.
Once SDN in the core network has passed on the data from the cells, subscribers, devices and networks to the cloud for optimization, data processing time is required in the cloud. The cloud environment for 5G requires efficient techniques for query optimization and intensive ability. If we consider SQL for structured and relational data then the NoSQL technique is useful for unstructured and semi-structured data. But in a cloud environment, a combination of SQL and NoSQL is required to make the databases more powerful. Combining databases helps in on-demand scalability, database management, reliability and performance along with elasticity. For the cloud these must be used in tandem with each other to have the highest benefit.

Figure 2 Framework for self-optimized 5G.
In cellular networks, many network parameters are discarded after a short interval of time, whereas in 5G SON they are stored as part of Big Data in the cloud. Once in the cloud, the data are unified or diffused for analysis purposes. The right data are collected in the specific KPI set and unwanted data are filtered out. The values of the KPIs must satisfy the already set performance criteria so they match the targeted objectives. If there is drift or if there are anomalies in the values obtained, optimization of the data is performed using data analytical techniques. Heuristic, metaheuristic or efficient programming approaches are applied to the data for the purpose of optimization. After application of the data analytic techniques, the optimized/predicted values are passed back to the network as new network parameters for improving the performance of the overall system.
Hence, it can be stated that with CRAN, the predictive/optimized information in the centralized base band units (BBUs) has the capability to accurately tackle problems that occur or are about to occur without manual intervention.
4 Case Study
The next generation wireless networks support several types of services with various QoS requirements. These services are confronted with severe issues because of the exponential growth of the number of mobile users and different types of applications [18]. In response to this, the third-generation partnership project 3GPP has adopted LTE. The fundamental objective of LTE is to minimize network congestion for different types of users while fulfilling QoS requirements. In LTE, traffic streams such as voice and video data fall under the class of guaranteed bit rate (GBR), which enforces certain resource demands to ensure guaranteed QoS [19].
In this case study, we looked at a case of the resource allocation problem of LTE. Due to unplanned deployment of small cells, resource allocation and interference management in 4G networks are more problematic than in singletier networks. Real-time interference management and resource allocation solutions are needed in 5G systems. The most popular solutions that have been proposed so far either require installation of hardware or manual reconfiguration of the device. However, our model does not require expensive or impractical techniques for load balancing.
In our model the traffic values for calculating the probability that a customer will be denied service due to lack of resources for the LTE system are captured using Wireshark.
We considered a 5MHz LTE system with 25 resource blocks, as shown in Figure. 3.

Figure 3 Traffic values for different scenarios.
Blocking probability for each cell is given by:
\[B = \frac{T^N}{N!} / \sum_{k=0}^{N} \frac{T^N}{K!}\] (1)
N = number of resource blocks utilized by the system
T = traffic of non-congested cells
B = blocking probability of congested cells
The blocking probabilities for each scenario are shown in Figure 4.

Figure 4 Blocking probabilities for different scenarios.
In case of poor QoS, manual configuration and management of networks is done, incurring costs and higher time consumption. Also, the manual solutions provided are error-prone due to the exponential increase in the number of mobile users and network nodes.
Many techniques that have been proposed to solve the problem of resource allocation and load balancing are designed for a specific cellular system with unique MAC and physical layer specifications, making them impractical for newer generations such as LTE, LTE-A and 5G [20].
The solution we propose introduces self-organizing capabilities for network management with minimum human involvement. The blocking probabilities along with traffic values are saved as part of Big Data rather than being discarded after a short interval. The threshold values are set for normal behavior of the cells; a value above the threshold indicates an anomaly.
Two seven-cell clusters were considered, where the blocking probability is calculated for each cell. The reported KPI measurements (traffic values) are collected and stored in a database. With the help of an algorithm, the relationship between the KPI measurements is exploited. Cells whose blocking probability values lie above the threshold for a certain period, i.e. cells with very high traffic values and cells with very low traffic values, are forced to share resources until the blocking probability of the high-traffic cells drops to 0.02 (traffic value 17.5 erlang).
Whenever a cell experiences congestion, the code given below carries out calculations to determine how much traffic it should transfer along with the cell
selection for load balancing. In the algorithm, T1, T2, T3, T4, T5, T6 and T7 are the traffic values of the seven cells.
Pseudo code
Traffic values and appropriate data collection
{double [] arr= { T1, T2, T3, T4, T5, T6, T7 }
double avg=arr.Average();
double max=arr.Max();
double min=arr.Min();
double diff=0;
Calculation of differences between maximum, minimum and average traffic values
if (max-avg > avg-min)
{diff=(max-avg) / (arr.Length);}
else
{diff = (avg-min) / (arr.Length); }
Load Balancing
while (max > (avg+diff)|| min < (avg-diff))
{double [] temp = new double [arr.Length];
Array.Copy(arr, temp, arr.length);
int i=0;
foreach (int a int temp)
{temp[i]=Math.Abs(temp[i] - avg);
i++;}
double x= temp.Average();
Array.Sort(arr);
Arr [0] = Math.Round(arr[0] + x, 1);
Arr [arr.Length - 1]= Math.Round(arr[arr.Length - 1] - x,1);
Min = arr.Min();
Max = arr.Max();}
In our case the learning model constantly observes the behavior of the network and dynamically predicts it. Our model quickly guides the network to adjust and scale the load without human intervention.
________________________________________________________________

Figure 5 Load balancing scenarios.

Figure 6 Blocking probabilities after balancing.
Table 1 shows different techniques used for resource allocation and the load balancing and problems associated with each technique.
| Table 1 | Comparison of different load balancing techniques. |
|---|
| Authors | Techniques Discussed | Limitations |
|---|---|---|
| Aldaly et al. [21] | Installation of Wi-Fi access. | Requires installation of WiFi access in every cell. |
| Huang et al. [22] | Algorithm and protocol based on Nash equilibrium. | Each base station must have complete information of the other base stations. |
| Honglin et al. [23] | The idea of load balancing is proposed by shifting the users on the edge of a cell with heavy traffic to adjacent cells. | This process may result in a huge number of hand-overs and delays. Furthermore, the traffic of adjacent cells is not known, which may result in degraded performance. |
| Sumita Mishra et al. [24] | Smart base station antennas, dynamically changing cellular coverage size and shapes per load distribution. | This approach is effective only when remote electrical tilt (RET) controllers are available |
| Omar et al. [25] | Simulation studies carried out in a specific scenario. | These results may not be applicable to other network architectures. |
| Elshaer et al. [26] | Multi-tier network with sub-6- GHz macro cells and mm-wave small cells are analyzed. Load is characterized by using the average number of associated users in a cell. | A dynamic traffic model should be considered for realistic characterization. |
As can be seen from Table 1, various load-balancing methods have been proposed in the literature. However, some of these methods require installation of additional dedicated load balancers, or they look for manual reconfiguration of the devices to handle newer services. The above mentioned techniques are either expensive, inefficient, time-consuming or impractical. Our proposed method overcomes all the limitations mentioned in Table 1. It does not require installation of WiFi access or remote electric tilt controllers in every cell. Each
base station does not require information of every other base station. Our method dynamically offloads high load cells by shifting traffic to less loaded cells resulting in a fair redistribution of users, thus improving the average throughput per user.
5 Conclusion
Big Data driven schemes have potential benefits for mobile network optimization. In this work, we presented a CRAN-enabled 5G architecture integrated with Big Data analytics for network optimization. The objective of introducing intelligence into the network is to improve the quality of service experienced by the users and leverage the investments in the entire network. A case study of Big Data driven optimization was presented for demonstrating the solution towards improved performance of mobile networks. Our next plan is to present the proposed data analytics using machine learning algorithms for 5G SON.
