Nowadays, it is the volume, velocity, veracity and variety of Big Data that are the primary factors and true amplifiers of the security issues experienced in the large-scale cloud infrastructures. The upsurge in security issues in Big Data installations is predominantly driven by an overall increase in volume and velocity of the data. However, dealing with the diversity of data sources (variety of data) is quickly becoming yet another of the pressing security concerns and the existence of enormous amounts of data is no longer the single factor creating the new security challenges. Data variety is one of the newest security challenges of Big Data infrastructures.
The purpose of the following article is to provide a more complex look at the Variety of data, which together with volume, velocity, and veracity from the four V’s of Big Data.
I outline my position on the importance of security in relation to the diversity of data sources and explain why the large distributed cloud set-ups need to consider the variety of data as one of the likely security impacts.
Data Variety and Data Security
What is Data Variety?
First, let’s briefly explain the ‘Variety’ of data? Variety is referring to an overall diversity of data sources that are traditionally handled by the Big Data solution. As we know, data “come in different formats and structures as well as different semantic models” Big Data Security (2014). It is referring to heterogeneous, structured and unstructured, text, multimedia and another type of possible formats, coming from sources such as sensors and IoT devices. However, also customer-facing applications, database systems, operating systems, infrastructure logs, public web, all kinds of different type of documents, numerous media sources and also social networks.
So, it is not uncommon to see that also big business gather data from in a variety of sources and assortment of different formats. The issue that is becoming quickly apparent is that most organizations focus largely on the task of correlating all the sources into a single structured order, which is certainly one of the most important duties when it comes to the variety of data. However, the business concentration to unify the data into an amalgamated common format often largely overshadows the security challenges that also come attached to the overall diversity of data, and the issue of security in general often becomes overlooked.
Data Variety Example
Let’s explain the problem of safety and variety of data on a very simple example. Assume a corporation that is processing the hundreds of terabytes of daily log files collected from numerous operating systems and hardware devices, not all of which belong to the organizations. The main challenge is of course to collect, aggregate and transform the data before it is stored in a distributed Big Data solution, ready for processing, data analysis, and visualization. However, looking at the process, often ignored is the largely hidden issue of the security of data collection process. “The data validation and filtering is a daunting challenge posed by untrusted input sources.” (Cloud Security Alliance, 2013).
The question becomes, how can we filter such malevolent input from the collection and data assembly process?
Data Variety and Data Security
One of the ways to tackle the security problem is to run an ongoing filtering audit process. The input validation talks primarily about the need for information auditing at the source and before entering the central collection system. Allow me to outline some of the other necessary steps.
- Physical Security – is the first step to ensuring the integrity of all data sources and sensitive types of information. The process of securing the physical infrastructure against intruders largely highlights the importance of data center and its physical security. Monitoring ensures this as well as CCTV camera surveillance of entire DC property and its surroundings. Supervision of all access points and restricting access to only selected employees (which need to be background screened), security guards on site, key card security and fingerprint traps need to become a standard requirement to guarantee a physical entry protection.
- Sensor Monitoring – refers to a security of IoT and sensor devices and ability to control and audit the information sent by these devices. It also highlights the need for preventing adversaries from manipulating data inputs from raw sensors.
- Data Encryption & Encrypted Communication – while it may not always be easy to manage multiple application and SSL certificates, this method usually pays off, because it discourages intruders by dramatically increasing the complexity of the attack. So, this is not as much of prevention, but rather a discouraging action.
- Variety and Data Security – a variety of schemes come with a variety of requirements that need to be met to ensure the security. This requirement speaks to need to address all data sources from a security perspective.
- Disaster Recovery Security – with increased variety of data, critical of having a solid backup and versioning system guarantees a simpler recovery from a data related disaster. The backup systems require their security.
- Proactive Prevention – speaks to a use of preventive measures that lower the probability of attacker/invader engendering the information contained in the data. This step concentrates mainly on preventing the data contamination before information entry into the centrally managed collection system. The analysis methods used here are primarily monitoring the possible manipulation of the source of information by comparing them to historical copies and trends. “Preventing an adversary from sending malicious input requires tamper-proof software and defenses against Sybil attacks. Research on the design and implementation of tamper-proof secure software has a very long history in both academia and industry.” (Cloud Security Alliance, 2013).
- Data Curation, AI, and Machine Learning – the data curation can help address the issues related to the diversity of data sources. This is done mainly by helping the navigation through numerous records from many big data sources. These tasks are increasingly being automated, and machine learning, as well as progressive learning algorithms, are being used to achieve higher confidence levels and in such way guarantee a data quality.
- Data Filtering – the central filtering system acts as a continuous audit of the information contained in the data that are inserted into the central collection system. Scanning involves monitoring against malicious data added after the data collection phase.
There are just some of the methods and techniques traditionally employed in identification and removal of liabilities from data, which are all enlarged by the overall variety of data. While neither of these methods is bulletproof and neither alone can guarantee the data safety against contamination, implementing at least above basic steps decreases the likelihood of data contamination and increase the security of data.
I want to conclude by saying, that most enterprises should take the questions of data variety very seriously. The introduction of real-time audits prior and post data collection, and monitoring of the data variety and diversity of data sources from which the business collects information should be paramount for ensuring the data security.
The auditing of data sets and their information needs to become one of the important tasks of the data collection process as it can notify the organization of the occurrence of any information changes.
There is strong evidence that the general increase in data variety introduces the reduction in data security and causes an overall decline in the data trust. Without a proper control and monitoring of the data variety, the increase in the diversity of data will almost certainly make the task of filtering and prevention of malicious attacks a progressively harder task for an organization to accomplish.
Big Data Security (2014) University of Liverpool. Available at: https://elearning.uol.ohecampus.com/bbcswebdav/institution/UKL1/201740JAN/MS_CKIT/CKIT_525/readings/UKL1_CKIT_525_Week08_LectureNotes.pdf (Accessed: 4 March 2017).
Ritchey, D. (2012). Big data, big security. Security, 49(7), 28-30. (Accessed: 4 March 2017).
Cloud Security Alliance (2013) Top Ten big data security and privacy challenges. Available at: https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Top_Ten_v1.pdf (Accessed: 4 March 2017).
Shacklett, M. (2016) How to cope with the big data variety problem. Available at: http://www.techrepublic.com/article/how-to-cope-with-the-big-data-variety-problem/ (Accessed: 5 March 2017).
Wen, M., Yu, S., Li, J., Li, H., & Lu, K. (2016). Big Data Storage Security. In Big Data Concepts, Theories, and Applications (pp. 237-255). Springer International Publishing. (Accessed: 4 March 2017).
Sagiroglu, S., & Sinanc, D. (2013, May). Big data: A review. In Collaboration Technologies and Systems (CTS), 2013 International Conference on (pp. 42-47). IEEE. (Accessed: 5 March 2017).
Cardenas, A. A., Manadhata, P. K., & Rajan, S. P. (2013). Big data analytics for security. IEEE Security & Privacy, 11(6), 74-76. (Accessed: 5 March 2017).