Thursday, 3 January 2013

Big Data – Security Implications!

Big Data is the buzzword these days. Gartner has listed Big Data as one of the top 10 technology trends for the year 2013 and beyond (

Big Data is an industry trend that has several characteristics such as size of data such as Terabytes, Petabyte, Exabyte and higher. To put it simply, the volume of data is several magnitudes larger than traditional small data such as single enterprise data in the past. The other important aspects of big data are velocity, the near real-time data that an organization collects formally and informally via various data sources.  Big data velocity happens due to data coming in from data sources across geography, time zones and in quite a few cases twenty fours a day. The 3rd aspect of the Big Data collection is the variety that results in increased velocity of data acquisition. Datavariety includes the popular ones such as social data through formal channel such as blogs, feedback forms, data coming in via social data platforms such as Facebook and Twitter. All this data when collected, aggregated and analyzed constitute the big explosion of data in Big Data.

With Big data, comes the challenge of data security and privacy for organizations that deal with this data and try to make sense of the information in the data. We will try to uncover security and big data challenges organizations face in Big Data, with particular emphasis on organizations using the cloud infrastructure to power their business applications.

Big Data Security

Data security and data privacy are extremely important aspects to consider for any organization in the increasingly boundary-less, social and networked world. Big data poses additional challenges in the scale of data it presents to the enterprise. 

Data that an organization collects can be classified based on the business objectives of different data. Customer data that are essential for providing services to the customers needed to be handled differently to social data that the organization collects formally or informally (such as monitoring tweets and facebook messages). Customer data is typically data the customer creates directly by using a certain application or service of an organization that provides. Organization typically use and store data on behalf of the customer. For example, financial data, tax records are example of customer data that customer pays for various functional uses. This data can be shared with the organization that uses the data on behalf of the customer fully or partially, or this data is private to the user but an organization indirectly uses this data to provide some valuable service to the customers. The variations are many. 

The social data is used more for data mining and analysis of user provided data for getting insights in user behavior, buying or measuring user trends to mention the important ones.

Secure Data Infrastructure 

With the advent of public cloud service providers (CSP), the data security takes another dimension for security. How do CSPs secure data in their cloud infrastructure? The CSP needs to secure data and the application that handles data at the network level, at the host level and at the application level.

Network level security and host level security are part of SLAs that govern the data security agreement between an enterprise and the CSP. The CSP also need to confirm to various industry compliance standards such as ISO 27001/27002 and audit compliances such as SAS70 and others.

Host level security needs to take into account the operating system versions, patches, known security vulnerabilities as published by the OS vendor. In addition, virtualization software and documented risks in virtual machines (Java VM, .NET etc. need to be factored in as well.

Application level security compliance can be engineering web application confirming to web security principles such as being compliant with the foundations and guidelines laid down by The Open Web Application Security Project (OWASP)

Secure Data Handling

Data also needs to be handled securely in the data life cycle depending on the priorities of how data is collected, stored, used, archived and disposed. The data security lifecycle needs to handle security at various stages:

•    Data transmission using secure transmission protocols
•    Data storage
•    Data processing, ensuring data while being processed in an unencrypted state is securely processed.
•    Data lineage – to ensure that audit trail is captured in the life cycle
•    Data provenance – data is not only secure but is also correct at any time.

All the above security measures are a must for data stored in 3rd party environment such as public cloud or CSP.

Data Access Identity Management
In a typical organization where applications are deployed internally or private data centers, the security is based on organization trust boundary. The trust boundary encompasses the network, systems, and applications hosted in a private data center managed by the IT department (sometimes third-party providers under IT supervision). Access to the network, systems, and applications is secured via network security controls including virtual private networks (VPNs), intrusion detection systems (IDSs), intrusion prevention systems (IPSs), and multifactor authentication.

However on the cloud environment, the organization’s trust boundary moves to the realm of cloud service provider. This may already be the case for most large enterprises engaged in e-commerce, supply chain management, outsourcing, and collaboration with partners and communities.It is imperative on the part of the organization to identify the identity management services offered by the cloud provider to ensure data access is controlled as per the organization defined access roles.

Privacy Issues
Data privacy is an often widely discussed and debated topic in any data collected by enterprises formally or informally. There is no universal agreement across nations and cultures on what data is private and what is not private. Privacy laws and rights govern how private data is collected, used, stored, interpreted and disposed as there are a lot of ambiguities in what constitutes a PII (Private Identifiable Information). Data collected through user-contributed data, social media contains private data that can be traced back to the particular identify of the individual. Securing such data is part of data governance policy measures such as removal of personal data related to race, gender, age, contact, credit rating, and loan and credit card details. Data mining techniques aggregate personal data for meaningful analysis for the purpose of predicting user behavior and testing hypothesis. At the same time data that are proscribed by users to be used and shared needs to be strictly adhered. The fine line between what is private and public in user-contributed data is difficult to ascertain easily in big data.

Strictly safeguarding the privacy of data is virtually impossible when the data needs to be shared with government agencies such as surveillance, taxation authorities and other government agencies that need access to private data. The problem gets a larger dimension with the size and scope of the virtual data as the channels of data collection varies by each source and not easily manageable as the lowest level of data comes from an individual, who may or not agree with an organization views of what constitutes data privacy.

We have looked at the challenges of securing data as part of Big Data collection and the various dimensions of security measures an organization needs to consider for using Big Data applications meaningfully.