Privacy Protection

Protecting data privacy is challenging, there is no “one-fits-all” privacy approach. Choosing approach depends on:

  • Intended data use
  • Accuracy constraints
  • Performance constraints
  • Adversarial model

The AMIS Privacy Framework is a comprehensive framework that encapsulates most prominent privacy models:

  • Syntactic models (generalization/suppression)
  • Semantic models (differential privacy)
  • Cryptographic models (searchable encryption)

The framework supports a diverse set of data uses (e.g. operations, statistics, data mining)

Functionality Overview

  1. Online
    1. operates on streaming data
    2. high-throughput, limited accuracy
  2. Offline
    1. queries repository with historical flow data
    2. complex strategies for increasing accuracy

Online Mode

This mode is suitable for network monitoring tasks. The users register for continuous queries and the flow data are generalized and streamed to users. In this case we are using k-anonimity for online anonymization of the streams. We receive continuous streams from a RabbitMQ exchange, the main thread of our program receives the streams by connecting to the exchange. We use Hilbert sorting (calculate a Hilbert coordinate for each data stream), then we anonymize the streams in buckets that present the same characteristics (each bucket has between k and 2*k-1 -values) and present the user with those buckets.

Offline Mode

The offline mode is suitable for data analysis, research, forensics. Flow data is stored outside AMA in Data Center (we could have an optional pre-processing step reduces data size). In our approach we use Hadoop to query repository. (Historical data may be queried) We need complex data query strategies and longer query execution times are acceptable.

In this part the data is saved on a Hadoop/HBase cluster and based on a set of pre-established queries (the queries that an group of users would be most interested into accessing) we are going to create Differentially Private answers to those queries. The saved data is raw data (there is no privacy protection in this stage), we query the raw data and create a set of contingency tables that are going to answer the most asked questions and using the Laplace mechanism we add noise to the result. When a query is going to be asked by an user, the user will get the differentially private answer back.

NetFlow Schema

The NetFlow Protocol Version 9 standard developed by Cisco provides a representation of a network flow called a flow record, containing a number of fields describing the traffic [CISCO]. NetFlow v9 defines over 100 field types, but only a few are needed for our use case. We receive flows in the form of export packets (diagram shown in Figure 3.1​) from various routers on the Internet2 network, and extract the necessary fields for collection. It is possible for export packets to contain multiple flow records, as well as various templates dictating what fields to extract from the record.

The NetFlow attributes to be collected into HBase are:

Field Type Length(bytes) Description
IN_BYTES 4 Number of incoming bytes for a network flow.
IN_PKTS 4 Number of incoming packets for a network flow.
IP_PROTO 1 IP protocol associated with the flow (i.e. TCP, UDP,

ICMP, etc.)

PORT_SRC 2 TCP/UDP Layer 4 source port.
IP_SRC 4 IPv4 source IP address.
PORT_DST 2 TCP/UDP Layer 4 destination port.
IP_DST 4 IPv4 destination IP address.
OUT_BYTES 4 Number of outgoing bytes for a network flow (we compute this)
OUT_PKTS 4 Number of outgoing packets for a network flow (we compute this)

Besides the above attributes we also save the "stamp_inserted" attribute of the flow in HBase.