I complied following notes to understand the fundamentals of federated learning:

Most current data mining and machine learning techniques (like process mining) work by gathering all datasets into a central site, then running various algorithms against that data. In most practical cases privacy concerns can prevent building of such a centralised data warehouse. Thus training log data may be distributed among several custodians, none of which are allowed to transfer their data to another site.

Researchers from multiple communities such as the database community, the data mining community, the statistical disclosure control community and the cryptography community have explored the field of distributed data mining and machine learning independently. Recently, federated learning and federated analytics has gain popularity for collaboratively learning a shared machine learning model from distributed data. Federated learning is defined as a machine learning setting where multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client's raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective'

Federated Learning enables a set of participants, working with a trusted server, to collaboratively learn a shared machine learning model while keeping participant's client’s data within its own local systems.

In Federated learning, a server is an agent who is responsible for orchestrating the whole process of learning. Our goal is to collaboratively learn a shared global consesus model by a loose federation of participating nodes, which are coordinated by a central server, such that the final model generalize over test dataset $\mathcal{D}{\text {test }}$ without compromising the privacy of data in individual datasets. A server is an agent responsible for orchestrating the whole process of federated learning. The server is a custodian of a shared model trained, initialzed by a random model. Model consists of a global parameter vector $\mathbf{w}{G} \in \mathbb{R}^{n}$, where $n$ is the dimensionality of the parameter space. Client Nodes will contribute compute for training, by for example using on-premise computing infrastructure or private-cloud service provision for training and evaluation.

The workhorse for optimization is gradidated Federated averaging, we can perform distributed stochastic optimization for large-scale deep neural networks like LSTMs by combining local stochastic gradient descent (SGD) on each client with a server that performs model averaging.

Practical Challenges and Considerations

In practical settings, we face a number of challenges that naive implementation of federated averaging might not be able to handle.

Data Challenges:

The first challenge is that of non-IID data due to non-identical client distributions, violation of independence, and dataset drift. The data distribution of participating clients differ greatly and is highly skewed To tackle the problem of highly skewed data, where participating clients differently greatly in distribution several consensus and pluratistic based solutions have been proposed. Mohr etl. al for example propose agnostic federated learning, a consues soltuion where centralized model is optimized for any possible target distribution. Another promising solution is Multi-task learning.

Communication:

Secondly, Adapting centralized training workflows such as hyper-parameter tuning, neural architecture design, debugging, and interpretability tasks to the federated learning setting present roadblocks to the widespread adoption of FL in practical settings, and hence constitute important open problems. Various Model compression techniques have been proposed to tackle this.

Privacy Considerations:

An important challenge in fully decentralized learning is to prevent any client from reconstructing the private data of another client from its shared updates while maintaining a good level of utility for the learned models.

Privacy Threats :

  • ‌‌Identity disclosure (or re-identification) : This is arguably the most notorious threat in publishing medical data. It occurs when an attacker can associate a patient with their record in a published dataset.
  • Membership disclosure : This threat occurs when an attacker can infer with high probability that an individual’s record is contained in the published data.‌‌‌‌
  • Attribute disclosure (or sensitive information disclosure) : This threat occurs when an individual is associated with information about their sensitive attributes.

We want to use techniques that allow us to learn useful information about a population but preserve the privacy of individuals(e.g by protecting against linkage attacks). Generally speaking it is challenging for data to provide utility if fully Anonymised.

Secure Multi-Party Compute:

A popular method that guarantees privacy without compromising accuracy is secure multi-party computation (MPC) . Using MPC, multiple parties collaborate to compute a common function of interest without revealing their private inputs to other parties. An MPC protocol is considered secure if the parties learn only the final result, and no other information.

Formally, Consider $n$ parties $P_{1}, \ldots, P_{n}$ that hold private inputs $x_{1}, \ldots, x_{n}$ and wish to compute some arbitrary function $\left(y_{1}, \ldots, y_{n}\right)=f\left(x_{1}, \ldots, x_{n}\right)$ where the output of $P_{i}$ is $y_{i} .$ Secure Multi-Party Computation (MPC) enables the parties to compute the function using an interactive protocol, where each party $P_{i}$ learns exactly $y_{i}$, and nothing else.

For example, a group of employees might want to compute their average salary without any employee revealing their individual salary to any other employee. This task can be completed using MPC, such that the only information revealed is the result of the computation (i.e. the average salary). If each pair of employees holds a large, arbitrary, shared number, such that one employee will add it to their salary and the other will subtract it, then the result of the computation will not change, but no one will know any employee’s real salary.

The same idea can be applied to federated learning by having the parties use a secure weighted average protocol, under which each client encrypts their model weights, but the server can still calculate the weighted average on the encrypted data.

MPC protects the computation inputs from exposure to the server, but the exact final result is revealed to all parties by design. Unfortunately, for some types of computation, the final result can be used to reveal information about the inputs. For example, in the case of employees computing their average salary, once the result is known, if all but one of the employees work together, they can easily determine the salary of the final employee given the output (average salary). A secure learning approach based only on MPC may not be ideal for these cases.

By applying differential privacy on top of MPC, we can construct a federated learning system that protects from even this type of extreme collusion attack.

Differential Privacy:

Instead of attempting to scrub a dataset until it is “perfectly anonymized”, an impossible target, we could try a second approach: releasing just the result of a computation based on the private data, without releasing the dataset itself. This is considerably more private, and, if done carefully, can leak almost no individual information at all.

Differential privacy is a formal, information theoretic conception of privacy preservation. Differential privacy has roots in cryptography, is a formal, mathematical conception of privacy preservation. Differential privacy offers the following promise: “You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.”.

An algorithm is differentially private when it injects a precisely calculated quantity of noise to any statistical query, masking the possible contribution of any one individual to the result. It provides a gold standard definition of privacy protection for data scientists—including industry analysts, scientific researchers, and data-driven policy makers—who want to analyze data that contains personal information that must remain private. Then it is mathematically provable that no possible combination of queries or model results can tease out information specific to any individual data subject. This is a nontrivial guarantee, because in the absence of sufficient noise—or when differential-privacy theory is not applied—adversaries can combine many aggregated statistics to infer sensitive attributes of specific individuals.

\[ \operatorname{Pr}\left[f\left(d_{1}\right) \in S\right] \leq e^{\varepsilon} \cdot \operatorname{Pr}\left[f\left(d_{2}\right) \in S\right] \]

Differential privacy states that if there are two databases that differ by only one element, they are statistically indistinguishable from each other. In particular, if an observer cannot tell whether the element is in the dataset or not, she will not be able to determine anything else about the element either. DP is a property of the algorithm where we intuitively we try to prove that no individuals contributing data has a large impact on the output of the algorithm. i.e DP ensures that the addition or removal does not substantially affect the outcome of any analysis, and is thus also widely studied in federated learning research to prevent the indirect leakage.

Achieving a desired level of differential privacy can require adding a great deal of accuracy-reducing noise into the mix. Several Algorithms have been proposed for calibration.


Useful Papers: