Integration of PSI-CA and Differential Privacy for APPI Compliance

This note consolidates the future work items and memoranda for “Integration of PSI-CA and Differential Privacy for APPI Compliance”（個人情報保護法準拠に資するPSI-CAと差分プライバシーの統合設計）.

Future Works#

Mitigation against malicious adversaries
- Dummy validation
  
  This can be viewed as a form of input poisoning. If we address this issue, we may also need to validate whether the dataset $V_B$ provided by $P_B$ is well-formed and correct.
  Applicability of Merkle Trees
  Merkle-tree-based verification is insufficient from the standpoint of (i) hiding size information and (ii) the limited scope of what a Merkle proof can certify. Concretely, it is necessary to prove all of the following:
  1. $P_A$ must not learn the size of the subset $\lvert C_B \rvert$ (effectively $\lvert u_1 \rvert$ ) of the common dummy-identifier set generated by $P_B$ .
  2. The set $C_B$ selected by $P_B$ must satisfy $C \supseteq C_B$ .
  3. For each $c_i \in C_B$ , the corresponding elements of $W_B$ must be proven to have been actually used within the PSI-CA protocol execution.
  Therefore, cryptographic guarantees (e.g., SNARKs) and/or assurances at another layer (e.g., TEEs or auditing) are required.
Enable bidirectional outputs

This can be addressed by using two distinct common dummy-identifier sets. In that case, the protocol should be designed so that both parties obtain the PSI-CA output directly. (In the current design, only $P_A$ receives the output and may share it with $P_B$ .)
Establish benchmarks

This includes comparing throughput/latency across models and identifying representative use cases.
Address the lack of cryptographic guarantees that $P_A$ will add $u_0$ to the protocol output.

Working Notes#

Why can identifiability elimination not be made a non-operational and purely cryptographic process?#

Could we not simply use an OPRF?

Assume an operational setting in which $P_A$ and $P_B$ evaluate an oblivious function $F_{sk}(\cdot)$ over $V_A$ using a secret key held by $P_B$ . In this setting, $P_B$ learns nothing about the input $V_A$ , and $V_A$ learns nothing about $P_B$ ’s secret key.

However, the secret key necessarily remains on $P_B$ ’s side at a moment. If either (i) the keyed function $F_{sk}(\cdot)$ were leaked from $P_B$ , or (ii) $P_B$ ’s key were leaked to $P_A$ , then the resulting values could become information that can be readily cross-referenced with other data, thereby enabling the identification of a specific individual. Consequently, they may fall within the scope of personal data.

In this case with OPRF, since key deletion can be performed solely by $P_B$ , the operational burden can be reduced, while also providing an additional layer of cryptographic assurance.

Notes on Circuit PSI#

Concrete implementations of Secure Circuit Aggregation can be realized using Yao’s garbled circuits, GMW, and related MPC techniques.
See Section 2.3 of Efficient Circuit-Based PSI with Linear Communication.

Design considerations for DH-based and HE-based PSI-CA#

How to use noise shares and how to design the PSI-CA output#

First, the noise shares generated via Secure Sampling cannot be added directly to the raw intersection cardinality.

The reason is that, even if the values are encrypted (or otherwise masked) prior to adding noise, the Personal Information Protection Commission’s FAQ indicates that “information relating to an individual” is considered such regardless of whether it is hidden by encryption or similar measures. Therefore, if $P_A$ and $P_B$ add their shares and there exists any risk that the result could be decrypted or reconstructed, the value may still be treated as personal data.

Accordingly, the intersection cardinality output produced by the PSI-CA protocol must always be released only in a form that already includes noise.

Handling individual dummy identifier sets from the perspective of "hiding size information"#

For a given query, the dataset size $\lvert V_j \rvert$ may allow the counterparty to infer the size of the dataset satisfying the query condition; thus, it must be treated as sensitive size information that requires concealment.

In Circuit PSI, this “size-hiding” property is achieved by default because bins are padded with dummy elements.

For DH-based and HE-based PSI-CA, the model design should similarly incorporate the hashing-to-bins paradigm and include padding with dummies to ensure size information is not revealed.