Worknotes

Integration of PSI-CA and Differential Privacy for APPI Compliance

This note consolidates the future work items and memoranda for “Integration of PSI-CA and Differential Privacy for APPI Compliance”(個人情報保護法準拠に資するPSI-CAと差分プライバシーの統合設計).

Future Works#

  • Mitigation against malicious adversaries

    • Dummy validation

      This can be viewed as a form of input poisoning. If we address this issue, we may also need to validate whether the dataset VBV_B provided by PBP_B is well-formed and correct.

      Therefore, cryptographic guarantees (e.g., SNARKs) and/or assurances at another layer (e.g., TEEs or auditing) are required.

  • Enable bidirectional outputs

    This can be addressed by using two distinct common dummy-identifier sets. In that case, the protocol should be designed so that both parties obtain the PSI-CA output directly. (In the current design, only PAP_A receives the output and may share it with PBP_B.)

  • Establish benchmarks

    This includes comparing throughput/latency across models and identifying representative use cases.

  • Address the lack of cryptographic guarantees that PAP_A will add u0u_0 to the protocol output.

Working Notes#

Why can identifiability elimination not be made a non-operational and purely cryptographic process?#

  • Could we not simply use an OPRF?

Assume an operational setting in which PAP_A and PBP_B evaluate an oblivious function Fsk()F_{sk}(\cdot) over VAV_A using a secret key held by PBP_B. In this setting, PBP_B learns nothing about the input VAV_A, and VAV_A learns nothing about PBP_B’s secret key.

However, the secret key necessarily remains on PBP_B’s side at a moment. If either (i) the keyed function Fsk()F_{sk}(\cdot) were leaked from PBP_B, or (ii) PBP_B’s key were leaked to PAP_A, then the resulting values could become information that can be readily cross-referenced with other data, thereby enabling the identification of a specific individual. Consequently, they may fall within the scope of personal data.

In this case with OPRF, since key deletion can be performed solely by PBP_B, the operational burden can be reduced, while also providing an additional layer of cryptographic assurance.

Notes on Circuit PSI#

  • Concrete implementations of Secure Circuit Aggregation can be realized using Yao’s garbled circuits, GMW, and related MPC techniques.
    See Section 2.3 of Efficient Circuit-Based PSI with Linear Communication.

Design considerations for DH-based and HE-based PSI-CA#

How to use noise shares and how to design the PSI-CA output#

First, the noise shares generated via Secure Sampling cannot be added directly to the raw intersection cardinality.

The reason is that, even if the values are encrypted (or otherwise masked) prior to adding noise, the Personal Information Protection Commission’s FAQ indicates that “information relating to an individual” is considered such regardless of whether it is hidden by encryption or similar measures. Therefore, if PAP_A and PBP_B add their shares and there exists any risk that the result could be decrypted or reconstructed, the value may still be treated as personal data.

Accordingly, the intersection cardinality output produced by the PSI-CA protocol must always be released only in a form that already includes noise.

Handling individual dummy identifier sets from the perspective of "hiding size information"#

For a given query, the dataset size Vj\lvert V_j \rvert may allow the counterparty to infer the size of the dataset satisfying the query condition; thus, it must be treated as sensitive size information that requires concealment.

In Circuit PSI, this “size-hiding” property is achieved by default because bins are padded with dummy elements.

For DH-based and HE-based PSI-CA, the model design should similarly incorporate the hashing-to-bins paradigm and include padding with dummies to ensure size information is not revealed.