[ Pobierz całość w formacie PDF ]
(2.0, New York)-sanitized 20,1% 58,3%
Homosexuality (1.0, Homosexuality)- 92,9% 97,5%
sanitized
(1.5, Homosexuality)- 55,3% 81,3%
sanitized
(2.0, Homosexuality)- 50,4% 77,2%
sanitized
Catholicism (1.0, Catholicism)-sanitized 96,3% 98,1%
(1.5, Catholicism)-sanitized 41,3% 73,4%
(2.0, Catholicism)-sanitized 30,9% 65,8%
First, we observe that data utility preservation is a direct function of the detection
precision reported in Table 1. That is, the lower the precision is, the higher the number of false
positives and, thus, the larger the amount of terms that are unnecessarily removed or generalized.
Moreover, the decrease of utility preservation is sharper when redacting terms in comparison
with a sanitization process. Certainly, term sanitization (i.e., generalization) enables a significant
increase of data utility in all cases, which is more noticeable as the number of terms detected as
sensitive increase. This shows that the exploitation of one or several knowledge bases makes the
output more usable and also less obviously protected; that is, terms to be hidden are generalized
instead of being blacked out, a measure that may raise awareness of their sensitivity.
In any case, the degree of preservation also depends on the generality of the concept to
protect. In order to fulfill the privacy guarantee, the most general ones (e.g., New York, HIV),
that is, those with a lower baseline IC, would usually require to protect a larger number of
entities and to generalize them (in case of sanitization) to a higher level of abstraction than the
most specific ones (e.g., Los Angeles, STD).
Conclusions
Because of the enormous amount of unstructured sensitive documents that are daily
exchanged, methods and tools aimed to automatize the burdensome sanitization/redaction
process are needed. This is the purpose of the privacy model proposed in this paper, which
defines the theoretical framework needed to implement practical sanitization tools that can offer
clear and a priori privacy guarantees and an inherently semantic privacy protection that aims at
mimicking the reasoning of human sanitizers. By enforcing the abstract model in terms of
information theory and quantifying semantics according to the information content of terms, a
practical and general purpose solution has been proposed. In comparison with other privacy
models available in the literature, our proposal provides a series of benefits which include i) an
intuitive and straightforward instantiation according to different privacy needs (i.e., anonymity or
confidentiality) and scenarios, by using the linguistic labels usually mentioned in current privacy
legislations, ii) a flexible adaptation of the degree of redaction/sanitization according to specific
privacy requirements, which can be applied to individual documents, iii) a priori privacy
guarantees on the kind of data protection applied to the output, iv) an automatic detection of
terms that may disclose sensitive data via semantic inferences, with an inherent support of all
kinds of semantic relations (taxonomic and non-taxonomic), and v) support for both redaction
and sanitization (the latter requires from an appropriate KB).
As additional contributions of this work, we have i) characterized the semantics of
disclosure according to the type of semantic relationship between the entities to protect and the
terms appearing in the document, ii) discussed the relevant aspects that should be considered
when implementing methods and tools based on our method, iii) proposed a simple and scalable
implementation algorithm and iv) shown the applicability and suitability of our proposal through
the evaluation, by a human expert, of the empirical results obtained in several realistic
sanitization scenarios.
The proposed model and its information theoretic enforcement open the door for the
development of theoretically sound redaction/sanitization methods. As future work, we plan to
develop more accurate implementations of our model which would consider, for example, i)
improvements on probability assessments by contextualizing web queries and, thus, minimizing
semantic ambiguity inherent to raw corpora (D. Sánchez et al., 2010), ii) integration of the
results provided by several web search engines in order to minimize data sparseness and compile
more robust probabilities, or iii) semantic disambiguation (Roman, Hulin, Collins, & Powell,
2012) of sensitive terms in order to retrieve more appropriate generalizations. Moreover, as it has
been shown through the empirical experiments, different ± values are needed to provide an
optimal balance between data privacy and utility for each specific scenario. We plan to research
on ways to automate the tuning of this parameter, so that ± can be set to an appropriate value
according, for example, to the informativeness of the entity c and/or of the sensitive terms found
in the input document. Finally, we also plan to perform additional evaluations that consider also
the perspective of potential attackers; that is, up to which point an external observer would or
would not be able to infer the sensitive entities from the sanitized output.
Disclaimer and acknowledgements
Authors are solely responsible for the views expressed in this paper, which do not
necessarily reflect the position of UNESCO nor commit that organization. This work was partly
supported by the European Commission under FP7 project Inter-Trust, by the Spanish Ministry
of Science and Innovation (through projects eAEGIS TSI2007-65406-C03-01, ICWT TIN2012-
32757, ARES-CONSOLIDER INGENIO 2010 CSD2007-00004 and BallotNext IPT-2012-0603-
430000) and by the Government of Catalonia (under grant 2009 SGR 1135).
References
Anandan, B., & Clifton, C. (2011). Significance of term relationships on anonymization. Paper
presented at the IEEE/WIC/ACM International Joint Conference on Web Intelligence and
Intelligent Agent Technology - Workshops, Lyon, France.
Anandan, B., Clifton, C., Jiang, W., Murugesan, M., Pastrana-Camacho, P., & L.Si. (2012). t-
plausibility: Generalizing words to desensitize text. Transactions on Data Privacy, 5,
505-534.
Batet, M., Erola, A., Sánchez, D., & Castellà-Roca, J. (2013). Utility preserving query log
anonymization via semantic microaggregation. Information Sciences, 242, 49-63.
Bier, E., Chow, R., P. Golle, T. H. King, & Staddon, J. (2009). The Rules of Redaction: identify,
protect, review (and repeat). IEEE Security and Privacy Magazine, 7(6), 46-53.
Bollegala, D., Matsuo, Y., & Ishizuka, M. (2009). A Relational Model of Semantic Similarity
between Words using Automatically Extracted Lexical Pattern Clusters from the Web.
Paper presented at the Conference on Empirical Methods in Natural Language
Processing, EMNLP 2009, Singapore, Republic of Singapore.
Cilibrasi, R. L., & Vitányi, P. M.B. (2006). The Google Similarity Distance. IEEE Transactions
on Knowledge and Data Engineering, 19(3), 370-383.
Cumby, C., & Ghani, R. (2011). A machine learning based system for semiautomatically
redacting documents. Paper presented at the Twenty-Third Conference on Innovative
[ Pobierz całość w formacie PDF ]