The Masking Package is a Python library designed to provide various masking operations on data columns. It supports hashing, fake data generation, and other masking techniques to ensure data privacy and security.
- Development:
To install the Masking Package, use the following command:
pip install masking@git+"https://github.com/d-one/masking"In the following we provide an example of how to use the masking module to hash the entire content of an entry.
Here is a basic example of how to use the package
import pandas as pd
from masking.mask.pipeline import MaskDataFramePipeline
from masking.mask.operations.operation_hash import HashOperation
# Sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie'],
'ssn': ['123-45-6789', '987-65-4321', '555-55-5555']}
df = pd.DataFrame(data)
# Configuration for masking
config = {
"name": {'masking_operation': HashOperation(col_name="name",
secret="my_secret")},
}
# Create and apply the masking pipeline
pipeline = MaskDataFramePipeline(config)
masked_df = pipeline(df)
print(masked_df)The module makes use of Presidio, a framework introduced by Microsoft, to enable entity recognition. This latter leverages Spacy as a NLP framework to do entity recognition.
Before using the masking module, make sure to have downloaded the necessary spacy model: open up a terminal in your project and execute the following line of code
python -m spacy download en_core_web_trfYou can make sure to have installed the spacy model correctly by running the following code:
import spacy
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_trf")
# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
"Google in 2007, few people outside of the company took him "
"seriously. “I can tell you very senior CEOs of major American "
"car companies would shake my hand and turn away because I wasn’t "
"worth talking to,” said Thrun, in an interview with Recode earlier "
"this week.")
doc = nlp(text)
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)Once the model is correctly installed, one can make use of it for masking in the following manner:
import pandas as pd
from masking.mask.pipeline import MaskDataFramePipeline
from masking.mask.operations.operation_presidio import MaskPresidio
from masking.utils.presidio_handler import PresidioMultilingualAnalyzer
analyzer = PresidioMultilingualAnalyzer(
models={"en": "en_core_web_trf"}
).analyzer
# Sample DataFrame
data = {'description': ['James Bond was a very clever inspector.']}
df = pd.DataFrame(data)
# Configuration for masking
config = {
"name": {'masking_operation': MaskPresidio(col_name="name", secret="my_secret")},
}
# Create and apply the masking pipeline
pipeline = MaskDataFramePipeline(config)
masked_df = pipeline(df)
print(masked_df)