Personal Identifiable Information
Warning
Ask your Databricks administrators to set the environmental variable PII_TABLE
before you get started.
Example: PII_TABLE=db_admin.gdpr.one_time_deletes
Consumer
dataclass
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark |
SparkSession
|
the active spark session |
required |
consumer |
str
|
the consuming catalog name |
required |
Example: # Get all the removal requests and mark one as completed ```python consumer = Consumer(spark, "beta_live")
consumer.get_removal_requests().display()
# After the handling the deletiong of a request
consumer.mark_as_completed("abc123")
```
Source code in delta_utils/pii.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
|
get_removal_requests()
Get all removal requests
Returns:
Name | Type | Description |
---|---|---|
Dataframe |
DataFrame
|
a dataframe that can be displayed with all the delete requests |
Source code in delta_utils/pii.py
189 190 191 192 193 194 195 196 197 198 199 200 201 |
|
mark_as_completed(id)
Mark the removal request as completed
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id |
str
|
the UUID from the request |
required |
Source code in delta_utils/pii.py
203 204 205 206 207 208 209 210 211 212 213 |
|
Producer
dataclass
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark |
SparkSession
|
the active spark session |
required |
Example
Create a removal request
producer = Producer(spark)
producer.create_removal_request(
affected_table="beta_live.world.adults_only",
source_table="alpha_live.world.people",
source_identifying_attributes=[("id", "1")],
)
Source code in delta_utils/pii.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
|
create_removal_request(*, affected_table, source_table, source_columns=None, source_identifying_attributes, when_to_delete=None)
Creates a personal identifiable information (PII) removal request.
This function generates a request to remove personal identifiable information (PII) from a specified affected table by utilizing the source table and associated columns with identifying attributes. The request can optionally include a specific date and time for when the deletion should occur.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
affected_table |
str
|
The name of the affected table from which PII needs to be removed. |
required |
source_table |
str
|
The name of the source table that contains the associated columns for identifying attributes. |
required |
source_columns |
Optional[List[str]]
|
A list of column names in the source table that hold PII. Defaults to None. |
None
|
source_identifying_attributes |
List[Tuple[str, str]]
|
A list of tuples representing the identifying attributes to match the PII records in the affected table. Each tuple consists of a column name in the source table and the value of the PII records. |
required |
when_to_delete |
Optional[datetime]
|
An optional datetime object representing the specific date and time when the PII deletion should occur. Defaults to None. |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If the affected_table or source_table is not provided or if source_identifying_attributes is empty. |
Source code in delta_utils/pii.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
|
Using it together with lineage
# Instantiate a Lineage object
lineage = Lineage(
databricks_workspace_url=dbutils.secrets.get("DATABRICKS", "URL"),
databricks_token=dbutils.secrets.get("DATABRICKS", "TOKEN"),
)
# Instantiate the Producer object
producer = Producer(spark)
source_table = "alpha_live.world.people"
# Get downstream tables for the alpha_live.world.people table
downstream_tables = lineage.downstream_tables(source_table)
for affected_table in downstream_tables:
producer.create_removal_request(
affected_table=affected_table,
source_table=source_table,
source_identifying_attributes=[("id", "1")],
)