Secure Azure Databricks Deployment
Prerequisites
Please take a note of Azure Databricks control plane endpoints for your workspace from here (map it based on region of your workspace). We’ll need these details to configure Azure Firewall rules later.
Name | Source | Destination | Protocol:Port | Purpose |
databricks-webapp | Azure Databricks workspace subnets | Region specific Webapp Endpoint | https:443 | Communication with Azure Databricks webapp |
databricks-webapp | Azure Databricks workspace subnets | Region specific Webapp Endpoint | https:443 | Communication with Azure Databricks webapp |
databricks-observability-eventhub | Azure Databricks workspace subnets | Region specific Observability Event Hub Endpoint | https:9093 | Transit for Azure Databricks on-cluster service specific telemetry |
databricks-artifact-blob-storage | Azure Databricks workspace subnets | Region specific Artifact Blob Storage Endpoint | https:443 | Stores Databricks Runtime images to be deployed on cluster nodes |
databricks-dbfs | Azure Databricks workspace subnets | DBFS Blob Storage Endpoint | https:443 | Azure Databricks workspace root storage |
databricks-sql-metastore (OPTIONAL – please see Step 3 for External Hive Metastore below) | Azure Databricks workspace subnets | Region specific SQL Metastore Endpoint | tcp:3306 | Stores metadata for databases and child objects in a Azure Databricks workspace |
Configure Azure Firewall Rules
With Azure Firewall, you can configure:
- Application rules that define fully qualified domain names (FQDNs) that can be accessed from a subnet.
- Network rules that define source address, protocol, destination port, and destination address.
- Network traffic is subjected to the configured firewall rules when you route your network traffic to the firewall as the subnet default gateway.
Configure Application Rule
We first need to configure application rules to allow outbound access to Log Blob Storage and Artifact Blob Storage endpoints in the Azure Databricks control plane plus the DBFS Root Blob Storage for the workspace.
- Go to the resource group, and select the firewall.
- On the firewall page, under Settings, select Rules.
- Select the Application rule collection tab.
- Select Add application rule collection.
- For Name, type databricks-control-plane-services.
- For Priority, type 200.
- For Action, select Allow.
- Configure the following in Rules -> Target FQDNs
Name | Source type | Source | Protocol :Port | Target FQDNs |
databricks-spark-log -blob-storage | IP Address | Azure Databricks workspace subnets | https:443 | Refer notes from Prerequisites above (for Central US) |
databricks-audit-log- blob-storage | IP Address | Azure Databricks workspace subnets | https:443 | Refer notes from Prerequisites above (for Central US) This is separate log storage only for US regions today |
databricks-artifact- blob-storage | IP Address | Azure Databricks workspace subnets | https:443 | Refer notes from Prerequisites above (for Central US) |
databricks-dbfs | IP Address | Azure Databricks workspace subnets | https:443 | Refer notes from Prerequisites above |
Public Repositories for Python and R Libraries(OPTIONAL – if workspace users are allowed to install libraries from public repos) | IP Address | Azure Databricks workspace subnets | https:443 | *pypi.org,*pythonhosted. org, cran.r-project.org Add any other public repos as desired |
Used by Ganglia UI | IP Address | Azure Databricks workspace subnets | https:443 | cdnjs.com or cdnjs.cloudflare.com |
Configure Network Rule
Some endpoints can’t be configured as application rules using FQDNs. So we’ll set those up as network rules, namely the Observability Event Hub and Webapp.
- Open the resource group adblabs-rg, and select the firewall.
- On the firewall page, under Settings, select Rules.
- Select the Network rule collection tab.
- Select Add network rule collection.
- For Name, type databricks-control-plane-services.
- For Priority, type 200.
- For Action, select Allow.
- Configure the following in Rules -> IP Addresses.
Name | Protocol | Source type | Source | Dest type | Dest Address | Dest Ports |
databricks- webapp | TCP | IP Address | Azure Databricks workspace subnets | IP Address | Refer notes from Prerequisites above (for Central US) | 443 |
databricks- observability- eventhub | TCP | IP Address | Azure Databricks workspace subnets | IP Address | Refer notes from Prerequisites above (for Central US) | 9093 |
databricks-sql- metastore (OPTIONAL – please see Step 3 for External Hive Metastore above) | TCP | IP Address | Azure Databricks workspace subnets | IP Address | Refer notes from Prerequisites above (for Central US) | 3306 |
Below is a terraform script to add rules to the Azure Firewall.
# Priority range -14150 – 14159
resource “azurerm_firewall_policy_rule_collection_group” “data-archive” {
count = var.ENVIRONMENT == “npd” ? 1 : 0
name = “${module.names-group-data-office.standard[“afw-policy-group”]}-data-archive”
firewall_policy_id = data.azurerm_firewall_policy.this.id
priority = 14150
# Col1
application_rule_collection {
name = “${module.names-data-archive.standard[“afw-rule-collection”]}-perimeter”
priority = 3200
action = “Allow”
# Rule 1
rule {
name = “${module.names-data-archive.standard[“afw-rule”]}-allow-ado-agents-https-outbound”
protocols {
type = “Https”
port = 443
}
terminate_tls = true
source_addresses = var.customers.data-archive.vmss_subnets
destination_urls = [
“raw.githubusercontent.com/databricks/“, # required for partner terraform provider download (databricks) “github.com/databricks/“, # required for partner terraform provider download (databricks)
“objects.githubusercontent.com/github-production-release-asset” #github redirects to this. there is no way to make a specific rule as the rest is a SAS token which changes every time ] } # Rule 2 rule { # Allow databricks subnets access to databricks APIs name = “${module.names-data-archive.standard[“afw-rule”]}-allow-agent-databricks-api-calls” protocols { type = “Https” port = 443 } terminate_tls = true source_addresses = var.customers.data-archive.databrick_subnets destination_fqdns = [ “.azuredatabricks.net” #Calling databricks API for terraform creation of databricks objects
]
}
# Rule 3
rule {
# Allow databricks subnets access to Maven repo URLs – called by Databricks to install Java libraries needed by Spark
name = “${module.names-data-archive.standard[“afw-rule”]}-allow-databricks-maven-calls”
protocols {
type = “Https”
port = 443
}
terminate_tls = true
source_addresses = var.customers.data-archive.databrick_subnets
destination_urls = [
“maven-central.storage-download.googleapis.com/maven2/“, “repo1.maven.org/maven2/“,
“repos.spark-packages.org/*”
]
}
}
# Col 2
network_rule_collection {
# needed for dbricks to work – step 4 https://databricks.com/blog/2020/03/27/data-exfiltration-protection-with-azure-databricks.html
# https://docs.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/udr
name = “${module.names-data-archive.standard[“afw-rule”]}-dataarchive-databricks-net”
priority = 3100
action = “Allow”
# Rule 2
rule {
# Allow databricks subnets to access the databricks webapp
# Tried to use application rule but it failed complaining that it needed Target Fqdns,Target Urls, FqdnTags or WebCategories.
name = “${module.names-data-archive.standard[“afw-rule”]}-allow-dbricks-webapp”
protocols = [“TCP”]
source_addresses = var.customers.data-archive.databrick_subnets
destination_addresses = [“51.140.204.4/32”]
destination_ports = [“443”]
}
# Rule 1
rule {
# Allow databricks subnets to access the observability hub
name = “${module.names-data-archive.standard[“afw-rule”]}-allow-dbricks-observability”
protocols = [“TCP”]
source_addresses = var.customers.data-archive.databrick_subnets
destination_fqdns = [“prod-ukwest-observabilityeventhubs.servicebus.windows.net”] # obervability address for uksouth for databricks
destination_ports = [“9093”]
}
}
lifecycle {
create_before_destroy = true
}
}
Create User Defined Routes (UDRs)
At this point, the majority of the infrastructure setup for a secure, locked-down deployment has been completed. We now need to route appropriate traffic from Azure Databricks workspace subnets to the Control Plane SCC Relay IP (see FAQ below) and Azure Firewall setup earlier.
- On the Azure portal menu, select All services and search for Route Tables. Go to that section.
- Select Add
- For Name, type firewall-route.
- For Subscription, select your subscription.
- For the Resource group, select adblabs-rg.
- For Location, select the same location that you used previously i.e. Central US
- Select Create.
- Select Refresh, and then select the firewall-route-table route table.
- Select Routes and then select Add.
- For Route name, add to-firewall.
- For Address prefix, add 0.0.0.0/0.
- For Next hop type, select Virtual appliance.
- For the Next hop address, add the Private IP address for the Azure Firewall that you noted earlier.
- Select OK.
Now add one more route for Azure Databricks SCC Relay IP.
- Select Routes and then select Add.
- For Route name, add to-central-us-databricks-SCC-relay-ip.
- For Address prefix, add the Control Plane SCC relay service IP address for Central US from here. Please note that there could be more than one ip addresses for relay service and in that case add additional rules on the UDR accordingly. In order to get SCC relay IP, please run nslookup on the relay service endpoint e.g.,
- For Next hop type, select Internet, although it says Internet, traffic between Azure Databricks data plane and Azure Databricks SCC relay service IP stays on Azure Network and does not travel over public internet, for more details please refer to this guide).
. - Select OK.
The route table needs to be associated with both of the Azure Databricks workspace subnets.
- Go to the firewall-route-table.
- Select Subnets and then select Associate.
- Select Virtual network > azuredatabricks-spoke-vnet.
- For Subnet, select both workspace subnets.
- Select OK.
Below if the terraform code:
"routeTable": {
"disableBgpRoutePropagation": true,
"routes": [
{
"name": "default-via-fw",
"addressPrefix": "0.0.0.0/0",
"nextHopIpAddress": "10.196.0.4",
"nextHopType": "VirtualAppliance"
},
{
"name": "to-uk-south-databricks-webapp",
"addressPrefix": "51.140.204.4/32",
"nextHopIpAddress": "",
"nextHopType": "Internet"
},
{
"name": "to-uk-south-databricks-scc-relay",
"addressPrefix": "51.141.103.193/32",
"nextHopIpAddress": "",
"nextHopType": "Internet"
},
{
"name": "to-uk-south-databricks-control-plane",
"addressPrefix": "51.140.203.27/32",
"nextHopIpAddress": "",
"nextHopType": "Internet"
},
{
"name": "to-uk-south-databricks-extended-infrastructure",
"addressPrefix": "51.141.64.128/28",
"nextHopIpAddress": "",
"nextHopType": "Internet"
}
]
}