Secure Azure Databricks Deployment

Prerequisites

Please take a note of Azure Databricks control plane endpoints for your workspace from here (map it based on region of your workspace). We’ll need these details to configure Azure Firewall rules later.

NameSourceDestinationProtocol:PortPurpose
databricks-webappAzure Databricks workspace subnetsRegion specific Webapp Endpointhttps:443Communication with Azure Databricks webapp
databricks-webappAzure Databricks workspace subnetsRegion specific Webapp Endpointhttps:443Communication with Azure Databricks webapp
databricks-observability-eventhubAzure Databricks workspace subnetsRegion specific Observability Event Hub Endpointhttps:9093Transit for Azure Databricks on-cluster service specific telemetry
databricks-artifact-blob-storageAzure Databricks workspace subnetsRegion specific Artifact Blob Storage Endpointhttps:443Stores Databricks Runtime images to be deployed on cluster nodes
databricks-dbfsAzure Databricks workspace subnetsDBFS Blob Storage Endpointhttps:443Azure Databricks workspace root storage
databricks-sql-metastore
(OPTIONAL – please see Step 3 for External Hive Metastore below)
Azure Databricks workspace subnetsRegion specific SQL Metastore Endpointtcp:3306Stores metadata for databases and child objects in a Azure Databricks workspace
Configure Azure Firewall Rules

With Azure Firewall, you can configure:

    • Application rules that define fully qualified domain names (FQDNs) that can be accessed from a subnet.
    • Network rules that define source address, protocol, destination port, and destination address.
    • Network traffic is subjected to the configured firewall rules when you route your network traffic to the firewall as the subnet default gateway.
Configure Application Rule

We first need to configure application rules to allow outbound access to Log Blob Storage and Artifact Blob Storage endpoints in the Azure Databricks control plane plus the DBFS Root Blob Storage for the workspace.

    • Go to the resource group, and select the firewall.
    • On the firewall page, under Settings, select Rules.
    • Select the Application rule collection tab.
    • Select Add application rule collection.
    • For Name, type databricks-control-plane-services.
    • For Priority, type 200.
    • For Action, select Allow.
    • Configure the following in Rules -> Target FQDNs
NameSource typeSourceProtocol
:Port
Target
FQDNs
databricks-spark-log
-blob-storage
IP AddressAzure
Databricks workspace subnets
https:443Refer notes
from Prerequisites
above
(for Central US)
databricks-audit-log-
blob-storage
IP AddressAzure
Databricks workspace subnets
https:443Refer notes
from Prerequisites
above
(for Central US)
This is separate
log storage only
for US regions today
databricks-artifact-
blob-storage
IP AddressAzure
Databricks workspace subnets
https:443Refer notes
from Prerequisites
above
(for Central US)
databricks-dbfsIP AddressAzure
Databricks workspace subnets
https:443Refer notes
from Prerequisites
above
Public Repositories for
Python and R Libraries(OPTIONAL –
if workspace users are
allowed to install libraries
from public repos)
IP AddressAzure
Databricks workspace subnets
https:443*pypi.org,*pythonhosted.
org,
cran.r-project.org
Add any other
public repos as
desired
Used by Ganglia UIIP AddressAzure
Databricks workspace subnets
https:443cdnjs.com or cdnjs.cloudflare.com
Configure Network Rule

Some endpoints can’t be configured as application rules using FQDNs. So we’ll set those up as network rules, namely the Observability Event Hub and Webapp.

    • Open the resource group adblabs-rg, and select the firewall.
    • On the firewall page, under Settings, select Rules.
    • Select the Network rule collection tab.
    • Select Add network rule collection.
    • For Name, type databricks-control-plane-services.
    • For Priority, type 200.
    • For Action, select Allow.
    • Configure the following in Rules -> IP Addresses.
NameProtocolSource typeSourceDest
type
Dest
Address
Dest Ports
databricks-
webapp
TCPIP AddressAzure
Databricks workspace
subnets
IP AddressRefer notes
from Prerequisites
above (for Central US)
443
databricks-
observability-
eventhub
TCPIP AddressAzure
Databricks workspace
subnets
IP AddressRefer notes
from
Prerequisites
above
(for Central US)
9093
databricks-sql-
metastore
(OPTIONAL –
please see
Step 3 for External Hive
Metastore
above)
TCPIP AddressAzure
Databricks workspace subnets
IP AddressRefer notes
from
Prerequisites
above
(for Central US)
3306


Below is a terraform script to add rules to the Azure Firewall.

# Priority range -14150 – 14159
resource “azurerm_firewall_policy_rule_collection_group” “data-archive” {
count = var.ENVIRONMENT == “npd” ? 1 : 0
name = “${module.names-group-data-office.standard[“afw-policy-group”]}-data-archive”
firewall_policy_id = data.azurerm_firewall_policy.this.id
priority = 14150

# Col1
application_rule_collection {
name = “${module.names-data-archive.standard[“afw-rule-collection”]}-perimeter”
priority = 3200
action = “Allow”
# Rule 1
rule {
name = “${module.names-data-archive.standard[“afw-rule”]}-allow-ado-agents-https-outbound”
protocols {
type = “Https”
port = 443
}
terminate_tls = true
source_addresses = var.customers.data-archive.vmss_subnets
destination_urls = [
“raw.githubusercontent.com/databricks/“, # required for partner terraform provider download (databricks) “github.com/databricks/“, # required for partner terraform provider download (databricks)
“objects.githubusercontent.com/github-production-release-asset” #github redirects to this. there is no way to make a specific rule as the rest is a SAS token which changes every time ] } # Rule 2 rule { # Allow databricks subnets access to databricks APIs name = “${module.names-data-archive.standard[“afw-rule”]}-allow-agent-databricks-api-calls” protocols { type = “Https” port = 443 } terminate_tls = true source_addresses = var.customers.data-archive.databrick_subnets destination_fqdns = [ “.azuredatabricks.net” #Calling databricks API for terraform creation of databricks objects
]
}
# Rule 3
rule {
# Allow databricks subnets access to Maven repo URLs – called by Databricks to install Java libraries needed by Spark
name = “${module.names-data-archive.standard[“afw-rule”]}-allow-databricks-maven-calls”
protocols {
type = “Https”
port = 443
}
terminate_tls = true
source_addresses = var.customers.data-archive.databrick_subnets
destination_urls = [
“maven-central.storage-download.googleapis.com/maven2/“, “repo1.maven.org/maven2/“,
“repos.spark-packages.org/*”
]
}
}

# Col 2
network_rule_collection {
# needed for dbricks to work – step 4 https://databricks.com/blog/2020/03/27/data-exfiltration-protection-with-azure-databricks.html
# https://docs.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/udr
name = “${module.names-data-archive.standard[“afw-rule”]}-dataarchive-databricks-net”
priority = 3100
action = “Allow”
# Rule 2
rule {
# Allow databricks subnets to access the databricks webapp
# Tried to use application rule but it failed complaining that it needed Target Fqdns,Target Urls, FqdnTags or WebCategories.
name = “${module.names-data-archive.standard[“afw-rule”]}-allow-dbricks-webapp”
protocols = [“TCP”]
source_addresses = var.customers.data-archive.databrick_subnets
destination_addresses = [“51.140.204.4/32”]
destination_ports = [“443”]
}
# Rule 1
rule {
# Allow databricks subnets to access the observability hub
name = “${module.names-data-archive.standard[“afw-rule”]}-allow-dbricks-observability”
protocols = [“TCP”]
source_addresses = var.customers.data-archive.databrick_subnets
destination_fqdns = [“prod-ukwest-observabilityeventhubs.servicebus.windows.net”] # obervability address for uksouth for databricks
destination_ports = [“9093”]
}
}

lifecycle {
create_before_destroy = true
}
}

Create User Defined Routes (UDRs)

At this point, the majority of the infrastructure setup for a secure, locked-down deployment has been completed. We now need to route appropriate traffic from Azure Databricks workspace subnets to the Control Plane SCC Relay IP (see FAQ below) and Azure Firewall setup earlier.

    • On the Azure portal menu, select All services and search for Route Tables. Go to that section.
    • Select Add
    • For Name, type firewall-route.
    • For Subscription, select your subscription.
    • For the Resource group, select adblabs-rg.
    • For Location, select the same location that you used previously i.e. Central US
    • Select Create.
    • Select Refresh, and then select the firewall-route-table route table.
    • Select Routes and then select Add.
    • For Route name, add to-firewall.
    • For Address prefix, add 0.0.0.0/0.
    • For Next hop type, select Virtual appliance.
    • For the Next hop address, add the Private IP address for the Azure Firewall that you noted earlier.
    • Select OK.

Now add one more route for Azure Databricks SCC Relay IP.

    • Select Routes and then select Add.
    • For Route name, add to-central-us-databricks-SCC-relay-ip.
    • For Address prefix, add the Control Plane SCC relay service IP address for Central US from here. Please note that there could be more than one ip addresses for relay service and in that case add additional rules on the UDR accordingly. In order to get SCC relay IP, please run nslookup on the relay service endpoint e.g.,
    • For Next hop type, select Internet, although it says Internet, traffic between Azure Databricks data plane and Azure Databricks SCC relay service IP stays on Azure Network and does not travel over public internet, for more details please refer to this guide).
      .
    • Select OK.

The route table needs to be associated with both of the Azure Databricks workspace subnets.

    • Go to the firewall-route-table.
    • Select Subnets and then select Associate.
    • Select Virtual network > azuredatabricks-spoke-vnet.
    • For Subnet, select both workspace subnets.
    • Select OK.

Below if the terraform code:

   "routeTable": {
      "disableBgpRoutePropagation": true,
      "routes": [
        {
          "name": "default-via-fw",
          "addressPrefix": "0.0.0.0/0",
          "nextHopIpAddress": "10.196.0.4",
          "nextHopType": "VirtualAppliance"
        },
        {
          "name": "to-uk-south-databricks-webapp",
          "addressPrefix": "51.140.204.4/32",
          "nextHopIpAddress": "",
          "nextHopType": "Internet"
        },
        {
          "name": "to-uk-south-databricks-scc-relay",
          "addressPrefix": "51.141.103.193/32",
          "nextHopIpAddress": "",
          "nextHopType": "Internet"
        },
        {
          "name": "to-uk-south-databricks-control-plane",
          "addressPrefix": "51.140.203.27/32",
          "nextHopIpAddress": "",
          "nextHopType": "Internet"
        },
        {
          "name": "to-uk-south-databricks-extended-infrastructure",
          "addressPrefix": "51.141.64.128/28",
          "nextHopIpAddress": "",
          "nextHopType": "Internet"
        }
      ]
    }