Skip to main content

Databricks Lakehouse

Overview

This destination syncs data to Delta Lake on Databricks Lakehouse. Each stream is written to its own delta-table.

caution

You must be using Unity Catalog to use this connector.

info

Please note, at this time OAuth2 authentication is only supported in AWS deployments. If you are running Databricks in GCP, you must use an access token.

This connector requires a JDBC driver to connect to the Databricks cluster. By using the driver and the connector, you must agree to the JDBC ODBC driver license. This means that you can only use this connector to connect third party applications to Apache Spark SQL within a Databricks offering using the ODBC and/or JDBC protocols.

Airbyte Setup

When setting up a Databricks destination, you need these pieces of information:

Server Hostname / HTTP Path / Port

  1. Open the workspace console.

  2. Open your SQL warehouse:

  3. Open the Connection Details tab:

  4. Finally, you'll need to provide the Databricks Unity Catalog Path, which is the path to the database you wish to use within the Unity Catalog. This is often the same as the workspace name.

Authentication

Follow the instructions in Databricks documentation to generate a client ID and secret.

  1. Open your workspace console.

  2. Click on your icon in the top-right corner, and head to settings, then developer, then manage under access tokens

  3. Enter a description for the token and how long it will be valid for (or leave blank for a permanent token):

Other Options

  • Default Schema - The schema that will contain your data. You can later override this on a per-connection basis.
  • Purge Staging Files and Tables - Whether Airbyte should delete files after loading them into tables. Note: if deselected, Databricks will still delete your files after your retention period has passed (default - 7 days).

Sync Mode

FeatureSupportNotes
Full Refresh SyncWarning: this mode deletes all previously synced data in the configured bucket path.
Incremental - Append Sync
Incremental - Append + Deduped
Namespaces

Output Schema

Each table will have the following columns, in addition to your whatever columns were in your data:

ColumnTypeNotes
_airbyte_raw_idstringA random UUID.
_airbyte_extracted_attimestampTimestamp when the source read the record.
_airbyte_loaded_attimestampTimestamp when the record was written to the destination
_airbyte_generation_idbigintSee the refreshes documentation.

Airbyte will also produce "raw tables" (by default in the airbyte_internal schema). We do not recommend directly interacting with the raw tables, and their format is subject to change without notice.

Changelog

Expand to review
VersionDatePull RequestSubject
3.3.22024-12-1849898Use a base image: airbyte/java-connector-base:1.0.0
3.3.12024-12-02#48779bump resource reqs for check
3.3.02024-09-18#45438upgrade all dependencies.
3.2.52024-09-12#45439Move to integrations section.
3.2.42024-09-09#45208Fix CHECK to create missing namespace if not exists.
3.2.32024-09-03#45115Clarify Unity Catalog Name option.
3.2.22024-08-22#44941Clarify Unity Catalog Path option.
3.2.12024-08-22#44506Handle uppercase/mixed-case stream name/namespaces
3.2.02024-08-12#40712Rely solely on PAT, instead of also needing a user/pass
3.1.02024-07-22#40692Support for refreshes and resumable full refresh. WARNING: You must upgrade to platform 0.63.7 before upgrading to this connector version.
3.0.02024-07-12#40689(Private release, not to be used for production) Add _airbyte_generation_id column, and sync_id entry in _airbyte_meta
2.0.02024-05-17#37613(Private release, not to be used for production) Alpha release of the connector to use Unity Catalog
1.1.22024-04-04#36846(incompatible with CDK, do not use) Remove duplicate S3 Region
1.1.12024-01-03#33924(incompatible with CDK, do not use) Add new ap-southeast-3 AWS region
1.1.02023-06-02#26942Support schema evolution
1.0.22023-04-20#25366Fix default catalog to be hive_metastore
1.0.12023-03-30#24657Fix support for external tables on S3
1.0.02023-03-21#23965Added: Managed table storage type, Databricks Catalog field
0.3.12022-10-15#18032Add SSL=1 to the JDBC URL to ensure SSL connection.
0.3.02022-10-14#15329Add support for Azure storage.
2022-09-01#16243Fix Json to Avro conversion when there is field name clash from combined restrictions (anyOf, oneOf, allOf fields)
0.2.62022-08-05#14801Fix multiply log bindings
0.2.52022-07-15#14494Make S3 output filename configurable.
0.2.42022-07-14#14618Removed additionalProperties: false from JDBC destination connectors
0.2.32022-06-16#13852Updated stacktrace format for any trace message errors
0.2.22022-06-13#13722Rename to "Databricks Lakehouse".
0.2.12022-06-08#13630Rename to "Databricks Delta Lake" and add field orders in the spec.
0.2.02022-05-15#12861Use new public Databricks JDBC driver, and open source the connector.
0.1.52022-05-04#12578In JSON to Avro conversion, log JSON field values that do not follow Avro schema for debugging.
0.1.42022-02-14#10256Add -XX:+ExitOnOutOfMemoryError JVM option
0.1.32022-01-06#7622 #9153Upgrade Spark JDBC driver to 2.6.21 to patch Log4j vulnerability; update connector fields title/description.
0.1.22021-11-03#7288Support Json additionalProperties.
0.1.12021-10-05#6792Require users to accept Databricks JDBC Driver Terms & Conditions.
0.1.02021-09-14#5998Initial private release.