Check Data Drift in DataBricks using Evidently and MLflow

Abhishek Sharma
AWS Tip
Published in
4 min readOct 3, 2023

--

Check Data Drift in DataBricks using Evidently and MLflow

Hey guys, in this blog we will see how we can Check Data Drift in DataBricks and log the results in a mlflow experiment. This is going to be a very interesting blog, so without any further due, let’s do it…

Read full article here — https://machinelearningprojects.net/check-data-drift-in-databricks/

Step 1 — Installing required libraries

## Importing and installing necessary libraries
!pip install evidently

import json
import mlflow
import numpy as np
import pandas as pd
from datetime import datetime
from evidently.tests import *
from evidently.test_suite import TestSuite
from databricks.feature_store import FeatureStoreClient
pd.set_option('display.max_rows', 100)

Step 2 — Fetching Data from the Feature Store

## Accessing data from feature store

# Create a FeatureStoreClient
feature_store = FeatureStoreClient()

# Specify the name of the feature table to read from
table_name = "your.table.name"

# Read data from the feature store
feature_df = feature_store.read_table(table_name)

# converting to pandas df
feature_df_pds = feature_df.toPandas()
feature_df_pds.head()
  • After importing all the required libraries, we will fetch our data from the Feature Store.
  • To do that, we need to create a FeatureStoreClient() first.
  • Then we need to give a table name from which we need to fetch the data.
  • Then finally we will run the feature_store.read_table(table_name) command to read the table from the feature store. This command will return a Spark Dataframe and we will store that in feature_df.
  • Then finally for easy operations, we will convert this Spark Dataframe to a Panda Dataframe. For this, we will use feature_df.toPandas() command.

Output

Check Data Drift in DataBricks using Evidently and MLflow

Step 3 — Creating Reference and Current Data

## creating reference data and current data 
thres = int(0.9*len(feature_df_pds))

ref = feature_df_pds[:thres]
print(ref.shape)

cur = feature_df_pds[thres:]
print(cur.shape)
  • Now we will split this data into 2 sets. Reference Dataset and Current Dataset.
  • We will keep 90% of our data in the Reference Dataset and the rest 10% in the Current Dataset.
  • We will name them ref and cur respectively.

Output

Check Data Drift in DataBricks using Evidently and MLflow

Step 4 — Let’s Check Data Drift in DataBricks

## Running Drift Test
tests = TestSuite(tests=[TestNumberOfDriftedColumns()])
tests.run(reference_data=ref, current_data=cur)

## Converting to JSON
res = json.loads(tests.json())

## Creating a Dataframe of it for easy visualization
drift = res['tests'][0]['parameters']['features']
driftdf = pd.DataFrame(drift).transpose()
driftdf
  • Now in this step, we will create a TestSuite object and pass a list of all the tests we want to perform on our data.
  • In our case, we just want to perform TestNumberOfDriftedColumns() test.
  • We will run the test and pass our reference and current dataset.
  • Then we are simply converting the results into a JSON file to create a Dataframe out of it for easier visualization.
  • And then we are simply printing our Dataframe.
  • You can check all the tests you can perform here.

Output

Check Data Drift in DataBricks using Evidently and MLflow

Step 5 — Logging the results in a mlflow experiment

## Logging this drift report in a mlflow experiment 
with mlflow.start_run() as run:
mlflow.log_param('date',datetime.now().strftime ('%y-%m-%d-%H:%M:%S'))
mlflow.log_param('reference_data', 'ref')
mlflow.log_param('current_data', 'cur')
mlflow.log_param('n_features', len(driftdf))
mlflow.log_param('features', list(driftdf.index))
mlflow.log_param('n_drifted_features', len(driftdf[driftdf['detected']==True]))
mlflow.log_param('drifted_features', list(driftdf[driftdf['detected']==True].index))
mlflow.log_param('drifted_features_p_vals', driftdf[driftdf['detected']==True]['score'].values)
  • Now finally we will log all this information in a mlflow experiment.
  • We will start a mlflow run using with mlflow.start_run() as run and we will start logging parameters in that.
  • Following are the parameters we are logging in this experiment:
  • date
  • reference_data
  • current_data
  • n_features
  • features
  • n_drifted_features
  • drifted_features
  • drifted_features_p_vals

Output

So in this way, you can Check Data Drift in DataBricks and log the results in a mlflow experiment. This was all for this blog guys, hope you enjoyed it…

Read my last article — Easiest way to Create a Databricks Feature Store

--

--