Check Data Drift in DataBricks using Evidently and MLflow

Published in

AWS Tip

4 min readOct 3, 2023

Check Data Drift in DataBricks using Evidently and MLflow

Hey guys, in this blog we will see how we can Check Data Drift in DataBricks and log the results in a mlflow experiment. This is going to be a very interesting blog, so without any further due, let’s do it…

Read full article here — https://machinelearningprojects.net/check-data-drift-in-databricks/

Step 1 — Installing required libraries

## Importing and installing necessary libraries
!pip install evidently

import json
import mlflow
import numpy as np
import pandas as pd
from datetime import datetime
from evidently.tests import *
from evidently.test_suite import TestSuite
from databricks.feature_store import FeatureStoreClient
pd.set_option('display.max_rows', 100)

Step 2 — Fetching Data from the Feature Store

## Accessing data from feature store

# Create a FeatureStoreClient
feature_store = FeatureStoreClient()

# Specify the name of the feature table to read from
table_name = "your.table.name"

# Read data from the feature store
feature_df = feature_store.read_table(table_name)

# converting to pandas df
feature_df_pds = feature_df.toPandas()
feature_df_pds.head()

After importing all the required libraries, we will fetch our data from the Feature Store.
To do that, we need to create a FeatureStoreClient() first.
Then we need to give a table name from which we need to fetch the data.
Then finally we will run the feature_store.read_table(table_name) command to read the table from the feature store. This command will return a Spark Dataframe and we will store that in feature_df.
Then finally for easy operations, we will convert this Spark Dataframe to a Panda Dataframe. For this, we will use feature_df.toPandas() command.

Output

Step 3 — Creating Reference and Current Data

## creating reference data and current data 
thres = int(0.9*len(feature_df_pds))

ref = feature_df_pds[:thres]
print(ref.shape)

cur = feature_df_pds[thres:]
print(cur.shape)

Now we will split this data into 2 sets. Reference Dataset and Current Dataset.
We will keep 90% of our data in the Reference Dataset and the rest 10% in the Current Dataset.
We will name them ref and cur respectively.

Output

Step 4 — Let’s Check Data Drift in DataBricks

## Running Drift Test
tests = TestSuite(tests=[TestNumberOfDriftedColumns()])
tests.run(reference_data=ref, current_data=cur)

## Converting to JSON
res = json.loads(tests.json())

## Creating a Dataframe of it for easy visualization
drift = res['tests'][0]['parameters']['features']
driftdf = pd.DataFrame(drift).transpose()
driftdf

Now in this step, we will create a TestSuite object and pass a list of all the tests we want to perform on our data.
In our case, we just want to perform TestNumberOfDriftedColumns() test.
We will run the test and pass our reference and current dataset.
Then we are simply converting the results into a JSON file to create a Dataframe out of it for easier visualization.
And then we are simply printing our Dataframe.
You can check all the tests you can perform here.

Output

Step 5 — Logging the results in a mlflow experiment

## Logging this drift report in a mlflow experiment 
with mlflow.start_run() as run:
    mlflow.log_param('date',datetime.now().strftime ('%y-%m-%d-%H:%M:%S'))
    mlflow.log_param('reference_data', 'ref') 
    mlflow.log_param('current_data', 'cur')
    mlflow.log_param('n_features', len(driftdf)) 
    mlflow.log_param('features', list(driftdf.index)) 
    mlflow.log_param('n_drifted_features', len(driftdf[driftdf['detected']==True])) 
    mlflow.log_param('drifted_features', list(driftdf[driftdf['detected']==True].index))
    mlflow.log_param('drifted_features_p_vals', driftdf[driftdf['detected']==True]['score'].values)

Now finally we will log all this information in a mlflow experiment.
We will start a mlflow run using with mlflow.start_run() as run and we will start logging parameters in that.
Following are the parameters we are logging in this experiment:
date
reference_data
current_data
n_features
features
n_drifted_features
drifted_features
drifted_features_p_vals

Output

So in this way, you can Check Data Drift in DataBricks and log the results in a mlflow experiment. This was all for this blog guys, hope you enjoyed it…

Read my last article — Easiest way to Create a Databricks Feature Store

Check Data Drift in DataBricks using Evidently and MLflow

Step 1 — Installing required libraries

Step 2 — Fetching Data from the Feature Store

Output

Step 3 — Creating Reference and Current Data

Output

Step 4 — Let’s Check Data Drift in DataBricks

Output

Step 5 — Logging the results in a mlflow experiment

Output

Written by Abhishek Sharma