Check Data Drift in DataBricks using Evidently and MLflow
Published in
4 min readOct 3, 2023
Hey guys, in this blog we will see how we can Check Data Drift in DataBricks and log the results in a mlflow experiment. This is going to be a very interesting blog, so without any further due, let’s do it…
Read full article here — https://machinelearningprojects.net/check-data-drift-in-databricks/
Step 1 — Installing required libraries
## Importing and installing necessary libraries
!pip install evidently
import json
import mlflow
import numpy as np
import pandas as pd
from datetime import datetime
from evidently.tests import *
from evidently.test_suite import TestSuite
from databricks.feature_store import FeatureStoreClient
pd.set_option('display.max_rows', 100)
Step 2 — Fetching Data from the Feature Store
## Accessing data from feature store
# Create a FeatureStoreClient
feature_store = FeatureStoreClient()
# Specify the name of the feature table to read from
table_name = "your.table.name"
# Read data from the feature store
feature_df = feature_store.read_table(table_name)
# converting to pandas df
feature_df_pds = feature_df.toPandas()
feature_df_pds.head()
- After importing all the required libraries, we will fetch our data from the Feature Store.
- To do that, we need to create a FeatureStoreClient() first.
- Then we need to give a table name from which we need to fetch the data.
- Then finally we will run the feature_store.read_table(table_name) command to read the table from the feature store. This command will return a Spark Dataframe and we will store that in feature_df.
- Then finally for easy operations, we will convert this Spark Dataframe to a Panda Dataframe. For this, we will use feature_df.toPandas() command.
Output
Step 3 — Creating Reference and Current Data
## creating reference data and current data
thres = int(0.9*len(feature_df_pds))
ref = feature_df_pds[:thres]
print(ref.shape)
cur = feature_df_pds[thres:]
print(cur.shape)
- Now we will split this data into 2 sets. Reference Dataset and Current Dataset.
- We will keep 90% of our data in the Reference Dataset and the rest 10% in the Current Dataset.
- We will name them ref and cur respectively.
Output
Step 4 — Let’s Check Data Drift in DataBricks
## Running Drift Test
tests = TestSuite(tests=[TestNumberOfDriftedColumns()])
tests.run(reference_data=ref, current_data=cur)
## Converting to JSON
res = json.loads(tests.json())
## Creating a Dataframe of it for easy visualization
drift = res['tests'][0]['parameters']['features']
driftdf = pd.DataFrame(drift).transpose()
driftdf
- Now in this step, we will create a TestSuite object and pass a list of all the tests we want to perform on our data.
- In our case, we just want to perform TestNumberOfDriftedColumns() test.
- We will run the test and pass our reference and current dataset.
- Then we are simply converting the results into a JSON file to create a Dataframe out of it for easier visualization.
- And then we are simply printing our Dataframe.
- You can check all the tests you can perform here.
Output
Step 5 — Logging the results in a mlflow experiment
## Logging this drift report in a mlflow experiment
with mlflow.start_run() as run:
mlflow.log_param('date',datetime.now().strftime ('%y-%m-%d-%H:%M:%S'))
mlflow.log_param('reference_data', 'ref')
mlflow.log_param('current_data', 'cur')
mlflow.log_param('n_features', len(driftdf))
mlflow.log_param('features', list(driftdf.index))
mlflow.log_param('n_drifted_features', len(driftdf[driftdf['detected']==True]))
mlflow.log_param('drifted_features', list(driftdf[driftdf['detected']==True].index))
mlflow.log_param('drifted_features_p_vals', driftdf[driftdf['detected']==True]['score'].values)
- Now finally we will log all this information in a mlflow experiment.
- We will start a mlflow run using with mlflow.start_run() as run and we will start logging parameters in that.
- Following are the parameters we are logging in this experiment:
- date
- reference_data
- current_data
- n_features
- features
- n_drifted_features
- drifted_features
- drifted_features_p_vals
Output
So in this way, you can Check Data Drift in DataBricks and log the results in a mlflow experiment. This was all for this blog guys, hope you enjoyed it…
Read my last article — Easiest way to Create a Databricks Feature Store