utils module

Utils for ML

utils.about_dataframe(df)

Describe DataFrame and show it’s information

Parameters:df (DataFrame) – Pandas DataFrame to describe and info
utils.columns_info(df, cat_count_threshold, show_group_counts=False)

Prints and returns column info for a given dataframe

Parameters:
  • df (DataFrame) – Pandas DataFrame
  • cat_count_threshold (int) – If a column in the dataframe has unique value count less than this threshold then it will be tagged as ‘categorical’
  • show_group_counts (boolean) – If True then prints the individual group counts for each column

Example

>>> object_cat_cols,
>>>    numeric_cat_cols,
>>>    numeric_cols =
>>>        utils.columns_info(data,
>>>                            cat_count_threshold=5,
>>>                            show_group_counts = True)
utils.count_compare_plots(df1, df1_title, df2, df2_title, column, **kwargs)

Show Count Plots of two DataFrames for comparision

Can be used to compare how Fill NA affects the distribution of a column

Args:

Example

The below example uses nhanes dataset.

>>> for each_column in object_cat_columns:
>>>    data[each_column] = data[each_column].fillna(
>>>    data.groupby(['Gender'])[each_column].ffill())
>>> for each_column in object_cat_columns:
>>>    str_count_of_nas = str(len(
>>>        data_raw.index[data_raw.isnull()[each_column]]))
>>>    str_count_of_nas = ' (Count of NAs:' + str_count_of_nas + ')'
>>>    utils.count_compare_plots(df1=data_raw,
>>>    df1_title='Before Fill-NA' + str_count_of_nas,
>>>            df2=data,
>>>            df2_title='After Fill-NA',
>>>            column=each_column,
>>>            height=4,
>>>            aspect=1.5,
>>>            hue_column='Diabetes',
>>>            split_plots_by='Gender')
utils.count_plots(df, columns, **kwargs)

Count Plots using seaborn

Display Count plots for the given columns in a DataFrame

Parameters:
  • df (DataFrame) – Pandas DataFrame
  • columns (array-like) – Columns for which count plot has to be shown
  • kwargs (array[str]) – Keyword Args
KeywordArgs:

hue_column (str): Color split_plots_by (str): Split seaborn facetgrid by column such as Gender

height (float): Sets the height of plot

aspect (float): Determines the width of the plot based on height

Example

>>> utils.count_plots(data, object_cat_cols, height=4, aspect=1.5)
utils.dist_plots(df, columns, **kwargs)

Dist Plots using seaborn

Parameters:
  • df (DataFrame) – Pandas DataFrame.
  • columns ([str]) – Plot only for selected columns.
  • **kwargs – Keyword arguments.
Keyword Arguments:
 
  • hue_column (str) – Color
  • split_plots_by (str) – Split seaborn facetgrid by column such as Gender
  • height (float) – Sets the height of plot
  • aspect (float) – Determines the width of the plot based on height

Example

>>> utils.dist_plots(data, numeric_cols, height=4, aspect=1.5,
>>>    hue_column='class', kde=False)

Returns: Nothing

utils.do_cross_validate(X, y, estimator_type, estimator, cv, **kwargs)

Cross Validate (sklearn)

Args:

Example

>>> cv_iterator = ShuffleSplit(n_splits=2, test_size=0.2, random_state=31)
>>> cv_results = utils.do_cross_validate(X_train,
>>>    y_train,
>>>    'Classification',
>>>    'DecisionTreeClassifier',
>>>    cv=cv_iterator,
>>>    kernel='rbf',
>>>    C=1,
>>>    gamma=0.01)
utils.do_feature_selection(X, y, method, num_of_features=None)

Summary line.

Extended description of function.

Args:

utils.do_outlier_detection(df, target_column, outlier_classes, method, **kwargs)
utils.do_scaling(df, method, columns_to_scale=[])

Scale data using the specified method

Columns specified in the arguments will be scaled

Parameters:
  • df (DataFrame) – Pandas DataFrame
  • columns (array-like) – List of columns that will be scaled
Returns:

df (DataFrame)

utils.encode_columns(df, method, columns=[])

Summary line.

Extended description of function.

Args:

utils.fill_null_values(df, column, value, row_index)

Fill null values in a dataframe column

Parameters:
  • df (DataFrame) – Pandas DataFrame that will be updated
  • column (str) – Column in the target dataframe that will be updated
  • value – (Union[int, str, object]): New value that will replace null values
  • row_index (Union[Index, array-like]) – Index of rows to be updated
utils.get_X_and_y(df, y_column)

Splits pd.dataframe into X (predictors) and y (response)

Parameters:
  • df (DataFrame) – Pandas DataFrame
  • y_column (str) – The response column name
Returns:

All columns except the response will be in X y (Series): Only the response column from dataframe

Return type:

X (DataFrame)

utils.get_dataframe_from_array(data_array, columns)

Convert ndarray to pd.DataFrame for the given list of columns

Parameters:
  • data_array (ndarray) – Array to convert to pd.DataFrame
  • columns (Union[array-like]) – Column Names for the pd.DataFrame
Returns:

pd.DataFrame

utils.kde_compare_plots(df1, df1_title, df2, df2_title, column, **kwargs)

Summary line.

Extended description of function.

Args:

utils.kde_plots(df, columns, **kwargs)

KDE Plots using seaborn

Parameters:
  • df (DataFrame) – DataFrame
  • columns ([str]) – Plot only for selected columns.
  • **kwargs – Keyword arguments.
Keyword Arguments:
 
  • hue_column – for color coding
  • split_plots_by – split seaborn FacetGrid by column, example: Gender
  • height – sets the height of plot
  • aspect – determines the widht of the plot based on height

Example

>>> utils.kde_plots(data, numeric_cols, height=4, aspect=1.5,
>>>    hue_column='class')
utils.null_values_info(df)

Show null value information of a DataFrame

Parameters:df (DataFrame) – Pandas DataFrame for which null values should be displayed
utils.plot_decision_boundary(x_axis_data, y_axis_data, response, estimator, x_axis_column=None, y_axis_column=None)

Plots the decision boundary

Args:

utils.plot_roc_curve_binary_class(y_true, y_pred)

Summary line.

Extended description of function.

Args:

utils.plot_roc_curve_multiclass(estimator, X_train, X_test, y_train, y_test, classes)

Summary line.

Extended description of function.

Args:

utils.print_confusion_matrix(y_true, y_pred)

Prints the confision matrix with columns and index labels

Parameters:
  • y_true (Union[ndarray, pd.Series]) – Actual Response
  • y_pred (Union[ndarray, pd.Series]) – Predicted Response
utils.print_func(value_to_print, mode=None)

Display or Print an object or string

Parameters:
  • value_to_print (Union[str, object]) – Value to print
  • mode (optional[str]) – Defaults to None. Accepts either DISPLAY or HTML
utils.print_new_line()

Prints a new line

utils.print_separator()

Prints a separator line using 80 underscores