utils module¶

Utils for ML

utils.about_dataframe(df)¶

Describe DataFrame and show it’s information

Parameters:	df (DataFrame) – Pandas DataFrame to describe and info

utils.columns_info(df, cat_count_threshold, show_group_counts=False)¶

Prints and returns column info for a given dataframe

Parameters:	df (DataFrame) – Pandas DataFrame cat_count_threshold (int) – If a column in the dataframe has unique value count less than this threshold then it will be tagged as ‘categorical’ show_group_counts (boolean) – If True then prints the individual group counts for each column

Example

>>> object_cat_cols,
>>>    numeric_cat_cols,
>>>    numeric_cols =
>>>        utils.columns_info(data,
>>>                            cat_count_threshold=5,
>>>                            show_group_counts = True)

utils.count_compare_plots(df1, df1_title, df2, df2_title, column, **kwargs)¶

Show Count Plots of two DataFrames for comparision

Can be used to compare how Fill NA affects the distribution of a column

Args:

Example

The below example uses nhanes dataset.

>>> for each_column in object_cat_columns:
>>>    data[each_column] = data[each_column].fillna(
>>>    data.groupby(['Gender'])[each_column].ffill())
>>> for each_column in object_cat_columns:
>>>    str_count_of_nas = str(len(
>>>        data_raw.index[data_raw.isnull()[each_column]]))
>>>    str_count_of_nas = ' (Count of NAs:' + str_count_of_nas + ')'
>>>    utils.count_compare_plots(df1=data_raw,
>>>    df1_title='Before Fill-NA' + str_count_of_nas,
>>>            df2=data,
>>>            df2_title='After Fill-NA',
>>>            column=each_column,
>>>            height=4,
>>>            aspect=1.5,
>>>            hue_column='Diabetes',
>>>            split_plots_by='Gender')

utils.count_plots(df, columns, **kwargs)¶

Count Plots using seaborn

Display Count plots for the given columns in a DataFrame

Parameters:	df (DataFrame) – Pandas DataFrame columns (array-like) – Columns for which count plot has to be shown kwargs (array[str]) – Keyword Args

KeywordArgs:

hue_column (str): Color split_plots_by (str): Split seaborn facetgrid by column such as Gender

height (float): Sets the height of plot

aspect (float): Determines the width of the plot based on height

Example

>>> utils.count_plots(data, object_cat_cols, height=4, aspect=1.5)

utils.dist_plots(df, columns, **kwargs)¶

Dist Plots using seaborn

Keyword Arguments:
Parameters:	df (DataFrame) – Pandas DataFrame. columns ([str]) – Plot only for selected columns. **kwargs – Keyword arguments.
	hue_column (str) – Color split_plots_by (str) – Split seaborn facetgrid by column such as Gender height (float) – Sets the height of plot aspect (float) – Determines the width of the plot based on height

Example

>>> utils.dist_plots(data, numeric_cols, height=4, aspect=1.5,
>>>    hue_column='class', kde=False)

Returns: Nothing

utils.do_cross_validate(X, y, estimator_type, estimator, cv, **kwargs)¶

Cross Validate (sklearn)

Args:

Example

>>> cv_iterator = ShuffleSplit(n_splits=2, test_size=0.2, random_state=31)
>>> cv_results = utils.do_cross_validate(X_train,
>>>    y_train,
>>>    'Classification',
>>>    'DecisionTreeClassifier',
>>>    cv=cv_iterator,
>>>    kernel='rbf',
>>>    C=1,
>>>    gamma=0.01)

utils.do_feature_selection(X, y, method, num_of_features=None)¶

Summary line.

Extended description of function.

Args:

utils.do_outlier_detection(df, target_column, outlier_classes, method, **kwargs)¶

utils.do_scaling(df, method, columns_to_scale=[])¶

Scale data using the specified method

Columns specified in the arguments will be scaled

Parameters:	df (DataFrame) – Pandas DataFrame columns (array-like) – List of columns that will be scaled
Returns:	df (DataFrame)

utils.encode_columns(df, method, columns=[])¶

Summary line.

Extended description of function.

Args:

utils.fill_null_values(df, column, value, row_index)¶

Fill null values in a dataframe column

Parameters:	df (DataFrame) – Pandas DataFrame that will be updated column (str) – Column in the target dataframe that will be updated value – (Union[int, str, object]): New value that will replace null values row_index (Union[Index, array-like]) – Index of rows to be updated

utils.get_X_and_y(df, y_column)¶

Splits pd.dataframe into X (predictors) and y (response)

Parameters:	df (DataFrame) – Pandas DataFrame y_column (str) – The response column name
Returns:	All columns except the response will be in X y (Series): Only the response column from dataframe
Return type:	X (DataFrame)

utils.get_dataframe_from_array(data_array, columns)¶

Convert ndarray to pd.DataFrame for the given list of columns

Parameters:	data_array (ndarray) – Array to convert to pd.DataFrame columns (Union[array-like]) – Column Names for the pd.DataFrame
Returns:	pd.DataFrame

utils.kde_compare_plots(df1, df1_title, df2, df2_title, column, **kwargs)¶

Summary line.

Extended description of function.

Args:

utils.kde_plots(df, columns, **kwargs)¶

KDE Plots using seaborn

Keyword Arguments:
Parameters:	df (DataFrame) – DataFrame columns ([str]) – Plot only for selected columns. **kwargs – Keyword arguments.
	hue_column – for color coding split_plots_by – split seaborn FacetGrid by column, example: Gender height – sets the height of plot aspect – determines the widht of the plot based on height

Example

>>> utils.kde_plots(data, numeric_cols, height=4, aspect=1.5,
>>>    hue_column='class')

utils.null_values_info(df)¶

Show null value information of a DataFrame

Parameters:	df (DataFrame) – Pandas DataFrame for which null values should be displayed

utils.plot_decision_boundary(x_axis_data, y_axis_data, response, estimator, x_axis_column=None, y_axis_column=None)¶

Plots the decision boundary

Args:

utils.plot_roc_curve_binary_class(y_true, y_pred)¶

Summary line.

Extended description of function.

Args:

utils.plot_roc_curve_multiclass(estimator, X_train, X_test, y_train, y_test, classes)¶

Summary line.

Extended description of function.

Args:

utils.print_confusion_matrix(y_true, y_pred)¶

Prints the confision matrix with columns and index labels

Parameters:	y_true (Union[ndarray, pd.Series]) – Actual Response y_pred (Union[ndarray, pd.Series]) – Predicted Response

utils.print_func(value_to_print, mode=None)¶

Display or Print an object or string

Parameters:	value_to_print (Union[str, object]) – Value to print mode (optional[str]) – Defaults to None. Accepts either DISPLAY or HTML

utils.print_new_line()¶: Prints a new line

utils.print_separator()¶: Prints a separator line using 80 underscores