In this project, we embarked on a comprehensive journey through the world of machine learning and model evaluation. Our primary goal was to develop a Tkinter GUI and assess various machine learning models on a given dataset to identify the best-performing one. This process is essential in solving real-world problems, as it helps us select the most suitable algorithm for a specific task. By crafting this Tkinter-powered GUI, we provided an accessible and user-friendly interface for users engaging with machine learning models. It simplified intricate processes, allowing users to load data, select models, initiate training, and visualize results without necessitating code expertise or command-line operations. This GUI introduced a higher degree of usability and accessibility to the machine learning workflow, accommodating users with diverse levels of technical proficiency.
We began by loading and preprocessing the dataset, a fundamental step in any machine learning project. Proper data preprocessing involves tasks such as handling missing values, encoding categorical features, and scaling numerical attributes. These operations ensure that the data is in a format suitable for training and testing machine learning models.
Once our data was ready, we moved on to the model selection phase. We evaluated multiple machine learning algorithms, each with its strengths and weaknesses. The models we explored included Logistic Regression, Random Forest, K-Nearest Neighbors (KNN), Decision Trees, Gradient Boosting, Extreme Gradient Boosting (XGBoost), Multi-Layer Perceptron (MLP), and Support Vector Classifier (SVC).
For each model, we employed a systematic approach to find the best hyperparameters using grid search with cross-validation. This technique allowed us to explore different combinations of hyperparameters and select the configuration that yielded the highest accuracy on the training data. These hyperparameters included settings like the number of estimators, learning rate, and kernel function, depending on the specific model.
After obtaining the best hyperparameters for each model, we trained them on our preprocessed dataset. This training process involved using the training data to teach the model to make predictions on new, unseen examples. Once trained, the models were ready for evaluation.
We assessed the performance of each model using a set of well-established evaluation metrics. These metrics included accuracy, precision, recall, and F1-score. Accuracy measured the overall correctness of predictions, while precision quantified the proportion of true positive predictions out of all positive predictions. Recall, on the other hand, represented the proportion of true positive predictions out of all actual positives, highlighting a model's ability to identify positive cases. The F1-score combined precision and recall into a single metric, helping us gauge the overall balance between these two aspects.
To visualize the model's performance, we created key graphical representations. These included confusion matrices, which showed the number of true positive, true negative, false positive, and false negative predictions, aiding in understanding the model's classification results. Additionally, we generated Receiver Operating Characteristic (ROC) curves and area under the curve (AUC) scores, which depicted a model's ability to distinguish between classes. High AUC values indicated excellent model performance.
Furthermore, we constructed true values versus predicted values diagrams to provide insights into how well our models aligned with the actual data distribution. Learning curves were also generated to observe a model's performance as a function of training data size, helping us assess whether the model was overfitting or underfitting.
Lastly, we presented the results in a clear and organized manner, saving them to Excel files for easy reference. This allowed us to compare the performance of different models and make an informed choice about which one to select for our specific task.
In summary, this project was a comprehensive exploration of the machine learning model development and evaluation process. We prepared the data, selected and fine-tuned various models, assessed their performance using multiple metrics and visualizations, and ultimately arrived at a well-informed decision about the most suitable model for our dataset. This approach serves as a valuable blueprint for tackling real-world machine learning challenges effectively.
SOURCE CODE:
#main_class.py
import os
import tkinter as tk
from tkinter import *
from design_window import Design_Window
from process_data import Process_Data
from helper_plot import Helper_Plot
from machine_learning import Machine_Learning
class Main_Class:
def __init__(self, root):
self.initialize()
def initialize(self):
self.root = root
width = 1500
height = 750
self.root.geometry(f"{width}x{height}")
self.root.title("TKINTER AND DATA SCIENCE")
#Creates necessary objects
self.obj_window = Design_Window()
self.obj_data = Process_Data()
self.obj_plot = Helper_Plot()
self.obj_ML = Machine_Learning()
#Places widgets in root
self.obj_window.add_widgets(self.root)
#Reads dataset
self.df = self.obj_data.preprocess()
#Categorize dataset
self.df_dummy = self.obj_data.categorize(self.df)
#Extracts input and output variables
self.cat_cols, self.num_cols = self.obj_data.extract_cat_num_cols(self.df)
self.df_final = self.obj_data.encode_categorical_feats(self.df, self.cat_cols)
self.X, self.y = self.obj_data.extract_input_output_vars(self.df_final)
#Binds event
self.binds_event()
#Initially turns off combo4 and combo5 before data splitting is done
self.obj_window.combo4['state'] = 'disabled'
self.obj_window.combo5['state'] = 'disabled'
def binds_event(self):
#Binds button1 to shows_table() function
#Shows table if user clicks LOAD DATASET
self.obj_window.button1.config(command = lambda:self.obj_plot.shows_table(self.root, self.df, 1400, 600, "Dataset"))
#Binds listbox to a function
self.obj_window.listbox.bind("<<ListboxSelect>>", self.choose_list_widget)
# Binds combobox1 to a function
self.obj_window.combo1.bind("<<ComboboxSelected>>", self.choose_combobox1)
# Binds combobox2 to a function
self.obj_window.combo2.bind("<<ComboboxSelected>>", self.choose_combobox2)
#Binds button2 to train_ML() function
self.obj_window.button2.config(command=self.train_ML)
# Binds combobox4 to a function
self.obj_window.combo4.bind("<<ComboboxSelected>>", self.choose_combobox4)
def choose_list_widget(self, event):
chosen = self.obj_window.listbox.get(self.obj_window.listbox.curselection())
print(chosen)
self.obj_plot.choose_plot(self.df, self.df_dummy, chosen,
self.obj_window.figure1, self.obj_window.canvas1,
self.obj_window.figure2, self.obj_window.canvas2)
def choose_combobox1(self, event):
chosen = self.obj_window.combo1.get()
self.obj_plot.choose_category(self.df_dummy, chosen,
self.obj_window.figure1, self.obj_window.canvas1,
self.obj_window.figure2, self.obj_window.canvas2)
def choose_combobox2(self, event):
chosen = self.obj_window.combo2.get()
self.obj_plot.choose_plot_more(self.df_final, chosen,
self.X, self.y,
self.obj_window.figure1,
self.obj_window.canvas1, self.obj_window.figure2,
self.obj_window.canvas2)
def train_ML(self):
file_path = os.getcwd()+"/X_train.pkl"
if os.path.exists(file_path):
self.X_train, self.X_test, self.y_train, self.y_test = self.obj_ML.load_files()
else:
self.obj_ML.oversampling_splitting(self.X, self.y)
self.X_train, self.X_test, self.y_train, self.y_test = self.obj_ML.load_files()
print("Loading files done...")
#turns on combo4 and combo5 after splitting is done
self.obj_window.combo4['state'] = 'normal'
self.obj_window.combo5['state'] = 'normal'
self.obj_window.button2.config(state="disabled")
def choose_combobox4(self, event):
chosen = self.obj_window.combo4.get()
self.obj_plot.choose_plot_ML(self.root, chosen, self.X_train, self.X_test,
self.y_train, self.y_test, self.obj_window.figure1,
self.obj_window.canvas1, self.obj_window.figure2,
self.obj_window.canvas2)
if __name__ == "__main__":
root = tk.Tk()
app = Main_Class(root)
root.mainloop()
#design_window.py import tkinter as tk from tkinter import ttk from matplotlib.figure import Figure from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg class Design_Window: def add_widgets(self, root): #Adds button(s) self.add_buttons(root) #Adds canvasses self.add_canvas(root) #Adds labels self.add_labels(root) #Adds listbox widget self.add_listboxes(root) #Adds combobox widget self.add_comboboxes(root) def add_buttons(self, root): #Adds button self.button1 = tk.Button(root, height=2, width=30, text="LOAD DATASET") self.button1.grid(row=0, column=0, padx=5, pady=5, sticky="w") self.button2 = tk.Button(root, height=2, width=30, text="SPLIT DATA") self.button2.grid(row=9, column=0, padx=5, pady=5, sticky="w") def add_labels(self, root): #Adds labels self.label1 = tk.Label(root, text = "CHOOSE PLOT", fg = "red") self.label1.grid(row=1, column=0, padx=5, pady=1, sticky="w") self.label2 = tk.Label(root, text = "CHOOSE CATEGORIZED PLOT", fg = "blue") self.label2.grid(row=3, column=0, padx=5, pady=1, sticky="w") self.label3 = tk.Label(root, text = "CHOOSE FEATURES", fg = "black") self.label3.grid(row=5, column=0, padx=5, pady=1, sticky="w") self.label4 = tk.Label(root, text = "CHOOSE REGRESSORS", fg = "green") self.label4.grid(row=7, column=0, padx=5, pady=1, sticky="w") self.label5 = tk.Label(root, text = "CHOOSE MACHINE LEARNING", fg = "blue") self.label5.grid(row=10, column=0, padx=5, pady=1, sticky="w") self.label6 = tk.Label(root, text = "CHOOSE DEEP LEARNING", fg = "red") self.label6.grid(row=12, column=0, padx=5, pady=1, sticky="w") def add_canvas(self, root): #Menambahkan canvas1 widget pada root untuk menampilkan hasil self.figure1 = Figure(figsize=(6.2, 7), dpi=100) self.figure1.patch.set_facecolor("lightgray") self.canvas1 = FigureCanvasTkAgg(self.figure1, master=root) self.canvas1.get_tk_widget().grid(row=0, column=1, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n") #Menambahkan canvas2 widget pada root untuk menampilkan hasil self.figure2 = Figure(figsize=(6.2, 7), dpi=100) self.figure2.patch.set_facecolor("lightgray") self.canvas2 = FigureCanvasTkAgg(self.figure2, master=root) self.canvas2.get_tk_widget().grid(row=0, column=2, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n") def add_listboxes(self, root): #Menambahkan list widget self.listbox = tk.Listbox(root, selectmode=tk.SINGLE, width=35) self.listbox.grid(row=2, column=0, sticky='n', padx=5, pady=1) # Menyisipkan item ke dalam list widget items = ["Marital Status", "Education", "Country", "Age Group", "Education with Response 0", "Education with Response 1", "Country with Response 0", "Country with Response 1", "Customer Age", "Income", "Amount of Wines", "Education versus Response", "Age Group versus Response", "Marital Status versus Response", "Country versus Response", "Number of Dependants versus Response", "Country versus Customer Age Per Education", "Num_TotalPurchases versus Education Per Marital Status"] for item in items: self.listbox.insert(tk.END, item) self.listbox.config(height=len(items)) def add_comboboxes(self, root): # Create ComboBoxes self.combo1 = ttk.Combobox(root, width=32) self.combo1["values"] = ["Categorized Income versus Response", "Categorized Total Purchase versus Categorized Income", "Categorized Recency versus Categorized Total Purchase", "Categorized Customer Month versus Categorized Customer Age", "Categorized Amount of Gold Products versus Categorized Income", "Categorized Amount of Fish Products versus Categorized Total AmountSpent", "Categorized Amount of Meat Products versus Categorized Recency", "Distribution of Numerical Columns"] self.combo1.grid(row=4, column=0, padx=5, pady=1, sticky="n") self.combo2 = ttk.Combobox(root, width=32) self.combo2["values"] = ["Correlation Matrix", "RF Features Importance", "ET Features Importance", "RFE Features Importance"] self.combo2.grid(row=6, column=0, padx=5, pady=1, sticky="n") self.combo3 = ttk.Combobox(root, width=32) self.combo3["values"] = ["Linear Regression", "RF Regression", "Decision Trees Regression", "KNN Regression", "AdaBoost Regression", "Gradient Boosting Regression", "XGB Regression", "LGB Regression", "CatBoost Regression", "SVR Regression", "Lasso Regression", "Ridge Regression"] self.combo3.grid(row=8, column=0, padx=5, pady=1, sticky="n") self.combo4 = ttk.Combobox(root, width=32) self.combo4["values"] = ["Logistic Regression", "Random Forest", "Decision Trees", "K-Nearest Neighbors", "AdaBoost", "Gradient Boosting", "Extreme Gradient Boosting", "Light Gradient Boosting", "Multi-Layer Perceptron", "Support Vector Classifier"] self.combo4.grid(row=11, column=0, padx=5, pady=1, sticky="n") self.combo5 = ttk.Combobox(root, width=32) self.combo5["values"] = ["LSTM", "Convolutional NN", "Recurrent NN", "Feed-Forward NN", "Artifical NN"] self.combo5.grid(row=13, column=0, padx=5, pady=1, sticky="n") #helper_plot.py from tkinter import * import seaborn as sns import numpy as np from pandastable import Table from process_data import Process_Data from machine_learning import Machine_Learning from sklearn.metrics import confusion_matrix, roc_curve, accuracy_score from sklearn.model_selection import learning_curve class Helper_Plot: def __init__(self): self.obj_data = Process_Data() self.obj_ml = Machine_Learning() def shows_table(self, root, df, width, height, title): frame = Toplevel(root) #new window self.table = Table(frame, dataframe=df, showtoolbar=True, showstatusbar=True) # Sets dimension of Toplevel frame.geometry(f"{width}x{height}") frame.title(title) self.table.show() # Defines function to create pie chart and bar plot as subplots def plot_piechart(self, df, var, figure, canvas, title=''): figure.clear() # Pie Chart (left subplot) plot1 = figure.add_subplot(2,1,1) label_list = list(df[var].value_counts().index) colors = sns.color_palette("deep", len(label_list)) _, _, autopcts = plot1.pie(df[var].value_counts(), autopct="%1.1f%%", colors=colors, startangle=30, labels=label_list, wedgeprops={"linewidth": 2, "edgecolor": "white"}, # Add white edge shadow=True, textprops={'fontsize': 7}) plot1.set_title("Distribution of " + var + " variable " + title, fontsize=10) # Bar Plot (right subplot) plot2 = figure.add_subplot(2,1,2) ax = df[var].value_counts().plot(kind="barh", color=colors, alpha=0.8, ax = plot2) for i, j in enumerate(df[var].value_counts().values): ax.text(.7, i, j, weight="bold", fontsize=7) plot2.set_title("Count of " + var + " cases " + title, fontsize=10) figure.tight_layout() canvas.draw() def another_versus_response(self, df, feat, num_bins, figure, canvas): figure.clear() plot1 = figure.add_subplot(2,1,1) colors = sns.color_palette("Set2") df[df['Response'] == 0][feat].plot(ax=plot1, kind='hist', bins=num_bins, edgecolor='black', color=colors[0]) plot1.set_title('Not Responsive', fontsize=15) plot1.set_xlabel(feat, fontsize=10) plot1.set_ylabel('Count', fontsize=10) data1 = [] for p in plot1.patches: x = p.get_x() + p.get_width() / 2. y = p.get_height() plot1.annotate(format(y, '.0f'), (x, y), ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=7, textcoords='offset points') data1.append([x, y]) plot2 = figure.add_subplot(2,1,2) df[df['Response'] == 1][feat].plot(ax=plot2, kind='hist', bins=num_bins, edgecolor='black', color=colors[1]) plot2.set_title('Responsive', fontsize=15) plot2.set_xlabel(feat, fontsize=10) plot2.set_ylabel('Count', fontsize=10) data2 = [] for p in plot2.patches: x = p.get_x() + p.get_width() / 2. y = p.get_height() plot2.annotate(format(y, '.0f'), (x, y), ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=7, textcoords='offset points') data2.append([x, y]) figure.tight_layout() canvas.draw() #Puts label inside stacked bar def put_label_stacked_bar(self, ax,fontsize): #patches is everything inside of the chart for rect in ax.patches: # Find where everything is located height = rect.get_height() width = rect.get_width() x = rect.get_x() y = rect.get_y() # The height of the bar is the data value and can be used as the label label_text = f'{height:.0f}' # ax.text(x, y, text) label_x = x + width / 2 label_y = y + height / 2 # plots only when height is greater than specified value if height > 0: ax.text(label_x, label_y, label_text, \ ha='center', va='center', \ weight = "bold",fontsize=fontsize) #Plots one variable against another variable def dist_one_vs_another_plot(self, df, cat1, cat2, figure, canvas, title): figure.clear() plot1 = figure.add_subplot(1,1,1) group_by_stat = df.groupby([cat1, cat2]).size() colors = sns.color_palette("Set2", len(df[cat1].unique())) stacked_data = group_by_stat.unstack() group_by_stat.unstack().plot(kind='bar', stacked=True, ax=plot1, grid=True, color=colors) plot1.set_title(title, fontsize=12) plot1.set_ylabel('Number of Cases', fontsize=10) plot1.set_xlabel(cat1, fontsize=10) self.put_label_stacked_bar(plot1,7) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=8) plot1.tick_params(axis='both', which='minor', labelsize=8) plot1.legend(fontsize=8) figure.tight_layout() canvas.draw() def box_plot(self, df, x, y, hue, figure, canvas, title): figure.clear() plot1 = figure.add_subplot(1,1,1) #Creates boxplot of Num_TotalPurchases versus Num_Dependants sns.boxplot(data = df, x = x, y = y, hue = hue, ax=plot1) plot1.set_title(title, fontsize=14) plot1.set_xlabel(x, fontsize=10) plot1.set_ylabel(y, fontsize=10) figure.tight_layout() canvas.draw() def choose_plot(self, df1, df2, chosen, figure1, canvas1, figure2, canvas2): print(chosen) if chosen == "Marital Status": self.plot_piechart(df2, "Marital_Status", figure1, canvas1) elif chosen == "Education": self.plot_piechart(df2, "Education", figure2, canvas2) elif chosen == "Country": self.plot_piechart(df2, "Country", figure1, canvas1) elif chosen == "Age Group": self.plot_piechart(df2, "AgeGroup", figure2, canvas2) elif chosen == "Education with Response 0": self.plot_piechart(df2[df2.Response==0], "Education", figure1, canvas1, " with Response 0") elif chosen == "Education with Response 1": self.plot_piechart(df2[df2.Response==1], "Education", figure2, canvas2, " with Response 1") elif chosen == "Country with Response 0": self.plot_piechart(df2[df2.Response==0], "Country", figure1, canvas1, " with Response 0") elif chosen == "Country with Response 1": self.plot_piechart(df2[df2.Response==1], "Country", figure2, canvas2, " with Response 1") elif chosen == "Income": self.another_versus_response(df1, "Income", 32, figure1, canvas1) elif chosen == "Amount of Wines": self.another_versus_response(df1, "MntWines", 32, figure2, canvas2) elif chosen == "Customer Age": self.another_versus_response(df1, "Customer_Age", 32, figure1, canvas1) elif chosen == "Education versus Response": self.dist_one_vs_another_plot(df2, "Education", "Response", figure2, canvas2, chosen) elif chosen == "Age Group versus Response": self.dist_one_vs_another_plot(df2, "AgeGroup", "Response", figure1, canvas1, chosen) elif chosen == "Marital Status versus Response": self.dist_one_vs_another_plot(df2, "Marital_Status", "Response", figure2, canvas2, chosen) elif chosen == "Country versus Response": self.dist_one_vs_another_plot(df2, "Country", "Response", figure1, canvas1, chosen) elif chosen == "Number of Dependants versus Response": self.dist_one_vs_another_plot(df2, "Num_Dependants", "Response", figure2, canvas2, chosen) elif chosen == "Country versus Customer Age Per Education": self.box_plot(df1, "Country", "Customer_Age", "Education", figure1, canvas1, chosen) elif chosen == "Num_TotalPurchases versus Education Per Marital Status": self.box_plot(df1, "Education", "Num_TotalPurchases", "Marital_Status", figure2, canvas2, chosen) def choose_category(self, df, chosen, figure1, canvas1, figure2, canvas2): if chosen == "Categorized Income versus Response": self.dist_one_vs_another_plot(df, "Income", "Response", figure1, canvas1, chosen) if chosen == "Categorized Total Purchase versus Categorized Income": self.dist_one_vs_another_plot(df, "Num_TotalPurchases", "Income", figure2, canvas2, chosen) if chosen == "Categorized Recency versus Categorized Total Purchase": self.dist_one_vs_another_plot(df, "Recency", "Num_TotalPurchases", figure1, canvas1, chosen) if chosen == "Categorized Customer Month versus Categorized Customer Age": self.dist_one_vs_another_plot(df, "Dt_Customer_Month", "Customer_Age", figure2, canvas2, chosen) if chosen == "Categorized Amount of Gold Products versus Categorized Income": self.dist_one_vs_another_plot(df, "MntGoldProds", "Income", figure1, canvas1, chosen) if chosen == "Categorized Amount of Fish Products versus Categorized Total AmountSpent": self.dist_one_vs_another_plot(df, "MntFishProducts", "TotalAmount_Spent", figure2, canvas2, chosen) if chosen == "Categorized Amount of Meat Products versus Categorized Recency": self.dist_one_vs_another_plot(df, "MntMeatProducts", "Recency", figure1, canvas1, chosen) def plot_corr_mat(self, df, figure, canvas): figure.clear() plot1 = figure.add_subplot(1,1,1) categorical_columns = df.select_dtypes(include=['object', 'category']).columns df_removed = df.drop(columns=categorical_columns) corrdata = df_removed.corr() annot_kws = {"size": 5} sns.heatmap(corrdata, ax = plot1, lw=1, annot=True, cmap="Reds", annot_kws=annot_kws) plot1.set_title('Correlation Matrix', fontweight ="bold",fontsize=14) # Set font for x and y labels plot1.set_xlabel('Features', fontweight="bold", fontsize=12) plot1.set_ylabel('Features', fontweight="bold", fontsize=12) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=5) plot1.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def plot_rf_importance(self, X, y, figure, canvas): result_rf = self.obj_data.feat_importance_rf(X, y) figure.clear() plot1 = figure.add_subplot(1,1,1) sns.set_color_codes("pastel") ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Blue", ax=plot1) plot1.set_title('Random Forest Features Importance', fontweight ="bold",fontsize=14) plot1.set_xlabel('Features Importance', fontsize=10) plot1.set_ylabel('Feature Labels', fontsize=10) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=5) plot1.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def plot_et_importance(self, X, y, figure, canvas): result_rf = self.obj_data.feat_importance_et(X, y) figure.clear() plot1 = figure.add_subplot(1,1,1) sns.set_color_codes("pastel") ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Red", ax=plot1) plot1.set_title('Extra Trees Features Importance', fontweight ="bold",fontsize=14) plot1.set_xlabel('Features Importance', fontsize=10) plot1.set_ylabel('Feature Labels', fontsize=10) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=5) plot1.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def plot_rfe_importance(self, X, y, figure, canvas): result_lg = self.obj_data.feat_importance_rfe(X, y) figure.clear() plot1 = figure.add_subplot(1,1,1) sns.set_color_codes("pastel") ax=sns.barplot(x = 'Ranking',y = 'Features', data=result_lg, color="orange", ax=plot1) plot1.set_title('RFE Features Importance', fontweight ="bold",fontsize=14) plot1.set_xlabel('Features Importance', fontsize=10) plot1.set_ylabel('Feature Labels', fontsize=10) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=5) plot1.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def choose_plot_more(self, df, chosen, X, y, figure1, canvas1, figure2, canvas2): if chosen == "Correlation Matrix": self.plot_corr_mat(df, figure1, canvas1) if chosen == "RF Features Importance": self.plot_rf_importance(X, y, figure2, canvas2) if chosen == "ET Features Importance": self.plot_et_importance(X, y, figure1, canvas1) if chosen == "RFE Features Importance": self.plot_rfe_importance(X, y, figure1, canvas1) def plot_cm_roc(self, model, X_test, y_test, ypred, name, figure, canvas): figure.clear() #Plots confusion matrix plot1 = figure.add_subplot(2,1,1) cm = confusion_matrix(y_test, ypred, ) sns.heatmap(cm, annot=True, linewidth=3, linecolor='red', fmt='g', cmap="Greens", annot_kws={"size": 14}, ax=plot1) plot1.set_title('Confusion Matrix' + " of " + name, fontsize=12) plot1.set_xlabel('Y predict', fontsize=10) plot1.set_ylabel('Y test', fontsize=10) plot1.xaxis.set_ticklabels(['Responsive', 'Not Responsive'], fontsize=10) plot1.yaxis.set_ticklabels(['Responsive', 'Not Responsive'], fontsize=10) #Plots ROC plot2 = figure.add_subplot(2,1,2) Y_pred_prob = model.predict_proba(X_test) Y_pred_prob = Y_pred_prob[:, 1] fpr, tpr, thresholds = roc_curve(y_test, Y_pred_prob) plot2.plot([0,1],[0,1], color='navy', linestyle='--', linewidth=3) plot2.plot(fpr,tpr, color='red', linewidth=3) plot2.set_xlabel('False Positive Rate', fontsize=10) plot2.set_ylabel('True Positive Rate', fontsize=10) plot2.set_title('ROC Curve of ' + name , fontsize=12) plot2.grid(True) figure.tight_layout() canvas.draw() #Plots true values versus predicted values diagram and learning curve def plot_real_pred_val_learning_curve(self, model, X_train, y_train, X_test, y_test, ypred, name, figure, canvas): figure.clear() #Plots true values versus predicted values diagram plot1 = figure.add_subplot(2,1,1) acc=accuracy_score(y_test, ypred) plot1.scatter(range(len(ypred)),ypred,color="blue", lw=3,label="Predicted") plot1.scatter(range(len(y_test)), y_test,color="red",label="Actual") plot1.set_title("Predicted Values vs True Values of " + name, fontsize=12) plot1.set_xlabel("Accuracy: " + str(round((acc*100),3)) + "%") plot1.legend() plot1.grid(True, alpha=0.75, lw=1, ls='-.') #Plots learning curve train_sizes=np.linspace(.1, 1.0, 5) train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(model, X_train, y_train, cv=None, n_jobs=None, train_sizes=train_sizes, return_times=True) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plot2 = figure.add_subplot(2,1,2) plot2.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plot2.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plot2.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plot2.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score") plot2.legend(loc="best") plot2.set_title("Learning curve of " + name, fontsize=12) plot2.set_xlabel("fit_times") plot2.set_ylabel("Score") plot2.grid(True, alpha=0.75, lw=1, ls='-.') figure.tight_layout() canvas.draw() def choose_plot_ML(self, root, chosen, X_train, X_test, y_train, y_test, figure1, canvas1, figure2, canvas2): if chosen == "Logistic Regression": best_model, y_pred = self.obj_ml.implement_LR(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix and ROC self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_LR.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Logistic Regression") if chosen == "Random Forest": best_model, y_pred = self.obj_ml.implement_RF(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix and ROC self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_RF.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Random Forest") if chosen == "K-Nearest Neighbors": best_model, y_pred = self.obj_ml.implement_KNN(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix and ROC self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_KNN.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of KNN") if chosen == "Decision Trees": best_model, y_pred = self.obj_ml.implement_DT(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix and ROC self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_DT.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Decision Trees") if chosen == "Gradient Boosting": best_model, y_pred = self.obj_ml.implement_GB(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix and ROC self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_GB.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Gradient Boosting") if chosen == "Extreme Gradient Boosting": best_model, y_pred = self.obj_ml.implement_XGB(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix and ROC self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_XGB.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Extreme Gradient Boosting") if chosen == "Multi-Layer Perceptron": best_model, y_pred = self.obj_ml.implement_MLP(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix and ROC self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_MLP.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Multi-Layer Perceptron") if chosen == "Support Vector Classifier": best_model, y_pred = self.obj_ml.implement_SVC(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix and ROC self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_SVC.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Support Vector Classifier") if chosen == "AdaBoost": best_model, y_pred = self.obj_ml.implement_ADA(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix and ROC self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_ADA.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of AdaBoost Classifier") def plot_accuracy(self, history, name, figure, canvas): acc = history['accuracy'] val_acc = history['val_accuracy'] epochs = range(1, len(acc) + 1) #Cleans and Creates figure figure.clear() plot1 = figure.add_subplot(1,1,1) # Plots training accuracy in red and validation accuracy in blue dashed line plot1.plot(epochs, acc, 'r', label='Training accuracy', lw=3) plot1.plot(epochs, val_acc, 'b--', label='Validation accuracy', lw=3) # Set plot title and legend plot1.set_title('Training and validation accuracy of ' + name, fontsize=12) plot1.legend(fontsize=8) # Set x-axis label and tick label font size plot1.set_xlabel("Epoch", fontsize=10) plot1.tick_params(labelsize=8) # Set background color plot1.gca().set_facecolor('black') figure.tight_layout() canvas.draw() def plot_loss(self, history, name, figure, canvas): loss = history['loss'] val_loss = history['val_loss'] epochs = range(1, len(loss) + 1) #Cleans and Creates figure figure.clear() plot1 = figure.add_subplot(1,1,1) # Plot training loss in red and validation loss in blue dashed line plot1.plot(epochs, loss, 'r', label='Training loss', lw=3) plot1.plot(epochs, val_loss, 'b--', label='Validation loss', lw=3) # Set plot title and legend plot1.set_title('Training and validation loss of ' + name, fontsize=12) plot1.legend(fontsize=8) # Set x-axis label and tick label font size plot1.set_xlabel("Epoch", fontsize=10) plot1.tick_params(labelsize=8) # Set background color plot1.gca().set_facecolor('lightgray') figure.tight_layout() canvas.draw() #machine_learning.py import numpy as np from imblearn.over_sampling import SMOTE from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, StratifiedKFold from sklearn.preprocessing import StandardScaler import joblib from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score from sklearn.metrics import classification_report, f1_score, plot_confusion_matrix from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier from xgboost import XGBClassifier from sklearn.neural_network import MLPClassifier from sklearn.svm import SVC import os import joblib import pandas as pd from process_data import Process_Data class Machine_Learning: def __init__(self): self.obj_data = Process_Data() def oversampling_splitting(self, X, y): sm = SMOTE(random_state=42) X,y = sm.fit_resample(X, y.ravel()) #Splits the data into training and testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2021, stratify=y) #Use Standard Scaler scaler = StandardScaler() X_train_stand = scaler.fit_transform(X_train) X_test_stand = scaler.transform(X_test) #Saves into pkl files joblib.dump(X_train_stand, 'X_train.pkl') joblib.dump(X_test_stand, 'X_test.pkl') joblib.dump(y_train, 'y_train.pkl') joblib.dump(y_test, 'y_test.pkl') def load_files(self): X_train = joblib.load('X_train.pkl') X_test = joblib.load('X_test.pkl') y_train = joblib.load('y_train.pkl') y_test = joblib.load('y_test.pkl') return X_train, X_test, y_train, y_test def choose_feats_boundary(self, X, y): file_path = os.getcwd() X_train_feat_path = os.path.join(file_path, 'X_train_feat.pkl') X_test_feat_path = os.path.join(file_path, 'X_test_feat.pkl') y_train_feat_path = os.path.join(file_path, 'y_train_feat.pkl') y_test_feat_path = os.path.join(file_path, 'y_test_feat.pkl') if os.path.exists(X_train_feat_path): X_train_feat = joblib.load(X_train_feat_path) X_test_feat = joblib.load(X_test_feat_path) y_train_feat = joblib.load(y_train_feat_path) y_test_feat = joblib.load(y_test_feat_path) else: # Make sure feat_boundary contains valid column indices from your X array feat_boundary = [1, 2] # actual indices if all(idx < X.shape[1] for idx in feat_boundary): X_feature = X[:, feat_boundary] X_train_feat, X_test_feat, y_train_feat, y_test_feat = train_test_split(X_feature, y, test_size=0.2, random_state=2021, stratify=y) # Saves into pkl files joblib.dump(X_train_feat, X_train_feat_path) joblib.dump(X_test_feat, X_test_feat_path) joblib.dump(y_train_feat, y_train_feat_path) joblib.dump(y_test_feat, y_test_feat_path) else: raise ValueError("Indices in feat_boundary exceed the number of columns in X array") return X_train_feat, X_test_feat, y_train_feat, y_test_feat def train_model(self, model, X, y): model.fit(X, y) return model def predict_model(self, model, X, proba=False): if ~proba: y_pred = model.predict(X) else: y_pred_proba = model.predict_proba(X) y_pred = np.argmax(y_pred_proba, axis=1) return y_pred def run_model(self, name, model, X_train, X_test, y_train, y_test, proba=False): y_pred = self.predict_model(model, X_test, proba) accuracy = accuracy_score(y_test, y_pred) recall = recall_score(y_test, y_pred, average='weighted') precision = precision_score(y_test, y_pred, average='weighted') f1 = f1_score(y_test, y_pred, average='weighted') print(name) print('accuracy: ', accuracy) print('recall: ', recall) print('precision: ', precision) print('f1: ', f1) print(classification_report(y_test, y_pred)) return y_pred def logistic_regression(self, name, X_train, X_test, y_train, y_test): #Logistic Regression Classifier # Define the parameter grid for the grid search param_grid = { 'C': [0.01, 0.1, 1, 10], 'penalty': ['none', 'l2'], 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'saga'], } # Initialize the Logistic Regression model logreg = LogisticRegression(max_iter=5000, random_state=2021) # Create GridSearchCV with the Logistic Regression model and the parameter grid grid_search = GridSearchCV(logreg, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best Logistic Regression model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'LR_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for LR:") print(grid_search.best_params_) return best_model def implement_LR(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/LR_Model.pkl" if os.path.exists(file_path): model = joblib.load('LR_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.logistic_regression(chosen, X_train, X_test, y_train, y_test) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_LR.csv") print("Training Logistic Regression done...") return model, y_pred def random_forest(self, name, X_train, X_test, y_train, y_test): #Random Forest Classifier # Define the parameter grid for the grid search param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [10, 20, 30, 40, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] } # Initialize the RandomForestClassifier model rf = RandomForestClassifier(random_state=2021) # Create GridSearchCV with the RandomForestClassifier model and the parameter grid grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best RandomForestClassifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'RF_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for RF:") print(grid_search.best_params_) return best_model def implement_RF(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/RF_Model.pkl" if os.path.exists(file_path): model = joblib.load('RF_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.random_forest(chosen, X_train, X_test, y_train, y_test) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_RF.csv") print("Training Random Forest done...") return model, y_pred def knearest_neigbors(self, name, X_train, X_test, y_train, y_test): #KNN Classifier # Define the parameter grid for the grid search param_grid = { 'n_neighbors': list(range(2, 10)) } # Initialize the KNN Classifier knn = KNeighborsClassifier() # Create GridSearchCV with the KNN model and the parameter grid grid_search = GridSearchCV(knn, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best KNN model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'KNN_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for KNN:") print(grid_search.best_params_) return best_model def implement_KNN(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/KNN_Model.pkl" if os.path.exists(file_path): model = joblib.load('KNN_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.knearest_neigbors(chosen, X_train, X_test, y_train, y_test) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_KNN.csv") print("Training KNN done...") return model, y_pred def decision_trees(self, name, X_train, X_test, y_train, y_test): # Initialize the DecisionTreeClassifier model dt_clf = DecisionTreeClassifier(random_state=2021) # Define the parameter grid for the grid search param_grid = { 'max_depth': np.arange(1, 51, 1), 'criterion': ['gini', 'entropy'], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], } # Create GridSearchCV with the DecisionTreeClassifier model and the parameter grid grid_search = GridSearchCV(dt_clf, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best DecisionTreeClassifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'DT_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for DT:") print(grid_search.best_params_) return best_model def implement_DT(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/DT_Model.pkl" if os.path.exists(file_path): model = joblib.load('DT_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.decision_trees(chosen, X_train, X_test, y_train, y_test) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_DT.csv") print("Training Decision Trees done...") return model, y_pred def gradient_boosting(self, name, X_train, X_test, y_train, y_test): #Gradient Boosting Classifier # Initialize the GradientBoostingClassifier model gbt = GradientBoostingClassifier(random_state=2021) # Define the parameter grid for the grid search param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [10, 20, 30], 'subsample': [0.6, 0.8, 1.0], 'max_features': [0.2, 0.4, 0.6, 0.8, 1.0], } # Create GridSearchCV with the GradientBoostingClassifier model and the parameter grid grid_search = GridSearchCV(gbt, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best GradientBoostingClassifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'GB_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for GB:") print(grid_search.best_params_) return best_model def implement_GB(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/GB_Model.pkl" if os.path.exists(file_path): model = joblib.load('GB_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.gradient_boosting(chosen, X_train, X_test, y_train, y_test) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_GB.csv") print("Training Gradient Boosting done...") return model, y_pred # def light_gradient_boosting(self, name, X_train, X_test, y_train, y_test): # #LGBM Classifier # # Define the parameter grid for grid search # param_grid = { # 'max_depth': [10, 20, 30], # 'n_estimators': [100, 200, 300], # 'subsample': [0.6, 0.8, 1.0], # 'random_state': [2021] # } # # Initialize the LightGBM classifier # lgbm = LGBMClassifier() # # Create GridSearchCV with the LightGBM classifier and the parameter grid # grid_search = GridSearchCV(lgbm, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # # Train and perform grid search # grid_search.fit(X_train, y_train) # # Get the best LightGBM classifier model from the grid search # best_model = grid_search.best_estimator_ # #Saves model # joblib.dump(best_model, 'LGB_Model.pkl') # # Print the best hyperparameters found # print(f"Best Hyperparameters for LGB:") # print(grid_search.best_params_) # return best_model # def implement_LGB(self, chosen, X_train, X_test, y_train, y_test): # file_path = os.getcwd()+"/LGB_Model.pkl" # if os.path.exists(file_path): # model = joblib.load('LGB_Model.pkl') # y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) # else: # model = self.light_gradient_boosting(chosen, X_train, X_test, y_train, y_test) # y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) # #Saves result into excel file # self.save_result(y_test, y_pred, "results_LGB.csv") # print("Training Light Gradient Boosting done...") # return model, y_pred def extreme_gradient_boosting(self, name, X_train, X_test, y_train, y_test): # Define the parameter grid for the grid search param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [10, 20, 30], 'learning_rate': [0.01, 0.1, 0.2], 'subsample': [0.6, 0.8, 1.0], 'colsample_bytree': [0.6, 0.8, 1.0], } # Initialize the XGBoost classifier xgb = XGBClassifier(random_state=2021, use_label_encoder=False, eval_metric='mlogloss') # Create GridSearchCV with the XGBoost classifier and the parameter grid grid_search = GridSearchCV(xgb, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best XGBoost classifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'XGB_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for XGB:") print(grid_search.best_params_) return best_model def implement_XGB(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/XGB_Model.pkl" if os.path.exists(file_path): model = joblib.load('XGB_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.extreme_gradient_boosting(chosen, X_train, X_test, y_train, y_test) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_XGB.csv") print("Training Extreme Gradient Boosting done...") return model, y_pred def multi_layer_perceptron(self, name, X_train, X_test, y_train, y_test): # Define the parameter grid for the grid search param_grid = { 'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50), (100, 100)], 'activation': ['logistic', 'relu'], 'solver': ['adam', 'sgd'], 'alpha': [0.0001, 0.001, 0.01], 'learning_rate': ['constant', 'invscaling', 'adaptive'], } # Initialize the MLP Classifier mlp = MLPClassifier(random_state=2021) # Create GridSearchCV with the MLP Classifier and the parameter grid grid_search = GridSearchCV(mlp, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best MLP Classifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'MLP_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for MLP:") print(grid_search.best_params_) return best_model def implement_MLP(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/MLP_Model.pkl" if os.path.exists(file_path): model = joblib.load('MLP_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.multi_layer_perceptron(chosen, X_train, X_test, y_train, y_test) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_MLP.csv") print("Training Multi-Layer Perceptron done...") return model, y_pred def support_vector(self, name, X_train, X_test, y_train, y_test): #Support Vector Classifier # Define the parameter grid for the grid search param_grid = { 'C': [0.1, 1, 10], 'kernel': ['linear', 'poly', 'rbf'], 'gamma': ['scale', 'auto', 0.1, 1], } # Initialize the SVC model model_svc = SVC(random_state=2021, probability=True) # Create GridSearchCV with the SVC model and the parameter grid grid_search = GridSearchCV(model_svc, param_grid, cv=3, scoring='accuracy', n_jobs=-1, refit=True) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best MLP Classifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'SVC_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for SVC:") print(grid_search.best_params_) return best_model def implement_SVC(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/SVC_Model.pkl" if os.path.exists(file_path): model = joblib.load('SVC_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.support_vector(chosen, X_train, X_test, y_train, y_test) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_SVC.csv") print("Training Support Vector Classifier done...") return model, y_pred def adaboost_classifier(self, name, X_train, X_test, y_train, y_test): # Define the parameter grid for the grid search param_grid = { 'n_estimators': [50, 100, 150], 'learning_rate': [0.01, 0.1, 0.2], } # Initialize the AdaBoost classifier adaboost = AdaBoostClassifier(random_state=2021) # Create GridSearchCV with the AdaBoost classifier and the parameter grid grid_search = GridSearchCV(adaboost, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best AdaBoost Classifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'ADA_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for AdaBoost:") print(grid_search.best_params_) return best_model def implement_ADA(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/ADA_Model.pkl" if os.path.exists(file_path): model = joblib.load('ADA_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.adaboost_classifier(chosen, X_train, X_test, y_train, y_test) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_ADA.csv") print("Training AdaBoost done...") return model, y_pred #process_data.py import os import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import RFE class Process_Data: def read_dataset(self, filename): #Reads dataset curr_path = os.getcwd() path = os.path.join(curr_path, filename) df = pd.read_csv(path) return df def preprocess(self): df = self.read_dataset("marketing_data.csv") #Drops ID column df = df.drop("ID", axis = 1) #Renames column name and corrects data type df.rename(columns={' Income ':'Income'},inplace=True) df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], format='%m/%d/%y') df["Income"] = df["Income"].str.replace("$","").str.replace(",","") df["Income"] = df["Income"].astype(float) #Checks null values print(df.isnull().sum()) print('Total number of null values: ', df.isnull().sum().sum()) #Imputes Income column with median values df['Income'] = df['Income'].fillna(df['Income'].median()) print(f'Number of Null values in "Income" after Imputation: {df["Income"].isna().sum()}') #Transformasi Dt_Customer df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer']) print(f'After Transformation:\n{df["Dt_Customer"].head()}') df['Customer_Age'] = df['Dt_Customer'].dt.year - df['Year_Birth'] #Creates number of children/dependents in home by adding 'Kidhome' and 'Teenhome' features #Creates number of Total_Purchases by adding all the purchases features #Creates TotalAmount_Spent by adding all the Mnt* features df['Dt_Customer_Month'] = df['Dt_Customer'].dt.month df['Dt_Customer_Year'] = df['Dt_Customer'].dt.year df['Num_Dependants'] = df['Kidhome'] + df['Teenhome'] purchase_features = [c for c in df.columns if 'Purchase' in str(c)] #Removes 'NumDealsPurchases' from the list above purchase_features.remove('NumDealsPurchases') df['Num_TotalPurchases'] = df[purchase_features].sum(axis = 1) amt_spent_features = [c for c in df.columns if 'Mnt' in str(c)] df['TotalAmount_Spent'] = df[amt_spent_features].sum(axis = 1) #Creates a categorical feature using the customer's age by binnning them, #to help understanding purchasing behaviour print(f'Min. Customer Age: {df["Customer_Age"].min()}') print(f'Max. Customer Age: {df["Customer_Age"].max()}') df['AgeGroup'] = pd.cut(df['Customer_Age'], bins = [6, 24, 29, 40, 56, 75], labels = ['Gen-Z', 'Gen-Y.1', 'Gen-Y.2', 'Gen-X', 'BBoomers']) return df def categorize(self, df): #Creates a dummy dataframe for visualization df_dummy=df.copy() #Categorizes Income feature labels = ['0-20k', '20k-30k', '30k-50k','50k-70k','70k-700k'] df_dummy['Income'] = pd.cut(df_dummy['Income'], [0, 20000, 30000, 50000, 70000, 700000], labels=labels) #Categorizes TotalAmount_Spent feature labels = ['0-200', '200-500', '500-800','800-1000','1000-3000'] df_dummy['TotalAmount_Spent'] = pd.cut(df_dummy['TotalAmount_Spent'], [0, 200, 500, 800, 1000, 3000], labels=labels) #Categorizes Num_TotalPurchases feature labels = ['0-5', '5-10', '10-15','15-25','25-35'] df_dummy['Num_TotalPurchases'] = pd.cut(df_dummy['Num_TotalPurchases'], [0, 5, 10, 15, 25, 35], labels=labels) #Categorizes Dt_Customer_Year feature labels = ['2012', '2013', '2014'] df_dummy['Dt_Customer_Year'] = pd.cut(df_dummy['Dt_Customer_Year'], [0, 2012, 2013, 2014], labels=labels) #Categorizes Dt_Customer_Month feature labels = ['0-3', '3-6', '6-9','9-12'] df_dummy['Dt_Customer_Month'] = pd.cut(df_dummy['Dt_Customer_Month'], [0, 3, 6, 9, 12], labels=labels) #Categorizes Customer_Age feature labels = ['0-30', '30-40', '40-50', '40-60','60-120'] df_dummy['Customer_Age'] = pd.cut(df_dummy['Customer_Age'], [0, 30, 40, 50, 60, 120], labels=labels) #Categorizes MntGoldProds feature labels = ['0-30', '30-50', '50-80', '80-100','100-400'] df_dummy['MntGoldProds'] = pd.cut(df_dummy['MntGoldProds'], [0, 30, 50, 80, 100, 400], labels=labels) #Categorizes MntSweetProducts feature labels = ['0-10', '10-20', '20-40', '40-100','100-300'] df_dummy['MntSweetProducts'] = pd.cut(df_dummy['MntSweetProducts'], [0, 10, 20, 40, 100, 300], labels=labels) #Categorizes MntFishProducts feature labels = ['0-10', '10-20', '20-40', '40-100','100-300'] df_dummy['MntFishProducts'] = pd.cut(df_dummy['MntFishProducts'], [0, 10, 20, 40, 100, 300], labels=labels) #Categorizes MntMeatProducts feature labels = ['0-50', '50-100', '100-200', '200-500','500-2000'] df_dummy['MntMeatProducts'] = pd.cut(df_dummy['MntMeatProducts'], [0, 50, 100, 200, 500, 2000], labels=labels) #Categorizes MntFruits feature labels = ['0-10', '10-30', '30-50', '50-100','100-200'] df_dummy['MntFruits'] = pd.cut(df_dummy['MntFruits'], [0, 1, 30, 50, 100, 200], labels=labels) #Categorizes MntWines feature labels = ['0-100', '100-300', '300-500', '500-1000','1000-1500'] df_dummy['MntWines'] = pd.cut(df_dummy['MntWines'], [0, 100, 300, 500, 1000, 1500], labels=labels) #Categorizes Recency feature labels = ['0-10', '10-30', '30-50', '50-80','80-100'] df_dummy['Recency'] = pd.cut(df_dummy['Recency'], [0, 10, 30, 50, 80, 100], labels=labels) return df_dummy def extract_cat_num_cols(self, df): #Extracts categorical and numerical columns in dummy dataset cat_cols = [col for col in df.columns if (df[col].dtype == 'object') or (df[col].dtype.name == 'category')] num_cols = [col for col in df.columns if (df[col].dtype != 'object') and (df[col].dtype.name != 'category')] return cat_cols, num_cols def encode_categorical_feats(self, df, cat_cols): #Encodes categorical features in original dataset print(f'Features that needs to be Label Encoded: \n{cat_cols}') for c in cat_cols: lbl = LabelEncoder() lbl.fit(list(df[c].astype(str).values)) df[c] = lbl.transform(list(df[c].astype(str).values)) print('Label Encoding done..') return df def extract_input_output_vars(self, df): #Extracts output and input variables y = df['Response'].values # Target for the model X = df.drop(['Dt_Customer', 'Year_Birth', 'Response'], axis = 1) return X, y def feat_importance_rf(self, X, y): names = X.columns rf = RandomForestClassifier() rf.fit(X, y) result_rf = pd.DataFrame() result_rf['Features'] = X.columns result_rf ['Values'] = rf.feature_importances_ result_rf.sort_values('Values', inplace = True, ascending = False) return result_rf def feat_importance_et(self, X, y): model = ExtraTreesClassifier() model.fit(X, y) result_et = pd.DataFrame() result_et['Features'] = X.columns result_et ['Values'] = model.feature_importances_ result_et.sort_values('Values', inplace=True, ascending =False) return result_et def feat_importance_rfe(self, X, y): model = LogisticRegression() #Creates the RFE model rfe = RFE(model) rfe = rfe.fit(X, y) result_lg = pd.DataFrame() result_lg['Features'] = X.columns result_lg ['Ranking'] = rfe.ranking_ result_lg.sort_values('Ranking', inplace=True , ascending = False) return result_lg def save_result(self, y_test, y_pred, fname): # Convert y_test and y_pred to pandas Series for easier handling y_test_series = pd.Series(y_test) y_pred_series = pd.Series(y_pred) # Calculate y_result_series y_result_series = pd.Series(y_pred - y_test == 0) y_result_series = y_result_series.map({True: 'True', False: 'False'}) # Create a DataFrame to hold y_test, y_pred, and y_result data = pd.DataFrame({'y_test': y_test_series, 'y_pred': y_pred_series, 'result': y_result_series}) # Save the DataFrame to a CSV file data.to_csv(fname, index=False)
No comments:
Post a Comment