Monday, October 9, 2023

TKINTER, DATA SCIENCE, AND MACHINE LEARNING

 Dataset

Googleplay Book

Amazon Kindle

Amazon Paperback

Kobo Store



In this project, we embarked on a comprehensive journey through the world of machine learning and model evaluation. Our primary goal was to develop a Tkinter GUI and assess various machine learning models on a given dataset to identify the best-performing one. This process is essential in solving real-world problems, as it helps us select the most suitable algorithm for a specific task. By crafting this Tkinter-powered GUI, we provided an accessible and user-friendly interface for users engaging with machine learning models. It simplified intricate processes, allowing users to load data, select models, initiate training, and visualize results without necessitating code expertise or command-line operations. This GUI introduced a higher degree of usability and accessibility to the machine learning workflow, accommodating users with diverse levels of technical proficiency.


We began by loading and preprocessing the dataset, a fundamental step in any machine learning project. Proper data preprocessing involves tasks such as handling missing values, encoding categorical features, and scaling numerical attributes. These operations ensure that the data is in a format suitable for training and testing machine learning models.


Once our data was ready, we moved on to the model selection phase. We evaluated multiple machine learning algorithms, each with its strengths and weaknesses. The models we explored included Logistic Regression, Random Forest, K-Nearest Neighbors (KNN), Decision Trees, Gradient Boosting, Extreme Gradient Boosting (XGBoost), Multi-Layer Perceptron (MLP), and Support Vector Classifier (SVC).


For each model, we employed a systematic approach to find the best hyperparameters using grid search with cross-validation. This technique allowed us to explore different combinations of hyperparameters and select the configuration that yielded the highest accuracy on the training data. These hyperparameters included settings like the number of estimators, learning rate, and kernel function, depending on the specific model.


After obtaining the best hyperparameters for each model, we trained them on our preprocessed dataset. This training process involved using the training data to teach the model to make predictions on new, unseen examples. Once trained, the models were ready for evaluation.


We assessed the performance of each model using a set of well-established evaluation metrics. These metrics included accuracy, precision, recall, and F1-score. Accuracy measured the overall correctness of predictions, while precision quantified the proportion of true positive predictions out of all positive predictions. Recall, on the other hand, represented the proportion of true positive predictions out of all actual positives, highlighting a model's ability to identify positive cases. The F1-score combined precision and recall into a single metric, helping us gauge the overall balance between these two aspects.


To visualize the model's performance, we created key graphical representations. These included confusion matrices, which showed the number of true positive, true negative, false positive, and false negative predictions, aiding in understanding the model's classification results. Additionally, we generated Receiver Operating Characteristic (ROC) curves and area under the curve (AUC) scores, which depicted a model's ability to distinguish between classes. High AUC values indicated excellent model performance.


Furthermore, we constructed true values versus predicted values diagrams to provide insights into how well our models aligned with the actual data distribution. Learning curves were also generated to observe a model's performance as a function of training data size, helping us assess whether the model was overfitting or underfitting.


Lastly, we presented the results in a clear and organized manner, saving them to Excel files for easy reference. This allowed us to compare the performance of different models and make an informed choice about which one to select for our specific task.


In summary, this project was a comprehensive exploration of the machine learning model development and evaluation process. We prepared the data, selected and fine-tuned various models, assessed their performance using multiple metrics and visualizations, and ultimately arrived at a well-informed decision about the most suitable model for our dataset. This approach serves as a valuable blueprint for tackling real-world machine learning challenges effectively.


















SOURCE CODE:


#main_class.py
import os
import tkinter as tk
from tkinter import *
from design_window import Design_Window
from process_data import Process_Data
from helper_plot import Helper_Plot
from machine_learning import Machine_Learning

class Main_Class:
    def __init__(self, root):
        self.initialize()

    def initialize(self):
        self.root = root
        width = 1500
        height = 750
        self.root.geometry(f"{width}x{height}")
        self.root.title("TKINTER AND DATA SCIENCE")
        
        #Creates necessary objects
        self.obj_window = Design_Window()
        self.obj_data = Process_Data()
        self.obj_plot = Helper_Plot()
        self.obj_ML = Machine_Learning()

        #Places widgets in root
        self.obj_window.add_widgets(self.root)    
        
        #Reads dataset
        self.df = self.obj_data.preprocess()

        #Categorize dataset
        self.df_dummy = self.obj_data.categorize(self.df)

        #Extracts input and output variables
        self.cat_cols, self.num_cols = self.obj_data.extract_cat_num_cols(self.df)
        self.df_final = self.obj_data.encode_categorical_feats(self.df, self.cat_cols)
        self.X, self.y = self.obj_data.extract_input_output_vars(self.df_final)  

        #Binds event
        self.binds_event() 

        #Initially turns off combo4 and combo5 before data splitting is done
        self.obj_window.combo4['state'] = 'disabled'
        self.obj_window.combo5['state'] = 'disabled'

    def binds_event(self):
        #Binds button1 to shows_table() function
        #Shows table if user clicks LOAD DATASET 
        self.obj_window.button1.config(command = lambda:self.obj_plot.shows_table(self.root, self.df, 1400, 600, "Dataset"))  

        #Binds listbox to a function
        self.obj_window.listbox.bind("<<ListboxSelect>>", self.choose_list_widget)

        # Binds combobox1 to a function
        self.obj_window.combo1.bind("<<ComboboxSelected>>", self.choose_combobox1)

        # Binds combobox2 to a function
        self.obj_window.combo2.bind("<<ComboboxSelected>>", self.choose_combobox2)

        #Binds button2 to train_ML() function
        self.obj_window.button2.config(command=self.train_ML)

        # Binds combobox4 to a function
        self.obj_window.combo4.bind("<<ComboboxSelected>>", self.choose_combobox4)

    def choose_list_widget(self, event):
        chosen = self.obj_window.listbox.get(self.obj_window.listbox.curselection())
        print(chosen)
        self.obj_plot.choose_plot(self.df, self.df_dummy, chosen, 
            self.obj_window.figure1, self.obj_window.canvas1, 
            self.obj_window.figure2, self.obj_window.canvas2)

    def choose_combobox1(self, event):
        chosen = self.obj_window.combo1.get()
        self.obj_plot.choose_category(self.df_dummy, chosen, 
            self.obj_window.figure1, self.obj_window.canvas1, 
            self.obj_window.figure2, self.obj_window.canvas2)

    def choose_combobox2(self, event):
        chosen = self.obj_window.combo2.get()
        self.obj_plot.choose_plot_more(self.df_final, chosen, 
            self.X, self.y,
            self.obj_window.figure1, 
            self.obj_window.canvas1, self.obj_window.figure2, 
            self.obj_window.canvas2)

    def train_ML(self):
        file_path = os.getcwd()+"/X_train.pkl"
        if os.path.exists(file_path):
            self.X_train, self.X_test, self.y_train, self.y_test = self.obj_ML.load_files()
        else:
            self.obj_ML.oversampling_splitting(self.X, self.y)
            self.X_train, self.X_test, self.y_train, self.y_test = self.obj_ML.load_files()

        print("Loading files done...")

        #turns on combo4 and combo5 after splitting is done
        self.obj_window.combo4['state'] = 'normal'
        self.obj_window.combo5['state'] = 'normal'

        self.obj_window.button2.config(state="disabled")

    def choose_combobox4(self, event):
        chosen = self.obj_window.combo4.get()
        self.obj_plot.choose_plot_ML(self.root, chosen, self.X_train, self.X_test, 
            self.y_train, self.y_test, self.obj_window.figure1, 
            self.obj_window.canvas1, self.obj_window.figure2, 
            self.obj_window.canvas2)         

if __name__ == "__main__":
    root = tk.Tk()
    app = Main_Class(root)
    root.mainloop()

#design_window.py
import tkinter as tk
from tkinter import ttk
from matplotlib.figure import Figure
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg

class Design_Window:
    def add_widgets(self, root):
        #Adds button(s)
        self.add_buttons(root)

        #Adds canvasses
        self.add_canvas(root)

        #Adds labels
        self.add_labels(root)

        #Adds listbox widget
        self.add_listboxes(root)

        #Adds combobox widget
        self.add_comboboxes(root)

    def add_buttons(self, root):
        #Adds button
        self.button1 = tk.Button(root, height=2, width=30, text="LOAD DATASET")
        self.button1.grid(row=0, column=0, padx=5, pady=5, sticky="w")

        self.button2 = tk.Button(root, height=2, width=30, text="SPLIT DATA")
        self.button2.grid(row=9, column=0, padx=5, pady=5, sticky="w")

    def add_labels(self, root):
        #Adds labels
        self.label1 = tk.Label(root, text = "CHOOSE PLOT", fg = "red")
        self.label1.grid(row=1, column=0, padx=5, pady=1, sticky="w")

        self.label2 = tk.Label(root, text = "CHOOSE CATEGORIZED PLOT", fg = "blue")
        self.label2.grid(row=3, column=0, padx=5, pady=1, sticky="w")

        self.label3 = tk.Label(root, text = "CHOOSE FEATURES", fg = "black")
        self.label3.grid(row=5, column=0, padx=5, pady=1, sticky="w")

        self.label4 = tk.Label(root, text = "CHOOSE REGRESSORS", fg = "green")
        self.label4.grid(row=7, column=0, padx=5, pady=1, sticky="w")

        self.label5 = tk.Label(root, text = "CHOOSE MACHINE LEARNING", fg = "blue")
        self.label5.grid(row=10, column=0, padx=5, pady=1, sticky="w")

        self.label6 = tk.Label(root, text = "CHOOSE DEEP LEARNING", fg = "red")
        self.label6.grid(row=12, column=0, padx=5, pady=1, sticky="w")

    def add_canvas(self, root):
        #Menambahkan canvas1 widget pada root untuk menampilkan hasil
        self.figure1 = Figure(figsize=(6.2, 7), dpi=100)
        self.figure1.patch.set_facecolor("lightgray")
        self.canvas1 = FigureCanvasTkAgg(self.figure1, master=root)
        self.canvas1.get_tk_widget().grid(row=0, column=1, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n")

        #Menambahkan canvas2 widget pada root untuk menampilkan hasil
        self.figure2 = Figure(figsize=(6.2, 7), dpi=100)
        self.figure2.patch.set_facecolor("lightgray")
        self.canvas2 = FigureCanvasTkAgg(self.figure2, master=root)
        self.canvas2.get_tk_widget().grid(row=0, column=2, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n")

    def add_listboxes(self, root):
        #Menambahkan list widget
        self.listbox = tk.Listbox(root, selectmode=tk.SINGLE, width=35)
        self.listbox.grid(row=2, column=0, sticky='n', padx=5, pady=1)

        # Menyisipkan item ke dalam list widget
        items = ["Marital Status", "Education", "Country", 
                 "Age Group", "Education with Response 0", "Education with Response 1",
                 "Country with Response 0", "Country with Response 1", 
                 "Customer Age", "Income", "Amount of Wines",
                 "Education versus Response", "Age Group versus Response",
                 "Marital Status versus Response", "Country versus Response",
                 "Number of Dependants versus Response",
                 "Country versus Customer Age Per Education",
                 "Num_TotalPurchases versus Education Per Marital Status"]
        for item in items:
            self.listbox.insert(tk.END, item)

        self.listbox.config(height=len(items)) 

    def add_comboboxes(self, root):
        # Create ComboBoxes
        self.combo1 = ttk.Combobox(root, width=32)
        self.combo1["values"] = ["Categorized Income versus Response", 
            "Categorized Total Purchase versus Categorized Income",
            "Categorized Recency versus Categorized Total Purchase",
            "Categorized Customer Month versus Categorized Customer Age",
            "Categorized Amount of Gold Products versus Categorized Income",
            "Categorized Amount of Fish Products versus Categorized Total AmountSpent",
            "Categorized Amount of Meat Products versus Categorized Recency",
            "Distribution of Numerical Columns"]
        self.combo1.grid(row=4, column=0, padx=5, pady=1, sticky="n")

        self.combo2 = ttk.Combobox(root, width=32)
        self.combo2["values"] = ["Correlation Matrix", "RF Features Importance",
            "ET Features Importance", "RFE Features Importance"]
        self.combo2.grid(row=6, column=0, padx=5, pady=1, sticky="n")

        self.combo3 = ttk.Combobox(root, width=32)
        self.combo3["values"] = ["Linear Regression", "RF Regression",
            "Decision Trees Regression", "KNN Regression",
            "AdaBoost Regression", "Gradient Boosting Regression",
            "XGB Regression", "LGB Regression", "CatBoost Regression",
            "SVR Regression", "Lasso Regression", "Ridge Regression"]
        self.combo3.grid(row=8, column=0, padx=5, pady=1, sticky="n")

        self.combo4 = ttk.Combobox(root, width=32)
        self.combo4["values"] = ["Logistic Regression", "Random Forest",
            "Decision Trees", "K-Nearest Neighbors",
            "AdaBoost", "Gradient Boosting",
            "Extreme Gradient Boosting", "Light Gradient Boosting", 
            "Multi-Layer Perceptron", "Support Vector Classifier"]
        self.combo4.grid(row=11, column=0, padx=5, pady=1, sticky="n")

        self.combo5 = ttk.Combobox(root, width=32)
        self.combo5["values"] = ["LSTM", "Convolutional NN", "Recurrent NN", "Feed-Forward NN", "Artifical NN"]
        self.combo5.grid(row=13, column=0, padx=5, pady=1, sticky="n")


#helper_plot.py
from tkinter import *
import seaborn as sns
import numpy as np 
from pandastable import Table
from process_data import Process_Data
from machine_learning import Machine_Learning
from sklearn.metrics import confusion_matrix, roc_curve, accuracy_score
from sklearn.model_selection import learning_curve

class Helper_Plot:
    def __init__(self):
        self.obj_data = Process_Data()
        self.obj_ml = Machine_Learning()

    def shows_table(self, root, df, width, height, title):
       frame = Toplevel(root) #new window
       self.table = Table(frame, dataframe=df, showtoolbar=True, showstatusbar=True)
       
       # Sets dimension of Toplevel
       frame.geometry(f"{width}x{height}")
       frame.title(title)
       self.table.show()

    # Defines function to create pie chart and bar plot as subplots   
    def plot_piechart(self, df, var, figure, canvas, title=''):
        figure.clear()

        # Pie Chart (left subplot)
        plot1 = figure.add_subplot(2,1,1)        
        label_list = list(df[var].value_counts().index)
        colors = sns.color_palette("deep", len(label_list))  
        _, _, autopcts = plot1.pie(df[var].value_counts(), autopct="%1.1f%%", colors=colors,
            startangle=30, labels=label_list,
            wedgeprops={"linewidth": 2, "edgecolor": "white"},  # Add white edge
            shadow=True, textprops={'fontsize': 7})
        plot1.set_title("Distribution of " + var + " variable " + title, fontsize=10)

        # Bar Plot (right subplot)
        plot2 = figure.add_subplot(2,1,2)
        ax = df[var].value_counts().plot(kind="barh", color=colors, alpha=0.8, ax = plot2) 
        for i, j in enumerate(df[var].value_counts().values):
            ax.text(.7, i, j, weight="bold", fontsize=7)

        plot2.set_title("Count of " + var + " cases " + title, fontsize=10)

        figure.tight_layout()
        canvas.draw()

    def another_versus_response(self, df, feat, num_bins, figure, canvas):
        figure.clear()
        plot1 = figure.add_subplot(2,1,1)

        colors = sns.color_palette("Set2")
        df[df['Response'] == 0][feat].plot(ax=plot1, kind='hist', bins=num_bins, edgecolor='black', color=colors[0])
        plot1.set_title('Not Responsive', fontsize=15)
        plot1.set_xlabel(feat, fontsize=10)
        plot1.set_ylabel('Count', fontsize=10)
        data1 = []
        for p in plot1.patches:
            x = p.get_x() + p.get_width() / 2.
            y = p.get_height()
            plot1.annotate(format(y, '.0f'), (x, y), ha='center',
                     va='center', xytext=(0, 10),
                     weight="bold", fontsize=7, textcoords='offset points')
            data1.append([x, y])

        plot2 = figure.add_subplot(2,1,2)
        df[df['Response'] == 1][feat].plot(ax=plot2, kind='hist', bins=num_bins, edgecolor='black', color=colors[1])
        plot2.set_title('Responsive', fontsize=15)
        plot2.set_xlabel(feat, fontsize=10)
        plot2.set_ylabel('Count', fontsize=10)
        data2 = []
        for p in plot2.patches:
            x = p.get_x() + p.get_width() / 2.
            y = p.get_height()
            plot2.annotate(format(y, '.0f'), (x, y), ha='center',
                     va='center', xytext=(0, 10),
                     weight="bold", fontsize=7, textcoords='offset points')
            data2.append([x, y])

        figure.tight_layout()
        canvas.draw()

    #Puts label inside stacked bar
    def put_label_stacked_bar(self, ax,fontsize):
        #patches is everything inside of the chart
        for rect in ax.patches:
            # Find where everything is located
            height = rect.get_height()
            width = rect.get_width()
            x = rect.get_x()
            y = rect.get_y()
    
            # The height of the bar is the data value and can be used as the label
            label_text = f'{height:.0f}'  
    
            # ax.text(x, y, text)
            label_x = x + width / 2
            label_y = y + height / 2

            # plots only when height is greater than specified value
            if height > 0:
                ax.text(label_x, label_y, label_text, \
                    ha='center', va='center', \
                    weight = "bold",fontsize=fontsize)
    
    #Plots one variable against another variable
    def dist_one_vs_another_plot(self, df, cat1, cat2, figure, canvas, title):
        figure.clear()
        plot1 = figure.add_subplot(1,1,1)

        group_by_stat = df.groupby([cat1, cat2]).size()
        colors = sns.color_palette("Set2", len(df[cat1].unique()))
        stacked_data = group_by_stat.unstack()
        group_by_stat.unstack().plot(kind='bar', stacked=True, ax=plot1, grid=True, color=colors)
        plot1.set_title(title, fontsize=12)
        plot1.set_ylabel('Number of Cases', fontsize=10)
        plot1.set_xlabel(cat1, fontsize=10)
        self.put_label_stacked_bar(plot1,7)
        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=8)
        plot1.tick_params(axis='both', which='minor', labelsize=8)    
        plot1.legend(fontsize=8)    
        figure.tight_layout()
        canvas.draw()

    def box_plot(self, df, x, y, hue, figure, canvas, title):
        figure.clear()
        plot1 = figure.add_subplot(1,1,1)

        #Creates boxplot of Num_TotalPurchases versus Num_Dependants
        sns.boxplot(data = df, x = x, y = y, hue = hue, ax=plot1)
        plot1.set_title(title, fontsize=14)
        plot1.set_xlabel(x, fontsize=10)
        plot1.set_ylabel(y, fontsize=10)
        figure.tight_layout()
        canvas.draw()

    def choose_plot(self, df1, df2, chosen, figure1, canvas1, figure2, canvas2):
        print(chosen)
        if chosen == "Marital Status":
            self.plot_piechart(df2, "Marital_Status", figure1, canvas1)

        elif chosen == "Education":
            self.plot_piechart(df2, "Education", figure2, canvas2)

        elif chosen == "Country":
            self.plot_piechart(df2, "Country", figure1, canvas1)            

        elif chosen == "Age Group":
            self.plot_piechart(df2, "AgeGroup", figure2, canvas2)              

        elif chosen == "Education with Response 0":
            self.plot_piechart(df2[df2.Response==0], "Education", figure1, canvas1, " with Response 0")

        elif chosen == "Education with Response 1":
            self.plot_piechart(df2[df2.Response==1], "Education", figure2, canvas2, " with Response 1")

        elif chosen == "Country with Response 0":
            self.plot_piechart(df2[df2.Response==0], "Country", figure1, canvas1, " with Response 0")

        elif chosen == "Country with Response 1":
            self.plot_piechart(df2[df2.Response==1], "Country", figure2, canvas2, " with Response 1")       

        elif chosen == "Income":
            self.another_versus_response(df1, "Income", 32, figure1, canvas1) 

        elif chosen == "Amount of Wines":
            self.another_versus_response(df1, "MntWines", 32, figure2, canvas2) 

        elif chosen == "Customer Age":
            self.another_versus_response(df1, "Customer_Age", 32, figure1, canvas1) 

        elif chosen == "Education versus Response":
            self.dist_one_vs_another_plot(df2, "Education", "Response", figure2, canvas2, chosen) 

        elif chosen == "Age Group versus Response":
            self.dist_one_vs_another_plot(df2, "AgeGroup", "Response", figure1, canvas1, chosen)

        elif chosen == "Marital Status versus Response":
            self.dist_one_vs_another_plot(df2, "Marital_Status", "Response", figure2, canvas2, chosen)            

        elif chosen == "Country versus Response":
            self.dist_one_vs_another_plot(df2, "Country", "Response", figure1, canvas1, chosen)              

        elif chosen == "Number of Dependants versus Response":
            self.dist_one_vs_another_plot(df2, "Num_Dependants", "Response", figure2, canvas2, chosen) 

        elif chosen == "Country versus Customer Age Per Education":
            self.box_plot(df1, "Country", "Customer_Age", "Education", figure1, canvas1, chosen)

        elif chosen == "Num_TotalPurchases versus Education Per Marital Status":
            self.box_plot(df1, "Education", "Num_TotalPurchases", "Marital_Status", figure2, canvas2, chosen)

    def choose_category(self, df, chosen, figure1, canvas1, figure2, canvas2):  
        if chosen == "Categorized Income versus Response":
            self.dist_one_vs_another_plot(df, "Income", "Response", figure1, canvas1, chosen)       

        if chosen == "Categorized Total Purchase versus Categorized Income":
            self.dist_one_vs_another_plot(df, "Num_TotalPurchases", "Income", figure2, canvas2, chosen)      

        if chosen == "Categorized Recency versus Categorized Total Purchase":
            self.dist_one_vs_another_plot(df, "Recency", "Num_TotalPurchases", figure1, canvas1, chosen)    

        if chosen == "Categorized Customer Month versus Categorized Customer Age":
            self.dist_one_vs_another_plot(df, "Dt_Customer_Month", "Customer_Age", figure2, canvas2, chosen) 

        if chosen == "Categorized Amount of Gold Products versus Categorized Income":
            self.dist_one_vs_another_plot(df, "MntGoldProds", "Income", figure1, canvas1, chosen) 

        if chosen == "Categorized Amount of Fish Products versus Categorized Total AmountSpent":
            self.dist_one_vs_another_plot(df, "MntFishProducts", "TotalAmount_Spent", figure2, canvas2, chosen) 

        if chosen == "Categorized Amount of Meat Products versus Categorized Recency":
            self.dist_one_vs_another_plot(df, "MntMeatProducts", "Recency", figure1, canvas1, chosen) 

    def plot_corr_mat(self, df, figure, canvas):
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1)  
        categorical_columns = df.select_dtypes(include=['object', 'category']).columns 
        df_removed = df.drop(columns=categorical_columns) 
        corrdata = df_removed.corr()

        annot_kws = {"size": 5}
        sns.heatmap(corrdata, ax = plot1, lw=1, annot=True, cmap="Reds", annot_kws=annot_kws)
        plot1.set_title('Correlation Matrix', fontweight ="bold",fontsize=14)

        # Set font for x and y labels
        plot1.set_xlabel('Features', fontweight="bold", fontsize=12)
        plot1.set_ylabel('Features', fontweight="bold", fontsize=12)

        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=5)
        plot1.tick_params(axis='both', which='minor', labelsize=5)

        figure.tight_layout()
        canvas.draw()

    def plot_rf_importance(self, X, y, figure, canvas):
        result_rf = self.obj_data.feat_importance_rf(X, y)
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1)  
        sns.set_color_codes("pastel")
        ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Blue", ax=plot1)
        plot1.set_title('Random Forest Features Importance', fontweight ="bold",fontsize=14)

        plot1.set_xlabel('Features Importance',  fontsize=10) 
        plot1.set_ylabel('Feature Labels',  fontsize=10) 
        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=5)
        plot1.tick_params(axis='both', which='minor', labelsize=5)
        figure.tight_layout()
        canvas.draw()

    def plot_et_importance(self, X, y, figure, canvas):
        result_rf = self.obj_data.feat_importance_et(X, y)
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1)  
        sns.set_color_codes("pastel")
        ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Red", ax=plot1)
        plot1.set_title('Extra Trees Features Importance', fontweight ="bold",fontsize=14)

        plot1.set_xlabel('Features Importance',  fontsize=10) 
        plot1.set_ylabel('Feature Labels',  fontsize=10) 
        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=5)
        plot1.tick_params(axis='both', which='minor', labelsize=5)
        figure.tight_layout()
        canvas.draw()        

    def plot_rfe_importance(self, X, y, figure, canvas):
        result_lg = self.obj_data.feat_importance_rfe(X, y)
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1)  
        sns.set_color_codes("pastel")
        ax=sns.barplot(x = 'Ranking',y = 'Features', data=result_lg, color="orange", ax=plot1)
        plot1.set_title('RFE Features Importance', fontweight ="bold",fontsize=14)

        plot1.set_xlabel('Features Importance',  fontsize=10) 
        plot1.set_ylabel('Feature Labels',  fontsize=10) 
        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=5)
        plot1.tick_params(axis='both', which='minor', labelsize=5)
        figure.tight_layout()
        canvas.draw()   

    def choose_plot_more(self, df, chosen, X, y, figure1, canvas1, figure2, canvas2):  
        if chosen == "Correlation Matrix":
            self.plot_corr_mat(df, figure1, canvas1)

        if chosen == "RF Features Importance":
            self.plot_rf_importance(X, y, figure2, canvas2)

        if chosen == "ET Features Importance":
            self.plot_et_importance(X, y, figure1, canvas1)

        if chosen == "RFE Features Importance":
            self.plot_rfe_importance(X, y, figure1, canvas1)

    def plot_cm_roc(self, model, X_test, y_test, ypred, name, figure, canvas):
        figure.clear()    
        
        #Plots confusion matrix
        plot1 = figure.add_subplot(2,1,1)  
        cm = confusion_matrix(y_test, ypred, )
        sns.heatmap(cm, annot=True, linewidth=3, linecolor='red', fmt='g', cmap="Greens", annot_kws={"size": 14}, ax=plot1)
        plot1.set_title('Confusion Matrix' + " of " + name, fontsize=12)
        plot1.set_xlabel('Y predict', fontsize=10)
        plot1.set_ylabel('Y test', fontsize=10)
        plot1.xaxis.set_ticklabels(['Responsive', 'Not Responsive'], fontsize=10)
        plot1.yaxis.set_ticklabels(['Responsive', 'Not Responsive'], fontsize=10)

        #Plots ROC
        plot2 = figure.add_subplot(2,1,2)
        Y_pred_prob = model.predict_proba(X_test)
        Y_pred_prob = Y_pred_prob[:, 1]

        fpr, tpr, thresholds = roc_curve(y_test, Y_pred_prob)
        plot2.plot([0,1],[0,1], color='navy', linestyle='--', linewidth=3)
        plot2.plot(fpr,tpr, color='red', linewidth=3)
        plot2.set_xlabel('False Positive Rate', fontsize=10)
        plot2.set_ylabel('True Positive Rate', fontsize=10)
        plot2.set_title('ROC Curve of ' + name , fontsize=12)
        plot2.grid(True)

        figure.tight_layout()
        canvas.draw()   

    #Plots true values versus predicted values diagram and learning curve
    def plot_real_pred_val_learning_curve(self, model, X_train, y_train, X_test, y_test, ypred, name, figure, canvas):
        figure.clear()    
        
        #Plots true values versus predicted values diagram
        plot1 = figure.add_subplot(2,1,1)  
        acc=accuracy_score(y_test, ypred)
        plot1.scatter(range(len(ypred)),ypred,color="blue", lw=3,label="Predicted")
        plot1.scatter(range(len(y_test)), 
            y_test,color="red",label="Actual")
        plot1.set_title("Predicted Values vs True Values of " + name, fontsize=12)
        plot1.set_xlabel("Accuracy: " + str(round((acc*100),3)) + "%")
        plot1.legend()
        plot1.grid(True, alpha=0.75, lw=1, ls='-.')

        #Plots learning curve
        train_sizes=np.linspace(.1, 1.0, 5)
        train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(model, 
            X_train, y_train, cv=None, n_jobs=None, train_sizes=train_sizes, return_times=True)
        train_scores_mean = np.mean(train_scores, axis=1)
        train_scores_std = np.std(train_scores, axis=1)
        test_scores_mean = np.mean(test_scores, axis=1)
        test_scores_std = np.std(test_scores, axis=1)

        plot2 = figure.add_subplot(2,1,2)
        plot2.fill_between(train_sizes, train_scores_mean - train_scores_std,
            train_scores_mean + train_scores_std, alpha=0.1, color="r")
        plot2.fill_between(train_sizes, test_scores_mean - test_scores_std,
            test_scores_mean + test_scores_std, alpha=0.1, color="g")
        plot2.plot(train_sizes, train_scores_mean, 'o-', 
            color="r", label="Training score")
        plot2.plot(train_sizes, test_scores_mean, 'o-', 
            color="g", label="Cross-validation score")
        plot2.legend(loc="best")
        plot2.set_title("Learning curve of " + name, fontsize=12)
        plot2.set_xlabel("fit_times")
        plot2.set_ylabel("Score")
        plot2.grid(True, alpha=0.75, lw=1, ls='-.')

        figure.tight_layout()
        canvas.draw()  

    def choose_plot_ML(self, root, chosen, X_train, X_test, y_train, y_test, figure1, canvas1, figure2, canvas2):  
        if chosen == "Logistic Regression":
            best_model, y_pred = self.obj_ml.implement_LR(chosen, X_train, X_test, y_train, y_test)

            #Plots confusion matrix and ROC
            self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1)

            #Plots true values versus predicted values diagram and learning curve
            self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, 
                X_test, y_test, y_pred, chosen, figure2, canvas2)

            #Shows table of result
            df_lr = self.obj_data.read_dataset("results_LR.csv")
            self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Logistic Regression")

        if chosen == "Random Forest":
            best_model, y_pred = self.obj_ml.implement_RF(chosen, X_train, X_test, y_train, y_test)

            #Plots confusion matrix and ROC
            self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1)

            #Plots true values versus predicted values diagram and learning curve
            self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, 
                X_test, y_test, y_pred, chosen, figure2, canvas2)
            
            #Shows table of result
            df_lr = self.obj_data.read_dataset("results_RF.csv")
            self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Random Forest")   

        if chosen == "K-Nearest Neighbors":
            best_model, y_pred = self.obj_ml.implement_KNN(chosen, X_train, X_test, y_train, y_test)

            #Plots confusion matrix and ROC
            self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1)

            #Plots true values versus predicted values diagram and learning curve
            self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, 
                X_test, y_test, y_pred, chosen, figure2, canvas2)
            
            #Shows table of result
            df_lr = self.obj_data.read_dataset("results_KNN.csv")
            self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of KNN")              

        if chosen == "Decision Trees":
            best_model, y_pred = self.obj_ml.implement_DT(chosen, X_train, X_test, y_train, y_test)

            #Plots confusion matrix and ROC
            self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1)

            #Plots true values versus predicted values diagram and learning curve
            self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, 
                X_test, y_test, y_pred, chosen, figure2, canvas2)
            
            #Shows table of result
            df_lr = self.obj_data.read_dataset("results_DT.csv")
            self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Decision Trees")  

        if chosen == "Gradient Boosting":
            best_model, y_pred = self.obj_ml.implement_GB(chosen, X_train, X_test, y_train, y_test)

            #Plots confusion matrix and ROC
            self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1)

            #Plots true values versus predicted values diagram and learning curve
            self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, 
                X_test, y_test, y_pred, chosen, figure2, canvas2)
            
            #Shows table of result
            df_lr = self.obj_data.read_dataset("results_GB.csv")
            self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Gradient Boosting") 

        if chosen == "Extreme Gradient Boosting":
            best_model, y_pred = self.obj_ml.implement_XGB(chosen, X_train, X_test, y_train, y_test)

            #Plots confusion matrix and ROC
            self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1)

            #Plots true values versus predicted values diagram and learning curve
            self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, 
                X_test, y_test, y_pred, chosen, figure2, canvas2)
            
            #Shows table of result
            df_lr = self.obj_data.read_dataset("results_XGB.csv")
            self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Extreme Gradient Boosting") 

        if chosen == "Multi-Layer Perceptron":
            best_model, y_pred = self.obj_ml.implement_MLP(chosen, X_train, X_test, y_train, y_test)

            #Plots confusion matrix and ROC
            self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1)

            #Plots true values versus predicted values diagram and learning curve
            self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, 
                X_test, y_test, y_pred, chosen, figure2, canvas2)
            
            #Shows table of result
            df_lr = self.obj_data.read_dataset("results_MLP.csv")
            self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Multi-Layer Perceptron") 

        if chosen == "Support Vector Classifier":
            best_model, y_pred = self.obj_ml.implement_SVC(chosen, X_train, X_test, y_train, y_test)

            #Plots confusion matrix and ROC
            self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1)

            #Plots true values versus predicted values diagram and learning curve
            self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, 
                X_test, y_test, y_pred, chosen, figure2, canvas2)
            
            #Shows table of result
            df_lr = self.obj_data.read_dataset("results_SVC.csv")
            self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Support Vector Classifier")

        if chosen == "AdaBoost":
            best_model, y_pred = self.obj_ml.implement_ADA(chosen, X_train, X_test, y_train, y_test)

            #Plots confusion matrix and ROC
            self.plot_cm_roc(best_model, X_test, y_test, y_pred, chosen, figure1, canvas1)

            #Plots true values versus predicted values diagram and learning curve
            self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, 
                X_test, y_test, y_pred, chosen, figure2, canvas2)
            
            #Shows table of result
            df_lr = self.obj_data.read_dataset("results_ADA.csv")
            self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of AdaBoost Classifier")


    def plot_accuracy(self, history, name, figure, canvas):
        acc = history['accuracy']
        val_acc = history['val_accuracy']
        epochs = range(1, len(acc) + 1)

        #Cleans and Creates figure
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1)  
    
        # Plots training accuracy in red and validation accuracy in blue dashed line
        plot1.plot(epochs, acc, 'r', label='Training accuracy', lw=3)
        plot1.plot(epochs, val_acc, 'b--', label='Validation accuracy', lw=3)
    
        # Set plot title and legend
        plot1.set_title('Training and validation accuracy of ' + name, fontsize=12)
        plot1.legend(fontsize=8)
    
        # Set x-axis label and tick label font size
        plot1.set_xlabel("Epoch", fontsize=10)
        plot1.tick_params(labelsize=8)
    
        # Set background color
        plot1.gca().set_facecolor('black')
    
        figure.tight_layout()
        canvas.draw() 

    def plot_loss(self, history, name, figure, canvas):
        loss = history['loss']
        val_loss = history['val_loss']
        epochs = range(1, len(loss) + 1)

        #Cleans and Creates figure
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1) 
    
        # Plot training loss in red and validation loss in blue dashed line
        plot1.plot(epochs, loss, 'r', label='Training loss', lw=3)
        plot1.plot(epochs, val_loss, 'b--', label='Validation loss', lw=3)
    
        # Set plot title and legend
        plot1.set_title('Training and validation loss of ' + name, fontsize=12)
        plot1.legend(fontsize=8)
    
        # Set x-axis label and tick label font size
        plot1.set_xlabel("Epoch", fontsize=10)
        plot1.tick_params(labelsize=8)
    
        # Set background color
        plot1.gca().set_facecolor('lightgray')
    
        figure.tight_layout()
        canvas.draw() 

#machine_learning.py
import numpy as np 
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score
from sklearn.metrics import classification_report, f1_score, plot_confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
import os
import joblib
import pandas as pd 
from process_data import Process_Data

class Machine_Learning:
    def __init__(self):
        self.obj_data = Process_Data()

    def oversampling_splitting(self, X, y):
        sm = SMOTE(random_state=42)
        X,y = sm.fit_resample(X, y.ravel())

        #Splits the data into training and testing
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2021, stratify=y)   

        #Use Standard Scaler
        scaler = StandardScaler()
        X_train_stand = scaler.fit_transform(X_train)
        X_test_stand = scaler.transform(X_test)    
    
        #Saves into pkl files
        joblib.dump(X_train_stand, 'X_train.pkl')
        joblib.dump(X_test_stand, 'X_test.pkl')
        joblib.dump(y_train, 'y_train.pkl')
        joblib.dump(y_test, 'y_test.pkl')  

    def load_files(self):
        X_train = joblib.load('X_train.pkl')
        X_test = joblib.load('X_test.pkl')
        y_train = joblib.load('y_train.pkl')
        y_test = joblib.load('y_test.pkl')
    
        return X_train, X_test, y_train, y_test

    def choose_feats_boundary(self, X, y):
        file_path = os.getcwd()
        X_train_feat_path = os.path.join(file_path, 'X_train_feat.pkl')
        X_test_feat_path = os.path.join(file_path, 'X_test_feat.pkl')
        y_train_feat_path = os.path.join(file_path, 'y_train_feat.pkl')
        y_test_feat_path = os.path.join(file_path, 'y_test_feat.pkl')

        if os.path.exists(X_train_feat_path):
            X_train_feat = joblib.load(X_train_feat_path)
            X_test_feat = joblib.load(X_test_feat_path)
            y_train_feat = joblib.load(y_train_feat_path)
            y_test_feat = joblib.load(y_test_feat_path)
        else:
            # Make sure feat_boundary contains valid column indices from your X array
            feat_boundary = [1, 2]  # actual indices
            if all(idx < X.shape[1] for idx in feat_boundary):
                X_feature = X[:, feat_boundary]
                X_train_feat, X_test_feat, y_train_feat, y_test_feat = train_test_split(X_feature, y, 
                    test_size=0.2, random_state=2021, stratify=y)  

                # Saves into pkl files
                joblib.dump(X_train_feat, X_train_feat_path)
                joblib.dump(X_test_feat, X_test_feat_path)
                joblib.dump(y_train_feat, y_train_feat_path)
                joblib.dump(y_test_feat, y_test_feat_path) 
            else:
                raise ValueError("Indices in feat_boundary exceed the number of columns in X array")

        return X_train_feat, X_test_feat, y_train_feat, y_test_feat

    def train_model(self, model, X, y):
        model.fit(X, y)
        return model

    def predict_model(self, model, X, proba=False):
        if ~proba:
            y_pred = model.predict(X)
        else:
            y_pred_proba = model.predict_proba(X)
            y_pred = np.argmax(y_pred_proba, axis=1)

        return y_pred

    def run_model(self, name, model, X_train, X_test, y_train, y_test, proba=False):   
        y_pred = self.predict_model(model, X_test, proba)
    
        accuracy = accuracy_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred, average='weighted')
        precision = precision_score(y_test, y_pred, average='weighted')
        f1 = f1_score(y_test, y_pred, average='weighted')
    
        print(name)
        print('accuracy: ', accuracy)
        print('recall: ', recall)
        print('precision: ', precision)
        print('f1: ', f1)
        print(classification_report(y_test, y_pred)) 

        return y_pred

    def logistic_regression(self, name, X_train, X_test, y_train, y_test):
        #Logistic Regression Classifier
        # Define the parameter grid for the grid search
        param_grid = {
            'C': [0.01, 0.1, 1, 10],
            'penalty': ['none', 'l2'],
            'solver': ['newton-cg', 'lbfgs', 'liblinear', 'saga'],
        }

        # Initialize the Logistic Regression model
        logreg = LogisticRegression(max_iter=5000, random_state=2021)
    
        # Create GridSearchCV with the Logistic Regression model and the parameter grid
        grid_search = GridSearchCV(logreg, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    
        # Train and perform grid search
        grid_search.fit(X_train, y_train)
    
        # Get the best Logistic Regression model from the grid search
        best_model = grid_search.best_estimator_

        #Saves model
        joblib.dump(best_model, 'LR_Model.pkl')    
    
        # Print the best hyperparameters found
        print(f"Best Hyperparameters for LR:")
        print(grid_search.best_params_)        

        return best_model

    def implement_LR(self, chosen, X_train, X_test, y_train, y_test):
        file_path = os.getcwd()+"/LR_Model.pkl"
        if os.path.exists(file_path):
            model = joblib.load('LR_Model.pkl')
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
        else:
            model = self.logistic_regression(chosen, X_train, X_test, y_train, y_test)
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

        #Saves result into excel file
        self.obj_data.save_result(y_test, y_pred, "results_LR.csv")

        print("Training Logistic Regression done...")
        return model, y_pred

    def random_forest(self, name, X_train, X_test, y_train, y_test):
        #Random Forest Classifier    
        # Define the parameter grid for the grid search
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [10, 20, 30, 40, 50],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }

        # Initialize the RandomForestClassifier model
        rf = RandomForestClassifier(random_state=2021)
    
        # Create GridSearchCV with the RandomForestClassifier model and the parameter grid
        grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    
        # Train and perform grid search
        grid_search.fit(X_train, y_train)
    
        # Get the best RandomForestClassifier model from the grid search
        best_model = grid_search.best_estimator_
    
        #Saves model
        joblib.dump(best_model, 'RF_Model.pkl')    
    
        # Print the best hyperparameters found
        print(f"Best Hyperparameters for RF:")
        print(grid_search.best_params_)        

        return best_model

    def implement_RF(self, chosen, X_train, X_test, y_train, y_test):
        file_path = os.getcwd()+"/RF_Model.pkl"
        if os.path.exists(file_path):
            model = joblib.load('RF_Model.pkl')
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
        else:
            model = self.random_forest(chosen, X_train, X_test, y_train, y_test)
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

        #Saves result into excel file
        self.obj_data.save_result(y_test, y_pred, "results_RF.csv")

        print("Training Random Forest done...")
        return model, y_pred

    def knearest_neigbors(self, name, X_train, X_test, y_train, y_test):
        #KNN Classifier
        # Define the parameter grid for the grid search
        param_grid = {
            'n_neighbors': list(range(2, 10))
        }

        # Initialize the KNN Classifier
        knn = KNeighborsClassifier()
    
        # Create GridSearchCV with the KNN model and the parameter grid
        grid_search = GridSearchCV(knn, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    
        # Train and perform grid search
        grid_search.fit(X_train, y_train)
    
        # Get the best KNN model from the grid search
        best_model = grid_search.best_estimator_
    
        #Saves model
        joblib.dump(best_model, 'KNN_Model.pkl')    
    
        # Print the best hyperparameters found
        print(f"Best Hyperparameters for KNN:")
        print(grid_search.best_params_)        

        return best_model

    def implement_KNN(self, chosen, X_train, X_test, y_train, y_test):
        file_path = os.getcwd()+"/KNN_Model.pkl"
        if os.path.exists(file_path):
            model = joblib.load('KNN_Model.pkl')
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
        else:
            model = self.knearest_neigbors(chosen, X_train, X_test, y_train, y_test)
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

        #Saves result into excel file
        self.obj_data.save_result(y_test, y_pred, "results_KNN.csv")

        print("Training KNN done...")
        return model, y_pred

    def decision_trees(self, name, X_train, X_test, y_train, y_test):
        # Initialize the DecisionTreeClassifier model
        dt_clf = DecisionTreeClassifier(random_state=2021)
    
        # Define the parameter grid for the grid search
        param_grid = {
            'max_depth': np.arange(1, 51, 1),
            'criterion': ['gini', 'entropy'],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4],
        }
    
        # Create GridSearchCV with the DecisionTreeClassifier model and the parameter grid
        grid_search = GridSearchCV(dt_clf, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    
        # Train and perform grid search
        grid_search.fit(X_train, y_train)
    
        # Get the best DecisionTreeClassifier model from the grid search
        best_model = grid_search.best_estimator_
    
        #Saves model
        joblib.dump(best_model, 'DT_Model.pkl')    
    
        # Print the best hyperparameters found
        print(f"Best Hyperparameters for DT:")
        print(grid_search.best_params_)        

        return best_model

    def implement_DT(self, chosen, X_train, X_test, y_train, y_test):
        file_path = os.getcwd()+"/DT_Model.pkl"
        if os.path.exists(file_path):
            model = joblib.load('DT_Model.pkl')
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
        else:
            model = self.decision_trees(chosen, X_train, X_test, y_train, y_test)
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

        #Saves result into excel file
        self.obj_data.save_result(y_test, y_pred, "results_DT.csv")

        print("Training Decision Trees done...")
        return model, y_pred

    def gradient_boosting(self, name, X_train, X_test, y_train, y_test):
        #Gradient Boosting Classifier      
        # Initialize the GradientBoostingClassifier model
        gbt = GradientBoostingClassifier(random_state=2021)

        # Define the parameter grid for the grid search
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [10, 20, 30],
            'subsample': [0.6, 0.8, 1.0],
            'max_features': [0.2, 0.4, 0.6, 0.8, 1.0],
        }
    
        # Create GridSearchCV with the GradientBoostingClassifier model and the parameter grid
        grid_search = GridSearchCV(gbt, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

        # Train and perform grid search
        grid_search.fit(X_train, y_train)

        # Get the best GradientBoostingClassifier model from the grid search
        best_model = grid_search.best_estimator_
    
        #Saves model
        joblib.dump(best_model, 'GB_Model.pkl')    
    
        # Print the best hyperparameters found
        print(f"Best Hyperparameters for GB:")
        print(grid_search.best_params_)        

        return best_model

    def implement_GB(self, chosen, X_train, X_test, y_train, y_test):
        file_path = os.getcwd()+"/GB_Model.pkl"
        if os.path.exists(file_path):
            model = joblib.load('GB_Model.pkl')
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
        else:
            model = self.gradient_boosting(chosen, X_train, X_test, y_train, y_test)
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

        #Saves result into excel file
        self.obj_data.save_result(y_test, y_pred, "results_GB.csv")

        print("Training Gradient Boosting done...")
        return model, y_pred

    # def light_gradient_boosting(self, name, X_train, X_test, y_train, y_test):
    #     #LGBM Classifier
    #     # Define the parameter grid for grid search
    #     param_grid = {
    #         'max_depth': [10, 20, 30],
    #         'n_estimators': [100, 200, 300],
    #         'subsample': [0.6, 0.8, 1.0],
    #         'random_state': [2021]
    #     }

    #     # Initialize the LightGBM classifier
    #     lgbm = LGBMClassifier()
    
    #     # Create GridSearchCV with the LightGBM classifier and the parameter grid
    #     grid_search = GridSearchCV(lgbm, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

    #     # Train and perform grid search
    #     grid_search.fit(X_train, y_train)

    #     # Get the best LightGBM classifier model from the grid search
    #     best_model = grid_search.best_estimator_
    
    #     #Saves model
    #     joblib.dump(best_model, 'LGB_Model.pkl')    
    
    #     # Print the best hyperparameters found
    #     print(f"Best Hyperparameters for LGB:")
    #     print(grid_search.best_params_)        

    #     return best_model

    # def implement_LGB(self, chosen, X_train, X_test, y_train, y_test):
    #     file_path = os.getcwd()+"/LGB_Model.pkl"
    #     if os.path.exists(file_path):
    #         model = joblib.load('LGB_Model.pkl')
    #         y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
    #     else:
    #         model = self.light_gradient_boosting(chosen, X_train, X_test, y_train, y_test)
    #         y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

    #     #Saves result into excel file
    #     self.save_result(y_test, y_pred, "results_LGB.csv")

    #     print("Training Light Gradient Boosting done...")
    #     return model, y_pred

    def extreme_gradient_boosting(self, name, X_train, X_test, y_train, y_test):
        # Define the parameter grid for the grid search
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [10, 20, 30],
            'learning_rate': [0.01, 0.1, 0.2],
            'subsample': [0.6, 0.8, 1.0],
            'colsample_bytree': [0.6, 0.8, 1.0],
        }

        # Initialize the XGBoost classifier
        xgb = XGBClassifier(random_state=2021, use_label_encoder=False, eval_metric='mlogloss')

        # Create GridSearchCV with the XGBoost classifier and the parameter grid
        grid_search = GridSearchCV(xgb, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

        # Train and perform grid search
        grid_search.fit(X_train, y_train)

        # Get the best XGBoost classifier model from the grid search
        best_model = grid_search.best_estimator_
    
        #Saves model
        joblib.dump(best_model, 'XGB_Model.pkl')    
    
        # Print the best hyperparameters found
        print(f"Best Hyperparameters for XGB:")
        print(grid_search.best_params_)        

        return best_model

    def implement_XGB(self, chosen, X_train, X_test, y_train, y_test):
        file_path = os.getcwd()+"/XGB_Model.pkl"
        if os.path.exists(file_path):
            model = joblib.load('XGB_Model.pkl')
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
        else:
            model = self.extreme_gradient_boosting(chosen, X_train, X_test, y_train, y_test)
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

        #Saves result into excel file
        self.obj_data.save_result(y_test, y_pred, "results_XGB.csv")

        print("Training Extreme Gradient Boosting done...")
        return model, y_pred

    def multi_layer_perceptron(self, name, X_train, X_test, y_train, y_test):
        # Define the parameter grid for the grid search
        param_grid = {
            'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50), (100, 100)],
            'activation': ['logistic', 'relu'],
            'solver': ['adam', 'sgd'],
            'alpha': [0.0001, 0.001, 0.01],
            'learning_rate': ['constant', 'invscaling', 'adaptive'],
        }

        # Initialize the MLP Classifier
        mlp = MLPClassifier(random_state=2021)

        # Create GridSearchCV with the MLP Classifier and the parameter grid
        grid_search = GridSearchCV(mlp, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

        # Train and perform grid search
        grid_search.fit(X_train, y_train)

        # Get the best MLP Classifier model from the grid search
        best_model = grid_search.best_estimator_
    
        #Saves model
        joblib.dump(best_model, 'MLP_Model.pkl')    
    
        # Print the best hyperparameters found
        print(f"Best Hyperparameters for MLP:")
        print(grid_search.best_params_)        

        return best_model

    def implement_MLP(self, chosen, X_train, X_test, y_train, y_test):
        file_path = os.getcwd()+"/MLP_Model.pkl"
        if os.path.exists(file_path):
            model = joblib.load('MLP_Model.pkl')
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
        else:
            model = self.multi_layer_perceptron(chosen, X_train, X_test, y_train, y_test)
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

        #Saves result into excel file
        self.obj_data.save_result(y_test, y_pred, "results_MLP.csv")

        print("Training Multi-Layer Perceptron done...")
        return model, y_pred

    def support_vector(self, name, X_train, X_test, y_train, y_test):
        #Support Vector Classifier
        # Define the parameter grid for the grid search
        param_grid = {
            'C': [0.1, 1, 10],
            'kernel': ['linear', 'poly', 'rbf'],
            'gamma': ['scale', 'auto', 0.1, 1],
        }

        # Initialize the SVC model
        model_svc = SVC(random_state=2021, probability=True)

        # Create GridSearchCV with the SVC model and the parameter grid
        grid_search = GridSearchCV(model_svc, param_grid, cv=3, scoring='accuracy', n_jobs=-1, refit=True)
    
        # Train and perform grid search
        grid_search.fit(X_train, y_train)

        # Get the best MLP Classifier model from the grid search
        best_model = grid_search.best_estimator_
    
        #Saves model
        joblib.dump(best_model, 'SVC_Model.pkl')    
    
        # Print the best hyperparameters found
        print(f"Best Hyperparameters for SVC:")
        print(grid_search.best_params_)        

        return best_model

    def implement_SVC(self, chosen, X_train, X_test, y_train, y_test):
        file_path = os.getcwd()+"/SVC_Model.pkl"
        if os.path.exists(file_path):
            model = joblib.load('SVC_Model.pkl')
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
        else:
            model = self.support_vector(chosen, X_train, X_test, y_train, y_test)
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

        #Saves result into excel file
        self.obj_data.save_result(y_test, y_pred, "results_SVC.csv")

        print("Training Support Vector Classifier done...")
        return model, y_pred

    def adaboost_classifier(self, name, X_train, X_test, y_train, y_test):
        # Define the parameter grid for the grid search
        param_grid = {
            'n_estimators': [50, 100, 150],
            'learning_rate': [0.01, 0.1, 0.2],
        }

        # Initialize the AdaBoost classifier
        adaboost = AdaBoostClassifier(random_state=2021)

        # Create GridSearchCV with the AdaBoost classifier and the parameter grid
        grid_search = GridSearchCV(adaboost, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

    
        # Train and perform grid search
        grid_search.fit(X_train, y_train)

        # Get the best AdaBoost Classifier model from the grid search
        best_model = grid_search.best_estimator_
    
        #Saves model
        joblib.dump(best_model, 'ADA_Model.pkl')    
    
        # Print the best hyperparameters found
        print(f"Best Hyperparameters for AdaBoost:")
        print(grid_search.best_params_)        

        return best_model

    def implement_ADA(self, chosen, X_train, X_test, y_train, y_test):
        file_path = os.getcwd()+"/ADA_Model.pkl"
        if os.path.exists(file_path):
            model = joblib.load('ADA_Model.pkl')
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) 
        else:
            model = self.adaboost_classifier(chosen, X_train, X_test, y_train, y_test)
            y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True)

        #Saves result into excel file
        self.obj_data.save_result(y_test, y_pred, "results_ADA.csv")

        print("Training AdaBoost done...")
        return model, y_pred


#process_data.py
import os
import numpy as np 
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

class Process_Data:
    def read_dataset(self, filename):
        #Reads dataset
        curr_path = os.getcwd()
        path = os.path.join(curr_path, filename) 
        df = pd.read_csv(path)

        return df
    
    def preprocess(self):
        df = self.read_dataset("marketing_data.csv")

        #Drops ID column
        df = df.drop("ID", axis = 1)

        #Renames column name and corrects data type
        df.rename(columns={' Income ':'Income'},inplace=True)
        df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], format='%m/%d/%y')  
        df["Income"] = df["Income"].str.replace("$","").str.replace(",","") 
        df["Income"] = df["Income"].astype(float)

        #Checks null values
        print(df.isnull().sum())
        print('Total number of null values: ', df.isnull().sum().sum())

        #Imputes Income column with median values
        df['Income'] = df['Income'].fillna(df['Income'].median())
        print(f'Number of Null values in "Income" after Imputation: {df["Income"].isna().sum()}')

        #Transformasi Dt_Customer
        df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])
        print(f'After Transformation:\n{df["Dt_Customer"].head()}')
        df['Customer_Age'] = df['Dt_Customer'].dt.year - df['Year_Birth']

        #Creates number of children/dependents in home by adding 'Kidhome' and 'Teenhome' features
        #Creates number of Total_Purchases by adding all the purchases features
        #Creates TotalAmount_Spent by adding all the Mnt* features
        df['Dt_Customer_Month'] = df['Dt_Customer'].dt.month
        df['Dt_Customer_Year'] = df['Dt_Customer'].dt.year
        df['Num_Dependants'] = df['Kidhome'] + df['Teenhome']    

        purchase_features = [c for c in df.columns if 'Purchase' in str(c)]
        #Removes 'NumDealsPurchases' from the list above
        purchase_features.remove('NumDealsPurchases')
        df['Num_TotalPurchases'] = df[purchase_features].sum(axis = 1)

        amt_spent_features = [c for c in df.columns if 'Mnt' in str(c)]
        df['TotalAmount_Spent'] = df[amt_spent_features].sum(axis = 1)  

        #Creates a categorical feature using the customer's age by binnning them, 
        #to help understanding purchasing behaviour
        print(f'Min. Customer Age: {df["Customer_Age"].min()}')
        print(f'Max. Customer Age: {df["Customer_Age"].max()}')
        df['AgeGroup'] = pd.cut(df['Customer_Age'], bins = [6, 24, 29, 40, 56, 75], 
             labels = ['Gen-Z', 'Gen-Y.1', 'Gen-Y.2', 'Gen-X', 'BBoomers'])

        return df  

    def categorize(self, df):
        #Creates a dummy dataframe for visualization
        df_dummy=df.copy()

        #Categorizes Income feature
        labels = ['0-20k', '20k-30k', '30k-50k','50k-70k','70k-700k']
        df_dummy['Income'] = pd.cut(df_dummy['Income'], 
            [0, 20000, 30000, 50000, 70000, 700000], labels=labels)        

        #Categorizes TotalAmount_Spent feature
        labels = ['0-200', '200-500', '500-800','800-1000','1000-3000']
        df_dummy['TotalAmount_Spent'] = pd.cut(df_dummy['TotalAmount_Spent'], 
            [0, 200, 500, 800, 1000, 3000], labels=labels)

        #Categorizes Num_TotalPurchases feature
        labels = ['0-5', '5-10', '10-15','15-25','25-35']
        df_dummy['Num_TotalPurchases'] = pd.cut(df_dummy['Num_TotalPurchases'], 
            [0, 5, 10, 15, 25, 35], labels=labels)

        #Categorizes Dt_Customer_Year feature
        labels = ['2012', '2013', '2014']
        df_dummy['Dt_Customer_Year'] = pd.cut(df_dummy['Dt_Customer_Year'], 
            [0, 2012, 2013, 2014], labels=labels)

        #Categorizes Dt_Customer_Month feature
        labels = ['0-3', '3-6', '6-9','9-12']
        df_dummy['Dt_Customer_Month'] = pd.cut(df_dummy['Dt_Customer_Month'], 
            [0, 3, 6, 9, 12], labels=labels)

        #Categorizes Customer_Age feature
        labels = ['0-30', '30-40', '40-50', '40-60','60-120']
        df_dummy['Customer_Age'] = pd.cut(df_dummy['Customer_Age'], 
            [0, 30, 40, 50, 60, 120], labels=labels)

        #Categorizes MntGoldProds feature
        labels = ['0-30', '30-50', '50-80', '80-100','100-400']
        df_dummy['MntGoldProds'] = pd.cut(df_dummy['MntGoldProds'], 
            [0, 30, 50, 80, 100, 400], labels=labels)

        #Categorizes MntSweetProducts feature
        labels = ['0-10', '10-20', '20-40', '40-100','100-300']
        df_dummy['MntSweetProducts'] = pd.cut(df_dummy['MntSweetProducts'], 
            [0, 10, 20, 40, 100, 300], labels=labels)

        #Categorizes MntFishProducts feature
        labels = ['0-10', '10-20', '20-40', '40-100','100-300']
        df_dummy['MntFishProducts'] = pd.cut(df_dummy['MntFishProducts'], 
            [0, 10, 20, 40, 100, 300], labels=labels)

        #Categorizes MntMeatProducts feature
        labels = ['0-50', '50-100', '100-200', '200-500','500-2000']
        df_dummy['MntMeatProducts'] = pd.cut(df_dummy['MntMeatProducts'], 
            [0, 50, 100, 200, 500, 2000], labels=labels)

        #Categorizes MntFruits feature
        labels = ['0-10', '10-30', '30-50', '50-100','100-200']
        df_dummy['MntFruits'] = pd.cut(df_dummy['MntFruits'], 
            [0, 1, 30, 50, 100, 200], labels=labels)

        #Categorizes MntWines feature
        labels = ['0-100', '100-300', '300-500', '500-1000','1000-1500']
        df_dummy['MntWines'] = pd.cut(df_dummy['MntWines'], 
            [0, 100, 300, 500, 1000, 1500], labels=labels)

        #Categorizes Recency feature
        labels = ['0-10', '10-30', '30-50', '50-80','80-100']
        df_dummy['Recency'] = pd.cut(df_dummy['Recency'], 
            [0, 10, 30, 50, 80, 100], labels=labels)

        return df_dummy

    def extract_cat_num_cols(self, df):
        #Extracts categorical and numerical columns in dummy dataset
        cat_cols = [col for col in df.columns if 
            (df[col].dtype == 'object') or (df[col].dtype.name == 'category')]
        num_cols = [col for col in df.columns if 
            (df[col].dtype != 'object') and (df[col].dtype.name != 'category')]
        
        return cat_cols, num_cols
    
    def encode_categorical_feats(self, df, cat_cols):
        #Encodes categorical features in original dataset     
        print(f'Features that needs to be Label Encoded: \n{cat_cols}')

        for c in cat_cols:
            lbl = LabelEncoder()
            lbl.fit(list(df[c].astype(str).values))
            df[c] = lbl.transform(list(df[c].astype(str).values))
        print('Label Encoding done..')  
        return df  

    def extract_input_output_vars(self, df): 
        #Extracts output and input variables
        y = df['Response'].values # Target for the model
        X = df.drop(['Dt_Customer', 'Year_Birth', 'Response'], axis = 1)  

        return X, y     

    def feat_importance_rf(self, X, y):
        names = X.columns
        rf = RandomForestClassifier()
        rf.fit(X, y)

        result_rf = pd.DataFrame()
        result_rf['Features'] = X.columns
        result_rf ['Values'] = rf.feature_importances_
        result_rf.sort_values('Values', inplace = True, ascending = False)

        return result_rf
    
    def feat_importance_et(self, X, y):
        model = ExtraTreesClassifier()
        model.fit(X, y)

        result_et = pd.DataFrame()
        result_et['Features'] = X.columns
        result_et ['Values'] = model.feature_importances_
        result_et.sort_values('Values', inplace=True, ascending =False)

        return result_et    
    
    def feat_importance_rfe(self, X, y):
        model = LogisticRegression()
        #Creates the RFE model
        rfe = RFE(model)
        rfe = rfe.fit(X, y)

        result_lg = pd.DataFrame()
        result_lg['Features'] = X.columns
        result_lg ['Ranking'] = rfe.ranking_
        result_lg.sort_values('Ranking', inplace=True , ascending = False)

        return result_lg   

    def save_result(self, y_test, y_pred, fname):
        # Convert y_test and y_pred to pandas Series for easier handling
        y_test_series = pd.Series(y_test)
        y_pred_series = pd.Series(y_pred)
        
        # Calculate y_result_series
        y_result_series = pd.Series(y_pred - y_test == 0)
        y_result_series = y_result_series.map({True: 'True', False: 'False'})

        # Create a DataFrame to hold y_test, y_pred, and y_result
        data = pd.DataFrame({'y_test': y_test_series, 'y_pred': y_pred_series, 'result': y_result_series})

        # Save the DataFrame to a CSV file
        data.to_csv(fname, index=False)









No comments: