This is Mehwish Alam and I am a First-year, BCIT student, working as a Technical member in the AI Club for this tenure.
As almost every student does as they join a university, I myself went out to hunt for societies and choose where my skills and heart best fit. Obviously, I became a member of the AI Club but we were not limited to just choosing one. I developed a keen interest in recommendation systems ever since I was introduced to them and my personal struggle of finding societies to join inspired my idea to make a Society Recommendation System(SRS for short) once work started in AIC.
The first step to any successful project is to collect data. After several months of collecting raw data, which is technically a small amount of time, I finally had enough to start working.
Before building any machine learning model, it is better to understand your data and what we are trying to achieve. Data exploration reveals the hidden trends and insights and data preprocessing makes the data ready for use by ML algorithms.
For cleaning my data and understanding it, I used the Pandas library. A screenshot of my code is attached below.
Since we have multiple choices for each column, we also have to take care of that before processing further.
After splitting the choice, each option we now have has a single value for each row now.
Another problem I faced, since I was using the sklearn Ensemble algorithm, I first needed to convert my categorical data into numeric form. There are several libraries available but I used Label Encoder, you can also use OneHotEncoder or any other you feel more comfortable with.
As you can see that every value is now an integer(type=int). I also separated the labels and features and made a Dataframe having encoded values given by Label Encoder and decoded values, so we can use that at the end for printing down the name.
Next, we split down our data into trains and test sets. Checking different values of tests and data sets to get the basic understanding of how the splitting is done. I put the train size at 60% while the test set holds 40% of the original data.
I then Scaled down the data and Normalized it.
Now our data is ready for applying a ML algorithm (Random Forest Classifier). Fitting the model and then predicting the result on the test set and using score functionality to calculate the prediction score.
Now we can predict societies using our model. Following are the results of an anonymous input.
LIBRARY USED :
Following are all the libraries I used in building this model;
- Pandas: For cleaning data and understanding data.
- Numpy: For finding unique value and other things.
- Normalizer: to normalize the dataset. Random Forest builds multiple decision trees by picking the ‘K’ number of data points from the dataset and merges them together to get a more accurate and stable prediction.
- Label Encoder: To convert categorical data. Encode target labels with values between 0 and n_classes-1.
- Random Forest: Random Forest Classifier is an ensemble algorithm.It builds multiple decision trees by picking the ‘K’ number of data points from the dataset and merges them together to get a more accurate and stable prediction.
- Train_test_split: For splitting the original dataset into train and test sets.
- F1_score: For finding the prediction score.
Written by: Mehwish Alam
Edited by: Syed Sannan Ali