The project aims to add the functionality of a recommendation system to the CircuitVerse website. In this report, I have explained every decision or technology in simple words to give a better understanding and to give an idea of why we did, what we did. All my code files 💻🎉 here.
The data that was available for us (check out the database schema) to use didn’t have a target variable (typically a X and a Y to train and test the model and improve the accuracy). Therefore, we had to rely on the usage of unsupervised learning-based algorithms to find the intrinsic pattern and hidden structures in the data and present it to the user.
Our model is a 2 layered model with the first layer consisting of CountVectorizer + LDA and in the next layer we find top 50 closest projects using K-D Trees, while using stars and views for re-ranking. The diagram below shows the model structure in detail :
A SQL query was used to extract the id, author_id, name, description, tag_names, views, and star count which were stored as a JSON file (1,80,000 projects approx). We removed the following:
HTML tags
Non-ASCII characters
Punctuations.
Replaced NULL values and “Untitled” by empty strings.
Note: We did not remove punctuations like &, *, +, etc as they have a meaning in logic circuits.
Feature Extraction is mainly done to extract important parts or features from a huge chunk of items, based on the frequency/count of occurrence and thus reducing the dimensions of the dataset to enhance processing. Our recommendation system uses CountVectorizer which converts the collection of documents to a matrix of token counts, thus filtering out the important and relevant keywords.
LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
Simply put, LDA creates a distribution of words and packs them together as a topic (the number of topics is chosen by the user). When a new sentence is fed into the model, it is converted to a vector based on the word count (by CountVectorizer). LDA then runs on this project vector and hence we get a distribution of the sentence over all the topics. This becomes the new dimension of the project, thus reducing the dimension of every project to the number of topics (in our case 10). It also reduces the time taken to build the K-D Tree and helps in finding better neighbors.
In our recommendation system, the log-likelihood was the least for 10 topics for the given dataset.
Topics 5 to 50 Zoomed for topics 5 to 20
After getting the probability distribution of each project we can compare the items and thus enhancing the 2nd layer of the model. Item distance being the best collaborative filtering approach gives us the most similar items. However, this will be a computationally expensive method with time complexity of O(N^2 log(N))
, and training the model will take days. We also won’t be able to use this method for new project recommendations in the online mode because of the high runtime.
Therefore, the need to use K-D Trees arose. A K-D Tree (also called a K-Dimensional Tree) is a binary search tree where data in each node is a K-Dimensional point in space. Points to the left of this space are represented by the left subtree of that node and points to the right of the space are represented by the right subtree.
The advantages of using K-D Trees were:
O(log n)
(hence it can be used online).O(N log(N))
(faster training).The top 50 nearest neighbors of each item are then taken and recommendations are computed.
We take the top 50 nearest neighbors of the project along with the stars and the views of each project.
We split the 10 recommendations into 3 parts:
The first 2 recommendations are just name-based ( since some well-documented circuits like this had too many “important” words in the description and hence was giving different results).
The next 4 recommendations are the top 4 nearest neighbors (based on name + description) that we got from the K-D Tree.
We then re-rank the remaining 46 neighbors by giving them a popularity score which is:
Score = (5 S) + V
Here S is the number of Stars and V is the number of views of a project.
Since we need to calculate the recommendations in real-time, we won’t be working and computing the recommendations for names separately. So the first 5 recommendations would be taking the name + description, extracting features, lemmatization, topic modeling, and showing the top 5 K-D Trees recommendations.
The next 5 recommendations would be re-ranking the remaining 45 recommendations based on the above formula and displaying the results to the user.
Uneven distribution of items
Since this project was the first recommendation system for the organization, there’s tremendous scope for improvement in the project and we would love you to contribute to anything that seems interesting🎉.
Some of the things are listed below:
My experience at CircuitVerse couldn’t be any better, I was given the right support, the mentors were really helpful, cooperative and understanding. My GSoC journey was filled with testing, implementation, researching and a lot of unexpected challenges but my mentor Biswesh Mohapatra helped me throughout. I have learned a lot from the project and super happy with my outcome. I also learnt a lot from my other fellow GSoCers. Kudos to the awesome team and the awesome community of CircuitVerse!
I am very grateful to my mentor Biswesh Mohapatra for helping me throughout the Google Summer of Code Period and for making this project a success. Special thanks to all the core members of the organization Satvik Ramaprasad, Aboobacker MK, Shivansh Srivastava and Shreya Prasad for looking out at my work & guiding me throughout. Lastly, sincere gratitude to Google for offering me this great opportunity to contribute to this wonderful community.
Feel free to follow me on medium and subscribe to my youtube channel, I am sure you’ll learn something new.
Thank you for reading. Happy Coding !! 💻🎉