lugnitdgp · soumiksamanta007 · Oct 19, 2021 · Oct 22, 2021 · Oct 22, 2021 · Oct 24, 2021
diff --git a/soumiksamanta/README.md b/soumiksamanta/README.md
@@ -0,0 +1,35 @@
+## Overview
+
+![working demo](glugle.gif)
+
+**glugle** is a search engine built using -
+
+* Python to implement the core logic of the application
+* Flask web framework to route the web requests
+* mongo db as a database to store the crawled data and maintain indexes to facilitate fast searching
+
+This project was made as a part of **10 Days of Code** event organized by [GNU/Linux Users&#39; Group, NIT Durgapur](https://github.com/lugnitdgp)
+
+## Working
+
+The web application consists of 4 basic parts:
+
+1. **Crawling** - Web search engines get their information by crawling from site to site. The crawler is provided with an entrypoint from which it starts collecting the links and text data and storing them in the database.
+2. **Indexing** - Indexing means associating the data found on the web pages with the domain it was found on and HTML fields. The way data is stored in the database is a major contributor to the efficiency of the search engine.
+3. **Searching** - Searching means to search the database for relevant results to the search query.
+4. **Ranking** - Ranking means to rank the search results found from the above operation in order of their relevance to the user. The better ranking system results in a better search experience.
+
+There is also an **admin panel** at **/admin** to submit new domain URLs to crawl data from. [Part of open innovation]
+
+## Future Ideas:
+
+* [ ] Voice Search, by converting input speech to text and forming query to search
+* [ ] Image Search, probably using Google Vision API to get summary of input image and forming query to search
+* [ ] Login System using Flask Secuirity
+* [ ] Narrow down search results, based on user's previous/recent searches
+
+## To run the project
+
+* clone the repository
+* cd into the project directory
+* Run command `python main.py `
diff --git a/soumiksamanta/glugle.gif b/soumiksamanta/glugle.gif
diff --git a/soumiksamanta/glugle/crawler.py b/soumiksamanta/glugle/crawler.py
@@ -0,0 +1,154 @@
+import re
+import sys
+import pymongo
+import requests
+import urllib.parse
+from bs4 import BeautifulSoup
+
+
+class Crawler():
+
+    """
+    The Crawler class is used to crawl on an entrypoint web URL, given as input by the user, with a depth also specified by the user.
+    """
+
+
+    def __init__(self):
+
+        """
+        Constructor for creating objects of the Crawler class.
+        It creates a connection to the mongo database, or exits if it enconters an error.
+        """
+
+        try:
+            self.client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
+            self.db = self.client.glugledb
+        except pymongo.errors.ConnectionFailure as e:
+            print("ERROR connecting to database", e)
+            exit(0)
+
+
+    def start_crawl(self, url, depth):
+
+        """
+        Initiates the crawling process on the given URL and depth.
+
+        Downloads the robots.txt file of the input URL;
+        Extracts all the disallowed links and;
+        Makes a call to the recursive crawl() function to start crawling
+
+        ##### Required parameters:
+        `url` : URL to crawl upon
+        `depth` : max recusion depth to control the number of pages crawled
+
+        """
+
+        disallowed_links = []
+
+        try:
+            # get the robots.txt file of the input URL
+            complete_url = urllib.parse.urljoin(url, '/robots.txt')
+            robots = requests.get(complete_url)
+            soup = BeautifulSoup(robots.text, 'lxml')
+
+            # extract the disallowed links
+            our_robots = soup.find('p').text
+            disallowed_links = [link[10:] for link in re.findall("Disallow: /.*", our_robots)]
+
+        except requests.exceptions.ConnectionError as e:
+            print("ERROR Connecting to", complete_url, ":", e)
+            return
+
+        # start crawling process
+        self.crawl(url, depth, disallowed_links)
+
+
+    def crawl(self, url, depth, disallowed_links):
+
+        """
+        A recursive function to crawl on a given URL. It extracts all the URLs available from the given page URL, and keeps going until it reaches the defined depth as input.
+
+        ##### Required parameters:
+        `url` : URL to crawl upon
+        `depth` : max recusion depth to control the number of pages crawled
+        `disallowed_links` : links that are disallowed in robots.txt of the domain
+
+        """
+
+        # create absolute URLs
+        url = urllib.parse.urljoin(url, "")
+        print(f"Crawling {url} at depth {depth}")
+        title = ""
+        desc = ""
+        result = None
+
+        try:
+            # get the page data
+            result = requests.get(url)
+
+            try:
+                soup = BeautifulSoup(result.text, 'lxml')
+
+                try:
+                    title = soup.find('title')
+                    title = title.text
+                except:
+                    title = ""
+
+                try:
+                    desc_list = soup.find_all('p')
+                    desc = " ".join(item.text.replace('\n', '') for item in desc_list)
+                except:
+                    desc = ""
+
+            except Exception as e:
+                print("ERROR getting page details : ", e)
+
+            # process and insert query data in database only if did not encounter a 404 error  
+            if result and result.status_code != 404:
+
+                query = {
+                    'url':url,
+                    'title': title,
+                    'description': desc
+                }
+                print(query)
+
+                try:
+                    self.db.query_data.insert_one(query)
+                    self.db.query_data.create_index(
+                        [
+                            ('url', pymongo.TEXT),  
+                            ('title', pymongo.TEXT),
+                            ('description', pymongo.TEXT)
+                        ],
+                        name="query_data_index",
+                        default_language="english"
+                    )
+                except Exception as e:
+                    print("ERROR inserting data : ", e)
+                    exc_type, exc_val, tb_obj = sys.exc_info()
+                    print(exc_type, "at", tb_obj.tb_lineno)
+
+                # extract all links in page and continue crawling, only if max depth allowed has not reached 0
+                if depth != 0:
+                    try:
+                        links = [urllib.parse.urljoin(url, link.get('href')) for link in soup.find_all('a')]
+
+                        for link in links:
+                            if link not in disallowed_links:
+                                self.crawl(link, depth-1, disallowed_links)
+
+                    except Exception as e:
+                        print("ERROR getting links : ", e)
+
+
+        except:
+            print("ERROR fetching", url)
+
+
+        self.client.close()
+
+
+# crl = Crawler()
+# crl.start_crawl("https://www.wikipedia.org", 2)
diff --git a/soumiksamanta/glugle/main.py b/soumiksamanta/glugle/main.py
@@ -0,0 +1,161 @@
+from flask import Flask, render_template, request, redirect, url_for
+from flask_paginate import Pagination, get_page_args
+from flask_admin import Admin
+import pymongo
+import string
+from nltk.corpus import stopwords
+from nltk.tokenize import word_tokenize
+from nltk.stem.porter import PorterStemmer
+from crawler import Crawler
+import threading
+
+
+app = Flask(__name__)
+client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
+db = client.glugledb
+query_data = db.query_data
+
+
+@app.route("/", methods=['GET', 'POST'])
+def home():
+    if request.method == 'POST':
+        query = request.form['query']
+        return redirect(url_for("search", query=query))
+
+    return render_template("index.html")
+
+
+# set optional bootswatch theme
+app.config['FLASK_ADMIN_SWATCH'] = 'cerulean'
+admin = Admin(app, template_mode='bootstrap4')
+# Add administrative views here
+@app.route("/admin", methods=['GET', 'POST'])
+def admin():
+    submitted = False
+    if request.method == 'POST':
+        crawl_url = request.form['crawl_url']
+        print(crawl_url)
+        crawler = Crawler()
+        threading.Thread(target=crawler.start_crawl, args=(crawl_url, 2), daemon=True).start()
+        # crawler.start_crawl(crawl_url)
+        submitted = True
+
+    return render_template("admin/index.html", submitted=submitted)
+
+
+def get_paginated_search_results(search_results, offset=0, per_page=10):
+    return search_results[offset: offset + per_page]
+
+
+def get_query_keywords(query):
+    # lowercasing
+    query = query.lower()
+
+    # remove punctuations
+    translator = str.maketrans('', '', string.punctuation)
+    query = query.translate(translator)
+
+    # removing stopwords and tokenization
+    stop_words = set(stopwords.words("english"))
+    word_tokens = word_tokenize(query)
+    filtered_query = [word for word in word_tokens if word not in stop_words]
+
+    # perform stemming
+    stemmer = PorterStemmer()
+    word_tokens = word_tokenize(query)
+    stems = [stemmer.stem(word) for word in word_tokens]
+    return stems
+
+
+def query_database(query):
+    return db.query_data.find(
+        {
+            '$text' : {
+                '$search' : query,
+                '$caseSensitive' : False,
+            }
+        },
+        {
+            'score': {
+                '$meta': "textScore"
+            }
+        }
+    ).sort(
+        [
+            ('score', {'$meta': 'textScore'}),
+            ('_id', pymongo.DESCENDING)
+        ]
+    )
+
+
+def remove_duplicates(result_data):
+    search_results = []
+    for doc in result_data:
+        exist = False
+        for result in search_results:
+            if result['title'] == doc['title'] or result['url'] == doc['url']:
+                exist = True
+                break
+
+        if exist == False:
+            search_results.append(doc)
+    return search_results
+
+
+def sort_rank(search_results, keywords):
+    for result in search_results:
+        for word in keywords:
+            if word in result['title']:
+                result['score'] += 2
+            else:
+                result['score'] += 0
+            if word in result['description']:
+                result['score'] += 1
+            else:
+                result['score'] += 0
+    return sorted(search_results, key = lambda result: result['score'], reverse=True)
+
+
+@app.route("/search")
+def search():
+
+    page, per_page, offset = get_page_args(page_parameter='page', per_page_parameter='per_page')
+
+    query = request.args.get('query', "")
+    search_results = []
+
+    # preprocess query
+    keywords = get_query_keywords(query)
+    processed_query = " ".join(keywords)
+
+    # search for data in database
+    search_results = query_database(processed_query)
+
+    # filter out duplicate search results
+    search_results = remove_duplicates(search_results)
+
+    # rank the retrieved search results
+    search_results = sort_rank(search_results, keywords)
+
+    # find total no of search results
+    total = len(search_results)
+
+    # paginate the search results
+    pagination = Pagination(page=page,
+                            total=total,
+                            css_framework='bootstrap4',
+                            per_page=per_page,
+                            format_total=True,
+                            format_number=True,
+                            record_name='results',
+                            alignment='center')
+
+    return render_template("search.html",
+                            query=query,    
+                            search_results=get_paginated_search_results(search_results, offset, per_page),
+                            total=total,
+                            pagination=pagination
+                            )
+
+if __name__ == "__main__":
+    app.run(debug=True)
diff --git a/soumiksamanta/glugle/sum.py b/soumiksamanta/glugle/sum.py
@@ -0,0 +1,19 @@
+from flask import Flask, render_template, request, redirect, url_for
+
+app = Flask(__name__)
+
+@app.route("/", methods=['GET', 'POST'])
+def addNumbers():
+    if request.method == 'POST':
+        a = request.form["a"]
+        b = request.form["b"]
+        return redirect(url_for("sum", a=a, b=b))
+
+    return render_template("sum_in.html")
+
+
+@app.route("/sum")
+def sum():
+    a  = int(request.args.get('a', None))
+    b  = int(request.args.get('b', None))
+    return render_template("sum_out.html", a=a, b=b, sum=(a+b))