Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_innovation #70

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions soumiksamanta/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
## Overview

![working demo](glugle.gif)

**glugle** is a search engine built using -

* Python to implement the core logic of the application
* Flask web framework to route the web requests
* mongo db as a database to store the crawled data and maintain indexes to facilitate fast searching

This project was made as a part of **10 Days of Code** event organized by [GNU/Linux Users' Group, NIT Durgapur](https://github.com/lugnitdgp)

## Working

The web application consists of 4 basic parts:

1. **Crawling** - Web search engines get their information by crawling from site to site. The crawler is provided with an entrypoint from which it starts collecting the links and text data and storing them in the database.
2. **Indexing** - Indexing means associating the data found on the web pages with the domain it was found on and HTML fields. The way data is stored in the database is a major contributor to the efficiency of the search engine.
3. **Searching** - Searching means to search the database for relevant results to the search query.
4. **Ranking** - Ranking means to rank the search results found from the above operation in order of their relevance to the user. The better ranking system results in a better search experience.

There is also an **admin panel** at **/admin** to submit new domain URLs to crawl data from. [Part of open innovation]

## Future Ideas:

* [ ] Voice Search, by converting input speech to text and forming query to search
* [ ] Image Search, probably using Google Vision API to get summary of input image and forming query to search
* [ ] Login System using Flask Secuirity
* [ ] Narrow down search results, based on user's previous/recent searches

## To run the project

* clone the repository
* cd into the project directory
* Run command `python main.py `
Binary file added soumiksamanta/glugle.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
154 changes: 154 additions & 0 deletions soumiksamanta/glugle/crawler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
import re
import sys
import pymongo
import requests
import urllib.parse
from bs4 import BeautifulSoup


class Crawler():

"""
The Crawler class is used to crawl on an entrypoint web URL, given as input by the user, with a depth also specified by the user.
"""


def __init__(self):

"""
Constructor for creating objects of the Crawler class.
It creates a connection to the mongo database, or exits if it enconters an error.
"""

try:
self.client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
self.db = self.client.glugledb
except pymongo.errors.ConnectionFailure as e:
print("ERROR connecting to database", e)
exit(0)


def start_crawl(self, url, depth):

"""
Initiates the crawling process on the given URL and depth.

Downloads the robots.txt file of the input URL;
Extracts all the disallowed links and;
Makes a call to the recursive crawl() function to start crawling

##### Required parameters:
`url` : URL to crawl upon
`depth` : max recusion depth to control the number of pages crawled

"""

disallowed_links = []

try:
# get the robots.txt file of the input URL
complete_url = urllib.parse.urljoin(url, '/robots.txt')
robots = requests.get(complete_url)
soup = BeautifulSoup(robots.text, 'lxml')

# extract the disallowed links
our_robots = soup.find('p').text
disallowed_links = [link[10:] for link in re.findall("Disallow: /.*", our_robots)]

except requests.exceptions.ConnectionError as e:
print("ERROR Connecting to", complete_url, ":", e)
return

# start crawling process
self.crawl(url, depth, disallowed_links)


def crawl(self, url, depth, disallowed_links):

"""
A recursive function to crawl on a given URL. It extracts all the URLs available from the given page URL, and keeps going until it reaches the defined depth as input.

##### Required parameters:
`url` : URL to crawl upon
`depth` : max recusion depth to control the number of pages crawled
`disallowed_links` : links that are disallowed in robots.txt of the domain

"""

# create absolute URLs
url = urllib.parse.urljoin(url, "")
print(f"Crawling {url} at depth {depth}")
title = ""
desc = ""
result = None

try:
# get the page data
result = requests.get(url)

try:
soup = BeautifulSoup(result.text, 'lxml')

try:
title = soup.find('title')
title = title.text
except:
title = ""

try:
desc_list = soup.find_all('p')
desc = " ".join(item.text.replace('\n', '') for item in desc_list)
except:
desc = ""

except Exception as e:
print("ERROR getting page details : ", e)

# process and insert query data in database only if did not encounter a 404 error
if result and result.status_code != 404:

query = {
'url':url,
'title': title,
'description': desc
}
print(query)

try:
self.db.query_data.insert_one(query)
self.db.query_data.create_index(
[
('url', pymongo.TEXT),
('title', pymongo.TEXT),
('description', pymongo.TEXT)
],
name="query_data_index",
default_language="english"
)
except Exception as e:
print("ERROR inserting data : ", e)
exc_type, exc_val, tb_obj = sys.exc_info()
print(exc_type, "at", tb_obj.tb_lineno)

# extract all links in page and continue crawling, only if max depth allowed has not reached 0
if depth != 0:
try:
links = [urllib.parse.urljoin(url, link.get('href')) for link in soup.find_all('a')]

for link in links:
if link not in disallowed_links:
self.crawl(link, depth-1, disallowed_links)

except Exception as e:
print("ERROR getting links : ", e)


except:
print("ERROR fetching", url)


self.client.close()


# crl = Crawler()
# crl.start_crawl("https://www.wikipedia.org", 2)
161 changes: 161 additions & 0 deletions soumiksamanta/glugle/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
from flask import Flask, render_template, request, redirect, url_for
from flask_paginate import Pagination, get_page_args
from flask_admin import Admin
import pymongo
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from crawler import Crawler
import threading


app = Flask(__name__)
client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client.glugledb
query_data = db.query_data


@app.route("/", methods=['GET', 'POST'])
def home():
if request.method == 'POST':
query = request.form['query']
return redirect(url_for("search", query=query))

return render_template("index.html")


# set optional bootswatch theme
app.config['FLASK_ADMIN_SWATCH'] = 'cerulean'
admin = Admin(app, template_mode='bootstrap4')
# Add administrative views here
@app.route("/admin", methods=['GET', 'POST'])
def admin():
submitted = False
if request.method == 'POST':
crawl_url = request.form['crawl_url']
print(crawl_url)
crawler = Crawler()
threading.Thread(target=crawler.start_crawl, args=(crawl_url, 2), daemon=True).start()
# crawler.start_crawl(crawl_url)
submitted = True

return render_template("admin/index.html", submitted=submitted)


def get_paginated_search_results(search_results, offset=0, per_page=10):
return search_results[offset: offset + per_page]


def get_query_keywords(query):
# lowercasing
query = query.lower()

# remove punctuations
translator = str.maketrans('', '', string.punctuation)
query = query.translate(translator)

# removing stopwords and tokenization
stop_words = set(stopwords.words("english"))
word_tokens = word_tokenize(query)
filtered_query = [word for word in word_tokens if word not in stop_words]

# perform stemming
stemmer = PorterStemmer()
word_tokens = word_tokenize(query)
stems = [stemmer.stem(word) for word in word_tokens]
return stems


def query_database(query):
return db.query_data.find(
{
'$text' : {
'$search' : query,
'$caseSensitive' : False,
}
},
{
'score': {
'$meta': "textScore"
}
}
).sort(
[
('score', {'$meta': 'textScore'}),
('_id', pymongo.DESCENDING)
]
)


def remove_duplicates(result_data):
search_results = []
for doc in result_data:
exist = False
for result in search_results:
if result['title'] == doc['title'] or result['url'] == doc['url']:
exist = True
break

if exist == False:
search_results.append(doc)
return search_results


def sort_rank(search_results, keywords):
for result in search_results:
for word in keywords:
if word in result['title']:
result['score'] += 2
else:
result['score'] += 0
if word in result['description']:
result['score'] += 1
else:
result['score'] += 0
return sorted(search_results, key = lambda result: result['score'], reverse=True)


@app.route("/search")
def search():

page, per_page, offset = get_page_args(page_parameter='page', per_page_parameter='per_page')

query = request.args.get('query', "")
search_results = []

# preprocess query
keywords = get_query_keywords(query)
processed_query = " ".join(keywords)

# search for data in database
search_results = query_database(processed_query)

# filter out duplicate search results
search_results = remove_duplicates(search_results)

# rank the retrieved search results
search_results = sort_rank(search_results, keywords)

# find total no of search results
total = len(search_results)

# paginate the search results
pagination = Pagination(page=page,
total=total,
css_framework='bootstrap4',
per_page=per_page,
format_total=True,
format_number=True,
record_name='results',
alignment='center')

return render_template("search.html",
query=query,
search_results=get_paginated_search_results(search_results, offset, per_page),
total=total,
pagination=pagination
)

if __name__ == "__main__":
app.run(debug=True)
19 changes: 19 additions & 0 deletions soumiksamanta/glugle/sum.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from flask import Flask, render_template, request, redirect, url_for

app = Flask(__name__)

@app.route("/", methods=['GET', 'POST'])
def addNumbers():
if request.method == 'POST':
a = request.form["a"]
b = request.form["b"]
return redirect(url_for("sum", a=a, b=b))

return render_template("sum_in.html")


@app.route("/sum")
def sum():
a = int(request.args.get('a', None))
b = int(request.args.get('b', None))
return render_template("sum_out.html", a=a, b=b, sum=(a+b))
Loading