Abstract—the

personalized web page recommendation is much needed these days. Generally,

Web page recommendation systems are implemented in Web servers.

They use data implicitly obtained as a collection of Web

browsing patterns of the users for recommending webpages.

The existing system

collects the Web logs and generates a cluster of

similar users and recommends pages to the user by actively analysing

it in online. However the time complexity for analysing it in online is

more. In order to optimize this and to improve the correctness

of recommendation systems we propose the method of applying Firefly based

algorithm for recommending Web pages along with Naive Bayes

clustering. It clusters Web logs in offline using

Naive Bayes clustering technique. To find the similarity between

the active user queries with other users in the

cluster Firefly algorithm based similarity measure is used. The

proposed approach uses a probability based

clustering which eliminates the odd records while forming clusters.

Firefly algorithm meticulously searches the generated web

logs present in the cluster of the active user and recommends the top

pages. Firefly algorithm utilizes time efficiently, thus it is used for

processing in online. When pages are obtained, they are

ranked and the top pages that are more relevant to

the query are recommended.

The efficiency of the system can be evaluated

using measures like precision, recall-Score, Matthews’s correlation and

Fallout rate. The proposed approach is expected to improve time

utilization in online process as well as recommends

more accurate Webpages.

Introduction- Web

page recommendation system is a sub-domain of recommendation systems that

recommends a set of Web pages to the users based on their past browsing

patterns. It is done by applying special mining techniques on the data that are

previously gathered from the users which in turn discovers and extract

information from Web documents and services. The major concern is to find

reliable and efficient recommendation algorithms. Recommendation system

typically produces the result by following one of the two ways – through

collaborative and content based filtering.

A. Collaborative

Filtering

Most

recommendation system has wide use of collaborative filtering for recommending

items. This method lies on collecting and processing the information’s on

user’s behaviours or activities and then predicting the items relating to their

similarity with other users. Collaborative filtering approaches building a

structure from a user’s past behaviours and decisions of other similar users.

This model is used to predict user interested items. Since collaborative

filtering is independent of machine analysable contents, it is capable of

recommending for complex items accurately without “understanding” of the item

itself.

B. Content

Based Filtering

Content based

filtering is a widely used approach for designing recommendation systems. This

technique is based on a definition of item and a user’s preferred profile. In a

content based recommendation systems, the keywords are considered as user’s

interest. It utilize a series of distinct property of an item for obtaining and

recommending items with same properties. These approaches are continually

combined as Hybrid Recommendation Systems. These algorithm try to recommend

items based on examining the items that are liked by a user in the past or in

the present. In general, various items of candidate set are compared with items

that are rated by the user in the past and the best matching items are

recommended.

Literature survey

Recommendation

system plays a vital role in recommending personalized items for the users

based on their interest in a web services. The web

also contains a rich and dynamic information’s. The amount

of information on the web is growing rapidly, as

well as the number of web sites and webpages per web site. Predicting the needs

of a web user as she visits web sites has gained

importance. Many webpage recommendation system

were developed in the past, since they compute recommendations

in online process, their time utilization should

be efficient. A system 4 that uses support vector

machine (SVM) learning based model was

developed for computing similarity between two items

which performed better than latent

factor approach for group recommendations. Since the

matrix representation was followed, the

data sparsity problem was solved.

However, the system was not able

to stably scale when size of the group

dynamically increased.

Hybrid

recommender systems that combines two or more

recommendation techniques was designed 5. It

eliminates any weakness which exist when only one recommender

system is used. There are several ways in which the systems can be

combined, such as weighted hybrid recommender where the score of a recommended item is

computed from the results of all of the available

recommendation techniques present in the system. However, data sparseness was

still a problem, the system may generate week recommendations if

few users have rated the same items and also

the system doesn’t overcome the cold start

problem. Hyperspectral sensors can acquire hundreds ofcontiguous

bands over a wide electromagnetic spectrum for each

pixel. The rich spectral information allows

for distinguishing materials with subtle spectral discrepancy, but

it usually leads to the “curse of

dimensionality”. To address this, an improved firefly algorithm based band

selection method 8 was used.

The Firefly

algorithm is an evolutionary optimization algorithm proposed by Yang

13. After the initializations of parameters, the brightness is calculated

with the objective function (2.1), where t is the

maximum iterations, ? is the step size and ? is the

light absorbance of m number of fireflies. The moment states are then evaluated

and the bands are selected. In order to avoid employing an actual classifier

within the band searching process to greatly reduce computational cost,

criterion functions that can gauge class separability are preferred which

provided better results. Firefly algorithm also had

a faster convergence even at the size of the

data is larger To improve the accuracy of similarity measure, a nature

inspired algorithm which is based in the behaviour of

Fireflies wereintroduced 10.We consider separate effects for ratings of

users with similar opinions and conflicting opinions. In order

to generate initial population of fireflies, half of population randomly

generated and the other half of population are randomly generated. Mean

absolute error was chosen as objective function to measure recommendation accuracy which

is obtained by difference between predicted rating and real rating.

An optimal

similarity measure via a simple linear combination of values and ratio of

ratings for user-based collaborative filtering provides better results. It

increased speed of finding nearest neighbours of active user and reduce

its computation time. Similarity function equation

basedon Firefly algorithm was simpler than the equation

used in traditional metrics therefore, the proposed method provided recommendations

faster than traditional metrics. Graph colouring problems are

generally discrete. Algorithms to discrete problems are

quite complex.

A new algorithm

based on Similarity and discretize firefly algorithm directly without any

other hybrid algorithm was developed 11. It was

adoptable to dynamic graph sizes. A system for assigning

an electronic document to one or more predefined categories

or classes based on its textual context and use of agglomerative

clustering algorithm was developed 6. This type of

clustering along with sample correlation coefficient as

similarity measure, allowed high indexing term space reduction factor with

a gain of higher classification accuracy.

In order to

minimize noise and outlier data, a modified DBSCALE algorithm using Naïve Bayes

has been designed 7. This algorithm is basically a prospect based

utility. This function is used to

estimate the outlier cluster

data and increase the correctness rate of algorithm on given

threshold value. Since Naïve Bayes is a probability based function,

it removes outlier cluster data and increases the correctness rate according to

threshold value. It also computes maximum posterior hypothesis for outlier

data. In order to minimize noise and outlier data, a modified DBSCALE algorithm

using Naïve Bayes has been designed 7. This algorithm is basically a prospect

based utility.

This function is

used to increase the

correctness rate of algorithm on given threshold value and to

estimate the outlier cluster data. Since Naïve

Bayes is a probability based

function, it removes outlier cluster data and

increases the correctness rate according to

threshold value. It also computes maximum posterior

hypothesis for outlier data. The memory

based collaborative system uses matrix

based computation and solves data sparsity problem but, scalability

of the system cannot be stable when size of the group dynamically increases.

Hybrid system could be helpful in overcoming

the scalability issue but it again leads to cold start problem.

To eliminate outliers as well as overcoming

other two

problems Naive Bayes clustering, a probability based

method was used in past. Firefly algorithm has a faster

convergence and searches all possible subsets with better time

utilization. Thus, to design an efficient recommendation system,

Naïve Bayes method can be followed for clustering in

offline. Since the time complexity should be less, Firefly

algorithm that is more efficient in terms of time

utilization, it can be used for calculating similarity in online. Combination

of these two technique might increase the accuracy of the

recommendation system as well as results in efficient

time utilization.

III. Overview of the proposed work

Initially, the web log files are obtained from

the 1 America Online Inc. The log files consists of five

fields i.e. anonymous ID for individual user, query of each user along

with query time, list of URLs which user proceeded and its

rank in the result. These logs are collected

and grouped based on anonymous ID. The URL among all

the users are obtained and its content are downloaded and

processed. The processing of data includes removal of

stop words from the URL’s data and

keyword extraction. Similar users are clustered based on fetched

keywords by using Naïve Bayes clustering technique which provides efficient

clusters compared to clustering by the use of association rules. The created

clusters are given to online component. In online process, when an active user

gives a query, the keywords from the query is extracted. The

similarity between the extracted keywords with the other users

in the same cluster of the active user

is calculated using Firefly similarity measure. The

similarity values are sorted along with the web pages

browsed by similar users in the cluster. The top k web pages are

recommended for the active user

as a result.

IV. The proposed

work

The proposed

system follows a linear process of initially collecting the

web logs and processing them followed by clustering similar users

by Naïve Bayes clustering technique and finally generating

recommendations based on a similarity measure from firefly

algorithm. A. Preprocessing of Web Logs The web

logs are collected form 1 AOL Inc. It consists of 20

million web queries from 650 thousand real users over 3

months. The data set includes anonymous ID, query, query

time, item rank and click URL. The log file contains

many number of users along with the web pages visited by

them. It is validated and separated based on anonymous ID. The user

is separated into individual file using anonymous ID. The content from

the URL are fetched and downloaded.

Those keywords are processed which undergoes stop

words removal and

stemming process. The final keywords are then

extracted. The features like keywords, Timings, Frequency, Click URL and

Revisit are fetched. The user profile is constructed using those

features. The user profile that constructed is based

on the features that are taken

form the user log files.

Timing: The timing

that the user spent on that particular URL

·

Frequency: The amount of time the user visited the URL

·

Clickstream: The number of click stream that are visited by user

·

Revisit: Whether the user visited the web page

The keywords are

generated from the data fetched form the

URL. Timing for each URL is estimated from

the given date and time by calculating the difference

between the each URL that are searched in a single

day by having some time constraints. Frequency

is hence calculated such that number of times the user

clicked the URL. The clickstreams are those that are

clicked by the user for additional information. The timing

of revisit is calculated such that to decide whether the

user preferred it much or not. Keywords:

Keywords are those which are extracted from the URL.

The information from the URL is hence collected and processed to

obtain features of the user.

B.

Naïve Bayes Clustering

Clustering, also

known as unsupervised classification, is a descriptive task with many

applications. Clustering is decomposition or partition of a data set into

groups such that the object in one group are similar to

each other but as different as possible from the

object in other groups. Three main approach for clustering of data is partition

based clustering, hierarchical clustering and probabilistic model

based clustering. Probabilistic model based clustering is a

soft clustering were an object can be in many cluster

following a probability distribution. A clustering is useful if it produces

some interesting insight in the problem that we

are analysing. Naïve Bayes clustering is also a probabilistic clustering technique

that is based in Bayes theorem with strong independent

assumption between features. The feature variables can

be discrete or continuous. This probabilistic clustering lies on nominal and

numeric variables in the data set and its novelty lies in the use of mixture of

truncated exponential (MTE) densities to model the numeric variables. In Naïve

Bayes clustering the class is the only root variable and all

the attributes are conditionally independent given the class. The

clustering problem reduces to take a data set of instances

and a previously specified number of clusters (k), and work out

each cluster’s distribution and the population distribution between

the clusters. To obtain these parameters the expectation maximization (EM)

algorithm is used. Since Naïve Bayes clustering is

a probability based techniques. The items belongs to the

cluster if and only if it has a relation to it. This helps in

eliminating outlier data in the process of clustering. It also provides proper

clustering with less computations. The given dataset is divided into two parts,

one for the training and other for testing. For each

record in the test and train databases, the distribution of the class

variable is computed. According to the obtained distribution, a value for the

class variable is simulated and inserted in the corresponding cluster. The

log-likelihood of the new model is computed. If it is higher than the initial

model, the process is repeated. Otherwise, the process is stopped,

obtained clusters are returned.

C.

Optimisation Using Firefly Algorithm

Firefly

algorithm is an evolutionary algorithm that is based on the

behaviour of fireflies. Fireflies live in colonies and cooperate for the

survival of the colony. Generally, in order to model the behaviour of

fireflies, three assumptions will always be considered i.e. all fireflies are

homogeneous, Attractiveness of each firefly is related to its level of

brightness, rightness of firefly is determined with an exponential

objective function. Each firefly always emits a kind

of light that by which attracts other fireflies. The amount of accessed

light depends on parameters such as distance and absorption coefficient of the

surroundings. The longer the distance the lesser the amount of accessed light

will be. Also in surroundings with high light absorption coefficient such as

foggy weathers, the intensity of light decreases. The

certain issue is that every firefly regardless of its gender has

always been attracted to and moved toward the brighter firefly.

Firefly has a light intensity of its own. The key concept is, the firefly with

low light intensity is always attracted to the firefly with high light

intensity. This concept can be incorporated for calculating similarity. By

using firefly based similarity measure unique and distinguished results can be

obtained which is a useful feature for ranking. It can deal with highly non-

linear, multi-modal optimization problems naturally and

efficiently. It does not use velocities, and there

is no problem as that associated with velocity in PSO. The

speed of convergence is very high in probability of finding the global

optimized answer. It has the flexibility of integration with other optimization

techniques to form hybrid tools. It does not require a

good initial solution to start its

iteration process. Each web pages visited by

the user i are considered a firefly. The number of user visited the

particular page is assumed as the light intensity of the firefly. The objective

function is formulated based on the frequency and duration. Frequency is

calculated as the ratio to the number of visits per page to the average vests

of all pages.

The duration is

the ratio of duration of page to the total duration of all the pages visited by

the user. Thus, the objective function can be defined as in equation 5.1

Interest (i)= 2*Frequency (i)*Duration (i) Frequency (i)+Duration (i) (5.1)

The interest of all users in the cluster is calculated. Then the pages

to be recommended are found by using page rank algorithm 2 on the obtained

result. The results after applying page rank algorithm is given as the

recommended web page to the user.

D. Ranking the Web

Pages

The result, set of

web pages obtain should ranked in an order that the user might have higher

interest. Thus, they are

ranked in a sorted order based

on the interest of the active user. The association

rule checks the maximum possible combinations

which provides more accurate pages.

E.

Recommendation Process

The URL that are

to be recommended will be identified based on ranking and similarity measure.

The similarity measure is calculated among the users by comparing their similar

interest. From the obtained result of pages, page rank algorithm

is used to rank the most relevant pages to the user. Thus, resultant URL’s are

recommended to the users. Hence

the web page that is to be recommended to

the user will be more relevant. The use of Nave Bayes clustering will

eliminate the outliers and Firefly based similarity calculation will

check all the subsets of the clusters.