Data mining is used extensively in many vicinities, but then it mines data from the structured database. Hence, extracting information such as logical emotional data, business repository present in insurance claims, or data such as genomics, which can be unstructured or semi-structured cannot be mined with the assistance of data mining. The Jakarta Government confronted the issue of managing the huge volume of complaints and opinion in its e-participation tool which involves its people in the development planning of the province. It is found noticeably hard to classify and distribute them in proper groups which would help the government to prioritize their actions. Similarly, Statistics agency of East Java faced high traffic congestions and mishaps because of its higher populace which travels without the knowledge of the route causing delays and traffic.
Finding out unknown patterns from this heterogenous type large databases of natural language would require text mining. Several text mining algorithms such as Naïve Bayes, k-Means, Support Vendor Machine, Generalized Linear Models and much more are developed to classify, regenerate and help extract or access the information from the language data.
There are approximately 400 million tweets per day which contain emotions too along with the text. The … paper uses Naïve Bayes model for classification where it initially collects the data and performs pre-processing on it which would remove URL from the texts, convert it to lower case, removing mentions and redundant words. Among the different models generated by the data modelling, the one that is most accurate is selected and testing is done in 3 parts namely: using 10-fold training data, based on a number of training data, based on unique words and training dataset. Wheel of feelings is used to generalize all the emotions into six categories.
In … it uses SVM algorithm, using Statistica 10 software, which is regression and classification based and obtains accuracy in the classification by obtaining as much as distance possible between the positive and negative nearest data. The acquired data is in the form of proposals or complaints on which pre-processing is applied initially to segregate the sentences, bringing the complete text to either lower or uppercase, removing the punctuations and bringing it to stem version and synonyms are grouped. Three different matrices are generated where 4 classification models are tested for highest accuracy.
The … focuses on obtaining real-time updates on the condition of roads using social media and text mining to solve their problem of unawareness among people regarding the condition of the motorway they opt for travel. The condition of the road will be posted by a group admin to the Facebook page from where through RSS (Rich Site Summary) feed, the data would be collected to the server where pre-processing would be applied to it and then it will be characterized into major groups. This data is available to the user in the form of specific markers on the map with the location in the form of latitude and longitude. The map of East Java is synced with google maps to facilitate this.
On the evaluation of results in … it is found that when the testing was done based on the number of unique words and training dataset, the user emotions were more accurately classified in comparison to the 10-fold cross-validation and training data. This exhibited that having a larger training data would lead to better classification of the emotions in the text. More accurate models can be discovered by testing the application using several other algorithms. Moreover, features like removal of duplicate feeds and texts with no emotions can be added to the pre-processing. Further addition of hashtags and punctuations such as exclamatory can also enhance the patterns deducted for human behaviour discerning.
The solution provided in … helped the government in taking more focused decisions and understand the factors affecting the proposals too. The most accurate model had an accuracy of 91.37% which was the stemming and synonym recognition model. The accuracy of the result can be increased using spatial-temporal and time series analysis, which will be able to realize the pattern more efficiently and can be further optimized by using more diverse classification features.
Mapping the route with the alert markers in … should help people save time by choosing alternate routes. The accuracy of 92% can be further increased by classifying further when there is conflict among the common words which may represent for example the colour of the vehicle as well as the name of a place. Not being able to differentiate would lead to wrong information which can not only lead to misusing of time but also undesired congestion. At the same time in case of serious blockage of certain routes, the suggestion of alternative routes for the proper diversion of traffic should be provided.