.

Wednesday, April 3, 2019

Internet of Things Paradigm

mesh of Things ParadigmIntroductionAccording to 2016 statistical forecast, there atomic number 18 al intimately 4.77 billion number of mobile ph one users in glob everyy and it is expected to pass the five billion by 2019. 1 The main place of this signifi laughingstockt increasing trend is imputable to increasing touristedity of smartphones. In 2012, virtually a quarter of all mobile users were smartphone users and this exit be treble by 2018 which mean there be be much than than 2.6 zillion smartphone users. Of these smartphone users to a greater extent than quarter be utilize Samsung and Apple smartphone.Until 2016, there are 2.2 million and 2 million of apps in google app enclose and apple store respectively. Such explosive growth of apps gives potential benefit to developer and overly companies. at that place are ab bug out $88.3 billion revenue for mobile industriousness market. with child(p) exponents of the IT industry estimated that the IoT paradigm depart g enerate $1.7 trillion in lever added to the global economy in 2019. By 2020 the Internet of Things device will more than double the size of the smartphone, PC, tablet, connected car, and the wearable market combined.Technologies and maneuverction belonging to the Internet of Things endure generated global revenues in $4.8 trillion in 2012 and will r to each one $8.9 trillion by 2020, growing at a compound annual rate (CAGR) of 7.9%.From this impressive market growth, malicious attacks also have been increased dramatically. According to Kaspersky Security Network(KSN) data report, there has been more than 171,895,830 malicious attacks from online resources among account book wide. In assist quarter of 2016, they have detected 3,626,458 malicious installation software programs which is 1.7 times more than first quarter of 2016. persona of these attacks are broad such as RiskTool, AdWare, Trojan-SMS, Trojan-Dropper, Trojan, Trojan-Ransom,Trojan-Spy,Trojan-Banker,Trojan-Downl oader,Backdoor, etc..http//resources.infosecinstitute.com/internet-things-much-exposed-cyber-threats/grefUnfortunately, the rapid diffusion of the Internet of Things paradigm is not accompanied by a rapid betterment of efficient security solutions for those smart objects, while the criminal ecosystem is exploring the technology as new attack vectors.Technological solutions belonging to the Internet of Things are force amply entering our daily life. Lets think, for example, of wearable devices or the SmartTV. The grea render problem for the tuition of the paradigm is the low perception of the cyber threats and the contingent impact on privacy.Cybercrime is sensitive of the difficulties faced by the IT community to define a shared dodge to mitigate cyber threats, and for this reason, it is plausible that the number of cyber attacks against smart devices will rapidly increase.As long there is money to be made criminals will go to take expediency of opportunities to pick our pocke ts. While the battle with cybercriminals can come out daunting, its a fight we can win. We only need to break one link in their chain to s crystallise them dead in their tracks. close tips to successDeploy patches quicklyEliminate unnecessary applicationsRun as a non-privileged userIncrease employee awarenessRecognize our weak points trim the threat surfaceCurrently, twain major app store companies, Google and Apple, takes different put down to approach netmail app detective work. One takes an active and the early(a)(a) with passive approach. in that respect is strong request of malware detection from globalBackground (Previous Study)The paper early Detection of Spam Mobile Apps was published by dr. Surangs. S with his colleagues at the 2015 International World Wide Web conferences. In this conference, he has been emphasise importance of early detection of malware and also introduced a unique estimation of how to detect e-mail apps. Every market operates with their polic ies to deleted application from their store and this is do thru continuous human intervention. They want to find reason and pattern from the apps deleted and identify spam apps.The diagram simply illustrates how they approach the early spam detection using manual labelling.Data PreparationNew dataset was prepared from previous read 53. The 94,782 apps of initial seed were curated from the list of apps obtained from more than 10,000 smartphone users. Around 5 months, re seeker has been collected metadata from Goole Play Store about application name, application definition, and application category for all the apps and discarded non-English definition app from the metadata.Sampling and Labelling wreakOne of strategic process of their investigate was manual labelling which was the first systemological analysis proposed and this allows to identify the reason behind their removal.Manual labelling was proceeded around 1.5 month with 3 reviewers at NICTA. Each reviewer labelled by heuristic checkpoint points and bulk reason of voting were denoted as come throughing Graph3. They identified 9 get wind reasons with heuristic checkpoints. These full list checkpoints can be find out from their technical report. (http//qurinet.ucdavis.edu/pubs/conf/www15.pdf)In this report, we only list checkpoints of the reason as spam.Graph3. labeled spam data with checkpoint reason.Checkpoint S1-Does the app exposition describe the app function clearly and curtly? ampere-second word bigrams and trigrams were manually conducted from previous studies which describe app functionality. There is mettle more or less probability of spam apps not having clear description. Therefore, coke dustup of bigrams and trigrams were examined with each description and counted frequency of occurrence.Checkpoint S2-Does the app description match too much details, incoherent text, or orthogonal text?literary style, known as Stylometry, was employ to map checkpoint2. In study, 16 features were listed in table 2.Table 2. Features associated with Checkpoint 2Feature1 occur number of characters in the description2Total number of actors line in the description3Total number of sentences in the description4Average word continuance5Average sentence length6 character of upper case characters7Percentage of punctuations8Percentage of numeric characters9Percentage of customary English linguistic communication10Percentage of individualised pronouns11Percentage of emotional voice communication12Percentage of misspelled word13Percentage of words with alphabet and numeric characters14 automated readability index(AR)15Flesch readability score(FR)For the characterization, feature selection of greedy method was apply with max depth 10 of decision tree sort. The death penalty was optimized by asymmetric F-Measure 55They open that Feature number 2, 3, 8, 9, and 10 were closely discriminativeand spam apps tend to have less wordy app description compare to non-spam apps. Ab out 30% spam app had less than 100 words description.Checkpoint S3 Does the app description contain a noticeable repetition of words or key words?They used vocabulary richness to derive spam apps.Vocabulary Richness(VR) =Researcher expected low VR for spam apps fit to repetition of keywords. However, solution was opposite to expectation. Surprisingly VR close to 1 was liable(predicate) to be spam apps and none of non-spam app had soaring VR result. This might be collect to terse style of app description among spam apps.Checkpoint S4 Does the app description contain unrelated keywords or references?Common spamming technique is adding unrelated keyword to increase search result of app that topic of keyword can vary significantly. New strategy was proposed for these terminus ad quems which is number the mentioning of popular applications name from apps description.In previous research name of top-100 apps were used for counting number of mentioning.Only 20% spam apps have me ntioned the popular apps more than once in their description. Whereas, 40 to 60 % of non-spam had mention more than once. They found that many of top-apps have social media interface and fan pages to keep linkup with users. Therefore, theses can be one of identifier to discriminate spam of non-spam apps.Checkpoint S5 Does the app description contain excessive references to other applications from the uniform developer?Number of times a developers other app names appear.Only 10 spam apps were considered as this checkpoint because the description contained think to the application rather than the app names.Checkpoint S6 Does the developer have multiple apps with approximately the same description?For this checkpoint, 3 features were consideredThe total number of other apps developed by same developer.The total number of apps that written in English description to circular description simile.Have description Cosine similarity(s) of over 60%, 70%, 80%, and 90% from the same devel oper.Pre-process was required to calculate the cosine similarity Firstly, converting the words in lower case and removing punctuation symbols.Then tweak each document with word frequency vector.Cosine similarity parhttp//blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/They observed that the most discriminative of the similarity between app descriptions.Only 10% 15% of the non-spam had 60% of description similarity between 5 other apps that developed by same developer. On the other hand, more than 27% of the spam apps had 60% of description similarity result. This evidence indicates the tendency of the spam apps multiple cone with similar app descriptions.Checkpoint S7 Does the app identifier (applied) make sense and have some relevance to the functionality of the application or does it appear to be auto generated? performance identifier(appid) is unique identifier in Google Play Store, name followed by the Java package naming convention. Example, for the facebook , appid is com.facebook.katana.For 10% of the spam apps the average word length is higher(prenominal) than 10 and it was so only for 2%-3% of the non-spam apps. None of the non-spam apps had more than 20% of non-letter bigram appear in the appid, whereas 5% of spam apps had.Training and ResultFrom 1500 of hit-or-miss sampling data 551 apps (36.73%) were suspicious as spam. MethodsAutomationWe used Checkpoint S1 and S2 for data management due to its comparability and highest number of agreement from reviewers. Due to limitation of accessibility for collect description reason only 100 prototype was used for the testing.We have automated checkpoint S1 and S2 according to following algorithmic rule. Collected data were used log transformation to modify. This can be valuable both for making patterns in the data more interpretable and for helping to bump into the assumptions of inferential statistics.To make a code most time overpowering part wa s description collection which takes more than two weeks to find and store. The afflictive data directed the description link for appID. However, many of them where not founded due to old version or no more operable. So we searched all this info manually from the web and founded description was saved as a file which named as appID. (Diagram.) This allowed us to recall the description more expeditiously in automation code.S1 was automated by identified 100 word-bigrams and word-trigrams that are describing a functionality of applications. Because there is high probability of spam app doesnt have these words in their description, we have counted number of occurrence in each application. effective list of these bigrams and trigrams found in Table 1.Table 1. Bigrams and trigrams using the description of top appsplay gamesare availableis the gameapp for humanoidyou canget notifiedto findlearn howget youris used toyour phoneto search port tocore functionalitya bare(a)match youris a sm artphoneavailable forapp forto playkey featuresstay in touchthis appis availablethat allowsto enchanttake care ofyou have toyou tocan you beatbuy youris perfunctoryits easyto usetry toallows youkeeps youaction gametake advantagetap thetake a picturesave yourmakes it easyfollow whatis the unembellishedis a globalbrings togetherchoose fromis a free light upon moreplay ason the gomore informationlearn moreturns onis an appface the challengesgame fromin your pocketyour deviceon your phonemake your lifewith androidit helpsdelivers theoffers essentialis a toolfull of featuresfor androidlets youis a uncomplicatedit givessupport forneed your helpenables yourgame ofhow to playat your fingertipsto discoverbrings youto learnthis gameplay withit bringsnavigation appmakes mobileis a funyour answerdrives youstrategy gameis an easygame onyour mienapp whichon androidapplication whichtrain yourgame whichhelps youmake yourS2 was second highest number of agreement from three reviewers in previous study. Among 551 identified spam apps, 144 apps were confirmed by S2, 63 from 3 reviewers and 81 from 2 reviewer agreed.We knew that from pre-research result, total number of words in the description, Percentages of numeric characters, Percentage of non-alphabet characters, and Percentage of common English words will give most distinctive feature. Therefore, we automated total number of words in the description and Percentage of common English words using C++.algorithmic program 1. Counting the total number of bi/tri-grams in the descriptionFrom lit , they used 16 features of to find the information from checkpointS2. This characterization was do with wrapper method using decision tree classifier and they have found 30% of spam apps were have less than 100 words in their description and only 15% of most popular apps have less than 100 words. We extracted simple but key point from their result which was number of words in description and the percentage of common English words. Thi s was developed in C++ as followed.Algorithm 2. Counting the total number of words in the descriptionint count_Words(stdstring comment_text)int number_of_words =1for(int i =0 i if(input_texti == )number_of_words++ outcome number_of_wordsPercentage of common English words has not make properly due to difficulty of standard selection. However, here is code that we will develop in future study.Algorithm 3. Calculate the Percentage of common English words(CEW) in the descriptionInt count_CEW(stdstring input_text)Int number_of_words=1For(int iwhile(CEW.eof()if(strcmp(input_texti,CEW)number_of_words++elsegetline(readFile, CEW)return number_of_wordsInt percentage(int c_words, int words)return (c_words/words)*100NormalizatonWe had variables between min, max for S1 and S2. Because of high skewness of database, standardisation was powerfully required. Database normalization is the process of organizing data into tables in such a way that the results of using the database are always unamb iguous and as intended. Such normalization is intrinsic to relational database theory.Using Excel, we had normalized data as following diagram.Thru normalization, we could have result of transformed data between 0 and 1. The range of 0 and 1 was important for later process in LVQ.Diagram. Excel bed cover sheet of automated data(left) and normalized data (right)After transformation we wanted to test data to show how LVQ algorithm works with modified attributes. Therefore, we sampled only 100 data from modified data set. Even the result was not significant, it was important to test. Because, after this step, we can add more attributes in future study and possible to adjust the calibration. We have randomly sampled 50 entities from each top rank 100 and from pre-identified spam data. Top 100 ranked apps was fabricated and high likely identify as non-spam apps.Diagram.Initial ResultsWe used the statistical package python to perform Learning Vector Quantification.LVQ is prototype-bases supervised classification algorithm which belongs to the field of Artificial Neural Networks. It can have implemented for multi-class classification problem and algorithm can modify during discipline process.The information touch on objective of the algorithm is to prepare a set of codebook (or prototype) vectors in the mankind of the observed input data samples and to use these vectors to classify unseen examples.An ab initio random pool of vectors was prepared which are then exposed to pedagogy samples. A winner-take-all strategy was employed where one or more of the most similar vectors to a given input pattern are selected and correct to be closer to the input vector, and in some cases, further onward from the winner for runners up. The repetition of this process results in the distribution of codebook vectors in the input space which approximate the underlying distribution of samples from the test datasetOur experiments are done using only the for the manufactured produc ts due to data size. We performed 10-fold cross trial impression on the data. It gives us the average value of 56%, which was quite high compare to previous study considering that only two attributes are used to distribute spam, non-spam.LVQ program was done by 3 steps euclidean hold surmount Matching UnitTraining Codebook Vectors1. Euclidean Distance.Distance between two rows in a dataset was required which generate multi-dimensions for the dataset.The order for calculating the distance between datasetWhere the difference between two datasets was taken, and squared, and summed for p variablesdef euclidean_distance(row1, row2)distance = 0.0for i in range(len(row1)-1)distance += (row1i row2i)**2return sqrt(distance)2. Best Matching UnitOnce all the data was converted using Euclidean Distance, these new piece of data should sorted by their distance.def get_best_matching_unit(codebooks, test_row)distances = list()for codebook in codebooksdist = euclidean_distance(codebook, test_r ow)distances.append((codebook, dist))distances.sort(key=lambda tup tup1)return distances 003. Training Codebook VectorsPatterns were constructed from random feature in the training datasetdef random_codebook(train)n_records = len(train)n_features = len(train 0)codebook = trainrandrange(n_records)i for i in range(n_features)return codebookFuture workDuring writing process, I found that data collection from Google Play Store can be automated using Java client. This will induce number of dataset and possible to improve accuracy with high time saving. Because number of attributes and number of random sampling, result of the research is appropriate to call as significant result. However, staple fiber framework was developed to improve accuracy.AcknowledgementIn the last summer, I did some research reading work under the supervision of come to Professor Julian Jang-Jaccard. Ive got really great support from Julian and INMS. give thanks to the financial support I received from INMS that I can fully focused on my academic research and benefited a great consider from this amazing opportunity.The following is a general report of my summer researchIn the beginning of summer, I studied the paper A exact Analysis of the KDD CUP 99 Data Set by M. Trvallaee et. al. This gave grassroots idea of how to handle machine learning techniques.Approach of KNN and LVQMain construe was followed from a paper Why My App Got Deleted Detection of Spam Mobile Apps by Suranga Senevirane et. al.I have tried my best to keep report simple yet technically correct. I hope I succeed in my attempt.ReferenceAppendixModified DataNumber of Words in thousandsbigram/tr-gramIdentified as spam(b)/not(g)0.0840b0.180b0.1210b0.0091b0.2410b0.4520b0.1051b0.1980b0.6921b0.2581b0.2561b0.2250b0.0520b0.0520b0.0210b0.1881b0.1881b0.0921b0.0980b0.1881b0.1611b0.1070b0.3750b0.1950b0.1120b0.111g0.1491g0.3681g0.221g0.1211g0.1631g0.0721g0.0981g0.3121g0.2821g0.2291g0.2561g0.2980g0.0920g0.1890g0.1341g0.1571g0.2531g0.1 21g0.341g0.571g0.341g0.3461g0.1261g0.2411g0.1621g0.0840g0.1590g0.2531g0.2311g

No comments:

Post a Comment