1 To Catch a Fake: Curbing Deceptive Yelp Ratings and Venues 1
2
2 3
4
3 Mahmudur Rahman1, Bogdan Carbunar1∗, Jaime Ballesteros2, and Duen Horng (Polo) Chau3†• 5
6
AQ1 7
4 8
9
A5Q2 1The School of Computing and Information Sciences, Florida International University, Miami, FL 33199,• USA 10
6 11
12
7 2Here, Nokia Inc. Chicago, IL, USA 13
14
8 15
9 3School of Computational Science & Engineering, Georgia Tech, Atlanta, GA, USA 16
17
10 18
19
11 20
21
12 Received 13 May 2014; revised 13 November 2014; accepted 3 March 2015 22
23
13 DOI:10.1002/sam.11264 24
25
14 Published online in Wiley Online Library (wileyonlinelibrary.com). 26
27
15 28
29
16 30
31
17 32
33
18 Abstract: The popularity and influence of reviews, make sites like Yelp ideal targets for malicious behaviors. We present 34
35
19 Marco, a novel system that exploits the unique combination of social, spatial and temporal signals gleaned from Yelp, to detect 36
37
20 venues whose ratings are impacted by fraudulent reviews. Marco increases the cost and complexity of attacks, by imposing 38
a tradeoff on fraudsters, between their ability to impact venue ratings and their ability to remain undetected. We contribute a 39
21 new dataset to the community, which consists of both ground truth and gold standard data. We show that Marco significantly 40
41
22 outperforms state-of-the-art approaches, by achieving 94% accuracy in classifying reviews as fraudulent or genuine, and 95.8% 42
43
23 accuracy in classifying venues as deceptive or legitimate. Marco successfully flagged 244 deceptive venues from our large 44
45
24 dataset with 7,435 venues, 270,121 reviews and 195,417 users. Furthermore, we use Marco to evaluate the impact of Yelp 46
47
25 events, organized for elite reviewers, on the hosting venues. We collect data from 149 Yelp elite events throughout the US. We 48
show that two weeks after an event, twice as many hosting venues experience a significant rating boost rather than a negative 49
26 impact. © 2015 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 0: 000–000, 2015 50
51
27 52
53
28 Keywords: fake review detection; deceptive venue detection; machine learning
29
30
31
32 1. INTRODUCTION venues and services. Crowdsourcing has however exposed
33 Yelp to significant malicious behaviors: Up to 25% of Yelp
34 Online reviews are central to numerous aspects of
reviews may be fraudulent [6].
35 people’s daily online and physical activities. Which Thai
While malicious behaviors may occasionally be per-
36 restaurant has good food? Which mover is reliable? Which
formed by inexperienced fraudsters, they may also be pro-
37 mechanic is trustworthy? People rely on online reviews to
fessionally organized. For example, search engine opti-
38 make decisions on purchases, services and opinions, among
mization (SEO) companies tap into review writer mar-
39 others. People assume these reviews are written by real
kets [7–9] to offer review campaigns. Review campaigns
40 patrons of venues and services, who are sharing their honest
act as “face lift” operations for business owners [10],
41 opinions about what they have experienced. But, is that
manipulating venue ratings through multiple, coordinated
42 really the case? Unfortunately, no. Reviews are sometimes
artificial reviews.
43 fake, written by fraudsters who collude to write glowing
For business owners, profit seems to be the main
44 reviews for what might otherwise be mediocre services or
incentive to drive them to engage in such activities. Studies
45 venues [1–4].
have shown that an extra half-star rating on Yelp causes a
46 In this paper we focus on Yelp [5], a popular social
restaurant to sell out 19% more often [11], and a one-star
47 networking and location based service that exploits crowd- increase leads to a 5–9% increase in revenue [12].
48 sourcing to collect a wealth of peer reviews concerning
Furthermore, we study the impact of Yelp “elite” events
49 on the ratings of hosting venues. Elite events are organized
50 ∗ Correspondence to: Bogdan Carbunar (carbunar@cs.fiu.edu)
51 † A preliminary version of this article has appeared in SIAM by Yelp for the benefit of “Elite,” influential users, who
52 write popular reviews. Yelp attempts to prevent review
53 SDM 2014. “unfairness” by encouraging attendees to review the event
This research was supported in part by DoD grant W911NF- instead of the venue. However, the ample warning offered
13-1-0142 and NSF grant EAGER-1450619.
© 2015 Wiley Periodicals, Inc.
2 Statistical Analysis and Data Mining, Vol. 0 (2015)
11
22
33
44
55
66
77
88
99
Color
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 Fig. 1 System overview of Marco. Marco relies on social, temporal and spatial signals gleaned from Yelp, to extract novel features. The 24
25 features are used by the venue classifier module to label venues (deceptive vs. legitimate) based on the collected data. Section 5 describes 25
26 Marco in detail. 26
27 27
28 to hosts, coupled with the inability of users to accurately • We contribute a novel dataset of reviews and 28
29 follow directions, may be used by adversaries to transform venues, which consists of both ground truth (i.e., 29
30 Yelp events into review campaign tools. objectively correct) and gold standard instances 30
31 We propose Marco (MAlicious Review Campaign (i.e., selected based on best available strate- 31
32 Observer), a novel system that leverages the wealth of spa- gies); and a large collection of 7,435 venues, 32
33 tial, temporal and social information provided by Yelp, to 270,121 reviews and 195,417 reviewer profiles 33
34 detect venues that are targets of deceptive behaviors. Marco (Section 4). 34
35 (see Fig. 1) exploits fundamental fraudster limitations (see 35
36 Section 5.1) to identify venues with (i) abnormal review • We demonstrate that Marco is effective and 36
37 spikes, (ii) series of dissenting reviews, and (iii) impact- fast; its classification accuracy is up to 94% 37
38 ful but suspicious reviews. Marco detects both venues that for reviews, and 95.8% for venues. It flags 38
39 receive large numbers of fraudulent reviews, and venues 244 of the 7,435 venues analyzed as deceptive; 39
40 that have insufficient genuine reviews to neutralize the manual inspection revealed that they were indeed 40
41 effects of even small scale campaigns. Our major contri- suspicious (Section 6). 41
42 butions include: 42
43 • We introduce a lower bound on the number of • We collect data from 149 Yelp Elite events 43
44 reviews required to launch a review campaign that throughout the US and use it to study the short 44
45 impacts a target venue’s rating, and prove that this and long term impact of Yelp events on the rating 45
46 bound renders such campaigns detectable. Our of the hosting venues (Section 6.4). 46
47 theoretical results force fraudsters to compromise 47
48 between the impact and undetectability of their Marco aims to complement legal actions against profitable, 48
49 review campaigns (Section 5). fraudulent review activities [10]. Organizations caught red- 49
50 handed in setting up review campaigns have been shown 50
51 • We present Marco, a system that leverages novel to pay $1–$10 per fraudulent review. By making the cost 51
52 social, spatial and temporal features gleaned from of purchasing reviews approach the cost of products and 52
53 Yelp services provided by hiring venues, Marco has the potential 53
Statistical Analysis and Data Mining DOI:10.1002/sam
Rahman et al.: To Catch a Fake: Curbing Deceptive Yelp Ratings and Venues 3
1 to act as an economic counter-incentive for rational venue and ratings in the context of Yelp’s spatial and temporal 1
2 owners. dimensions. 2
3 Gao et al. [21] target asynchronous wall messages to 3
4 detect and characterize spam campaigns. They model each 4
5 2. RELATED WORK wall post as a pair of text description and URL and apply 5
6 semantic similarity metrics to identify large subgraphs 6
7 2.1. Research in Detecting Fraudulent Reviews representing potential social spam campaigns and later 7
8 Jindal and Liu [2] introduce the problem of detecting incorporate threshold based techniques for spam detection. 8
9 opinion spam for Amazon reviews. They proposed solutions Instead, we focus on temporal and geosocial review context, 9
10 for detecting spam, duplicate or plagiarized reviews and the where reviewer activity and behavioral pattern are of 10
11 outlier reviews. Jindal et al. [3] identify unusual, suspicious significant importance. 11
12 review patterns. In order to detect “review spam”, Lim et 12
13 al. [4] propose techniques that determine a user’s deviation Wang et al. [22] introduce the concept of heteroge- 13
14 from the behavior of other users reviewing similar products. neous review graphs and iterative methods exploring rela- 14
15 Mukherjee et al. [13] focus on fake reviewer groups; similar tionship among reviewers, reviews and stores to detect 15
16 organized fraudulent activities were also found on online spammers. While we also consider social relations among 16
17 auction sites, such as eBay [14]. Mukherjee et al. [15] reviewers we differ on our focus on temporal and spatial 17
18 leverage the different behavioral distributions of review dimensions. 18
19 spammers to learn the population distributions of spammer 19
20 and non-spammer clusters. Li et al. [16] exploit the reviews 2.2. Research in Sybil Detection 20
21 of reviews concept of Epinions to collect a review spam 21
22 corpus, then propose a two view, semisupervised method Sybil accounts can be used to launch review campaigns, 22
23 to classify reviews. by enabling a single adversary to write multiple reviews 23
24 for the same venue, each from a different account. Yelp 24
25 Ott et al. [17] integrate work from psychology and identifies venues that receive multiple reviews from the 25
26 computational linguistics to develop and compare several same IP address (but different user accounts). Tools such as 26
27 text-based techniques for detecting deceptive TripAdvisor proxies [23] and anonymizers (e.g., Tor [24]) can however 27
28 reviews. To address the lack of ground truth, they be used to avoid detection. 28
29 crowdsourced the job of writing fraudulent reviews for 29
30 existing venues. SybilInfer [25], detects Sybil nodes in social networks 30
by using Bayesian inference and knowledge of the social
31 Unlike previous research, we focus on the problem of network graph. Sybil tolerant solutions like DSybil exploit 31
32 detecting impactful review campaigns. Our approach takes the heavy-tail distribution of the typical behavior of honest 32
33 advantage of the unique combination of social, spatial and users and rely on user weights to identify whether the 33
34 temporal dimensions of Yelp. Furthermore, we do not break system needs more opinions or not. Similarly, SumUp [26] 34
35 Yelp’s terms of service to collect ground truth data. Instead, uses “adaptive vote flow aggregation” to limit the number 35
36 we take advantage of unique Yelp features (i.e., spelp sites, of fake feedback provided by an adversary to the number 36
consumer alerts) to collect a combination of ground truth of attack edges in the trust network—that is, the number of 37
37 and gold standard review and venue datasets.
38 bi-directional trust edges the attacker is able to establish to 38
Feng et al. [18] seek to address the lack of ground truth other users. Molavi et al. [27] propose to associate weights
39 data for detecting deceptive Yelp venues: They introduce with ratings and introduce the concept of “relative ratings” 39
40 40
three venue features and use them to collect gold standard to defend against bought ratings and ratings from Sybil
41 41
sets of deceptive and legitimate venues. They show that accounts. When given access to the perspective of the social
42 42
an SVM classifier is able to classify these venues with an network provider, Wang et al. [28] proposed an approach
43 43
accuracy of up to 75%. In Section 6 we confirm their results that detects Sybil accounts based on their click stream
44 44
on our datasets. We show that with an accuracy of 95.8%, behaviors (traces of click-through events in a browsing
45 Marco significantly outperforms the best strategy of Feng session). 45
46 et al [18]. Our work aims to complement Sybil detection tech- 46
47 Li et al. [19] and Ntoulas et al. [20] rely on the review niques. Reviews written from accounts detected to be Sybils 47
48 content to detect review spam. Li et al. [19] exploit machine may be classified as fraudulent. The number (or percent- 48
49 49
learning methods in their product review mining system. age) of reviews of a venue written from Sybil accounts
50 Ntoulas et al. [20] propose several heuristic methods for can be used as a feature to detect “deceptive” venues. 50
51 51
detecting content based spam and combine the most Conversely, user accounts with high numbers of posted
52 52
effective ones to improve results. Our work differs through fraudulent reviews may be used as candidates for further
53 its emphasis on relationship among reviewers, friends Sybil detection screenings. 53
Statistical Analysis and Data Mining DOI:10.1002/sam
4 Statistical Analysis and Data Mining, Vol. 0 (2015)
11
22
33
44
55
66
77
88
99
Color Fig. 2 YCrawl system architecture. YCrawl relies on a pool of servers and proxies to issue requests. The scheduler relies on a request
10 queue to ensure there are no loops in the crawling process. 10
11 11
12 12
13 3. SYSTEM MODEL “Consumer Alert” process, posting “alert badges” on the 13
14 3.1. Yelp’s Review System pages of venues for which (i) people were caught red- 14
15 handed buying fraudulent reviews, offering rewards or 15
16 For this work, we focus on Yelp [5], a review centric discounts for reviews or (ii) that have a large number of 16
17 geosocial network that hosts information concerning users reviews submitted from the same IP address. The consumer 17
18 and venues. Subscribed users (“yelpers”) have accounts alert badge is displayed for 90 days. 18
19 and can write reviews, befriend other subscribers, report 19
20 locations and search for venues of interest. We use the term 3.4. Yelp Events 20
21 “venue” to represent a business or event with an associated 21
22 location (e.g., restaurants, shops, offices, concerts). Yelp organizes special Elite events, at select venues, 22
23 Reviews have a star rating, an integer ranging from 1 where only Elite badge holders are invited. For each event, 23
24 to 5, with 5 being the highest mark. In Yelp, an average Yelp creates a separate Yelp page, containing the name 24
25 rating value is computed for each venue (rounded to the of the event and the name, address and information for 25
26 nearest half star), over the ratings of all the posted reviews. the hosting venue. Attendees are encouraged to review the 26
27 For a review R, let R.ρ denote its rating and R.τ to denote event account, which then lists the reviews, just like a 27
28 the time when the review was posted. We say a review is regular venue. 28
29 “positive” if its rating is at least 4 stars and “negative” if its 29
30 rating is 2 stars or fewer. In our analysis we do not consider 30
31 3 star reviews. Their impact on the rating of the venue is 4. COLLECTED YELP DATA 31
32 likely to be small: Yelp denotes a 3 star rating as “A-OK”. 32
33 In this section we describe the Yelp datasets we collected 33
34 3.2. Influential and Elite Yelpers using the YCrawl crawler that we developed. Our data 34
35 consists of: (i) 90 deceptive and 100 legitimate venues; (ii) 35
36 Users can rate the reviews of others, by clicking 426 fraudulent and 410 genuine reviews; and (iii) a large 36
37 on associated buttons (e.g., “useful”, “funny” or “cool” collection of 7,435 venues and their 270,121 reviews from 37
38 buttons). They can upload photos taken at venues reviewed 195,417 reviewers, from San Francisco, New York City, 38
39 and perform “check-ins”, to formally record their real-time and Miami. 39
40 presence at the venue. Yelp rewards “influential” reviewers 40
41 (often peer-recommended) with a special, yearly “Elite” 4.1. YCrawl 41
42 badge. 42
43 We have developed YCrawl, a crawling engine for 43
44 3.3. Fraudulent Reviews and Deceptive Venues automatically collecting data from Yelp user and venue 44
45 pages. YCrawl consists of 1820 lines of Python code. It 45
46 A review is fraudulent if it describes a fictitious fetches the raw HTML pages of target Yelp user and venue 46
47 experience. Otherwise, the review is genuine. We say a accounts. Fig. 2 illustrates the system design of YCrawl. 47
48 venue is deceptive if it has received a sufficient number of Yelp keeps track of requests made from a single IP and 48
49 fraudulent reviews to impact its average rating by at least suppresses any IP making an exorbitant number of requests 49
50 half a star. Otherwise, the venue is legitimate. within a short time window 1. To overcome this limitation, 50
51 Yelp relies on proprietary algorithms to filter reviews it 51
52 considers fraudulent. See ref. [29] for an attempt to reverse 1 Such IP addresses are suppressed from Yelp’s servers and this 52
53 engineer Yelp’s filter. Furthermore, Yelp has launched a remains in place for a few weeks (or sometimes forever). 53
Statistical Analysis and Data Mining DOI:10.1002/sam
Rahman et al.: To Catch a Fake: Curbing Deceptive Yelp Ratings and Venues 5
1 YCrawl uses a pool of servers and IP proxies: For every et al. [18] showed that this strategy is not accurate. In total, 1
2 request, YCrawl randomly picks a server and proxy. If the we selected 100 legitimate venues. 2
3 request is not successful, a new request is made using a In addition to collecting the html pages of all the reviews 3
4 different proxy. A centralized scheduler maintains a request of the selected deceptive and legitimate venues, we have 4
5 queue to ensure there are no loops in the crawling process. also collected the html pages of all the users who wrote 5
6 At the time when we performed this data collection, reviews for them, and the html pages of all the reviews 6
7 Yelp’s filtered reviews could only be accessed by solv- written by these reviewers. This data enables us to extract 7
8 ing a CAPTCHA. In order to collect filtered reviews the features that we introduce in the following sections. 8
9 we used DeathByCaptcha [30] to programatically collect For the 90 deceptive venues we have collected their 9
10 CAPTCHA protected reviews filtered by Yelp. 10,063 reviews written by 7,258 reviewers. We have 10
11 We used YCrawl to collect a seed dataset of random collected all the reviews (311,994 in total) written by the 11
12 venue and user accounts, using a breadth-first crawling 7,258 reviewers of the 90 deceptive venues. In addition, 12
13 strategy and stratified sampling [31]. First, we selected a we have collected the 9,765 reviews, written by 7,161 13
14 list of 10 major cities (e.g., New York, San Francisco, Los reviewers, of the 100 legitimate venues. We have then 14
15 Angeles, Chicago, Seattle, Miami) in the United States and collected all the reviews written by these 7,161 reviewers, 15
16 we collected an initial random list of 100 venues from each for a total of 530,408 reviews. Thus, for these 190 venues, 16
17 of these cities as a seed dataset. We note that the strata we have collected more than 840,000 reviews. Note how 17
18 venues are mutually exclusive, i.e., venues do not belong the 90 deceptive venues have received more reviews than 18
19 to two or more different cities. We then randomly selected the 100 legitimate venues. However, the total number 19
20 10,031 Yelp users who reviewed these venues, and collected of reviews written by reviewers of legitimate venues 20
21 their entire Yelp data (the html pages), including all their significantly exceeds those written by the reviewers of 21
22 reviews, for a total of 646,017 reviews. This process deceptive venues. 22
23 enabled us to avoid bias toward high degree nodes (users Gold standard fraudulent reviews. We have used spelp 23
24 with many friends, venues with many reviews), which is (Spam + Yelp) sites (e.g., refs. [33,34]), forums where 24
25 a common problem when crawling social networks [32]. members, often “Elite” yelpers with ground truth knowl- 25
26 We have then randomly selected a list of 16,199 venues, edge, reveal and initiate the discussion on fraudulent Yelp 26
27 reviewed by the previously collected 10,031 Yelp users. We reviews. While in theory such sites are ideal targets for 27
28 have collected the html pages of the selected the venues, fraudulent behavior, the high investment imposed on fraud- 28
29 including all their reviews. sters, coupled with the low visibility of such sites, make 29
30 them unappealing options. Nevertheless, we have identified 30
31 spelp reviews that (i) were discussed by and agreed upon by 31
32 4.2. The Data multiple other Yelp users, (ii) were written from accounts 32
33 We use the term “ground truth” set to denote data with no user photo or with web plagiarized photos (iden- 33
34 objectively known to be correct. We use the term “gold tified through Google’s image search), and (iii) that were 34
35 standard” to denote data selected according to the best short (less than 50 words). From this preliminary set, we 35
36 available strategies. We collect such data following several have manually selected 410 generic reviews, that provide 36
37 stringent requirements, often validated by multiple third- no venue specific information [35]. 37
38 parties. Specifically, each “spelp” review we collected was 38
39 Ground truth deceptive venues. We relied on Yelp’s posted by a Yelp users, and discussed and agreed upon 39
40 “Consumer Alert” feature to identify deceptive venues. We by multiple other Yelp users. 40
41 have used Yelp and Google to identify a snapshot of all the Gold standard genuine reviews. Given the seed user and 41
42 90 venues that received consumer alerts during July and venue datasets previously described, we have extracted a 42
43 August, 2013. list of 410 genuine reviews satisfying a stringent test that 43
44 Gold standard legitimate venues. We have used the col- consists of multiple checkpoints. In a first check we used 44
45 lected list of 16,199 venues previously described to first Google (text and image search) to eliminate reviews with 45
46 selected a preliminary list of venues with well known con- plagiarized text and reviewer account photos. In a second 46
47 47
sistent quality, e.g., the “Ritz-Carlton” hotel. We have then check we discarded short (less than 50 words), generic
48 manually verified each review of each venue, including reviews, lacking references to the venue. Third, we gave 48
49 their filtered reviews. We have selected only venues with preference to reviews written by users who 49
50 at most one tenth of their reviews filtered by Yelp and 50
51 whose filtered reviews include a balanced amount of posi- • Reached the “Elite” member status at least once. 51
52 tive and negative ratings. While Yelp tends to filter reviews 52
53 received from users with few friends and reviews, Feng • Participated in forums e.g., Yelp Talk. 53
Statistical Analysis and Data Mining DOI:10.1002/sam
6 Statistical Analysis and Data Mining, Vol. 0 (2015)
1 • Garnered positive feedback on their reviews. Table 1. Table of notations. 1
2 • Provided well thought out personal information Notation Definition 2
3 3
4 on their profile. A Adversary 4
Target venue 5
5 We have collected the 54,213 reviews written by the writers V V ’s timeline and active interval 6
6 Rating of V at time T 7
HV , T Desired rating increase by A
ρV (T )
7 of the 410 genuine reviews. We have also collected the δr
8 1,998 reviews written by the writers of the 426 fraudulent δt Review campaign duration 8
Number of fraudulent reviews by A 9
9 reviews. q Review, its rating and its posting time 10
Number of genuine reviews of V 11
10 Large Yelp Data Set. We have used YCrawl to collect the R, R.ρ, R.τ Sum of ratings of all genuine reviews 12
11 data of 7,435 car repair shops, beauty & spa centers and n Number of genuine positive reviews
σ
12 moving companies from San Francisco, New York City and p
13 Miami. The collection process took 3 weeks. Of the 7,345 13
14 venues, 1928 had no reviews posted. We have collected 14
15 all their 270,121 reviews and the data of their 195,417 V . A receives a finite budget, and needs to “adjust” the 15
16 reviewers (one user can review more than 1 of these rating of V , i.e., either increase or decrease it by at least 16
17 venues). Table 8 shows the number of venues collected for half a star. 17
18 each venue type and city. Yelp limits the results for a search We assume A controls a set of unique (IP address, Yelp 18
19 to the first 1000 matching venues. Entries with values less Sybil account) pairs and has access to a market of review 19
20 than 1000 correspond to cities with fewer than 1000 venues writers. Sybil accounts [36] are different Yelp identities 20
21 of the corresponding type. controlled by A. A uses these resources to launch a “review 21
22 Yelp event collection. We have collected Yelp events from campaign” to bias the rating of V : post one review from 22
23 60 major cities covering 44 states of United States. The each controlled (IP address, Yelp Sybil account) pair and/or 23
24 remaining states had no significant Yelp events or activities hire (remote) review writers, with valid Yelp accounts, to 24
25 (WY, VT, SD, NE, WV, ND). After identifying an Elite do it. 25
26 event, we identified the hosting venue through either its The number of reviews A can post is limited by the 26
27 name or address. We used YCrawl to collect a majority of number of unique (IP address, Yelp Sybil account) pairs it 27
28 the available Yelp events and hosting venues, for a total of controls as well as by the budget received in the contract 28
29 149 pairs. (minus A’s fee) divided by the average cost of hiring a 29
30 For each Yelp event and corresponding venue, we have review writer. 30
31 collected their name, number of reviews, star rating and 31
32 all their reviews. For each review, we have collected the 5.2. Overview of Marco 32
33 date when it was written, the rating given and the available 33
34 information about the reviewer, including the Elite status, Marco, whose functionality is illustrated in Fig. 1, 34
35 number of friends and number of reviews written. In total, consists of 3 primary modules. The Review Spike Detection 35
36 we have collected 24,054 event/hosting venue reviews. (RSD) module relies on temporal, inter-review relations 36
37 While we are unable to make public these datasets, due to to identify venues receiving suspiciously high numbers 37
38 possible legal action from Yelp, we recommend researchers of positive (or negative) reviews. The Aggregate Review 38
39 to contact us with questions concerning this data. Disparity (ARD) module uses relations between review 39
40 ratings and the aggregate rating of their venue, at the time 40
41 41
of their posting, to identify venues that exhibit a “bipolar”
42 5. MARCO: PROPOSED METHODS review behavior. The Fraudulent Review Impact (FRI) 42
43 43
module first classifies reviews as fraudulent or genuine
44 We present Marco, a system for automatic detection of 44
based on their social, spatial and temporal features. It then
45 fraudulent reviews, deceptive venues and impactful review 45
identifies venues whose aggregate rating is significantly
46 campaigns. We begin with a description of the adversary 46
impacted by reviews classified as fraudulent. Each module
47 and his capabilities. 47
produces several features that feed into a venue classifier,
48 48
trained on the datasets of Section 4.2. Table 1 shows the
49 49
50 5.1. Adversarial Model notations used by Marco. 50
The approach used in Marco leverages manually labeled
51 51
We model the attacker following the corrupt SEO (Search data, including fraudulent and genuine reviews, as well as
52 52
Engine Optimization) model mentioned in the introduction. deceptive and legitimate venues, to classify reviews and
53 53
The attacker A receives a contract concerning a target venue venues. Marco does not require knowledge of all the data
Statistical Analysis and Data Mining DOI:10.1002/sam
Rahman et al.: To Catch a Fake: Curbing Deceptive Yelp Ratings and Venues 7
1 and can classify new data in an online manner. A drawback Positive reviews South Bay BMW 1
2 of this approach stems from the difficulty of acquiring 15 2
3 ground truth and gold standard data. While it is also difficult 10 2008 2009 2010 2011 2012 2013 3
4 to identify relevant features that are hard to bypass by 5 Time 4
5 adversaries, we note that Marco introduces a trade-off for 0 5
6 attackers, between impact and detectability. 6
2007
7 An alternative approach is to use unsupervised outlier Positive reviews Smith Brothers Appliance Repair Color 7
8 detection solutions [37–41]. While such solutions do not 8
9 require labeled data, they require knowledge of the entire 10 9
10 dataset. This approach is thus suitable for the providers (i.e., 8 10
11 Yelp). We note however that an adversary with sufficient 6 11
12 knowledge of the data can attempt to bypass this approach, 4 12
2
0
2010 2011 2011 2012 2012 2013 2013
Time
13 by determining and introducing fraudulent data that would Positive reviews Bert Levi Family Jewelers 13
14 not be classified as outlier. 6 14
15 4 15
2
16 0 16
17 5.3. Review Spike Detection (RSD) Module 2010 2011 2012 2013
18 A review campaign needs to adjust (e.g., increase) the Time 17
18
19 rating of its target venue, by posting (fraudulent) reviews Fig. 3 Timelines of positive reviews of three deceptive venues 19
20 that compensate the negative ratings of other reviews. (see Section 4.2). Each venue has several significant spikes in its 20
number of daily positive reviews.
21 The RSD module detects this behavior by identifying 21
22 venues that receive higher numbers of positive (or negative) 22
23 reviews than normal. Ri.ρ ∈ [1, 5], thus σ ∈ [n, 5n]. The “genuine” rating of V 23
24 In the following, our first goal is to prove that review ρVg σ 24
is = n . In order to minimize q, A has to write only
25 campaigns that impact the ratings of their target venues 5 star reviews. Let δr be the increase in the rating of V 25
generated by A’s review campaign. Note that δr ∈ [0.5, 4).
26 are detectable. For this, let q denote the total number of 26
27 fraudulent reviews that A posts for the target venue V . We Furthermore, σ + δr ≤ 5, as the final rating of V cannot 27
n
28 focus on the typical scenario where an attacker attempts to exceed 5. Hence,
28
29 increase the rating of V (ballot stuffing). Attempts to reduce σ + 5q = σ + δr, 29
30 the rating of V (bad mouthing) are similar and omitted here n+q n 30
31 for brevity. 31
32 A can follow any strategy, including (i) greedy, by Thus, q = n2 δ r . Given that σ ≥ n, we have q ≥ nδr . 32
33 posting all q reviews in a short time interval and (ii) 5n−σ −nδr 4−δr 33
34 uniform, by spreading reviews over a longer time interval. When δr = 1/2, this results in q ≥ n/7. For δr = 1, q ≥ 34
35 While a greedy strategy is likely to quickly impact the n/3, when δr = 2, q ≥ n, etc. 35
36 venue, a uniform strategy seems more likely to pass 36
37 unnoticed. However, we show in the following that, if the We say a review campaign is successful if it increases the 37
38 review campaign is successful, it becomes detectable. rating of the target venue by at least half a star (δr ≥ 1/2). 38
39 Let Ts and Te denote the start and end times of the We introduce now the notion of venue timeline: 39
40 campaign, the times when the first and last fraudulent 40
41 reviews initiated by A are posted. δt = Te − Ts is the DEFINITION 1: The timeline of a venue V is the set 41
42 campaign duration interval. Let n denote the number of of tuples HV = {(Ui, Ri)|i = 1...n}, the list of reviews Ri 42
43 genuine reviews V has at the completion of the campaign received by V from users Ui, chronologically sorted by 43
44 (time Te). We assume V receives fraudulent reviews only the review post time, Ri.τ . Let T = Tc − T1 denote the 44
45 from A. We prove the following lower bound on the number active interval of the venue, where Tc denotes the current 45
46 of reviews that A needs to write in order to impact the rating time and T1 = R1.τ . 46
47 of V . 47
48 Claim 1 The minimum number of reviews A needs to post Fig. 3 illustrates this concept, by showing the evolution of 48
49 in order to (fraudulently) increase the rating of V by half a the positive review (four and five star) timelines of three 49
50 star is q = n/7. venues selected from the ground truth deceptive venue 50
51 dataset (see Section 4.2). Let p denote the number of 51
52 Proof: Let R1, R2, ..., Rn denote the n genuine reviews positive reviews received by V during its active interval, 52
53 n T . We now show that: 53
of V . Let σ = i=1 Ri .ρ. According to Yelp semantics,
Statistical Analysis and Data Mining DOI:10.1002/sam
8 Statistical Analysis and Data Mining, Vol. 0 (2015)
1 Claim 2 Assuming a uniform arrival process for genuine identify spikes, or outliers in a venue’s timeline. For 1
2 positive reviews, the maximum number of genuine positive instance, each venue in Fig. 3 has several significant review 2
3
3 reviews in a δt interval is approximately p δt (1 + √1 ), spikes. The RSD module of Marco uses the measures of 4
T c 5
4 dispersion of Box-and-Whisker plots [31] to detect outliers.
5 where c = p δt T . Specifically, given a venue V , it first computes the quartiles
T log δt
6 Proof: The distribution of reviews into δt intervals follows and the inter-quartile range IQR of the positive reviews 6
7 7
8 a balls and bins process, where p is the number of balls from V ’s timeline HV . It then computes the upper outer 8
9 fence (U OF ) value using the Box-and-Whiskers plot [31]. 9
10 and T /δt is the number of bins. It is known (e.g., [42,43]) 10
11 For each sub-interval d of set length (in our experiments 11
12 that given b balls and B bins, the maximum number of balls |d| = 1 day) in V ’s active period, let Pd denote the set of 12
positive reviews from HV posted during d. If |Pd | > U OF ,
in any bin is approximately b (1 + √1 ), where c = B b . the RSD module marks Pd , i.e., a spike has been detected.
B c logB
Thus, the result follows.
13 We introduce now the following result. For instance, the “South Bay BMW” venue (see Fig. 3) has 13
14 a U OF of 9 for positive reviews: any day with more than 14
15 THEOREM 1: If n > 49, a successful review campaign 9 positive reviews is considered to be a spike. 15
16 will exceed, during the attack interval, the maximum We note that a different empirical approach, proposed by 16
17 number of reviews of a uniform review distribution. Fei et al. [44] is to use Kernel Density Estimation (KDE) to 17
18 estimate the probability distribution function of the reviews 18
19 Proof: Let p denote the number of positive, genuine of a venue.
20 reviews received by the target venue at the end of the review 19
The RSD module outputs two features (see Table 3): 20
21 campaign. p < n, where n is the total number of genuine SC(V ), the number of spikes detected for a venue V , 21
reviews at the end of the campaign. According to Claim 1,
22 and SAmp(V ), the amplitude of the highest spike of V , 22
a successful review campaign needs to contain at least n/7
23 normalized to the average number of reviews posted for V 23
24 positive (5 star) reviews. Then, since the expected number during an interval d.
of positive genuine reviews to be received in a δt interval 24
25 will be pδt , following the review campaign, the expected 25
26 T 26
number of (genuine plus fraudulent) positive reviews in the
27 5.4. Aggregate Rating Disparity 27
28 attack interval will be n + pδt . 28
29 7 T A venue that is the target of a review campaign 29
30 The maximum number of positive genuine reviews posted is likely to receive reviews that do not agree with 30
31 its genuine reviews. Furthermore, following a successful 31
32 during an interval δt, assuming a uniform distribution, is, review campaign, the venue is likely to receive reviews 32
33 from genuine users that do not agree with the venue’s newly 33
34 according to Claim 2, approximately p δt + pδt log T engineered rating. 34
35 T T 35
36 δt . Let ρV (T ) denote the average rating of a venue V at time 36
37 T ∈ [T1, Tc]. We define the rating disparity of a review R 37
38 Thus, the number of positive reviews generated by a review written at time R.τ for V to be the divergence of R’s 38
39 rating from the average rating of V at the time of its 39
40 campaign exceeds the maximum positive reviews of a posting, |R.ρ − ρV (R.τ )|. Let R1, ..., RN , N = n + q, be 40
41 all the reviews received by V (both genuine and fraudulent) 41
42 uniform distribution if during its active interval T . We define the aggregate 42
43 rating disparity (ARD) score of V to be the average rating 43
n + pδt > pδt + pδt log T disparity of all the reviews of V :
7T T T δt .
Since n > p, this converts to n > log T . Since T > δt,
49 δt
T
T δt
we have that log δt < 1. Thus, the above inequality trivially
T
holds for n > 4δ9t .
44 Theorem 1 introduces a tradeoff for attackers. Specif- ARD(V ) = N |Ri .ρ − ρV (Ri.τ )| 44
45 ically, an attacker can choose to either (i) post enough i=1 45
46 reviews to impact the rating of a venue (launch a successful 46
N
47 campaign) but then become detectable (exceed the maxi- By influencing the average rating of a venue, a review 47
48 mum number of reviews of a uniform distribution) or (ii) campaign will increase the rating disparity of both fraudu- 48
49 remain undetected, but then do not impact the rating of the lent and of genuine reviews. This is illustrated in Fig. 4, that 49
50 venue. plots the evolution in time of the average rating against the 50
51 Detect abnormal review activity. We exploit the above ratings of individual reviews received by the “Azure Nail 51
52 results and use statistical tools to retrieve ranges of & Waxing Studio” (Chicago, IL). The positive reviews (1 52
53 abnormal review activities. In particular, our goal is to day has a spike of 19, five-star reviews, shown in red in 53
Statistical Analysis and Data Mining DOI:10.1002/sam
Rahman et al.: To Catch a Fake: Curbing Deceptive Yelp Ratings and Venues 9
1 22 4 2 Table 2. Features used to classify review R written by user U 1
2 52 19 4 4 6 2 2 for venue V . 2
3 Venue avg rating Notation Definition 3
User rating 4
44 3 f (U ) The number of friends of U 5
r(U ) The number of reviews written by U 6
5 Color ExpU (V ) The expertise of U around V 7
cU (V ) The number of check-ins of U at V 8
6 Rating 2 pU (V ) The number of photos of U at V 9
73 f eedback(R) The feedback count of R 10
AgeU (R) Age of U ’s account when R was posted
8
9 22
10
11 1 2 2 2 2 22 2 2 2222 2 11
12 0 5 10 15 20 25 30 35 40 45 50 55 number of reviews U wrote for venues in the vicinity 12
13 Days (50 mile radius) of V . Furthermore, FRI uses the number 13
14 Fig. 4 Evolution in time of the average rating of the venue of activities of U recorded at V , the feedback of R, 14
15 “Azure Nail & Waxing Studio” of Chicago, IL, compared against counting the users who reacted positively to the review, 15
16 the ratings assigned by its reviews. The values in parentheses and the age of U ’s account when R was posted, AgeU (R). 16
17 denote the number of reviews that were assigned a corresponding Section 6.1 shows that the Random Forest tool achieves 17
18 rating (shown on the y-axis) during one day. The lack of consensus 94% accuracy when classifying fraudulent and genuine 18
19 between the many low and high rated reviews raises a red flag. reviews. 19
20 the upper right corner) disagree with the low rated reviews, In a second step, the FRI module introduces the notion 20
21 generating a high ARD value. The ARD module contributes 21
22 one feature, the ARD score, see Table 3. of FRI, to model the impact of fraudulent reviews on the 22
23 ρVg 23
24 We note that Jindal and Liu [2], Lim et al. [4], Mukherjee final rating of the venue. Let = σ denote the genuine 24
25 et al. [13] and Mukherjee et al. [15] proposed a feature n 25
26 similar to ARD. However, the ARD feature we introduce 26
27 differs, in that the disparity is between the rating of a review rating of V , computed as an average over its n genuine 27
28 and the rating of the venue at the time when the review reviews. Then, F RI (V ) = ρV (Tc) − ρVg , where ρV (Tc) is 28
29 was written. Previous work considers a formula where the 29
30 disparity is computed at the present time. the average rating of V at current time Tc. Note that 30
F RI (V ) can be negative, for a bad-mouthing campaign.
The FRI module contributes two features, F RI (V ), and
the percentage of reviews classified as fraudulent for V ,
CF (V ) (see Table 3).
31 31
32 5.6. Venue Classification 32
33 5.5. FRI Module 33
34 Venues that receive few genuine reviews are particularly In addition to the features provided by the RSD, ARD 34
35 vulnerable to review campaigns (see also Theorem 1). and FRI modules, we also use the rating of V , ρV , its 35
36 Furthermore, long term review campaigns that post high number of reviews N , its number of reviews with associated 36
37 numbers of fraudulent reviews can re-define the “normal” user check-ins, cir(V ), and with uploaded photos, pr(V ), 37
38 review posting behavior, flatten spikes and escape detection and the current age of V , Age(V ), measured in months 38
39 by the RSD module. They are also likely to drown the since V ’s first review. Table 3 lists all the features we 39
40 impact of genuine reviews on the aggregate rating of the selected. Section 6.2 shows that the features enable the 40
41 venue. Thus, the ARD of the campaign’s target venue will Random Forest classifier to achieves 95.8% accuracy when 41
42 be small, controlled by the fraudulent reviews. classifying the venue sets of Section 4.2. 42
43 We propose to detect such behaviors through fraudulent 43
44 reviews that significantly impact the aggregate rating of 44
45 venues. For this, in a first step, the FRI module uses 6. EMPIRICAL EVALUATION 45
46 machine learning tools to classify the reviews posted for V 46
47 as either fraudulent or genuine. It uses features extracted In this section we show that Marco is scalable as well 47
48 48
from each review, its writer and the relation between as efficient in detecting fraudulent reviews and deceptive
49 49
the review writer and the target venue (see Table 2). venues. We have implemented Marco using (i) Python, to
50 50
Specifically, let R denote a review posted for a venue extract data from parsed pages and compute the proposed
51 V , and let U denote the user who wrote it. In addition features and (ii) the statistical tool R, to classify reviews 51
52 52
to the friend and review count of U , we introduce the and venues. We used MySQL to store collected data and
53 concept of expertise of U around V . ExpU (V ) is the features. 53
Statistical Analysis and Data Mining DOI:10.1002/sam
10 Statistical Analysis and Data Mining, Vol. 0 (2015)
1 6.1. Review Classification Table 3. Features used to classify a venue V as either deceptive 1
2 We investigated the ability of the FRI module to classify or legitimate. 2
3 3
4 reviews, when using five machine learning tools: Bagging, Notation Definition 4
5 5
6 k-Nearest Neighbor (kNN), Random Forest (RF), Support SC(V ) The number of review spikes for V 6
7 SAmp(V ) The amplitude of the highest spike 7
8 Vector Machines (SVM) and C4.5 Decision Trees (DT). ARD(V ) Aggregate rating disparity 8
9 F RI (V ) The FRI of V 9
10 We used tenfold cross-validation over the fraudulent and CF (V ) Count of reviews classified fraudulent 10
11 ρV The rating of V 11
410 genuine reviews of Section 4.2. Fig. 5(a) shows the N The number of reviews of V
cir(V ) The number of reviews with check-ins
receiver operating characteristic (ROC) curve for the top pr(V ) The number of reviews with photos
three performers: RF, Bagging and DT.
The overall accuracy ( T P T P R+T NR N R ) of RF,
R+T N R+F P R+F
Bagging and DT is 93.8%, 93.6% and 93.2% respectively.
12 TPR is the true positive rate, TNR is the true negative Age(V ) The age of V 12
13 rate, FPR the false positive rate and FNR the false negative 13
14 rate. The (FPR, FNR) pairs for RF, Bagging and DT are feature for the 426 fraudulent and the 410 genuine reviews. 14
15 (7.0%, 5.3%),(6.3%, 6.6%) and (5.1%, 8.6%) respectively We emphasize their symmetry: few fraudulent review 15
16 (shown in Table 4). In the remaining experiments, the FRI writers posted a significant number of reviews, while few 16
17 module of Marco uses the RF classifier. genuine review writers posted only a few reviews. Fig. 5(c) 17
18 The top 2 most impactful features for RF are r(U ) and compares the distribution of the ExpU (V ) measure. The 18
19 ExpU (V ). Fig. 5(b) compares the distribution of the r(U ) distributions are also almost symmetric: most writers of 19
20 20
21 (b) 21
22 (a) 22
23 1.0 260 fake 23
24 240 genuine 24
25 0.8 220
Color
200 25
True positive rate
26 Number of reviewers 180 26
27 0.6 160 27
Number of reviewers 140
28 0.4 120 28
29 100 29
30 0.2 80 30
Decision tree 60
31 Random forest 40 31
32 0.0 Bagging 20 32
0
33 33
34 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6−10 11−15 16−20 21−2526−100 >100 34
False positive rate
35 35
Number of reviews
36 36
37 (c) 280 37
38
38 260 fake 39
240 genuine 40
39 220 41
42
40 200 43
180 44
41 160 45
46
42 140 47
120
43 100
44 80
45 60
40
46 20
47 0
48 0 1 2 3 4 5 6−10 11−20 21−100 >100 48
49 Reviewer expertise 49
50 50
Fig. 5 (a) ROC plot of Random Forest (RF), Bagging and C4.5 Decision Tree (DT) for review classification (426 fraudulent, 410
51 51
genuine). RF performs best, at 93.83% accuracy. (b) Distribution of reviewers’ review count: fraudulent versus genuine review sets. (c)
52 Distribution of reviewers’ expertise levels: fraudulent versus genuine sets. Note their symmetry: unlike genuine reviewers, fraudulent 52
53 reviewers tend to have written only few reviews and have low expertise for the venues that they reviewed. 53
Statistical Analysis and Data Mining DOI:10.1002/sam
Rahman et al.: To Catch a Fake: Curbing Deceptive Yelp Ratings and Venues 11
1 Table 4. Review classification: comparison of machine learning Table 6. Marco versus the three deceptive venue detection 1
2
2 algorithms. RF performs best, at 93.83% accuracy. strategies of Feng et al. [18]. Marco shows over 23% accuracy 3
4
3 Classifier TPR (%) FPR (%) FNR (%) Acc (%) improvement over dist . 5
6
4 Random Forest 94.71 7.0 5.29 93.83 Strategy FPR FNR Accuracy 7
Bagging 93.45 6.28 6.55 93.59 8
5 Decision tree 91.44 5.07 8.56 93.22 Marco/RF 5/90 = 0.055 3/100 = 0.3 95.8%
6 avg 33/90 = 0.36 31/100 = 0.31 66.3%
7 SVM 89.92 9.66 6.04 92.11 dist 28/90 = 0.31 25/100 = 0.25 72.1%
8 peak ↑ 41/90 = 0.45 37/100 = 0.37 58.9%
99
Table 5. Significance test: pairwise comparison of machine
10 the null hypothesis, which means that the differences in 10
11 learning algorithms using McNemar’s test. With the exception performance metrics of DT, RF, Bagging and SVM models 11
12 are statistically significant. 12
13 of the (Bagging, RF) pair, for all other pairs McNemar’s test 13
produces a χ 2 value with one degree of freedom, highly significant
with a confidence level of more than 95.0%.
14 χ 2 value p-value 6.2. Venue Classification 14
15 Compared Classifiers 15
16 Bagging-DT 11.6452 0.0006437 16
17 RF-DT 13.5 0.0002386 We have used tenfold cross-validation to evaluate the 17
18 Bagging-RF 0.0476 0.8273 ability of Marco to classify the 90 deceptive and 100 18
RF-SVM 4.8983 0.0268 legitimate venues of Section 4.2. Fig. 6(a) shows the ROC 19
19 Bagging-SVM 5.2258 0.0222 curve for Marco when using the RF, Bagging and C4.5 20
5.1142 0.0237
20 DT-SVM
21 DT classifiers on the features listed in Table 3. The over- 21
22 all accuracy for RF, Bagging and DT is 95.8%, 93.7% 22
23 genuine reviews have written at least 4 reviews for other and 95.8% respectively, with the corresponding (FPR,FNR) 23
venues in the vicinity of the venue of their selected review. pairs being (5.55%, 3%), (8.88%, 4%) and (5.55%, 3%)
24 24
25 Furthermore, we tested the null hypothesis that the respectively.
26 classifiers used in review classification are equivalent i.e., 25
Fig. 6(a) shows the distribution of SC(V ) for the 190 26
the difference in performance metrics of different classifiers venues. Only one legitimate venue has a review spike,
27 27
is not significant. As the classifiers are trained and tested while several deceptive venues have more than 10 spikes.
28 28
on the same dataset, we used McNemar’s test which Furthermore, 26 deceptive venues have an FRI score larger
29 29
tabulates the outcomes of every two classifiers used for than 1; only one legitimate venue has an FRI larger than 1.
30 30
review classification. The results are shown in Table 5. Comparison with state-of-the-art. We compared Marco with
31 31
With the exception of the test that compares Bagging and the three deceptive venue detection strategies of Feng
32 RF, all other tests produce a χ 2 value with one degree et al. [18], avg , dist and peak ↑. Table 6 shows 32
33 33
of freedom, highly significant with a confidence level of the FPR, FNR and overall accuracy of Marco, avg ,
34 34
35 more than 95.0% (the p-value is <0.05). Thus, we reject dist and peak ↑. Marco achieves a significant accuracy 35
36 36
37 (a) (b) 37
38 1.0 38
39 90 39
fake 40
40 0.8 80 genuine 41
42
Color41 70 Number of venues 43
44
True positive rate 60 45
42 0.6 46
47
43 50 48
49
44 0.4 40
45 Decision tree 30
46 0.2 Random forest 20
Bagging 10
47
48 0.0 0
49 0.0 0.2 0.4 0.6 0.8 1.0 0
1
50 False positive rate 2 50
3
4
5
6
7−10
11−15
>15
Number of review spikes
51 51
Fig. 6 (a) ROC plot of RF, Bagging and C4.5 DT for the 90 deceptive/100 legitimate venue datasets. RF and DT are tied for best
52 accuracy, of 95.8%. (b) Distribution of SC(V), for the 90 deceptive and 100 legitimate venues. 60 deceptive venues have at least one 52
53 review spike. One legitimate venue has one spike. 53
Statistical Analysis and Data Mining DOI:10.1002/sam
12 Statistical Analysis and Data Mining, Vol. 0 (2015)
1 Table 7. Marco performance on new, unpopular venues: com- Table 8. Collected venues organized by city and venue type. 1
2
2 parison of machine learning algorithms. RF and DT perform the Values between parentheses show the number of venues detected 3
3 best. by Marco to be deceptive. San Francisco has the highest 4
4 Classifier TPR (%) FPR (%) FNR (%) Acc (%) percentage of deceptive venues.
5 Random Forest 96.07 20.0 3.92 94.64 City Car Shop Mover Spa 5
6 Bagging 94.12 20.0 5.88 92.15 6
0.0 5.88 94.64 Miami, FL 1000 (6) 348 (8) 1000 (21) 7
7 Decision tree 94.12 San Fran., CA 612 (59) 475 (45) 1000 (42)
8 NYC, NY 1000 (8) 1000 (27) 1000 (28) 8
99
10 (a) (b) 10
11 on the remaining 40 deceptive and 50 legitimate venues. 11
12 On average over 10 random experiments, Marco achieved 12
an FPR of 6.25% and an FNR of 3%.
13 13
14 We have used Marco to classify the 7,435 venues we 14
15 collected from Miami, San Francisco and New York City. 15
16 We have divided the set of 7,435 venues into subsets of 16
17 200 venues. We trained Marco on the 190 ground truth/gold 17
standard venues and tested it separately on all subsets of 200
18 Fig. 7 (a) Marco’s per-module overhead: FRI is the most venues. Table 8 shows the total number of venues collected 18
19 expensive, but under 2.3 s even for venues with 500 reviews. and the number of venues detected to be deceptive, between 19
20 (b) Zoom-in of FRI module overhead. Computing the ExpU (V ) 20
parentheses. San Francisco has the highest concentration 21
21 feature takes the most time.
22 of deceptive venues: Marco flags almost 10% of its car 22
23 improvement (95.8%) over dist , the best strategy of Feng repair and moving companies as suspicious, and upon 23
our manual inspection, they indeed seemed to engage
24 et al. [18] (72.1%). 24
25 Marco performance for new venues. We have also evaluated in suspicious review behaviors. While the FRI of San 25
26 the performance of Marco to classify relatively new venues Francisco’s collected genuine venues is at most 1, 60% 26
27 with few genuine reviews. Specifically, from our set of 90 of its deceptive venues have an FRI between 1 and 4.
27
28 deceptive and 100 genuine reviews, we selected 51 decep- 28
29 tive and 5 genuine venues that had less than 10 genuine 6.4. Detecting Yelp Campaigns 29
30 reviews when we collected them. The overall accuracy of 30
31 RF, Bagging and DT on these 56 venues is 94.64%, 92.15% We conjecture that Yelp events can be used as review 31
32 and 94.64% respectively. The (FPR, FNR) pairs for RF, campaigns. Our hypothesis is based on several observations. 32
33 Bagging and DT are (20.0%, 3.92%), (20.0%, 5.88%) and 33
First, the process of choosing the venues hosting Yelp
34 (0.0%, 5.88%), respectively (Table 7). events is not public. Second, a venue hosting an event 34
35 is given ample warning to organize the event. Third, 35
36 only Elite yelpers attend this event. While the attendees 36
37 6.3. Marco in the Wild are encouraged to review the event’s Yelp account, we 37
38 38
Marco takes only a few seconds to classify a venue, have identified Yelp events that impacted the ratings of
39 39
on a [email protected], 4 GB of RAM Dell laptop. Fig. 7(a) the corresponding host venues. We call such events, Yelp
40 shows the per-module overhead of Marco (averages over campaigns. Fig. 8(a) shows an example of venue and event 40
41 41
10 experiment runs), as a function of the review count of timelines, correlated in time, for the venue “Pink Taco 2”
42 42
the venue classified. While the FRI module is the most (Los Angeles). Note how the venue’s latest two spikes
43 time consuming, even for venues with 500 reviews the coincide with the spikes of the event. 43
44 FRI overhead is below 2.3 s. The RSD and ARD modules To detect correlations between Yelp events and increased 44
45 45
impose only a few ms (6 ms for 500 reviews), while DB review activity concerning the venues hosting the events,
46 46
access and data retrieval take around 90 ms. Fig. 7(b) we use Marco’s RSD module as follows. Specifically, given
47 47
zooms-in into the FRI overhead. For 500 reviews, the a Yelp event and a time interval T (system parameter),
48 48
most time consuming components are computing the user we determine of the hosting venue experiences a positive
49 49
50 expertise, ExpU (V ) (≈ 1.1 s), computing all the other review spike within an interval T of the event’s date. 50
features (≈ 0.4 s) and classifying the reviews (≈ 0.8 s). For the events and hosting venues collected (see
51 51
In order to understand the ability of Marco to perform Section 4.2), Fig. 8(b) plots the number of positive review
52 52
well when trained on small sets, we have trained it on 50 spikes detected within T days, when T ranges from
53 53
deceptive and 50 legitimate venues and we have tested it 1 to 5 weeks. For instance, when T is 14 days, Marco
Statistical Analysis and Data Mining DOI:10.1002/sam
Rahman et al.: To Catch a Fake: Curbing Deceptive Yelp Ratings and Venues 13
1 (a) (b) 1
2 2
33
44
55
66
77
88
99
Color
10 10
11 11
12 12
13 Fig. 8 (a) The timeline of “Pink Taco 2” (Los Angeles) and of the Yelp event for this venue. Note the correlation between the two. (b) 13
14 Yelp events: Positive review spike count as a function of T . 14
15 (b) 15
16 (a) 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 Fig. 9 (a) Distribution of the short term impact (2 weeks) of Yelp events on venue ratings. (b) Yelp events: Distribution of the improvement 26
27 due to Elite events. 27
28 28
29 detected 36 spikes on the 149 venues. Some venues have The impact of an event is a categorical variable, as it is 29
30 more than one spike within the 14 days. The total number of quantified with fractions of a star (integer). The number 30
31 venues with at least one spike is 24, accounting for around of reviews however is a discrete variable. Therefore, we 31
32 32
17% of the venues. While for T = 35 Marco detected 47 cannot use methods for linear or nonlinear association, e.g.,
33 33
spikes, we prefer a shorter interval: the correlation between correlation coefficient. Instead, we tested the hypothesis of
34 the event and spikes may fade over longer intervals. In the independence between the rating impact and the number 34
35 following we use T =14. 35
of reviews, using a χ 2 test [31]. The test produced a
36 36
Furthermore, we focused on determining the influence χ 2 = 58.6837 with 36 degrees of freedom, which is highly
37 37
of Yelp events on the overall rating of a venue. First, we significant (the p-value is 0.009854). Thus, we reject the
38 computed the 2-week impact of the Yelp event on the venue. independence hypothesis. 38
39 We define the 2-week impact as the difference between the Fig. 10(a) shows the mosaic plot depicting this relation. 39
40 rating of the venue two weeks after the event and the rating Each rectangle corresponds to a set of venues, that have a 40
41 of the venue before the event. We compute the rating of a certain review count range (the x axis) and having been 41
42 venue at any given time T as the average over the ratings impacted by a certain measure within two weeks of an 42
43 43
of all the reviews received by the venue before time T . event (the y axis). The shape and size of each rectangle
44 44
Fig. 9(a) shows the distribution of the 2-week impact of the depict the contribution of the corresponding variables, so
45 Yelp event on the venue. While 55 (of the 149) venues show a large rectangle means a large count in the contingency 45
46 46
no impact, 60 venues show at least a 0.5 star improvement, table. Blue rectangles indicate that they are more than two
47 47
with 3 at or above 2 star improvements. 32 venues are standard deviations above the expected counts. Then, the
48 48
negatively impacted. Thus, almost twice as many venues figure shows that more than half of the (149) venues have
49 benefit from Yelp events, when compared to those showing more than 40 reviews. Moreover, we notice that the venues 49
50 a rating decay. 50
having more than 40 reviews set the trend of Fig. 9(a):
51 51
This result raises the question of whether there exists while roughly one third of the venues show no impact, twice
52 52
a relation between the number of reviews of a venue as many venues show a positive impact versus a negative
53 and the short term impact an event has on the venue. one. 53
Statistical Analysis and Data Mining DOI:10.1002/sam
14 Statistical Analysis and Data Mining, Vol. 0 (2015)
1 (a) 0−4 4−10 10−20 20−40 >40 1
2
2 >2.5
3 >4 3
4 2.0 4
5 2:4 5
6 1.5 6
7 (b) 0−4 4−10 10−20 20−40 >40 7
1.0
0:2 >4 8
8 >2.0 9
Color
9 0.5 Magnitude of Improvement −2:0 0.5 2:4
10 Standardized−4:−2
Residuals: 10
11 Magnitude of Improvement
Standardized 0:2 11
12 Residuals: 12
13 0.0
−2:0 13
14 −4:−2 14
15 <−4 0.0 15
16 −0.5 <−4 16
−1.0 17
17 −1.5
18 −0.5 18
19 −2.0 19
20 20
21 Number of reviews Number of reviews 21
22 Fig. 10 Mosaic plots: The standardized residuals indicate the importance of the rectangle in the χ 2 test. (a) The dependency between 22
23 the short term rating change of venues due to events and their number of reviews. (b) The dependency between the long term rating 23
24 change of venues due to events and their number of reviews. 24
25 25
26 26
Second, we study the long term impact of Yelp events. a few hundred ground-truth and gold-standard reviews
27 27
28 For this, we compare the current ratings of the 149 venues (fraudulent/genuine) and venues (deceptive/legitimate). 28
29 with their ratings before the events. Fig. 9(b) shows the Marco is effective in classifying both reviews and venues, 29
distribution (over the 149 venues) of the difference between with accuracies exceeding 94%, significantly outperform-
30 30
the current rating of the venues and their rating before the ing state-of-the-art strategies. Using Marco, we show that
31 31
events. 78% of venues show no improvement. Furthermore, two weeks after an event, twice as many venues that host
32 32
we see a balance between the number of venues showing an Yelp events experience a significant rating boost, when
33 improvement versus a negative impact (16 positive vs. 14 compared to the venues that experience a negative impact. 33
34 negative). However, we emphasize that the negative impact Marco is also fast; it classifies a venue with 500 reviews in 34
35 is only half a star, while the positive impact reaches up to under 2.3 s. 35
36 3.5 stars. 36
37 REFERENCES 37
38 We conducted a χ 2 test to verify the dependence of the 38
39 long term impact of events on venues on the number of [1] D. Segal, A Rave, a Pan, or Just a Fake? The New 39
40 ratings of the venues. The test was highly significant with York Times www.nytimes.com/2011/05/22/your-money/ 40
41 χ 2 = 29.2038, 12 degrees of freedom and a p-value of 22haggler.html, 2011. 41
42 0.003674. Fig. 10(b) shows the mosaic plot: a vast majority 42
43 of the venues having more than 40 reviews have no impact [2] N. Jindal and B. Liu, Opinion spam and analysis, In 43
44 on the long term. This shows that review spikes have a Proceedings of the International Conference on Web Search 44
45 smaller impact on constantly popular venues. and Web Data Mining, WSDM ’08, New York, ACM, 45
46 219–230, 2008. 46
47 [3] N. Jindal, B. Liu, and E-P. Lim, Finding unusual review 47
48 7. CONCLUSIONS 48
49 We presented Marco, a system for detecting deceptive patterns using unexpected rules, In Proceedings of the 49
50 Yelp venues and reviews, leveraging a suite of social, tem- 19th ACM International Conference on Information and 50
51 poral and spatial signals gleaned from Yelp reviews and Knowledge Management, CIKM ’10, 1549-1552, 2010. 51
52 venues. We also contribute a large dataset of over 7K [4] E-P. Lim, V-A. Nguyen, N. Jindal, B. Liu, and H. W. 52
53 venues, 270K reviews from 195K users, containing also Lauw, Detecting product review spammers using rating 53
behaviors, In Proceedings of the International Conference
on Information and Knowledge Management (CIKM),
939–948, 2010.
Statistical Analysis and Data Mining DOI:10.1002/sam
Rahman et al.: To Catch a Fake: Curbing Deceptive Yelp Ratings and Venues 15
1 [5] Yelp, http://www.yelp.com. [26] N. Tran, B. Min, J. Li, and L. Subramanian, Sybil-resilient 1
2 [6] Yelp admits a quarter of submitted reviews could be fake, online content voting, In NSDI’09: Proceedings of the 2
3 BBC, www.bbc.co.uk/news/technology-24299742. 6th USENIX Symposium on Networked Systems Design 3
[7] Sponsored Reviews, www.sponsoredreviews.com/, 2013. and Implementation, Berkeley, CA, USENIX Association, 4
4 [8] Posting Positive Reviews, www.postingpositivereviews. 2009, 15–28.
5 blogspot.com/, 2013. [27] A. Molavi Kakhki, C. Kliman-Silver, and A. Mislove, 5
6 [9] Pay Per Post, https://payperpost.com/, 2013. Iolaus: Securing online content rating systems, In Proceed- 6
7 [10] A. G. Schneiderman announces agreement with 19 com- ings of the Twenty-Second International World Wide Web 7
8 panies to stop writing fake online reviews and pay more Conference (WWW’13), Rio de Janeiro, Brazil, May 2013. 8
9 than $350,000 in fines, www.ag.ny.gov/press-release/ag- [28] G. Wang, T. Konolige, C. Wilson, X. Wang, H. Zheng, and 9
schneiderman-announces-agreement-19-companies-
B. Y. Zhao, You are how you click: Clickstream analysis
10 stop-writing-fake-online-reviews-and. for sybil detection, In Proceedings of USENIX Security, 10
11 [11] M. Anderson and J. Magruder, Learning from the crowd: 2013. 11
12 Regression discontinuity estimates of the effects of an [29] A. Mukherjee, V. Venkataraman, B. Liu, and N. Glance, 12
13 online review database. Econ J, 122 (2012), 957–989. What yelp fake review filter might be doing, In Proceedings 13
14 [12] M. Luca, Reviews, reputation, and revenue: the case of of the International Conference on Weblogs and Social 14
15 Media, 2013. 15
Yelp.com, www.hbswk.hbs.edu/item/6833.html. [30] Death By Captcha. www.deathbycaptcha.com/.
[13] A. Mukherjee, B. Liu, and N. Glance, Spotting fake
16 [31] A. C. Tamhane and D. D Dunlop. Statistics and Data 16
reviewer groups in consumer reviews, In Proceedings of
Analysis: From Elementary to Intermediate. Upper Saddle
17 the International Conference on World Wide Web, 2012. 17
River, NJ, Prentice Hall, 2000.
18 [14] S. Pandit, D. H. Chau, S. Wang, and C. Faloutsos, [32] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou, 18
19 Netprobe: a fast and scalable system for fraud detection in 19
20 online auction networks, In Proceedings of the International Walking in Facebook: a case study of unbiased sampling of 20
21 Conference on World Wide Web, ACM, New York, OSNs, In Proceedings of IEEE INFOCOM ’10, San Diego, 21
22 201–210, 2007. CA, 2010. 22
23 [33] Spelp. www.yelp.com/topic/boston-spelp-9. 23
24 [15] A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu, M. [34] Flelp. www.yelp.com/topic/miami-flelp-we-rock. 24
25 Castellanos, and R. Ghosh, Spotting opinion spammers [35] 3 tips for spotting fake product reviews - from someone 25
26 using behavioral footprints, In Proceedings of the 19th who wrote them. MoneyTaksNews, www.moneytalksnews. 26
27 ACM SIGKDD International Conference on Knowledge com/2011/07/25/3-tips-for-spotting-fake-product-reviews– 27
28 Discovery and Data Mining, KDD ’13, 632–640, 2013. from-someone-who-wrote-them. 28
[36] J. R. Douceur, The Sybil attack, In Revised Papers from
[16] F. Li, M. Huang, Y. Yang, X. Zhu, and X. Zhu, Learning to the First International Workshop on Peer-to-Peer Systems,
identify review spam, In IJCAI, pages 2488–2493, 2011. 2002.
[17] Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T.
29 Hancock. Finding deceptive opinion spam by any stretch [37] V. J. Hodge and J. Austin, A survey of outlier detection 29
30 of the imagination, In Proceedings of the 49th Human methodologies, Artif Intell Rev 22 (2004), 85–126. 30
31 Language Technologies (HLT), 2011. [38] A. Zimek, E. Schubert, and H.-P. Kriegel, A survey 31
32 [18] S. Feng, L. Xing, A. Gogar, and Y. Choi, Distributional on unsupervised outlier detection in high-dimensional 32
33 numerical data, Stat Anal Data Min 5 (2012), 363–387. 33
footprints of deceptive product reviews, In Proceedings of
the 6th International Conference on Weblogs and Social [39] K. Yamanishi, J-I. Takeuchi, G. Williams, and P. Milne,
34 Media (ICWSM), 2012. On-line unsupervised outlier detection using finite mixtures 34
35 [19] F. Li, M. Huang, Y. Yang, and X. Zhu, Learning to identify with discounting learning algorithms, In Proceedings of 35
review spam, In Proceedings of the 22nd International
36 Joint Conference on Artificial Intelligence, IJCAI ’11, the Sixth ACM SIGKDD International Conference on 36
37 2488–2493, 2011. Knowledge Discovery and Data Mining, KDD ’00, ACM, 37
38 [20] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, New York, 2000, 320–324. 38
[40] K. Zhang, S. Shi, H. Gao, and J. Li. Unsupervised
39 Detecting spam web pages through content analysis, In outlier detection in sensor networks using aggregation tree, 39
40 Proceedings of the 15th international conference on World In Advanced Data Mining and Applications, New York, 40
41 Wide Web, WWW ’06, 83–92, 2006. Springer, 2007, 158–169. 41
42 [21] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. Y.. [41] Z. Ferdousi and A. Maeda, Unsupervised outlier detection 42
Zhao, Detecting and characterizing social spam campaigns, in time series data, In Proceedings of 22nd International
43 43
In Proceedings of the 10th Annual Conference on Internet Conference on Data Engineering Workshops, x121–x121,
44 Measurement, IMC ’10, 35–47, 2010. IEEE, 2006. 44
45 [22] G. Wang, S. Xie, B. Liu, and P. S. Yu, Review graph [42] P. Berenbrink, A. Brinkmann, T. Friedetzky, and L. Nagel, 45
46 based online store review spammer detection, In ICDM Balls into bins with related random choices, In Proceedings 46
47 ’11, 1242–1247, 2011. of the Symposium on Parallelism in Algorithms and 47
[23] Hide My Ass! Free Proxy and Privacy Tools, http://www. Architectures (SPAA), 2010.
48 hidemyass.com/. 48
[43] M. Raab and A. Steger. Balls into bins: a simple
49 [24] R. Dingledine, N. Mathewson, and P. F. Syverson, Tor: and tight analysis, In Proceedings of Randomization 49
50 The second–generation onion router, In USENIX Security and Approximation Techniques in Computer Science 50
51 Symposium, 303–320, 2004. (RANDOM), 159–170, 1998. 51
52 [25] G. Danezis and P. Mittal, Sybilinfer: Detecting sybil nodes [44] G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, 52
53 using social networks, In Proceedings of the Network and and R. Ghosh, Exploiting burstiness in reviews for review 53
Distributed System Security Symposium (NDSS), 2009. spammer detection, In ICWSM, 2013.
Statistical Analysis and Data Mining DOI:10.1002/sam