Download the PDF here.

Please note that this analysis is done for academic interest only.

Please do not parade the results for your propagandas

The media landscape of 2013 is a very new one. The 13th General Elections of Malaysia had taken on its own life on social media websites like Facebook and Twitter, what with hashtags like #ubah and catchphrases like ‘ini kalilah”. This author had watched the Malaysian General Election with a certain perverse obsession, despite having nothing to do with it.

There were rhetorics abound, made by supporters of both parties. Ridiculous claims and promises – objectively unsustainable ones – were made by the ruling party. Slogans were shouted on traditional media and on social media. There were *ceramah*s and there were concerts. Then came Election Day. By Election Day, the situation had turned out to be what I consider fairly ugly. Look out for fraud, people were told. Look out for phantom voters, such as Bangladeshis, derogatively called ‘Banglas’, who were hired by the ruling party, Barisan National to play phantom voters. Many allegations and rumours of citizen arrests circulated around Facebook and Twitter.

As with any election, comes the counting. During the counting period, there were again, many anecdotal stories about blackouts followed by sudden increase in ballot boxes; stories about vote swapping; and stories about new ballot boxes being ferried in. Naturally, people cried foul over such activities, once again claiming fraud. This situation was exacerbated by the announcement that Barisan National had won the elections and would remain government.

People were not happy, and for a period of two to three days, social media was flooded with “evidence” to fraud. In this author’s opinions, they were hardly evidence of fraud, merely anecdotal evidence. To quote Michael Shermer, Editor in Chief of Skeptic Magazine: *“Anecdotal thinking is natural. Science requires training.”*

And so, this author decided to perform some analysis to determine if fraud had happened.

## Affiliation Disclaimer

This author has no affiliations with any political party in Malaysia. The analysis was done mainly out of academic curiosity. However, considering rather racist and segregationalist claims made by the leaders of Barisan National, the results of the analysis has presented an ethical dilemma to the author.

The ethical dilemma is this: Should this analysis ever be discovered by a political party, it would most definitely be paraded around. By the ruling party, this analysis will most definitely be misconstrued as the General Elections were conducted fairly; by the opposition party, this analysis will most definitely be misconstrued as propaganda from the ruling party.

And yet, this author owes it to the enlightened peoples of Malaysia, to present a analysis that is not fraught with emotions, nor side: A factual analysis, so to speak.

As such, the data and source code used in this analysis will be open source and available to all.

## The 13th General Elections of Malaysia: A statistical analysis

In this study, we shall analyse the results of the 13th General Elections of Malaysia through the lens of a statistician. We will do so with a rough framework of answering the various questions of fraud that had been floating around social media websites. With the big question in mind: **DID FRAUD OCCUR?**, and we will begin by investigating the allegations of how such frauds might occur. We will finally return to the question at the end of the analysis.

We will acquire data from official figures released by the SPR (both from The Star and the compilation by James Chong.

Both data from The Star and James Chong have been matched up and no discrepancies were found.

### A General Overview

We begin with a general overview of the question: **Did fraud occur?** with a cursory glance at the numbers of the elections. A very popular technique to discover evidence of fraud is to apply a Benford’s Law analysis on the numbers of the elections.

**Benford’s Law** refers to a specific form of frequency distribution of digits in many real-life data. Many people have defined this as data from naturally-occuring processes. The idea is that the first (and/or) second digit of numbers generated from naturally-occuring processes would fall into this sort of distribution: ‘1’ in the first digit would appear more often than ‘2’; ‘2’ will appear more often than ‘3’; ‘3’ will appear more often than ‘4’ and so on and so forth. Specifically, ‘1’ would appear in the first digit about 30% of the time, and ‘9’ will appear in the first digit about 5% of the time. Mathematicians are still trying to figure out why this happens.

We would expect that if a process generates the numbers naturally (i.e. the numbers have not been tampered with), the numbers will follow the distribution of Benford’s Law. However, if the numbers have been tampered with, one would expect aberrations in the distribution, with spikes in other numbers.

An election is a naturally-occuring process that generates numbers in terms of votes, and turnouts. We would expect, if the election numbers have not been tampered with – such as with ballot stuffing – would follow a distribution that is quite similar to Benford’s Law. It should be noted that some deviation is to be expected.

Without much further ado, we shall analyse the distribution of the numbers generated by the 13th General Elections of Malaysia. The Benford’s Law distribution is plotted for both vote counts for each party (BN vs PR) and for turnout vs registered voters.

This chart shows the distribution of first digits of votes for each party, as compared to the Benford’s Law distribution (pink line). Note that both PR and BN lines do follow the Benford’s Law distribution quite closely (it does in fact fit quite well)

This chart shows the distribution of first digits of the turnout and the number of registered voters, as compared to the Benford’s Law distribution. Do note that the registered voters count is slightly off the Benford Law distribution, for the number 2.

**What does this imply?** It does imply that the election is fairly natural and the election numbers were generally not tampered with. The distribution of ‘2’ in the registered voters could be concerning, but it’s not much to stand on.

### Alleged Discrepancies

The use of Benford’s Law in election data has been widely disputed. Deckert et. al. (2011) asserts that it is like flipping a coin to determine if fraud had occurred, and ‘…at best a forensic tool’ – which is what precisely we treated the results as. With a skeptical mind, we pursued further.

Perhaps one of the more easily verified allegations floating around social media is that the numbers do not add up (such as in this picture). To achieve this, we combed through the data for discrepancies.

We approached the discrepancy problem with a rather novel method due to the data. We had noticed that the turnout numbers for both data from The Star and James Chong were actually sums of the actual votes for each party and the number of rejected votes. They were not actually reported number of total votes. As such a simple analysis for discrepancy (i.e. taking the sum of actual votes for each party and the number of rejected votes, and then comparing it to the reported total votes) would be a useless affair. Instead, a different method had to be used:

The election was split into two parts, and most people in most states had two ballots: one for state level (N) and one for parliament level (P). If there were to be any discrepancies, it would most likely show up in the differences between the State level and the Parliament level, due to logistics involved in ballot stuffing.

We computed a table of the total number of votes for the N level and the P level elections, and computed their discrepancies. We define an acceptable error margin of 1% to account for human and systemic error (because humans do make mistake, both in counting and entering data into a spreadsheet).

Below is the resulting table:

As can be noted – the discrepancies are very minute – and most definitely within acceptable error margin. Were we to reduce the acceptable error margin to 0.5%, all of the data would still be within acceptable range.

Observant readers will notice that Sarawak as well as the Federal Territories is missing from this list. This is because of the way the discrepancies were counted: they require both N and P level vote counts. Due to the unique history of Sarawak, the state elections will be held much later, and the Federal Territories are not states, therefore do not contribute to the N level votes. They were therefore omitted from analysis.

One might also notice that this does not actually answer the question of discrepancies as listed in the allegation above. The reason is simple: a per-electorate turnout ratio was computed for further analysis below, and no electorates were found to have turnout rates higher than 91%. This completely dispels the allegations of higher-than-100% turnout/voting rates

### Systemic Election Irregularities

Astute readers would have noticed that the phrase “ballot stuffing” has been thrown about a few times thus far. Indeed, the whole exercise of this analysis is to figure out if fraud had happened by ballot stuffing. The state-of-the-art method of detecting election fraud was created by Klimek et. al (2012). In their paper, Klimek et. al. had defined two form of voting fraud: a) Incremental fraud; b) extreme fraud. We have taken their approach, and adapted it to the Malaysian general elections.

#### Incremental Fraud

Incremental fraud is defined as fraud that causes increases the vote count for the winning party. Ballot stuffing is a common method, and was described by Klimek et. al. in their paper. In the Malaysian context, we take the allegations of fraud and consider them one by one.

**Phantom voters**– phantom voters are voters that do not exist on the electoral roll, and yet have their votes counted in. This is the traditional ballot stuffing. Here are a few ways to perform a phantom voters fraud: i) a bunch of new ballots from unknown origin for the defrauding party are added to the ballot box before or during counting (such as after a blackout); ii) after counting, increment the result count for the defrauding party, per channel (*saluran*)**Dirty electoral roll**– a dirty, or tainted electoral roll simply has the people who are not supposed to be on the electoral roll be on the electoral roll and voting. Here are a few ways to perform this fraud: i) pre-register a bunch of foreign workers as citizens eligible to vote – perhaps with financial incentives – and have them vote for the defrauding party; ii) have one person be registered to vote and vote at multiple electorates; iii) have one person vote multiple times per electorate (holding fake ICs and removing the indelible ink, for example)**Default votes**– default votes are votes that default to the defrauding party. An example of this kind of fraud is as such: change all incoming postal/military/police votes to default to the defrauding party.

All these fall under the purview of **Incremental Fraud**. In every way, it is essentially robbing the non-defrauding parties of votes.

According to Klimek et. al., incremental fraud can be modeled as such: ‘[W]ith probability *fi*, ballots are taken away from both the nonvoters and the opposition, and they are added to the [defrauding] party’s ballots.’

To detect incremental fraud then, is simple. If any one of the methods were used, we would expect the number of total votes to increase, in relation to the actual number of people. If the electoral roll is dirty, we would also expect the number of registered voters to increase.

Therefore, if incremental fraud had happened, we should expect to see a correlation between the percentage of people who voted for the defrauding party, and the percentage of people who turned up – in essence, because these extra people who turn up, we expect them to vote for the winning party.

#### Extreme Fraud

In the Klimek et. al. paper, extreme fraud was characterized as “…[W]ith probability fe, almost all ballots from the nonvoters and the opposition are added to the winning party’s ballots.”. Here, we differ from the Klimek paper. Instead of defining extreme fraud as one where nearly all of the opposition’s votes are swapped into votes for the defrauding party, we define extreme fraud as swapping results of counts, as per this allegation.

Although it is more than likely that the allegation were the results of clerical error, it would be nonetheless interesting to simulate what would happen.

Extreme fraud in our case is modeled as such: with probability *fe*, if the count of votes for the opposition part(ies) is higher than the count of the defrauding party, switch the counts so that the defrauding party has the count of the opposition party.

Both Klimek et. al.’s modeling of extreme fraud as well as this author’s own modeling of extreme fraud were performed. However, in interest of brevity of this article, only our modeling will be shown. The Klimek modeling of the election data will be provided in a link next to the caption of the images. Interpretation will be left as an exercise to the reader.

#### The Analysis

Now that **Incremental Fraud** and **Extreme Fraud** , as well as examples of those fraudulent activities are defined, we proceed to detect irregularities. Because we are only concerned with Barisan National defrauding the election process to win the government, we will restrict our analysis to the P-level elections.

First, we look at the logarithmic vote rate for Barisan National at the P level. As in the Klimek paper, we assume that the vote rate can be represented by a Gaussian distribution, with mean and SD taken from actual samples.

The logarithmic vote rate. From this figure, it can be observed that the vote rate is roughly Gaussian in nature, albeit not centered at 0, and is probably bimodal

The skewness for Barisan National at P level elections is 0.697269; while the kurtosis is 4.237479. One data point (PASIR MAS) was removed because BN had not competed in that electorate.

These numbers are relatively in line with the data from countries with ‘cleaner’ elections such as Austria, Canada or Finland. In fact, the distribution of logarithmic vote rates is remarkably similar to Sweden’s 2010 elections (also included in the Klimek et. al. paper).

Next we compare the distribution of the correlation between the Winning Ratio and Turnout Ratio. To do this, we follow in the footsteps of Klimek et. al – see the paper for model information.

Let *fi* be the probability that incremental fraud had happened; and let *fe* be the probability that extreme fraud had happened. We start by simulating the General Elections with a variety of *fi* and *fe* values. We then compare the distribution of the simulated resultant matrix of Winning Ratio vs Turnout Ratio to the matrix of the actual results.

An *fi* and *fe* of 0 means that the election is fair, and an *fi* and *fe* of 1 each means that the election is extremely corrupted. The figure below shows the distribution of votes for Barisan National, compared with simulated values of different *fi* and *fe*:

This figure shows The Winning Ratio vs Turnout Ratio of various levels of *fi* and *fe*. This is the result of our own model. Results following the original Klimek et. al. model can be found here.

Here, we temporarily return to the Benford Law distribution. While the Benford Law distribution has been established as not a very good measure for detecting election frauds, it would still undoubtedly be interesting to note the distribution of the first digits of fraudulent and non-fraudulent voting behaviours.

Benford’s Law on simulated election data. Note that even with fraud parameters of (0, 0), the simulations do not really follow Benford’s Law. It is however, less irregular than the simulation with high fraud parameters. Whilst this author has some ideas as to why this is the case, it will be left as an exercise to the reader.

Do note that in Figure 4, that the actual data looks more like simulations with low fraud parameters than simulations with high *fi* and *fe*. This is true for both our model and the original Klimek et. al. method of modeling. The main idea is to find the *fi* and *fe* values that fits best with the original data. This process is repeated for 1000 times to then find the range of *fi* and *fe* that best fit the election data. The original Klimek modeling process was repeated for 500 times due to time constraints

After searching for the best fit for 1000 times, we find the sector of (*fi*, *fe*) that appears the most often. We can then say that it is most likely those were the ranges of (*fi*, *fe*) of which the Malaysian General Elections happened in.

The best fit after 1000 iterations was: (fi, fe) = (0.03471, 0.01275). Here is the comparison between the simulated best fit and the actual data:

This figure shows comparison between simulated and actual results. Results following the Klimek et. al. model can be found here.

This means that the best simulation could provide shows that with a probability 0.03471, Barisan National engaged in incremental fraud; and with a probability of 0.01275, Barisan National engaged in extreme fraud. A further analysis can be done, as the figures below show, on the distribution of the votes for Barisan National. We expect that if the simulation results make sense, the distribution of simulated votes would closely match the distribution of actual votes for the winning party.

The ascertained figure of (*fi*, *fe*) = (0.03471, 0.01275) is the mean of all (*fi*, *fe*) of best fit of the 1000 simulations. Simply put, for each round of simulation, we acquire the (*fi*, *fe*) of the best fit. Then we repeat the simulation 1000 times, which results in 1000 pairs of (*fi* , *fe*). We then take the mean of *fi* and *fe*, which is 0.03471 and 0.01275 respectively. However there remains some amount of variances to the range (*fi*, *fe*) can take. The figure below shows the ranges of *fi* and *fe* that fits best with the actual election results after 1000 simulations:

S is the sum of squares fit. The smaller S is, the better. To plot this chart, we used a simple inverse to find the sectors with the highest amounts of best-fits.

Finally, as mentioned by Klimek et. al., a chart showing the cumulative number of votes as a function of turnout is a good way to spot fraud as well. According to the authors, it is plotted as such “…[f]or each turnout level, the total number of votes from [electorates] with this level or lower is shown.”. Russia and Uganda did not show plateaus in such charts, which are indicative of fraudulent behaviour.

Here, we show a similarly plotted cumulative vote as a function of turnout for the Malaysian General Elections. Do note the plateau at a little bit past 90% turnout rate.

### Voter Growth

Another allegation that was made was the sudden increase of voter counts in various electorates. While such a factor would already be considered in the previous analysis, this author has decided to single out this issue and perform additional analysis on it. If fraud were to happen by means of voter growth, we would expect to see correlations between growth and votes for the winning party.

The figure below shows correlation between the proportion of population who voted for Barisan National and voter count growth per electorate. Both axes are in percentages.

A few negative-growth electorates were removed from the analysis, as is one electorate that had a growth rate above 100% (PUTRAJAYA).

A few data points were interesting: Barisan National lost in about half of the highest growing electorates – this gives credence to the theory that the opposition party, PR has managed to mobilize voters to their advantage in those electorates; the largest growth (outside PUTRAJAYA) was SUBANG. The Prime Minister Najib Tun Razak’s own electorate of Pekan, being hotly debated as a prime location for fraud had the 11th largest growth.

All in all, however, the data did not have any indication of suspicious activity.

### Marginal Analysis

One final analysis that can be done is the same as above, except only performed with seats that were won by BN with a small margin (say, under 2%).

A cursory analysis indicated nothing suspicious. However it must be admitted that the analysis was incomplete for the lack of time.

## Making Sense of All of This

What does this all mean? **This author has failed to find evidence of fraud**. From the numbers and statistics alone, it is indicative that the elections are quite clean and fair. It is likely some very tiny amounts of fraud did occur. It is however, in this author’s belief, not significant enough to change the results of the election.

To manipulate the number of votes in favour of Barisan National and yet not show up on a statistical analysis such as this would require tremendous amounts of knowledge.

For example, in order to perform any of the incremental fraud activities, the would-be defrauders would have to have perfect information about the position at every polling station in the country when the extra votes are brought in. Any slight change to tip the favour of Barisan National would skew a) the Benford Law distribution (as shown above); b) the distribution of Turnout Ratio and Winning Ratios.

If the would-be defrauders were to rig the count in one polling station, they would skew the distributions of the votes, leading to detection. To avoid detection, they would have to adjust the count at every polling station.

A better way to do it would be to rig the numbers on Borang 14 (again, with perfect information of what the other polling stations have reported).

Another method that was brought up was to have prepared the ballots in advance. Let us examine the two ways this can be done:

- Prepare additional ballot boxes with results in advance. Switch the ballot boxes before counting begins.
- Prepare two sets of ballots – one for BN and one for PR. Top up to the desired numbers.

The first method would be a logistics nightmare. The required amount of pre-prepared ballot boxes with the results would be a very large number. In order to rig the vote counts in one station, the other stations and other electorates would have to have their vote counts rigged as well, lest it be discovered by statistical techniques such as the ones above.

The second method would appear more plausible, but would require again, a network of constant communications across the country’s counting stations. The counting process is being watched by observers, so this is as well, unlikely.

There is one final method of fraud that will elude detection. The implications that come with it is also very massive. The method simply requires a group of highly sociopathic individuals who are very good at mathematics. Their job is to generate the fake votes in a convincing manner as to elude statistical detection. With an extension of method #1 above, it can be performed.

The implication, as previously mentioned, is massive. If that is happening, it means that one’s votes no longer matters. However, there is consolation that such an idea is so ludicrous that it never has a snowball’s chance in hell of happening.

## Further Analysis

No statistical analysis is without weaknesses. Here, we list some of those weaknesses down. We leave them as suggestions for future work as an exercise to the reader.

- The resolution of the data is extremely poor. Higher levels of aggregation tend to mask irregularities at the lower level. In the Klimek paper, the resolution of data goes to polling station level. This cannot be done for Malaysia. However, Borang 14 data, should they be uploaded on to the internet, could act as a lower level of aggregation.
- The analysis concerns itself with only P-level elections due to time constraints. Further analysis could be done, on the N level as well as a combined analysis.
- As stated above, marginal analysis could potentially be revealing, however not much was done. Future analysis should also be aware of the small sample sizes involved and take that into account.
- Proper variance analysis was also not done. One would expect a binomial variance, and if the variability of votes for Barisan National were to be significanly less than binomial variability, it would be suggestive of fraud. However, cursory analysis from above indicates that variance is indeed binomial.
- Scacco and Baber (2008) and (2012)‘s hypothesis that human generated numbers tend to end in 7s and 5s could also be used to test the distribution of vote counts.

## Conclusion

From the data, the 13th General Elections of Malaysia can be concluded to be quite fair. This author has failed to detect any irregularities through means of statistics. However, this does not mean to say that fraud did not happen, given leaked evidence of such fraud in form of communiques between high-up officials. If this election is fraught with fraud, it is not through means of incremental voting (ballot stuffing, “bangla” voters, extra ballot boxes and the like), or extreme fraud (swapping of results).

There were allegations of voter intimidation and blackmail (with what is known as the 13th May event). This author is unable to account for such activities within this analysis as the data arising from such events will probably fall in line with our model. This is to be left to any Royal Commission of sorts to figure out.

Here this author would also like to comment upon malapportionment and gerrymandering. Malapportionment and gerrymandering are very much tied to the bedrocks of modern representational democracy, and can often be considered as rules of the game. To fix this would require some massive upheaval of the democracies we’re used to. Whilst this author has some ideas as to what could be done with regards to malapportionment and gerrymandering (an idea is to rid of apportionment all together and return to Greek-style democracy, but that’s just crazy), it is very much outside the scope of this analysis and hence only a passing remark.

PR had won the popularity vote in this General Election. Were this author to give political advice, it would be to stop chasing on electoral fraud, and start campaigning on actual issues that matter to seats that do not represent many people. Win by the fringes, just like what Barisan National did.

Hi, I am not sure whether your conclusion will still be true if you confine your analysis to the seats BN won by small majority. PR has identified 27 seats today. What do you think? Thank you.

We have identified 17 low margin electorates, but no conclusion can be made.

We do not take sides on this issue. This analysis was done as a matter of academic interest

Thank you for you analysis and reply.

I did not mean to accuse you of taking sides. You made it clear at the beginning of your article.

The reason I asked is because PR has alleged that fraud is unlikely to occur in constituencies where BN or PR was likely to win by big margin which makes sense to me. So as a matter of academic interest, would you conclusion as to the existence of fraud be different if you had taken into account only the 17 constituencies you had identified as being won on low margins?

Thank you again.

Hi Abdullah,

No. In the section above on marginal analysis, we indicated that it was too low a sample size to make any conclusions meaningful. There are some statistical methods for handling small samples, but we have not done them (time reasons), although I have a feeling they will be equally inconclusive as to whether fraud has occurred.

Very academical… but a good read though. :)

~ OnDaStreet

Thank you. We hope you enjoyed it.

What about the cumulative regression analysis method reported here? https://iweb.cerge-ei.cz/pdf/gdn/RRCX_96_paper_01.pdf

I’m not sure if I buy that Benford’s Law or Klimek’s method would be able to detect the kind of fraud that it takes to win this election. Was your Benford’s Law analysis on leading digits or trailing digits? If it was on leading digits then it would not be able to detect ballot manipulation in marginal areas as such manipulation would probably not change the leading digits of data (e.g. 3400 BN – 3500 PR -> 3671 BN – 3500 PR).

Also, while I was initially interested in the Klimek approach, I was disappointed to see that the variance of turnout in our electoral data is very small, and in particular (as you correctly noted) there were no 100% turnout constituencies – and indeed I have not seen that claim anywhere on social media, so I’m not sure where you saw it. Thus, I doubt that any fraud would be detectable on the level of analysis that Klimek performs.

In any case I agree with you that data on the level of individual streams would be far more amenable to analysis; do you know if such data would include number of registered voters as well as actual votes counted, for the purpose of %turnout analysis?

Hey Shern Ren,

Regarding Vorobyev(2011), we were aware of the paper, and decided that in the interests of time, was close enough with the Klimek et. al. (2012) paper. It should be noted the Klimek et. al. paper also takes into account similar things like right/left handed variance, though the statistic used was a bit different (winning ratio and turnout ratio instead of winning-vs-losing ratio as in the Vorobyev). We would encourage anyone interested to also repeat Vorobyev’s regression.

Somehow that was left out of the Further Analysis section. Sorry bout that.

Anyway, good evening, it was a pleasure reading your critique Shern Ren, and thank you for that.

Nice work! In case you’re not already in this group, do participate and contribute there in the future, if you can. https://www.facebook.com/groups/641535625862722/

Good morning Yang Jerng,

Thank you for the invite. We may join the group, and while I cannot speak for the rest of the group, Facebook is not really my thing.

A small part of me is both slightly disappointed and also relief after going through your analysis. I’m not in the field of mathematics but from all the data and analysis combined with my small knowledge of statistics, I find your statistical analysis very very convincing. Thank you for putting effort into all this.

I guess… it is time for us to grow up and accept that the fight is still a long way more to go.

I particularly liked the last 2 paragraphs of “Making Sense of All of This” section. It made me laugh which i haven’t done much when it comes to issues of GE13.

Keep up the good work.

indeed, making electoral fraud is one thing, making it statistically undetectable by VERY careful rigging is not something easy to do, esp when polling booths are not controlled by a network statisticians wired to one other. tI will be interesting that if we also isolate the seats claimed by PR to have had fraud, and analyze them separately, perhaps the results will show more. I also like the figure where BN wins in areas with lower voter growth (increase). I guess BN did invest their “efforts” at the right places, where there are less voters to impress. When we look at the results, BN’s win this time is contributed by large gains in Sabah and Sarawak. In Peninsula, the fight was very close. It will be interesting to have Sabah & Sarawak in this analysis, as this work omitted them and the Federal Territories. Thus, i dont think it is fair to conclude “no fraud” as of yet until we see the whole picture. The large presence of PACABAs too might have deterred some intended fraud. SPR, SPR, if the dakwat was kekal, many would have trusted you. A lot of ppl lost that once they washed their hands that night. Thank you for your nice paper, i enjoyed it a lot :)

Good evening roaccchz,

Sabah and Sarawak and the Federal Territories were included in the analysis of voter turnouts and winning ratio, since they do have P level electorates being contested. Sabah, Sarawak and the Federal Territories were not calculated in the State/Parliament discrepancies

We do not make a claim that there was no fraud. It is a very small but important distinction that we found no evidence of fraud, because there were insignificant amounts of irregularities

We’re glad you enjoyed the paper

Your variance/discrepancy for Johor’s registered voters at Parliament is not 0.01% but is actually at 0.1747%. I have crossed checked your data at PasteBin against SPR’s data on their website. With this difference, the variance/discrepancy for actual voter turnout is not 0.12% but is 4.07%. So far I have checked and found discrepancies for 1 state, there are 14 more states and territories to go. Let’s say that the discrepancies are cumulative, what will your graphs be actually showing?

I am sorry, I counted wrong, its not 4.07%, should be 0.15% instead.

Hey,

We’ll double check in the morning as it’s quite late and close to bed time here.

I also agree with tayyz1990. Waited with bated breath as I read through your analysis, trying very hard to take my time to make sense of all the numbers and logic (since I’m no mathematician/statistician @.@). Feel slightly ashamed to have made certain claims, as your analysis proves me (and manyyyyyy others) wrong.

I also applaud your effort and time invested in doing this. I, and many others who struggle to remain as unbiased as possible (which is very difficult to do what with all the overwhelming “information” on social medias and elsewhere), needed an unbiased and neutral commentary. Thank you. :D

We are glad you enjoyed it.

Thank you for your effort.

I applaud your effort. There are a couple of things I would be interested in:-

– Firstly, why did you choose to include results using Benford’s law, when you go on to say the following: ‘While the Benford Law distribution has been established as not a very good measure for detecting election frauds’. From the sounds of it it seems to be a disputed method for use in detecting election fraud.If you were to try to make a convincing argument using Benford’s law I would suggest including what the figures would look like if ‘fraud’ is introduced. For example, say if you reverse the results of the 27 disputed Parlimentary seats by Pakatan Rakyat, how different would the resulting plot be from the Benford plot.

– As this is my first time hearing about the Bentford’s law, I would just like to point out a couple of points which cause me concern. In the first two figures, the graphs closely match the Benfort fit, while the 1s you have simulated (all & 0,0) don’t. Do you have any insight as to the reason why? If its the case that Bentford’s law is not a proven reliable marker, I think you should put less emphasis on it in the article, to improve readability.

Its a very interesting field, I for one am new to this field, so I apologize if that shows in my comments.

Best regards.

Good evening JC,

Lastly, we were all beginners in statistics. We all make mistakes. Do not apologize for learning.

The reason for my comments and questions, was not because I was particularly interested in the maths, but i was trying to gauge which of the methods were more reliable. And it seems you have answered with Klimek below. My only worry about this article is as to how accurate these tests are. For instance things like sensitivity and specificity should deserve a better mention. For instance you wont provide a test for cancer without quantifying how accurate it is. As an example in regard to the plot Comparisons of actual election data with different values of fi and fe (and maybe I am interpreting it wrongly) I can think of a couple of instances where a false positive of fraud might occur. Now this is an extreme case just to illustrate the point, but lets say a single party wins all the seats, however half the country had poor turnout because of floods, wouldnt that cause a bimodal distribution, which from the looks of it seem to point to fraud. Obviously this is beyond my scope of expertise, but as always looking forward to hear ur feedback, and of others.

Hi,

sensitivity and specificity were not included for editorial reasons, which in retrospect was not a good idea. We thought we’d just put up a fun article on the General Elections and not bother with the tables you usually find in the apendices. Boy you Malaysians take your politics and academics rather seriously. It’s a lame excuse, no doubt, and whether we would rectify it by posting it, well, depends on time.

But yes, you are correct – we should have mentioned statistical power. We had not. Going by memory, they were okay and acceptable by normal standards.

And regarding floods, yes you are also correct. It would cause a bimodal distribution, but like the Canadian case, we would be sure to mention it.

I think instead of putting your work online, do submit it to a international journal and see how it stands (for academic purposes). Post the reviewers comments online and then we will see if ur conclusion stands or not.

Also, Please do add in the disclaimer, saying that you guys are beginners and are learning this. I think i will benefit if you guys can add up the limitations on Benford’s law more throughly. thank you

Good morning Kuhan,

While we are are familiar with publishing (I see you’re a well-published astrobiologist as well), we do not intend for this to be published. This was done on a whim, and a lot more polishing needs to be done to even get it up to publishing standards. There were a lot more tables and stuff which we had omitted from this blog entry (error tables, etc – boring stuff that is typically added to the appendix) it was meant to be a quick and dirty job.

We’re not really beginners though – statistics are kinda our day jobs. We’ve already identified the limitations of Benford’s Law. The detour to Benford’s Law is mainly an interest/curiosity thing. The meat of the analysis is the Klimek et. al. style analysis.

Thanks for the reply. One of your comment to JC indicated that you guys are “beginners and learning”. Thats the reason i suggested for a disclaimer.

Well the reason i suggested publication, is simply for peer reviewing purpose. I think you work will stand the test of credibility better. Even if the paper is rejected, please don’t be dishearten since it will only enhance your research work.

Good morning Kuhan,

I fear you may have misread us. I said “we were all beginners” to JC, assuring JC that it was absolutely fine to ask questions.

The reason why we have put up the source code and data is exactly for the reasons you’ve mentioned: peer review. The code and data is up for anyone with competence to review. We have responded to various criticisms, but they mainly arose from poor understanding of our analysis – which we will agree that was not written as clearly or as lucidly as could be.

Siapa yang buat ni. Memang aku respect. Buang masa, tenaga semata-mata untuk mengkaji dan meredakan gelombang perpecahan rakyat Malaysia. Tahniah!

Saya tak anggap usaha penulis sebagai membuang masa…. ia bukan sekadar meredakan buat sementara, tapi saya harap dapat menjadi ikutan dan teladan kepada semua rakyat malaysia supaya mengelak dari terlalu percaya dengan khabar angin. analisis yang rasional perlu lebih diketengahkan untuk meredakan tindakan yang terlalu emosional.

Haha,

You flatter us. We were just bored, really.

Sorry for asking, but Google Translate seems to say something about diffusing a split wave?

meredakan gelombang perpecahan rakyat Malaysia , means to reduce the split being seen in Malaysian citizens. Actually i dont think he meant you wasted your time, he said you used your time (in a crude way) and energy for research and to reduce the split being seen in Malaysian citizens.

The election fraud should be settled in the court of law, not in or by an article whatever it’s academic merit. A party in the dispute could present the article in the court to support its case. BN or PR could start recruiting teams of reserachers to do the job if they have not already have such teams.

Syabas! bukan senang nak buat statistical analysis, apa yg saya faham statistic ini somehow menunjukkan, in term of numbers, irregularities dlm election hari tu mmg close to 0.

Cuma analisa nombor-nombor ini mmg takkan dapat detect isu seperti:

1. Vote-buying activities. Nombor2 yg ada tu takkan boleh bagitahu area mana berlaku aktiviti ini kerana sekiranya ia berlaku, mereka yang dibeli undinya adalah pengundi yang sah, oleh itu undi mereka adalah sah.

2. Makhluk klon. Ana terbaca bbrp kes yg mana ada orang yg ingin mengundi, bila sampai ke tempat mengundi, sekali nama dia dah kena potong, sbb ada klon dia dtg dah mengundi kat tempat tu. sekali lg, nombor2 ni takkan dpt detect sbb tidak akan berlaku 2 kali pengundian bagi org yg sama di tempat yg sama.

3. 1 orang mengundi di 2-3 tempat yg berbeza. Sekali lg, org mcm ni, dr segi sistem yg ada, undi dia dikira sah dan nombor2 takkan dpt detect any irregularities.

4. Orang yg sudah mati, tp nama dia ada dlm senarai pengundi yang sah, dan dia pegi mengundi..ha ha ha…ni mmg hantu betul…

5. Penipuan dlm Postal votes. Yang mcm2 jenis penipuan boleh berlaku yg nombor2 ni cant tell. Ugutan untuk diturunkan pangkat, 1 orang mengundi untuk 9-10 kawasan, etc. terlalu byk sbb nombor2 ini hanya bgtau, orang ini adalah legitimate voter dan vote dia adalah sah.

6. Penipuan semasa di Pusat Pengiraan Undi (tempat menjumlahkan undi).

7. Pengundi hantu yang sah!. Yang ini pon nombor2 itu takkan boleh detect. Sbb ikut sistem, walaupun mereka ni dr Bngladesh ke indonesia ke vietnam, etc…asalkan nama dia registered dlm sistem i.e. dia adalah legitimate voter, dia punya vote adalah sah.

Mungkin kalau ada orang lain berminat nak buat kajian lg, ana cdg perkara2 berikut:

1. Analisis antara undi majoriti vs undi yg rosak

2. Analisis peningkatan pengundi baru yang abnormal

3. Analisis voters’ turnout. Mengapa ada kawasan yg tinggi, ada kawasan yg rendah.

EDITED BY BLOG AUTHOR TO INCLUDE GOOGLE TRANSLATION (for easier replies):Hi Nizam,

I have attached a Google Translated document to enable easier access when replying. I would like to respond to some of your raised issues:

Vote buying activities. You are correct that this analysis will not be able to pick up vote buying. Mostly because vote-buying will show up as legitimate votes. Here’s a question: what prevents a voter have to take the money, promise to vote for party 1, and then go in to the booth and vote for party 2? This is of course a naive way of looking at vote buying – the more sophisticated methods would be to promise an entire village electricity or water supply if the party wins. We will not comment on these more sophisticated methods, as the line of delineation between vote buying behaviours and election promises become fuzzy.Incremental Fraud. For your points 2, 3, 4, 5, 7, please consider this scenario with simplified numbers. Imagine an electorate with 2000 registered voters. And say 1500 people turned out to vote. Let’s call the turnout ratio theExpected Turnout Ratio: 0.75 – that is to say, if no fraud had happened, we expect the turnout ratio to be 0.75. Out of the 1500 people who voted, let’s say 1000 people voted for the Opposition Party, and 500 people voted for the Government Party. We call the ratio of votes to the winning party theExpected Winning Ratio– of 0.33. If there were no fraud, the Government Party would lose, since 500 is less than 1000.So now, let’s assume that the Government Party wants to defraud the electoral system. They need more than 500 votes in order to win this electorate. For convenience, let’s just say they stuffed the ballot boxes with 700 votes, to ensure they win with a large-enough margin. 500 + 700 = 1200. From this alone, you can see that the voter turnout is now 1200 (new votes for the Government Party) + 1000 (votes for the Opposition Party) = 2200. The new turnout rate is now 2200/2000 = 1.1. There is a marked increase in turnout rate compared to the expected one. The Winning Ratio is now 1200/2200 = 0.54

Let’s say the Government Party is more sophisticated than merely stuffing ballots. What they do is they are in cahoots with the Electoral Commission. They tell the Electoral Commission to junk 300 votes for the Opposition Party – declare them invalid votes, so to speak. The voter turnout is now: 1200 + 700 (Opposition gets 300 votes marked as spoilt votes) = 1900. The turnout ratio is now 1900/2000 = 0.95. The Winning Ratio is now: 1200/1900 = 0.63

Here’s a quick summary:

The takeaway conclusion is simply this: if

Incremental Fraudis engaged, both the Turnout Ratio and Winning Ratio will increase compared to expected values. As to how the expected values are computed, see the Klimek et. al. paper.Let us then consider point #2:

Being cloned. The translation is not clear (correct me if I’m wrong), but from what I understand from it, it means that there were people who had their identities stolen, and they discover they cannot vote – that someone else has already voted in their stead. And so they voted two times? Anyway, if a person votes twice (once their actual votes are cast, and the other when someone else had voted in their stead), it’s simple ballot stuffing. See above examples on why it can be detected1 person voting in 2-3 electorates. Also ballot stuffing. If this one person votes for any party any number of times, the turnout ratio and the winning ratio will increase compared to expected value. It’s precisely these kinds of behaviour that this analysis is good at detectingLegitimate phantom voters– Tainted electoral rolls are an interesting one to figure out as well. We did have a lot of fun discussing these particular scenarios when doing the analysis. Let us return to the example from #2 above. Let’s assume that the government is even more sophisticated than ever. They now add 700 people to the electoral roll to account for the additional 700 votes they had . The turnout ratio is now 0.7. Does this go against what was said in #2 above? Not really if you consider the distribution of voter turnout, deviations can go both ways. One simply has to set a range of n-sigmas to filter out the places with extreme turnouts. Granted, this is not something we put a lot of effort into, as all we did was a cursory look into this and found nothing of interest to warrant deeper analysis. It could be the little blob in the top left corner in the actual data in this chart is indicative of something of this sort. However, we felt that if such an effort were to be carried out, it would require a lot of pre-planning and forethought, as well as plenty of logistical nightmares. While possible, it was improbable given the high amount of effort, knowledge and (near) perfect information required to keep this a secret. Of course, wikileaks-style leaks could happen, and in our opinion, that would be fairly interesting to watch. Please feel free to perform your own variance analysis on this issue with varying (fi, fe) valuesBorang 14. Nonetheless, rigging would require near-perfect information.Regarding your suggestions:

Please feel free to engage in your own analysis! We’ve spent a bit too much time on this already, and are ready to retire from the Malaysian elections analysis for other engagements.

After giving it much thought, my conclusion is that this methods are just indicators, and are by no means conclusive. The same way a normal heart rate is an indicator of good health it does not necessary mean so. My only qualm is that this could have been made much clearer at the start of the article.

In addition (as expressed above) there is no mention of how accurate these methods are. And even if they were made, I would have my doubts about how this accuracy was calculated. Maybe it is done differently in this field of science, but I would imagine for you to calculate accuracy you would require ground truth data sets of elections where no fraud has occurred and elections where fraud has occurred. In each fraud data set you would require knowing the exact number of people who has committed fraud, which is logically speaking impossible since who is going to admit to committing fraud. Also given the fact that you cant simulate fraud data since you have no ground truth to compare whether you simulation is actually correct, I just dont see how accuracy can be calculated. Furthermore there are so many variables which can affect the accuracy, such as demographics in terms of age/gender/ethnicity/wealth/religion, weather, etc. One method which works in one country might not necessary work in another.

Since it is unreasonable for you to address everything that has been mentioned in the comments section, and since your last reply indicates a reasonable reluctance to do so, you should really at least mention at The Start that this statistical methods are merely indicators and Not Conclusive Prove.

Haha by the way its funny you should mentioned Malaysians take politics and academics rather seriously, when it is you guys who took the effort to write the article :). Anyway I am glad to see other people taking an interest in Malaysia, have a good weekend.

indicators, clues, signs, hints, educated guess, whatever you want to call it, is better than nothing. your talk about accuracy is merely beating around the bush… maybe this analysis will trigger further discussion from more level-headed malaysians? the social media is so stuffed with rumours from cybertroopers, that makes reading this analysis gives us a fresh air of rationality.

Accuracy is important I can assure you, and I am not merely beating around the bush. Say the weather forecast with an accuracy of 1% predicted rain, would you bring an umbrella with you? Say a test for Cancer with a known accuracy of 1% came back positive, would you choose to believe it? My remarks are just to point to the readers, while this paints a picture, its not the full picture, just as everything in life, a little thought and consideration can bring you a long way.

Hi! I am a college student from the Philippines and Im doing a little research about Malaysian Politics. Would you be interested in a small discussion about it? Let me know, thanks!