Mathematics
Raiting:
2

How to Lie With Statistics


There are 3 types of lies: Lies, damned lies, and statistics
Statistics, infographics, data analysis and data science – who isn’t doing it right now. Everyone knows how to do it right, just left for someone to write how you SHOULDN’T do it. In the article we’ll try to fix it.

image
(Hazen Robert "Curve fitting". 1978, Science.)

Article structure:
  1. Lead
  2. Sampling Bias
  3. Well-chosen average
  4. 10 more failed experiments of which we haven’t written yet
  5. Playing with scale
  6. Selecting 100%
  7. Hiding main numbers
  8. Visual metaphor
  9. Example of qualitative visualization
  10. Conclusion and what to read next

Sampling Bias


In 1948 during presidential election in the USA, on election night (Truman vs Dewey), newspaper Chicago Tribune has published their, arguably, the most famous headline DEWEY DEFEATS TRUMAN (see photo). Immediately, after the precincts has been closed , newspaper conducted a poll, rang round lots of electors, and everything predicts deafening victory of Dewey. On the photo, we are seeing laughing Truman, winner of the presidential elections. What went wrong?

image

People were really randomly rang round, but in 1948 phone have people of certain plenty and rarely poor people. Therefore, method itself do some corrections to the vote distribution. Selection was not covering wide enough layer of Truman electors (usually, democrats have a lead among poor people), which had no phones. Such kind of sampling called bias.

Folklore about this phenomenon:
According to the internet voting, 100% of people use internet
Graduates Salary
Was anyone surprised, that when we are hearing about graduates' salaries, for some reason it is fabulously large numbers? It comes to the courts in the USA, where graduates say that the salary data is artificially high.

image(Source - How to Lie with Statistics)

It is quite old problem, according to Daren Huff, that question arose in 1924 graduates of Yale University. In fact, everyone is telling a truth, just not all of it. Statistics were being collected as polls (in those years with help of paper mail). Not everyone sends an answer, only minority of all university graduates. More active are those, whose things are going well (often resulting in a good salary). Therefore, we see only “good” part of the picture. It is what creates sampling bias and makes the result of such polls totally useless.

Well-chosen average


Imagine a company where chief earns about 25.000$ and his deputy 7.600$, top-managers 5.500$, middle management 3.500$, junior managers 2.500$ and regular employees 1.400$ per month.

Our task is to represent the company in a favorable light. We can write average salary in the company is X. However, what does average mean? Let us examine possible ways (see diagram below)

image

Arithmetic average for some finite set X = {xi} – it is the number m equal mean(X) from the equation:
image
It is the most useless information from an employee’s point of view – 3.472$ average salary, with help of what do we have so large number? Due the high salaries of governance, what makes the illusion that an employee will earn as much. From an employee's point of view, the value is useless.

Median of some allocation P(X) (X={xi}), it’s such a value of m that equals for the next equation:
image
Put simply, half of employees earn more than m value and half of them less – so we have exactly the middle of the distribution. Such kind of statistics is very informative for company workers, because it lets to identify how salary of the employee related to most of the employees.

Therefore, depending on situation average value could mean any of specified values above. So, it’s very important to understand, how an average value is calculated.

10 more failed experiments of which we have not written yet


Our studies reported that toothpaste Doake's is 23% more efficiently compared with competitors and everything of it thanks to Dr. Cornish's Tooth Powder! You perhaps will be surprised, but the studies really have been conducted by and even a technical report was released. Experiment truly shows that the toothpaste is 23% more efficiently than competitors have (whatever that means). However, is it the whole story?

In reality, sampling for experiment consists of dozen persons (according to Daren Huff and mentioned book above). It is the exact sampling where you want to get any results! Imagine that we toss a coin five times. What is the probability that the head will outcomes all five times? (1/2)5 = 1/32. It cannot be a coincidence if you got the head all five times, can it? Now image that we are repeating the experiment 50 times. At least one of our tries will succeed. About this one, we will write in our report and all others tries we just skip. Thus, we get exclusively random data, which are perfectly within our task.

Playing with scale


Let us assume, tomorrow you have to show in a meeting that we are catch up with the competitors, but the numbers are differ a little bit, what have we do? Let us move the scale a bit! Even New York Times, famous with their quality data handling, produced highly confusing graph (please note on the leap from 800k to 1.5m in the middle of the scale).

image
(Source is How to Display Data Badly Howard Wainer. The American Statistician, 1984.)

Selecting 100%


Let us assume that milk prices were 10 cents per liter and bread ones were 10 cents per loaf last year. This year the milk prices fell down by 5 cents and the bread rose by 20. Question, what do we want to prove?

Let us assume that the past year – it is 100%, the basis for the calculations. Then the milk fell in price by 50% and the bread rose by 200%, average 125%, hence in general prices rose by 25%.

image

Let us try one more time, current year – it is 100%, so the milk prices were 200% in the past year and the bread ones 50%. This means that the prices were 25% higher in the past year.

image
(Graphs and examples are from the chapter «How to Statisticulate» of the How to Lie with Statistics)

Hiding main numbers


Best way to hide something - divert attention. For example, let us to view dependence of the number of private and public schools (in thousands) by years. As we can see from the graph, number of public school reducing and the number of private schools is not changing.

image

In fact, private schools growth are hidden concomitant with public schools number. Because they are differed by an order of magnitude, so any changes will be no noticeable on a scale with a large step. Redraw the number of private schools separate; now we distinctly seeing significant growth of the number of private schools that was “hidden” on the previous graph.

image

Visual metaphor


If you have nothing to compare with, but to confuse very want to, that is exactly the time for inexplicable visual metaphors. For example, if we depict an area instead of length, then any growth will looks like much more significant.

Let us look on beer consumption in the USA in 1970-1978. Take market share of Schlitz company (see graph below) for example. It looks good, impressive. Does it not?

image

Now, let us get rid of all “trash” on the graph and redraw it to a normal form. Now it is not so impressive and serious as before.

image
(Graphs and examples are from John P. Boyd, lecture notes How to Graph Badly or What. NOT to Do)

First picture does not lie, all numbers in it are correct, but it implicitly presents the data entirely in a different light.

image
(Picture is from How to Lie with Statistics)

Example of qualitative visualization


Qualitative visualization provides results first, avoiding ambiguity and provide sufficient information in a compressed volume. As Charles Joseph Minard said:
Absolutely everything is excellent here. No one holds the viewer for an idiot. Wide beige streak shows army size in each point of the campaign. In upper right corner – Moscow, where the army of France are coming and whence starting their retreat, shown by the black streak. For additional interest, time and temperature graphs were joined to the retreat route.

In the end: astonished viewer compares size of the army at the start with whose, who has returned to the home. Viewer is all in feelings, he learned something new, he felt the scale, he is mesmerized, he has understood that he learned nothing in school.

image
(Charles Joseph Minard: Napoleon's Retreat From Moscow)

Conclusion and what to read next

76% of all statistics are made up
The compilation covers not all list of methods, which knowingly or unknowingly distorting data. The article, first of all, demonstrates that we have to follow closely for the provided statistical data and conclusions have been made on their basis.

Short-list what to read next:
How to Lie with Statistics - Darrell Huff
How to Display Data Badly - Howard Wainer. The American Statistician (1984),
KlauS 30 june 2014, 14:35
Vote for this post
Bring it to the Main Page
 

Comments

Leave a Reply

B
I
U
S
Help
Avaible tags
  • <b>...</b>highlighting important text on the page in bold
  • <i>..</i>highlighting important text on the page in italic
  • <u>...</u>allocated with tag <u> text shownas underlined
  • <s>...</s>allocated with tag <s> text shown as strikethrough
  • <sup>...</sup>, <sub>...</sub>text in the tag <sup> appears as a superscript, <sub> - subscript
  • <blockquote>...</blockquote>For  highlight citation, use the tag <blockquote>
  • <code lang="lang">...</code>highlighting the program code (supported by bash, cpp, cs, css, xml, html, java, javascript, lisp, lua, php, perl, python, ruby, sql, scala, text)
  • <a href="http://...">...</a>link, specify the desired Internet address in the href attribute
  • <img src="http://..." alt="text" />specify the full path of image in the src attribute