How not to recount the sums and averages each time
Let’s imagine that we have an electronic payment system with the database table operations. For example, we want to calculate the size of average operation. It is easy, here's a query:
SELECT avg(amount) FROM transfer;
generated in 3850 seconds
Now let’s imagine that the index should be the latest one, the entries are made every second in the table, and after a month there are millions of them. To aggregate the same data each time is very expensive. Traditional databases do not offer such optimizations. What to do?
A person that rides a bike could wonder about how the bike computer could count the average speed infinitely long and every second, and at the same time it does not store all speed values. Now, the bike computer has a micro SD card, but they worked well without this memory years ago.
It's simple: it stores the mileage and the time. Every second, it is just needed to adjust the time counter and add this mileage to the odometer. It is only 2 values at 4 or 8 bytes, only 2 additions to derive the average, and it is needed to divide one by another.
In the case of the payment system, we periodically need to add and delete the sum values or the average. This is done by the same formula, only in reverse, if we know what value to delete.
If we do not know d1, it will not work.
If we changed some value, we do the same twice, so we delete the old value from the average and add the new one.
People that know math will easy factorize into the formulas such statistics as a variance, product, even the product of each variable for each (or covariance). Here is an example.
Users fill out the questionnaire composed of various questions starting from 1 to 100 (from strong “Yes” to complete “No”). It is needed to calculate the difference between 2 users, namely the standard deviation between the responses to relevant questions:
i and j are different users. Also, it is needed to calculate the difference between a user X and a group, and between a user and all other users.
Querying the database, it turned out like this without optimization:
avg (power (my_answer.value - his_answer.value, 2))
answer AS my_answer
INNER JOIN answer AS his_answer ON my_answer.question_id = his_answer.question_id
my_answer.user_id = X AND
his_answer.user_id = Y
Tt is just needed to change the condition for some users:
his_answer.user_id IN (Z) AND
Maybe even remove it, if we compare the answers with everybody. Obviously, in order to do this, it is needed to select the entire table. Of course, it is possible to cache 50 questions and 10 000 questionnaires, namely 500,000 lines, it is only a few megabytes of memory. However, the users often fill out the questionnaires and look at the results, we do not want to repeat unnecessary computational operations, we need to optimize.
Here is a variance formula in the first line, the part that needs to be optimized. Xq is the answer of the user (calculated an average) to a question q, Vqu is the user response u to the same question q. We total per users and questions. If there are not too many the questions (m ≅ 50), then there are a lot of questionnaires (n ≅ 10,000). The second line has the same formula, where the sum of users (n, 10 000) is isolated. Namely, the data must be divided into the questions, and they need to be totaled each time, but it's only 50 lines. Concerning the sums of 10 000 users, u, we isolated, and we can calculate and store them. If we label the sum of responses to the question q as Sq, and the sum of the squares as Rq, we get a very compact formula:
Sq and Rq can be calculated in advance and stored in a separate table.
In order to run faster the system, I made a script that generates random answers of abstract users to abstract questions. The script fills in only one table, the answers. Other tables are not required.
import sys, sqlite3, random, itertools
USERS = xrange (100000)
QUESTIONS = xrange (50)
conn = sqlite3.connect ('aggregation.sqlite3')
cur = conn.cursor ()
cur.execute ('drop table if exists answer');
cur.execute ('create table answer (user_id integer, question_id integer, value integer)')
for u, q in itertools.product (USERS, QUESTIONS):
cur.execute ('insert into answer values (?,?,?)', (u, q, random.randint (1, 100)))
# build indexes after data entry
# so that during insertion we don’t have to rebuild again the binary tree
cur.execute ('create index answer_user on answer (user_id)')
cur.execute ('create index answer_question on answer (question_id)')
cur.execute ("" "
create table answer_summary as
select question_id, sum (value) value, sum (value * value) value_square, count (*) num
group by question_id
cur.execute ('create unique index answer_question on answer_summary (question_id)')
50 responses to 100 000 users are 5 million records. Using my poor laptop (1.5GHz x2 and 2 GB of memory), the tables were built for half an hour, and the file depending on the number of records was obtained from tens to hundreds of megabytes.
I have made several inquiries about the sums and sums of squares. The important thing is that Linux cached file with the database entirely in memory. They are the only things that slowed the calculation work: the search index and addition.
Here are the results:
|without optimization||with optimization|
If our program communicates with the database, and we need the sum of a large number of lines (thousands and more), it is reasonable to allow the database to do that, rather than send all the data into the program and summarize them, it increases the overhead costs.
However, if our data is summarized in advance, as is the case with questionnaires from the sample, then we have only 50 + 50 lines. Such data could be simply selected as unmodified state and calculated in a program, where the code will be much more laconic.
Similarly, the sum update could be done: no need to write complex queries UPDATE with join tables we can choose and add the data, and then update using INSERT OR REPLACE.
Where the approach is not working
Let’s get back to the example of the speedometer. We have an average speed in kilometers per hour. We can linearly convert it into meters per second. If we would convert from km / h into m / s, and only then we aggregated, we would have the same thing.
If we keep the average, and we want to calculate the average air resistance (proportional to the square of velocity), it would not work, because the sum of the squared values from the sum of the values is not obtained. We will need initial observations.
Everything is not so difficult
In fact, if we do not have OLAP, and we just have ORM, we will not need to write sheet queries SQL, which then will be needed to maintain and the worst of all to be understood by another person. Such optimizations could be done in the form of connected models. Here is how to organize the models in Django:
class Question (Model):When we create the answer, we will add it in different sums. In order to remove it, we simply subtract the values. To change it, we subtract the old and add new value.
text = CharField ()
class Answer (Model):
user = ForeignKey (User)
question = ForeignKey (Question)
value = IntegerField ()
class AnswerAggregate (Model):
question = ForeignKey (Question, related_name = 'aggregates')
summ = IntegerField ()
summ_of_squares = IntegerField ()
answers_count = IntegerField ()
def add_to_aggregates (* kwargs):
answer = kwargs ['instance']
summ = F ('summ') + answer.value,
summ_of_squares = F ('summ_of_squares') + answer.value ** 2
answers_count = F ('answers_count') + a
def remove_from_aggregates (* kwargs):
answer = kwargs ['instance']
summ = F ('summ') - answer.value,
summ_of_squares = F ('summ_of_squares') - answer.value ** 2
answers_count = F ('answers_count') - 1
signals.post_add.connect (add_to_aggregates, model = Answer)
signals.pre_update.connect (remove_from_aggregates, model = Answer)
signals.post_update.connect (add_to_aggregates, model = Answer)
signals.pre_delete.connect (remove_from_aggregates, model = Answer)
Here is homework, write a query through ORM or SQL, counting the standard deviation from the formula above.
Cost of enhancements
It is reasonable to optimize these calculations, but this is a separate issue. In the example with questionnaires the reports have been slowed down by a couple thousands users. This size of data is achievable for a commercial enterprise and even for a small project. Today the regular databases do not make such optimizations automatically. These optimizations are built-in into the OLAP databases, but they are expensive and cumbersome for the small businesses. Therefore, such small optimizations could be a good solution for them.
The cost for this optimization will be:
1. To derive the formulas and to understand them, there is needed a good knowledge of mathematical statistics.
2. To write a trigger or procedure in ORM and to debug it.
3. To document the system in details, so after you leave a new employee will understand it and will not be using the usual aggregation functions.
First, as we see the main brake in aggregates (SUM, AVG) is the addition and not a disk reading.
Second, even complex aggregate functions could be factorized and extracted from them aggregates. I have shown as the square of the difference could be broken into the components and extracted from it the sum of values and the sum of their squares. The sums could be calculated in advance and kept ready.
Resource-capacity of reporting decreases in proportion to the number of observations, O(n). The systems that store gigabytes of data, this could be a significant speed-up of work, and the reports could be speed-up to a fraction of seconds.
Third, we can add and edit new data, as well as delete old data even not recounting the aggregate completely. The resource-capacity of recalculation is also reduced in O(n) times.
Finally, working with a small number of lines, namely with aggregates, the size of data is reduced, and it could be extracted from the database and calculated directly in the program code, avoiding SELECT or UPDATE inquiries.
I hope I have told you something new. Thank you for your reading.
|Vote for this post
Bring it to the Main Page