Quantcast
Channel: Colbran South Africa » enzee
Viewing all articles
Browse latest Browse all 5

Managing Least to Greatest

$
0
0

A question I receive often can be characterized as "do I really need to do this for smaller tables? Isn't this only necessary for larger tables?"

 

And the question could run the gamut from things like when-to-use-bigint to how-to-distribute-tables. The interesting thing - some of it really can be deferred until later, since Netezza has the power to turn-the-ship so to speak. Other things we need to consider for different reasons entirely. For example, will the table always be small? Or can it grow without asking our permission?

 

Case in point - a discussion arises as to whether we need bigint surrogate key types for small lookup tables that will never exceed hundreds or even thousands. The bigint type seems like overkill. So this makes some sense in that we don't want such overkill mechanics in the data model. Then someone notes, "and we'll save disk space too". Well, not really. On the smaller tables the additional bytes are negligible, add to this the compression ratio and the fact that all tables have a minimum required allocation of space - and we see that the avoid-bigint-to-save-space isn't really valid at all.

 

The avoid-bigint-to-reduce-domain-size tends to make sense, until we get into the issues of actually filling them with data. If we want to use a hashN() from the toolkit, we'll have to coordinate between the hash8() or the hash4() for example, to make sure we use the right one. If we plan to use sequence generators, we'll also have to coordinate the separate use of those, one for bigint and one for integer etc. In this aspect, the bigint (since it isn't costing anything in disk space) has higher value in its data-type-portability and the portability of the code or SQL leveraging it. We reduce logistical complexity with no penalty. The bigint then acquires a more polymorphic flavor to it, as a pointer to data rather than as a numeric counter or index.

 

Another aspect is distribution. How do we want to distribute the smaller tables, and what should we choose? For the smaller tables, recall, if they are not distributed on the same key as the larger joining tables  (usually the case) then they will automatically broadcast to the larger table. As this is actually the expected behavior, is there anything that could change this? I have noted on some sites that when queries get complicated, the optimizer does not have enough information to properly make a call on broadcasting, and can actually reverse this outcome - broadcast the larger table toward the smaller. Such things are very difficult to find when the data sizes are marginal and only rear their heads when the data sizes (on the larger table) hit a certain threshold. Meaning: it will work for a while, perhaps even a very long time, and then show signs of inexplicable drag.

 

Of course. this can also happen if the statistics on the table are stale, so we always want the optimizer to make to right call. We could always keep our stats updated and we could make sure those complex queries never got-that-way. We could-and-should do that over time and maintenance activities, but we should also never get bitten in between.

 

To avoid pain and injury, the way to make the smaller tables always behave is to distribute them on random. The optmizer then has no basis to broadcast a larger table into the smaller, because no distribution exists. The smaller will always broadcast to the larger (the behavior we expected, right?) and all is well. The beauty on the TwinFin is that this smaller table will be cached in memory at no additional charge, further boosting the query and any subsequent queries using it - a beautiful thing.

Continue reading

Viewing all articles
Browse latest Browse all 5

Trending Articles