Homepage of Boris Haase

Previous | Next

#23: Extension Sorting and Searching on 17.09.2010

In all the sorting procedures in that values are sorted with a bit length of l = q a with address length a, the time complexity increases per value from O(1) to O(q). If O(r) is the average value of O(q), thus in the time complexity of the sorting procedure at least one O(n) is to be replaced by O(r n). If s is the average of the number of different (!) values that end up in a bucket after sorting up to bit length a and whose bit length is greater than a, then the time complexity of the sorting procedure is O((log s + r) n). The mentioned values of a bucket are stored per bucket in an AVL tree (or in a B-tree, if they are stored in a relatively slower external memory with respect to the main memory). There is worked each time always only with the significant part value of address length of a value.

If one forgoes that the values of each processing of a value are present completely sorted, then the time complexity O(r n) can be maintained for the complete sorting by using two areas of main memory. The values are first sorted up to address length into buckets and then further sorted address length for address length (i.e. part value for part value), bucket for bucket. The processing of the values whose part value previously processed is equal is counted as one run. It is presupposed here that reading and writing of the part values of a value and its address on all runs are on average completely counted as O(r).

The second area of main memory is deleted after each run to be available for the next run, until the bucket is processed. Into the first area of main memory, always only the addresses in the current sort sequence are written back, before the values with equal part value just processed are further sorted. Each value is to be stored contiguously in memory, so that can be accessed quickly to any part value. About the time complexity can be admittedly argued. The values of a bucket and all the addresses should fit into the main memory to avoid relatively long loading times.

Both procedures can be combined well: First the O(r n)-procedure for master data, then the O((log s + r) n)-procedure for transaction data. Sorted data can be stored in O(n) in main memory and AVL- resp. B-tree. Thus especially indices can be quickly built for non-indexed columns in database queries. For large tables, it makes a significant difference, whether an index is built in O(r n) instead of O(r n log n), and then a value can be searched on average in O(log s + r) instead of O(log n + r) or worse. Using skilfully a sufficiently large main memory several indexes can easily be kept simultaneously. For numbers, r is usually quite small, depending on the accuracy, and s also for a corresponding distribution. The same is true for (compressed) words, if the memory size corresponds to that of today commercially available computers. Sorting by merging of arbitrarily many sorted lists is also possible in O(r n). The implementation of an appropriate memory with relatively low intelligence is admittedly simple and saves more complex processors, but by using sufficiently many of these or parallel processing (e.g. parallel use of memory areas) the term log s + r in O((log s + r) n) can be made less than 1, if one only uses AVL- resp. B-trees and merges the result lists afterwards in O(n).

In support of sorting and searching in heterogeneous (distributed) databases (for instance of different database models), a database query language can be used that allows a unified data manipulation, with the aid of dynamically generated data structures from these (e.g. under a uniform interface), as long as the records have a somehow similar structure. It can decide dynamically, based on corresponding demands (possibly anticipatorily), which manipulation steps are to be performed. It must be provided only with the corresponding memory information of the databases with the setup of the database structures.

Changes of the programme code and data (such as the integration of further databases) at run time can be provided, if sufficient security mechanisms (including authorisation checking and setting locks) are available. The indices may be built up, for example parallel or on times of low charge, depending on their size, with different data structures - adapted to the individual presentation of the problem or optimised. Possibly a temporary merging or parallel accessing is sufficient. High-octane standards can facilitate the implementation of such a database language clearly.

© 17.09.2010 by Boris Haase

Valid XHTML 1.0 • disclaimer • mail@boris-haase.de • pdf-version • bibliography • subjects • definitions • statistics • php-code • rss-feed • top