Previous Page Table of Contents Next Page


6. SIZING THE DATA BANK

Final description lists have been received by the consultant for the following species:

-   Cattle

-   Buffalo

-   Pigs

-   Sheep

-   Goat

Estimates on field lengths for each of the attributes in the description list for buffalo have also been supplied to the consultant. These are considered over rather than under estimates. These lengths and the number of attributes have been assumed for all species (including chicken, duck, turkey, rabbit and other poultry) in determining possible size of the future data bank.

The data in the data bank can be broken down into two major categories. These have been termed Master and Slave record categories. The Master record occurs once only for each species and broadly describes the characteristics of the species. Slave records, however, may occur very many times. These will be derived from single documents (research papers, published and unpublished reports, theses etc.). A single document could generate several slave records, as it may deal with several breeds or several crossbred types of a breed.

Slave records may be further broken down into various categories e.g. performance, management system, environment etc.

From estimates supplied to the consultant, the potential data bank is huge. Each master record could conceivably contain up to 8 000 characters. However the number of species is minimal. Each slave record could perhaps contain up to 30 000 characters. Estimates of the number of slave records are varied. One possible distribution is shown in Table 1. The existence of full records for each slave record in this table would lead to a global database of over 1 500 million characters. Obviously, though, this will not be the case.

Another estimate was that there could be of the order of 300 breeds each with an average of 50 records, each of average 1 000 characters. This would give a size of 15 million characters. If one assumed that only one percent of each slave record's possible character positions were used, then, assuming 30 000 character per full record and the distribution of species in Table 1, each region would have a data bank with the following sizes:

Table 1. Estimates of potential size of data bank if all records were fixed length.
Assuming a record length of 30 000 characters of each species.
Figures represent estimated totals at time of writing.
No growth factors have been applied.

 ASIAAFRICALATINAMERICATOTAL
 No. of RecordsSize in millions of charactersNo. of RecordsSize in millions of charactersNo. of RecordsSize in millions of charactersNo. of RecordsSize in millions of characters
Cattle8 7502624 350130.57 350220.520 4506.3
Buffalo5 250157.51504.51504.55 550166.5
Goat1 25037.51 8005475022.53 800114
Sheep1 000302 700811 500455 200156
Pig3 750752 25067.51 500457 500187.5
Chicken3 750753 000903 000909 750255
Duck50015300930091 10033
Turkey1253.75752.25752.252758.25
Rabbit and other Poultry62518.7537511.2537511.251 37541.25
TOTAL25 000674.515 00045015 00045055 0001 574.5

Asia-6.75 MB (MB = million characters)
Africa-4.5MB
Latin America-4.5 MB

This gives a regional size of 15.75 million characters, close to the second estimate given above. A one percent factor is not unrealistic to apply in this case as all field lengths have been overestimated, and in most cases only a few traits will be recorded in the source document. It has also been suggested that the figure of 55 000 source documents mentioned in Table 1 is an overestimate.

What previous figures do show, however, is that it is imperative to use a storage method that only stores entered characters and does not waste vast amounts of space by storing blank (missing) information. This implies either a system of codes, as devised in several of the pilot trials, where each trait entered is associated with a unigue number for identification (and with trailing spaces eliminated from the value of the trait), or a system that has a very flexible storage algorithm that is not position dependent and that can detect both the presence and the absence of information - fields not entered have no information stored for them, and unnecessary spaces are removed from non-null traits.


Previous Page Top of Page Next Page