EXTRACTION AND PREPARATION OF DATA

1. INTRODUCTION

This chapter gives guidelines for extracting data on breed characteristics and for assembling them in an appropriate fashion for subsequent compilation into the approved Descriptor List. The person preparing data (compiler) is reminded of the role of the Data Bank (DB) and urged to keep in mind its value as a pool of information on breed characteristics within defined environments. The compiler should also keep in mind the needs of users for information relevant to the future utilization of animal genetic resources in other similar or dissimilar environments. Thus, this exercise of data extraction and presentation must include an exhaustive search of the published literature and other unpublished data sources, the evaluation of these sources and the extraction of valid genetic and associated environmental information and preparation of this information in a form suitable for entry into the Descriptor Lists.

2. WHERE TO FIND THE DATA

The data for the Data Bank will be derived from various published or unpublished sources. A Source is defined here as any document having authentic data which would add to the sum of knowledge about the genetic characteristics of a breed. The Source could have been written in any language. The likely types of Sources are listed below.

published scientific papers,
papers presented in conferences with or without proceedings,
specific reports or case studies,
annual reports (livestock stations, research centres, government departments),
theses, graduate and undergraduate, and
vi) stores of unpublished data ('idle' data).

The Data Bank does not include individual animal records but performance statistics of groups of animals of known breed type and conditions under which these statistics were measured. They should be entered in English, using the Descriptor lists in this publication. Similar Descriptor Lists in French and Spanish are available.

3. THE WORKING GROUP

All the persons involved should understand the background objectives and the basic principles of data handling. The team leader must have the following qualifications:

should be an animal geneticist by training, and should also have professional experience with the species being studied,
have a good general knowledge of animal production,
have the ability selectively to extract relevant information and be able to judge the authenticity of the source material.
have some appreciation of statistics and computerization.

The assisting members of the team should preferably have a degree in Animal Science, Veterinary Science or Biological Sciences. Non professional members could assist in restricted areas such as compilation of data on rainfall, environmental temperatures etc. for various stations covered by the Sources. It is emphasised that the team leader be closely involved in training the team members and at all stages of the data extraction.

4. A NOTE OF CAUTION TO COMPILERS

The Descriptor List is comprehensive, covering all aspects of the breed characteristics and almost all classes of livestock. It was derived from trials in different countries in Africa, Asia and Latin America, and covers all possible traits of interest and occurence. As a result it is massive. It is therefore emphasized here that the compiler should study the general pattern and contents of the Descriptor List first. Then the mode of execution is to look and search from each source, data on genetic characterization. It is not to look at the Descriptor List each time and search for corresponding data from the source. From past experience, each source is likely to provide data for only 5 to 40 percent of the options listed in the Descriptor List.

The Descriptor List should serve as a dictionary of genetic characteristics and should be used as a format for layout of the Source Data Sheet prepared by the compiler before entering them into the system (see item 10 of these guidelines).

5. GENERAL LAYOUT OF DESCRIPTOR LIST

The Descriptor List is divided into two components.

Master Record. This record refers to physical characteristics of the breed within the species. Descriptive features have been categorised and may require the compiler to make decisions. For instance, in the case of hump size (large or medium or small) or proportion of a colour. Each species will have one Master Record for each of its breeds or strains. This record for the strain need not necessarily be derived from a single Source, but from a number of Sources and may also include additional information supplied by the compiler himself. This will allow the compiled Master Record to consist of one complete set of information on the physical characteristics of the strain.

Slave Record. This consists of performance characteristics of a group of animals of a breed or strain within a species. It also contains provisions for entering environmental characteristics if such details are given in the Source. Every Source will result in one Slave Record. But if the Source has performance characteristics of more than one breed, than this Source will provide one Slave Record for each breed; in this case environmental details are repeated for each of these Slave Records, unless of course the breeds were raised differently. In exceptional circumstances, an author may have published two or more papers covering different traits in each paper but all derived from the same group of animals maintained over the same time period. The information from these sources could be pooled into a single Slave Record. If these papers compared several breeds, then, the resulting number of Slave Records will correspond to the total number of breeds in all these papers.

After a complete exercise, the end result is one Master Record for each breed or crossbred and a larger number of Slave Records for each breed or crossbred. Each Slave Record derives from one Source, (or from several only in exceptional circumstances when several Sources report on the same animals). On the other hand, each Source contributes a Slave Record for each breed or crossbred type reported.

6. PROCEDURE FOR MASTER RECORDS

The Master Record is made up of breed descriptive data and is qualitative in nature. Attempts have been made in the Descriptor Lists to categorise descriptors such as body colours, horn shape and size, temperament and belly shape into fixed format alternatives (e.g. straight vs. curved; short, medium or long and colour percent). Compilers need to be consistent in their subjective evaluations. For other traits, for example, resistance to diseases and parasites, format free fields for word description are allowed. It is requested that such descriptions need to be precise and short.

Usually very few publications are available which describe the physical features of a breed. Therefore, the Master Record in spite of the lack of published data, should be completed as far as possible with added information based on personal experiences. Visual examination of the animals should be necessary to reduce unfilled gaps in the record.

As some of the data in the Master Record are subjective measures, it is recommended that all Master Records for a group of breeds or crosses be completed within an uninterrupted period of time so as to ensure uniformity.

Experience shows that about three man-days are normally necessary to complete one Master Record for a breed if the breed is available in the station where the geneticist who is compiling the data is working.

7. PROCEDURE FOR SLAVE RECORDS

All Sources after 1960 should be used to develop the Data Bank. Exceptionally Sources before 1960 may be considered valuable, but it is recommended not to search for Sources before 1960 normally. The Source should first be reviewed. Subsequently, if it is found to be suitable, information can be extracted for Data Bank use.

Review of Source: Each source needs to be studied carefully and the following points noted.

Reliability. The authenticity of the data in the source need to be judged and a value between 1 (most reliable) and 5 (least reliable) be given. (Item 8 in Slave Record). Various factors such as statistical results (number of observations, standard deviations), management system, feeding standards and clear presentation of experimental design or model will serve as indicators.
Documentation vs. Evaluation. The distinction between these two in each Source should be made. Documentation is simply the collation of existing data whereas Evaluation is a contemporary comparison of performance records of two or more breeds under the same environmental circumstances. Though each breed or strain within the Source will be presented in separate Slave Records, linkage between' them will be maintained through the bibliographical reference field. (Item 6 in Slave Record).
Bibliographical Reference. All Sources should be referenced even if some were found not useful. In such instances only item 6 of Slave Record will be filled. This will allow users to know the material was scanned but not used. The following sample formats need to be strictly followed in quoting the Source reference.

Journal:

Johnson, S.A., T. Killer and A. Victor. 1981. The relative performance of Friesian and Brown Swiss cattle in Nigeria. J. Anim. Sci. 51: 2222-2275.

Proceedings:

Nanda, K. and S. Singam. 1972. Growth rate and milk yield of Selembu cattle in Malaysia. Proc. Malaysian Society of Animal Production, 8th Ann. Conf., p. 197-200.

Annual Report:

Black, T. and M. White. 1965. Performance of Black and White cattle in South Africa. Ann. Rpt. No. 32. 1970, Agric. Res. Inst. , London.

Mahendra, M. and V. Buva. 1982. Factors affecting performance of Friesian crossbred cattle in Sri Lanka. Ministry of Agriculture, Sri Lanka, No. 3, 56 pp.

Idle data:

Hoest, R. and M.E. Berg. 1985. Unpublished data Livestock Department, Ministry of Agriculture, Kuala Lumpur, Malaysia.

Extraction of data: As much relevant information as possible must be extracted from the Sources. The Slave Record descriptor list needs to be referred to constantly especially during early stages, Generally, the extraction of data from the Sources may not be straight forward. Often a considerable amount of data editing is necessary and the following is a brief summary of types of data:

Actual Data. This is the data taken directly from the Source and transferred on to Source Data Sheets (see Section 9 of this manual) such as breed average 305-day milk yield, yearling weight and the associated number of observations, standard deviation and ranges. These figures are as given in the text of the Source.
Summarised Data. Many authors give annual averages for a single trait with standard deviations and number of observations for each breed. Overall means and standard deviations need to be calculated the latter from the pooled sums of squares. An example is given in Appendix 1. A similar procedure should be followed if data are presented by herds within farm or other similar groupings.
Transformed data. Some data such as those on feeding, management and adaptive characteristics are described in Sources. These data need to be summarised and transformed into defined alternatives suitable for the standardised format of the Slave Record. For instance, grazing management may be described along with concentrate feeding giving various components. These need to be clearly defined and entered into section 18 of Slave Record.
Additional Data. This refers to data pertaining to the Source but not given in the Source. The compiler should limit such supplementary data to some environmental characteristics such as meteorological records covering the period of study in the report. If accurate management characteristics such as type of housing, could be obtained from the station or from the author, they may be included. However, caution should be taken against extrapolation, guess work or searches that involve unwarranted time. Such additional data should be minimum and undertaken only if the compiler geneticist feels that such data are absolutely necessary for understanding the results.

In the case of 'idle' data, the compiler is expected to conduct some minimum statistical analysis as required by the Slave Record. Environmental data with relevant and reliable details should also be provided.

All statistics should be given in the metric system. Coversions from inches, lb and Fahrenheit to cm, kg and Celsius respectively, are given in Appendix 1.

During the process of data extraction, some common problems may be encountered, as follows:

Repeated data. There may be a few cases where part of the data in a Source is repeated in another. Only the first Source needs to be used.
Adjusted data. If both raw averages as well as adjusted data are given for the same traits, the latter is recommended. Factors for which adjustments have been made to the data, need to be mentioned in section 7 of Slave Record. If only some traits were adjusted, then these traits need also be mentioned in the same section.
Feeding trials. If some useful breed information is available from Sources that are nutrition orientated, and if the sample sizes are greater than 20 head per breed, then they could be used.
Incomplete statistics. A few Sources, though of reliable origin, may report only averages for each trait without number of animals used and/or standard deviations. These sources should also be included, and the blank spaces in the Descriptor List will indicate the lack.

8. RELEVANT DETAILS

The compiling geneticist is encouraged to be specific and accurate while transcribing data from Sources for the Data Bank. For example, if yields of a dairy herd were given and during the period of data recording the cows were herded for some days and strip grazed on other days, both of these should be indicated in Section 8.1.1 of Slave Record of Cattle Descriptors. In addition, if details are given, the compiler should include the proportion of time for each, e.g.

herded (20%)
strip grazed (80%)

9. PRESENTATION OF DATA FOR DATA ENTRY

The Master and Slave Records should be prepared separately. Any one Source will usually have less than 40 percent of the characteristic listed in the Descriptor Lists. Therefore, to complete a set of Descriptor List for each Source will mean bulky copies of the descriptors and many items whose contents only partially filled. Further, because of the size of the Descriptor List, the necessity of reviewing the Sources before extraction of the relevant data, the need for processing of some of the data and to allow layoff time for data collection on climate, direct entry of data from Source into the computer system is not possible. It is therefore suggested that the extracted data be written on to a sheet of paper, the Source Data Sheet. Relevant climatic details are also added to the list as these details come in. In order to maintain the meaningful link between the data and its name headings, the corresponding descriptor number that appears on the left of the descriptor list (e.g. 4.4.1.1.2) is also written alongside the data on the Source Data Sheets as tag numbers. The resulting Source Data Sheets derived from the various sources are now ready for entry into the system. An example of a Source Data Sheet for a cattle Slave Record is given below.

Tag number	Source Data Sheet for a Source
1	Kedah-Kelantan
2	purebred
4	800112 - 830531
6	Mahatir, M. and S. Velu. 1970 Performance of Kedah-Kelantan cattle in Malaysia. J. Animal Sc. 32 : 1-20.
8	3
9	Malaysia
9.1	Serdang
18.1.1.3	Tethered
18.1.2.2	improved
18.1.4.1	Bracharia decumbens			60%
18.1.4.2	Paspalum spp.			10%
18.1.5.1	Centrosema			30%
18.3.1.1	Rice bran			70%
18.3.1.2	Molasses			20%
18.3.1.3	Urea			3%
18.3.1.4	Mineral mixture			7%
18.3.2	4 kg per day per head for two weeks before calving, 3 kg. per day per head from calving to end of 100 days and 1 kg per day per head until end of lactation.
	-
	-
	-
22.1.1.1		300	18.5		3.2	16.1-20.5
22.3.2.3	12	-	113.2		7.5	109.0-118.2
		-
		-
		-
22.8.4.2		25	2.3		0.5	3.0-5.1

10. TIME FRAMEWORK

As a guide to compilers, a brief time framework is given in Appendix 2 for the various steps in the data search, extraction and presentation. This is based upon the experiences in the two-year trials held in different countries in Africa, Asia, and Latin America from 1983-85.

11. SUMMARY

Various source materials published after 1960 will be scanned and breed or strain characteristics extracted and presented in a format (free as well as fixed) that could be easily entered into a computer system. The presentation will be separate for physical characteristics (in Master Records) and performance and environmental characteristics (in Slave Records). A summarised flow chart is given below for the data extraction and presentation. For each breed/strain represented in the country, there will be one Master Record and several Slave Records. The latter will depend on the number of publications available.