CHAPTER 17. DATA PROCESSING
 

The information included in this chapter is intended primarily for senior statisticians responsible for the organization and management of agricultural census data processing. The detailed technical issues of computer data processing are considered to be beyond the scope of this manual and are therefore not included. Other relevant topics are the Census Questionnaire (Chapter 8) and the Tabulation Plan (Chapter 9). 

Data processing relates to those activities normally undertaken during and after data collection. The data must be edited before they are summarized and published in tabular form. In many countries large numbers of questionnaires are collected and the processing is a lengthy and complex operation. Of course, it is not possible to utilize this data without checking, classifying and summarizing them. 

This chapter discusses the concerns and issues which arise during an agricultural census for the various activities related to data processing. Owing to rapid improvements in data processing technology, it is especially difficult to make recommendations or even to generalise because prior experiences have little direct impact on activities and situations in today's agricultural census, and so this is not attempted. Instead, technical details on data processing can be found in specialized literature. FAO has published a booklet "Micro-computer-based data processing" (1987), illustrating how to use standard software packages like dBase, SAS, and Lotus in census applications; however, it should be read as a guide to processes and techniques rather than as a reference to specific hardware or software, as technology in this area is changing rapidly.

Prior experiences

17.1 The increased availability of computers in the 1970s and 1980s was expected to be of great assistance in rapidly producing accurate census results. In practice, many problems, objective and subjective, occurred, leading to major delays in data processing. Some of these problems related to frequent failures of early models of computer equipment, difficulties in maintenance, power failures, lack of qualified staff, etc.

17.2 Other problems concerned poor organization from lack of experience; for example, although computers can quickly tabulate large amounts of data, data entry and error checking present different kinds of problems. Perhaps the most important of these problems relates to difficulties in communication between statisticians and computer experts who are not familiar with each other's work. Typically, statisticians would forward questionnaires to computer sections without sufficient guidance and instructions. Since errors arising during data collection become visible when tabulations are prepared, the blame for these errors has often been attributed to the data processing operation.

17.3 The rapid improvement in electronic data processing hardware and software creates some difficulties in realistic planning. The proper choice of appropriate hardware, whether personal computers or micro-, mini- or mainframe computers requires knowledgeable input. The use of optical readers for automatic data entry and hand-held computers for direct entry by enumerators in the field, has not yet become practical for most agricultural statistical applications. On the other hand, progress achieved in the reliability of the hardware and the low cost of electronic data storage, plus increasing availability of suitable software and trained computer experts are expected to contribute to smoother data processing of agricultural censuses and surveys in the future. Because of these rapid changes in technology, FAO chooses not to recommend specific hardware or software, since today's optimum configuration will rapidly become out-dated.

17.4 It should be remembered that, with respect to data processing, the agricultural census will, in general, be a new experience, even if previous censuses were processed by computer. Technology has changed and little of the previous experience and know-how can be applied. Often, the persons involved in earlier censuses may now be involved in other activities and not be available for the current census. Many countries will have recent experience of processing population censuses and other large surveys and this should be used to develop data processing methods and procedures for the agricultural census.

Hardware

17.5 When considering hardware requirements, the main characteristics of agricultural census data processing should be kept in mind. These are: (i) large amounts of data to be entered in a short time, (ii) large amounts of data storage required although most data processing requires sequential access to data, (iii) relatively simple transactions, (iv) relatively large numbers of tables to be printed and (v) extensive use of raw data files which need to be on-line, if possible.

 

17.6 Basic hardware requirements consist, therefore, of many data entry stations (PCs, terminals or similar) and a relatively simple central processor. Arrangements should be made for regular backups of the data files and a security system of the storage devices must be maintained. Previously, magnetic tapes were used since they were relatively economical and satisfactory storage devices. Sequential processing of the data was most common, but now most processing also utilizes direct access methods (see Frame 17.1). These types of storage devices (usually hard disks) are more readily available and are more economical than in the past. Fast, high-resolution graphics printers capable of producing tables ready for distribution are also required.

17.7 Many developed countries take advantage of local area networks (LANs) for processing the census data (see Frame 17.2). However, it is important to realize that networks require substantial maintenance (trained staff and specialized hardware) and technical support for both hardware and software, involving also organizational and security issues. It is imperative that network problems do not prevent continued (although perhaps limited) processing of the data.

 

When information is stored in a file, it is usually located in "records" in the order in which it is entered. This file can then be sorted on the basis of a specific piece of information (key), so that the user can quickly identify the proper record. 

When software programs access records in a data file on the basis of the order stored, the process is called sequential access. Thus, the software must read the entire file to access the last record in the file. In this case the "order" of the records is physically and logically the same. 

If the software programs can read directly any record in the file (based on the key), then this process is called direct access. In this case the "order" of the records is not the same, physically or logically. 

The access method depends not only on the software, but also the hardware. Sequential access was the only method possible when data were stored on a tape, because the tape would need to be rewound or wound forward in order to allow "jumps" in the reading. Hard disk storage allows reading without winding and rewinding. 

Frame 17.1 File Access Methods

17.8 In developing countries with frequent power failures, equipment ensuring a continuous and stable source of electricity is essential. An Uninterrupted Power Supply (UPS) is very important and relatively inexpensive not only to protect computer hardware but also to prevent loss of data or delays in data processing during power failures.

17.9 One of the major decisions is the choice between a mainframe computer, a mini-computer and micro-computers. The solution depends mainly on the organization of the work and the cost of the equipment. The fact that much data entry is involved, and only relatively simple processing, makes micro-computers suitable for this application. Furthermore, micro-computers are more flexible for subsequent uses (e.g., can be transferred to provincial offices) and may be easily used for other applications after the census.

17.10 When estimating the hardware requirements the most important factor to be kept in mind in agricultural censuses is the amount of data collected because of the time involved in data entry and verification. It is important, therefore, to estimate the number of data entry stations (terminals or micro-computers) required for this operation. This can be done based on an estimated number of keystrokes per census questionnaire, or measuring time required for entering data from test questionnaires obtained as part of a pre-test or pilot census. The number of required stations will also depend on the time planned to complete the whole data entry operation. An example of such calculations is shown in Frame 17.3.

17.11 When estimating the required number of stations it should be kept in mind that many of them will be applied to other uses: verification of data entry (often 100 percent verification is done; see the section on data entry and verification later in this chapter), correction of data errors discovered, programming and testing programs, etc. Also, possible delays because of power failures, organizational problems, human errors, etc., should be taken into account. These problems, as a rule, are much greater than expected. On the other hand, the number of months required for data entry can be reduced by introducing additional shifts, overtime, etc.
 

 

LAN (local area network) is a communication link allowing computers ranging from micro-computers to mainframes, and most of the peripherals (printers, modems, etc.) to access each other (data, programs, etc.) bypassing hierarchical structures. LAN usually operates within one site (such as a central statistical office).

WAN (wide area network) is a communication link between different LANs, either on the same site or in different places or countries (such as central and provincial statistical offices).

Frame 17.2 LAN and WAN

If t is the time (in minutes) required to enter data from a questionnaire, the number of questionnaires that can be entered in a month using one station is:

Q = ( 6 × 1 × 20 ) × 60 / t, or

Q = 7200 / t,

assuming respectively: 6 working hours per shift, 1 shift a day and 20 working days a month.

The number of stations required to enter N questionnaires in M months will then be:

S = N / ( M × Q ).

With the above assumptions, in a country where data entry is planned to be completed in 6 months (M=6) and 10 minutes are required to enter data from one questionnaire (t=10); for 100,000 questionnaires (N=100000) one can calculate:

Q = 720 and S = 100000/(6×720), or

S = 23.

That is, 23 stations are needed for data entry only.

Frame 17.3 Calculation of the Number of  Data Entry Stations

Software

17.12 As explained in more detail in the following sections, the main task of software in census data processing is: (i) data entry, (ii) checking data for consistency, (iii) automatic data correction (when applied), (iv) handling data files (sorting, checking for duplicates, direct access, etc.), (v) data tabulation and (vi) presentation of data for printing. In addition, increasing use of software for graphic presentation and mapping of census results has been observed in many countries. In the case of sample enumeration, data have to be expanded and software for calculating sampling errors is required.

17.13 Improvements and changes in computer hardware/peripherals, including significant advances in reducing both the physical size and cost of storage, have had a major impact on the development of all software and especially of data base and statistical analysis software.

17.14 Some software packages which were considered state-of-the-art only one or two years ago are now considered obsolete because more current software packages take advantage of quicker, more accurate processing techniques which formerly required (and still do) large amounts of memory and storage space. The difference is that more computers have the necessary memory and storage space.

17.15 Moreover, in one year, with potential changes in the operating systems and in the integration of the software into a multi-task environment, different versions and/or different packages will replace the older ones. The user is confronted with an enormous task of trying to maintain the quality of data processing, while adjusting to the rapid changes in its appearance and application.

17.16 The use of procedure-oriented software has already reduced the burden of programmers in the development of modules for generic types of statistical processes, like averages, variances, regressions, analysis of variances, scatter plots, etc. In the future it will be possible to "talk" to the computer and to accomplish many of these simple tasks. However, it will still require proper syntax and format to convey the commands to the computer.

17.17 Given this wide range of hardware and software, and rapid progressive changes in this area, it is not realistic to expect that one type of hardware and one type of software will serve for many years. As hardware and software are upgraded it becomes necessary to ensure that data can be moved from one software to another (data files are portable). Word processing files written in one package may not be readable in a different package unless the file is converted to the proper format. Data entered on a spreadsheet may not be readable by a statistics analysis package; data entered into one kind of database may not be readable by a spreadsheet or by another type of database software. Thus, it is usually preferable to use standard software which are maintained by the manufacturer and for which documentation is easy to find plus the experts with wide experience. Portability of data files is important, not only within the statistical office but also to be able to provide data in a computer readable form to external users (see Chapter 18).

17.18 The improvements in laser printers, including the possibility of printing colour graphics, have made it possible to use standard micro-computer software for many publications. In the near future, one can expect the cost of these printers to be reduced significantly and the resolution to be much better. One major advantage is that many special characters can be incorporated into reports and graphs. Of particular value is the capability to print reports in most languages. Of course, this capability is directly related to the use of these same characters in software commands and text and in corresponding screen images.

17.19 Given all these considerations, it is not appropriate or possible for FAO to make any specific software recommendations. However, it can be said that the use of modern technology can expedite the processing and dissemination of agricultural census data.

Purpose of checking, editing and coding data

17.20 The effect of checking and/or editing questionnaires is (i) to achieve consistency within the data and consistency within the tabulations (within and between tables) and (ii) to detect and verify, correct or eliminate outliers, since extreme values are major contributors to errors in summaries (major errors in data, when sample expansion factors are applied, contribute to unrealistic values).

17.21 Editing involves revising or correcting the entries in the questionnaires. The need for revising recorded data occurs in cases of illegible editing by enumerators. It should be kept in mind that most of the "errors" detected during data entry occur because of illegible handwriting.

17.22 An important function of checking is to verify that completed questionnaires properly identify the holding as an agricultural holding meeting minimum requirements such as size of holding or livestock, or value of sales, as defined for a specific census.

17.23 Numerical codes now replace "words" with numbers as a means of condensing information to be stored. Thus, the words "spring wheat" are replaced by a number (usually 3 digits in most countries), reducing the number of characters used from 12 (including blanks) to 3, which reduces the possibility of mistyping (misspelling) the crop when information is keyed into the computer.

17.24 Data checking, editing and coding is considered to be the most difficult phase of data processing. Most first-time census planners/statisticians can prepare a reasonably good table, but have great difficulty with the organization of data management. It is recommended, therefore, that this phase be planned early so that computer programs and related procedures can be prepared and tested to ensure that the overall approach is realistic and functional.

Data processing activities

17.25 The main activities in data processing are as follows:

  1. Monitoring and controlling of questionnaires.
  2. Checking (manual editing) and coding.
  3. Data entry and verification.
  4. Computer editing and coding.
  5. Storage and security.
  6. Tabulation.
  7. Calculation of sampling error and additional data analysis.

17.26 These activities are closely interrelated and must be coordinated within a well-planned timetable. Sufficient documentation must be prepared to enable everyone to understand the specific steps to be undertaken. Cooperation between the computer processing unit and the statistical unit is important to reduce the possibilities of misunderstanding and to clarify any issues which arise.

17.27 Countries where provincial offices are involved in the processing will have some of the activities listed above completed in the provincial offices. Provincial offices need to establish a control system to ensure receipt of questionnaires from every enumeration area. Generally, if there is a provincial office system, the provincial offices will also carry out the functions ensuring the completion of the enumeration process and the questionnaires. Provincial offices will reduce the processing workload of the central office. However, with a provincial office system, the central office will need to be prepared to track and verify the processing completed in the provincial offices and will have to provide technical assistance (instructions, software, hardware, training, etc.).

Monitoring and control of questionnaires

17.28 The agricultural census is a large operation usually involving thousands of questionnaires even if a small-scale sample enumeration is organized; in the case of complete enumeration in larger countries there may be millions of questionnaires. Obviously, special control measures are required to ensure that all questionnaires are received. Adequate physical storage space should be made available in time to avoid damage or misplacement of questionnaires. When the completed questionnaires are returned by the enumerators, they should be transferred through supervisors at different administrative levels to the designated processing centre (central or provincial offices). To simplify control measures, questionnaires should be grouped by geographical areas and identified by appropriate forms relevant to the filing system adopted.

17.29 During the processing, questionnaires are removed from storage many times for manual editing, data entry and verification, checking of figures when computer editing detects potential errors, etc. Strict control during this phase is essential but difficult. It is important, therefore, to establish very rigid control over the flow of questionnaires and to make periodic reviews in order to detect misplaced questionnaires. Good organization in filing the questionnaires will greatly facilitate control. For example, in sample enumeration, an important practical advantage of having a fixed number of respondents selected in the secondary sampling units is in the fact that the size of folders is generally the same.

Checking (manual editing) and coding

17.30 The main task of checking or manual editing in data processing is to detect omissions, inconsistencies and other obvious errors in the returns and to correct them before subsequent processing stages or at least to reduce the level of errors to acceptable limits. Faulty questionnaires can be sent back to the field, or corrected in the office on the basis of instructions given to the editing personnel (e.g., using averages from the province or data from neighbouring holdings).

17.31 Manual editing should begin as soon as possible after data collection and as close to the source of the data as possible, such as in provincial, district or lower level offices. This procedure facilitates any necessary re-enumeration and has the advantage of utilizing personnel familiar with local conditions.

17.32 The errors, which might be discovered through internal and external consistency checks, may be response errors; they may also result from recording the replies in the wrong place on the questionnaire or from faulty and illegible handwriting.

17.33 During manual editing it is beneficial to conduct a random review of the checking and coding operations because many "editors" develop a pattern for correcting errors and for interpreting difficult-to-read hand-written responses. Although some kind of bias may be introduced by these "editors", it is also important that the corrections be done consistently.

17.34 Coding refers to the operation where original information from the questionnaire, as recorded by enumerators, is replaced by a numerical code required for processing. Typical examples are when names of crops, livestock, farm machinery, activities, etc., are replaced by a unique number (code) or when data expressed in local units are converted to a standard unit. The modern trend is either to enter the complete answer or to use fully precoded questionnaires and to leave the problem of local units to enumerators who are expected to enter in the questionnaires data ready for processing.

17.35 Furthermore, since manual editing is followed by computer editing, the two phases should be coordinated. Since the computer has the capability to implement instructions quickly, completely, consistently and accurately, some of the functions of editing, such as imputation of missing entries, if implemented, should be entrusted to computers rather than to manual editing.

17.36 As already mentioned, manual editing should be organized in the field as part of the supervisors' responsibility. Instructions for editing should be included in the supervisors' manual. It is important to prepare a detailed manual for central office editing, not only to give instructions to editing clerks who can be trained verbally, but for the benefit of other staff to describe exactly how the procedures are applied. For example, staff responsible for computer editing should know the exact rules for manual editing to avoid possible contradictions and the introduction of personal bias.

Data entry and verification

17.37 Data entry, which refers to transfer of data from questionnaires to the computer-readable media, is one of the greatest time and resource consuming phases of data processing.

17.38 This operation is normally done by data entry clerks who key in data from questionnaires to disks or tapes using keyboards of data entry stations (terminals, micro-computers or similar units). Work can be organized in different ways; all data for the questionnaire can be entered at the same time or the data can be entered section by section using data entry clerks specialized in the specific sections. The present trend is to enter data for the whole questionnaire at one time, using software which simulates parts of the questionnaire on the micro-computer monitor.

17.39 The speed of data entry in ideal cases is considered to be 8000 keystrokes/hour but it may be much less if the questionnaire is not designed for rapid data entry. In particular, interactive editing may slow the speed of data entry. This matter will be discussed later in the section on computer editing and coding.

17.40 It is recommended that data entry be 100 percent verified for agricultural censuses based on a small sample of holdings. Verification should be done by a data entry clerk who alternatively changes from data entry to verification of other operators but who did not do the original data entry. Experience shows that when a second data entry clerk is just a verifier who reviews/corrects the work done by the first data entry clerk, the verifier tends to agree with what has already been completed. This second method of verification should be avoided whenever possible.

17.41 In the case of larger censuses, complete verification should be done at the beginning of data entry, not only to to identify errors but to identify clerks with low performance. Subsequent verification on a sample basis may be sufficient to monitor the performance. One-hundred percent verification may be reintroduced for clerks failing to maintain an adequate standard of work. Verification could be reduced as performance improves, but a sample verification at some level should continue for all data entry clerks.

Data entry alternatives

17.42 Data entry through a keyboard is a time-consuming operation subject to human error. One alternative is to use automatic reading devices which are capable of scanning human-recorded documents and reading them into the computer directly without keying. Basically, there are two kinds of optical readers: (i) optical character readers capable of reading numbers or letters written by hand on a strictly predetermined position on the questionnaire and (ii) optical mark readers which can recognize marks made by a special pencil on numbers or letters preprinted on very special questionnaires. These methods have been tried in many countries with relative success, but a general conclusion seems to be that they are not suitable for agricultural census and survey applications which involve large digit numbers. The conclusions from cost/benefit analyses advise against using these machines in developing countries. Nevertheless, among the infrequent cases reporting successful experiences with optical mark readers we found the 1983/84 agricultural census of Bangladesh and the 1990 agricultural census of Japan.

17.43 Another approach used to speed up data entry concerns the field use of hand-held computers instead of questionnaires (or in addition to) with data entry completed directly by the enumerators who then send the data file using a telephone modem to a computer in the central office. This method has been used for a number of applications but still does not appear to be suitable for agricultural statistics on a cost/benefit basis, even in developed countries. Technological advances in the near future may reverse this evaluation.

Computer editing and coding

17.44 Computer editing is checking the general credibility of the data by computer with respect to (i) missing data, (ii) range tests, and (iii) logical and/or numerical consistency. Examples could be: (i) non-response (e.g., age of the holder not reported); (ii) improbable or impossible entries (e.g., yield is a hundred times higher than normal, age of the holder is less than 15 years); (iii) internal inconsistencies (e.g., wheat production reported but area not reported, pigs under 6 months plus pigs 6 months and over not equal to total pigs). In many cases these errors occur because of the failure to define the terms completely, or because the enumerators have not had sufficient training to detect incomplete information. And, of course, it is possible that the errors were created during the data entry phase.

17.45 Computer coding (or precoding) refers to assigning special codes to important classes of data, such as size class codes (codes 1,2,3,...) for consecutive zones, in order to avoid repetition of required calculations by the computer. This approach is a classic approach normally not used any more with fast modern computers (or PCs).

17.46 Computer editing is in fact a continuation of manual editing with several differences. Firstly, computer editing is aimed at discovering not only errors in questionnaires, but also errors committed at the data entry stage. Secondly, computer editing has an important advantage in that the computer is not subject to human errors, that is, instructions given will be implemented repeatedly as written so that consistency will exist. Thirdly, the process of computer editing is much faster than manual editing and, although detected errors may require manual intervention, it can immediately and accurately insert imputations and reduce the workload.

17.47 It should also be noted that manual editing may have some advantages. For example, manual editing can identify questionnaires which should have been returned for completion and initiate follow-up action. Similarly, with a quick review of the questionnaires, supervisory enumerators can detect poor enumeration and inconsistent responses and take corrective action.

17.48 It is important, therefore, to coordinate manual and computer editing so that advantages of each are fully utilized and, above all, that instructions given for the two kinds of editing are not contradictory. For example, irrigated land which is not cropped may exist in countries where pastures are sometimes irrigated. Statisticians preparing instructions for computer editing should, therefore, know exactly what kind of manual editing is to be done and how.

17.49 Computer editing can be done in two ways: (i) interactively at the data entry stage, or (ii) using batch processing after data entry, or some combination of both. Modern equipment and software facilitate checking during data entry and immediately provide error messages on the monitor and/or may reject the data unless they are corrected. This process is very useful in the case of simple mistakes such as keying errors, but may greatly slow down the data entry process in the case of errors which require consultation with supervisors. Interactive editing at the data entry stage is aimed mainly at discovering errors made in data entry, while more difficult cases, such as non-response, are left for a separate computer editing operation.

Imputation

17.50 Most detected errors cannot be corrected without re-interviewing the holder. When returning questionnaires to the field is not possible, a remedy (imputation) is available which consists of correcting inconsistent data or providing missing entries on the basis of knowledge available in the office. These inserted values may be averages for groups of holdings with similar characteristics, or may be logical conclusions based on other information available (e.g., missing age of the holder may be estimated from information on the age of his/her children). Missing data for a typical agricultural holding can be copied from another similar holding, without major effects on final results.

17.51 Whichever method is used, data correction or imputation, it is a delicate procedure difficult to implement. Imputation (see Frame 17.4 for examples of two methods) can be done manually or automatically by computer. Generally, manual corrections are recommended for smaller surveys and sample enumeration, particularly in developing countries. Manual imputations generally involve consulting the questionnaire for some additional information which may be useful, or simply modifying a keying error that has been discovered, often due to illegible handwriting. One of the problems with manual imputations is in the repetition of the editing process. Typically, many computer runs may be required before all errors are eliminated (e.g., out of, say, 800 errors discovered in a province during the first edit check, only 600 may be successfully corrected and 50 new ones made so that during the second edit check, 250 errors are detected, and so on). Furthermore, in order to avoid repeated edit checks of "good" records, they should be either stored separately, or flagged; some organizational problems may be created in either case.  

 

"Cold deck" and "hot deck" are the names of two procedures of imputation for missing or wrong values.

The "Cold deck" imputation consists of having pre-selected data for typical agricultural holdings, for each administrative area, and copying these data to replace non-response.

The "Hot deck" procedure consists of using data from a recently processed holding with similar characteristics instead of using data from pre-selected holdings.

"Cold deck" is obviously easier to implement but requires that the choice of replacement values should be perfect in order not to bias the data and artificially minimize the variability (particularly in case of intensive use). "Hot deck" avoids this risk but is more difficult to define and requires more powerful computers.

In any case, counts should be kept of the number of accesses to each procedure, by areas, regions, etc., and these numbers should remain within reasonable limits.

Frame 17.4 Methods of Imputation

17.52 Some of the above problems can be avoided by using automatic computer correction. However, this operation is very delicate and may change the values of the original data considerably. Some surveys have been ruined because programming errors have spoiled the data.

17.53 The philosophy of computer editing and imputation may consider the following aspects: (i) the immediate goal in an agricultural census is to collect data of good quality. If only a few errors are discovered, any method of correcting them may be considered satisfactory; (ii) it is important to keep a record of the number of errors discovered and the corrective action (by kind of correction); (iii) non-response can always be tabulated as such in a separate column. The data user, however, is generally less qualified than the statistician to guess what non-response means (say age of holder not reported) and prefers not to see this category; and (iv) redundancy of information collected in the questionnaire is very useful to help detect response error and quality of data in general. For example, if data are collected on total number of pigs classified by age: under six months and six months and over, this is redundant information (which may result in 5=2+2). It is difficult to correct these data unless the holding is visited again or data entry errors are discovered. Too much redundancy may, however, slow data processing considerably. It is considered, therefore, that a reasonable amount of redundancy of data, particularly for important data is useful, but when there is too much redundant data some may have to be ignored.

Storage and security

17.54 Two levels of storage and security are required: first, at the questionnaire level, and second, at the level of data stored on computers. When data are collected, most countries emphasize the confidentiality of the responses. Thus, it is necessary to prevent unauthorized access to the data at each level. When data are entered into a computer or edited, the individuals involved should be aware of this requirement and of the penalties associated with disclosure. Passwords should be used to limit access to files and, at a certain stage, it may be worthwhile to encrypt the data so that unauthorized access does not permit direct reading of data. The failure to do so may impact on the credibility of the organization and lead to less accurate responses or even refusals in the future.

17.55 It is impossible to know when data are going to be destroyed unintentionally; natural disasters, fires, power failures, programming errors can all contribute to the loss of important data files. For this reason, it is always stressed that there should be backup copies of data and that, as the processing continues, changes in the data require new backup copies. These copies could be both "on-line" and "off-line", but remember to take proper precautions to prevent the destruction of all copies at the same time, because they are stored on the same micro-computer and/or in the same room or building. For example, one copy of the data could be stored in a fire-proof safe or a copy of sub-national data could be maintained in each of the sub-national offices.

Tabulation

17.56 The tabulation plan was described in Chapter 9. Tabulation is not only the main component of data processing but the tables produced are the most visible outcome of the whole census operation and the most used output. Nevertheless, all preparations (dummy tables, computer programs, etc.) must be completed and tested, and the data editing and corrections properly done before the tabulation can become a reality. The main problems with final tables are those mistakes committed in earlier phases of the census operations, but which may not become visible until tabulation. The need for correction of data and of processing programs and retabulation at this stage can delay the final output considerably.

Calculation of sampling errors and other analysis

17.57 As pointed out in Chapter 9, Tabulation plan, the data collected by sample enumeration cannot be properly used and evaluated unless an indication of the sampling error is associated with values obtained. These calculations are usually simple to obtain with "statistical packages". However, one should be careful about calculating sampling errors without having a good knowledge of sampling.

17.58 The data can be prepared not only in table format, but also as a data file which can be distributed to the users (see also Chapter 18). In this case it is important to ensure the confidentiality of data; even if questionnaire-level data is needed and can be legally supplied, names and addresses of holders should not be provided to users.

Testing computer programmes

17.59 Considerable time is required to write computer programmes for error identification, automatic error correction (if applied), tabulation, calculation of sampling errors, etc., using available software. The computer programmes prepared should be tested, possibly with data from pretest surveys. Questionnaires used in the main data collection operation are likely to differ from questionnaires used for pretesting; in such a case, data on questionnaires referring to holdings enumerated in the pretest must be transferred to the census questionnaires. It may also be necessary to enter estimates on the census questionnaires for items not included in the pretest, as well as erroneous data designed to test the full range of error detection specified for the computer programmes. Computer printouts should list identified errors and corrections. Corrections should also be reviewed to determine whether all errors have been detected. If they have not been detected, additional specifications are required to correct the remaining errors or inconsistencies.

17.60 Computer programmes should be tested, normally by verifying results of both error detection and tabulations for a group of 100500 questionnaires. Data used for such tests should be tabulated manually to check each item or its classification in the tabulations. Manual tabulation of 100500 questionnaires is a timeconsuming operation and requires qualified staff. When such staff are not available, the number of questionnaires used for testing may be reduced. In any case, it is best to conduct an initial test using questionnaires with artificial data in an attempt to cover all items in as few questionnaires as possible. If the data are well prepared, only 2050 questionnaires may be sufficient.

Suggested reading
FAO (1965). Sampling methods and censuses (by S.S. Zarkovich).
FAO (1987). Micro-computer-based data processing: 1990 World Census of Agriculture.
UN (1982). Survey data processing: A review of issues and procedures. NHSCP technical study.