|
DATA PROCESSING
The information included in this chapter is intended
primarily for senior statisticians responsible for the organization
and management of agricultural census data processing. The
detailed technical issues of computer data processing are
considered to be beyond the scope of this manual and are therefore
not included. Other relevant topics are the Census Questionnaire
(Chapter 8) and the Tabulation Plan
(Chapter 9).
Data processing relates to those activities normally
undertaken during and after data collection. The data must
be edited before they are summarized and published in tabular
form. In many countries large numbers of questionnaires are
collected and the processing is a lengthy and complex operation.
Of course, it is not possible to utilize this data without
checking, classifying and summarizing them.
This chapter discusses the concerns and issues which
arise during an agricultural census for the various activities
related to data processing. Owing to rapid improvements in
data processing technology, it is especially difficult to
make recommendations or even to generalise because prior experiences
have little direct impact on activities and situations in
today's agricultural census, and so this is not attempted.
Instead, technical details on data processing can be found
in specialized literature. FAO has published a booklet "Micro-computer-based
data processing" (1987), illustrating how to use standard
software packages like dBase, SAS, and Lotus in census applications;
however, it should be read as a guide to processes and techniques
rather than as a reference to specific hardware or software,
as technology in this area is changing rapidly.
Prior experiences
17.1 The increased availability of computers in the 1970s
and 1980s was expected to be of great assistance in rapidly
producing accurate census results. In practice, many problems,
objective and subjective, occurred, leading to major delays
in data processing. Some of these problems related to frequent
failures of early models of computer equipment, difficulties
in maintenance, power failures, lack of qualified staff, etc.
17.2 Other problems concerned poor organization from lack
of experience; for example, although computers can quickly
tabulate large amounts of data, data entry and error checking
present different kinds of problems. Perhaps the most important
of these problems relates to difficulties in communication
between statisticians and computer experts who are not familiar
with each other's work. Typically, statisticians would forward
questionnaires to computer sections without sufficient guidance
and instructions. Since errors arising during data collection
become visible when tabulations are prepared, the blame for
these errors has often been attributed to the data processing
operation.
17.3 The rapid improvement in electronic data processing
hardware and software creates some difficulties in realistic
planning. The proper choice of appropriate hardware, whether
personal computers or micro-, mini- or mainframe computers
requires knowledgeable input. The use of optical readers for
automatic data entry and hand-held computers for direct entry
by enumerators in the field, has not yet become practical
for most agricultural statistical applications. On the other
hand, progress achieved in the reliability of the hardware
and the low cost of electronic data storage, plus increasing
availability of suitable software and trained computer experts
are expected to contribute to smoother data processing of
agricultural censuses and surveys in the future. Because of
these rapid changes in technology, FAO chooses not to recommend
specific hardware or software, since today's optimum configuration
will rapidly become out-dated.
17.4 It should be remembered that, with respect to data processing,
the agricultural census will, in general, be a new experience,
even if previous censuses were processed by computer. Technology
has changed and little of the previous experience and know-how
can be applied. Often, the persons involved in earlier censuses
may now be involved in other activities and not be available
for the current census. Many countries will have recent experience
of processing population censuses and other large surveys
and this should be used to develop data processing methods
and procedures for the agricultural census.
Hardware
17.5 When considering hardware requirements, the main characteristics
of agricultural census data processing should be kept in mind.
These are: (i) large amounts of data to be entered in a short
time, (ii) large amounts of data storage required although
most data processing requires sequential access to data, (iii)
relatively simple transactions, (iv) relatively large numbers
of tables to be printed and (v) extensive use of raw data
files which need to be on-line, if possible.
| 17.6
Basic hardware requirements consist, therefore, of many
data entry stations (PCs, terminals or similar) and a
relatively simple central processor. Arrangements should
be made for regular backups of the data files and a security
system of the storage devices must be maintained. Previously,
magnetic tapes were used since they were relatively economical
and satisfactory storage devices. Sequential processing
of the data was most common, but now most processing also
utilizes direct access methods (see Frame 17.1). These
types of storage devices (usually hard disks) are more
readily available and are more economical than in the
past. Fast, high-resolution graphics printers capable
of producing tables ready for distribution are also required.
17.7 Many developed countries take advantage of local
area networks (LANs) for processing the census data
(see Frame 17.2). However, it is important to realize
that networks require substantial maintenance (trained
staff and specialized hardware) and technical support
for both hardware and software, involving also organizational
and security issues. It is imperative that network problems
do not prevent continued (although perhaps limited)
processing of the data.
|
| When information is stored in a
file, it is usually located in "records" in the
order in which it is entered. This file can then
be sorted on the basis of a specific piece of information
(key), so that the user can quickly identify the
proper record.
When software programs access records in a data
file on the basis of the order stored, the process
is called sequential access. Thus, the
software must read the entire file to access the
last record in the file. In this case the "order"
of the records is physically and logically the
same.
If the software programs can read directly any
record in the file (based on the key), then this
process is called direct access. In this
case the "order" of the records is not the same,
physically or logically.
The access method depends not only on the software,
but also the hardware. Sequential access was the
only method possible when data were stored on
a tape, because the tape would need to be rewound
or wound forward in order to allow "jumps" in
the reading. Hard disk storage allows reading
without winding and rewinding.
|
Frame 17.1 File Access Methods |
| 17.8
In developing countries with frequent power failures,
equipment ensuring a continuous and stable source of electricity
is essential. An Uninterrupted Power Supply (UPS) is very
important and relatively inexpensive not only to protect
computer hardware but also to prevent loss of data or
delays in data processing during power failures.
17.9 One of the major decisions is the choice between
a mainframe computer, a mini-computer and micro-computers.
The solution depends mainly on the organization of the
work and the cost of the equipment. The fact that much
data entry is involved, and only relatively simple processing,
makes micro-computers suitable for this application.
Furthermore, micro-computers are more flexible for subsequent
uses (e.g., can be transferred to provincial offices)
and may be easily used for other applications after
the census.
17.10 When estimating the hardware requirements the
most important factor to be kept in mind in agricultural
censuses is the amount of data collected because of
the time involved in data entry and verification. It
is important, therefore, to estimate the number of data
entry stations (terminals or micro-computers) required
for this operation. This can be done based on an estimated
number of keystrokes per census questionnaire, or measuring
time required for entering data from test questionnaires
obtained as part of a pre-test or pilot census. The
number of required stations will also depend on the
time planned to complete the whole data entry operation.
An example of such calculations is shown in Frame 17.3.
17.11 When estimating the required number of stations
it should be kept in mind that many of them will be
applied to other uses: verification of data entry (often
100 percent verification is done; see the section on
data entry and verification later in this chapter),
correction of data errors discovered, programming and
testing programs, etc. Also, possible delays because
of power failures, organizational problems, human errors,
etc., should be taken into account. These problems,
as a rule, are much greater than expected. On the other
hand, the number of months required for data entry can
be reduced by introducing additional shifts, overtime,
etc.
|
| LAN (local area network) is a communication
link allowing computers ranging from micro-computers
to mainframes, and most of the peripherals (printers,
modems, etc.) to access each other (data, programs,
etc.) bypassing hierarchical structures. LAN usually
operates within one site (such as a central statistical
office).
WAN (wide area network) is a communication link
between different LANs, either on the same site
or in different places or countries (such as central
and provincial statistical offices).
|
Frame 17.2 LAN and WAN
| If t is the time (in minutes)
required to enter data from a questionnaire, the
number of questionnaires that can be entered in
a month using one station is:
Q = ( 6 × 1 ×
20 ) × 60 / t, or
Q = 7200 / t,
assuming respectively: 6 working hours per shift,
1 shift a day and 20 working days a month.
The number of stations required to enter N
questionnaires in M months will then be:
S = N / ( M × Q ).
With the above assumptions, in a country where
data entry is planned to be completed in 6 months
(M=6) and 10 minutes are required to enter data
from one questionnaire (t=10); for 100,000 questionnaires
(N=100000) one can calculate:
Q = 720 and S = 100000/(6×720), or
S = 23.
That is, 23 stations are needed for data entry
only.
|
Frame 17.3 Calculation of the Number of Data
Entry Stations |
Software
17.12 As explained in more detail in the following sections,
the main task of software in census data processing is: (i)
data entry, (ii) checking data for consistency, (iii) automatic
data correction (when applied), (iv) handling data files (sorting,
checking for duplicates, direct access, etc.), (v) data tabulation
and (vi) presentation of data for printing. In addition, increasing
use of software for graphic presentation and mapping of census
results has been observed in many countries. In the case of
sample enumeration, data have to be expanded and software
for calculating sampling errors is required.
17.13 Improvements and changes in computer hardware/peripherals,
including significant advances in reducing both the physical
size and cost of storage, have had a major impact on the development
of all software and especially of data base and statistical
analysis software.
17.14 Some software packages which were considered state-of-the-art
only one or two years ago are now considered obsolete because
more current software packages take advantage of quicker,
more accurate processing techniques which formerly required
(and still do) large amounts of memory and storage space.
The difference is that more computers have the necessary memory
and storage space.
17.15 Moreover, in one year, with potential changes in the
operating systems and in the integration of the software into
a multi-task environment, different versions and/or different
packages will replace the older ones. The user is confronted
with an enormous task of trying to maintain the quality of
data processing, while adjusting to the rapid changes in its
appearance and application.
17.16 The use of procedure-oriented software has already
reduced the burden of programmers in the development of modules
for generic types of statistical processes, like averages,
variances, regressions, analysis of variances, scatter plots,
etc. In the future it will be possible to "talk" to the computer
and to accomplish many of these simple tasks. However, it
will still require proper syntax and format to convey the
commands to the computer.
17.17 Given this wide range of hardware and software, and
rapid progressive changes in this area, it is not realistic
to expect that one type of hardware and one type of software
will serve for many years. As hardware and software are upgraded
it becomes necessary to ensure that data can be moved from
one software to another (data files are portable). Word processing
files written in one package may not be readable in a different
package unless the file is converted to the proper format.
Data entered on a spreadsheet may not be readable by a statistics
analysis package; data entered into one kind of database may
not be readable by a spreadsheet or by another type of database
software. Thus, it is usually preferable to use standard software
which are maintained by the manufacturer and for which documentation
is easy to find plus the experts with wide experience. Portability
of data files is important, not only within the statistical
office but also to be able to provide data in a computer readable
form to external users (see Chapter 18).
17.18 The improvements in laser printers, including the possibility
of printing colour graphics, have made it possible to use
standard micro-computer software for many publications. In
the near future, one can expect the cost of these printers
to be reduced significantly and the resolution to be much
better. One major advantage is that many special characters
can be incorporated into reports and graphs. Of particular
value is the capability to print reports in most languages.
Of course, this capability is directly related to the use
of these same characters in software commands and text and
in corresponding screen images.
17.19 Given all these considerations, it is not appropriate
or possible for FAO to make any specific software recommendations.
However, it can be said that the use of modern technology
can expedite the processing and dissemination of agricultural
census data.
Purpose of checking, editing and coding data
17.20 The effect of checking and/or editing questionnaires
is (i) to achieve consistency within the data and consistency
within the tabulations (within and between tables) and (ii)
to detect and verify, correct or eliminate outliers, since
extreme values are major contributors to errors in summaries
(major errors in data, when sample expansion factors are applied,
contribute to unrealistic values).
17.21 Editing involves revising or correcting the entries
in the questionnaires. The need for revising recorded data
occurs in cases of illegible editing by enumerators. It should
be kept in mind that most of the "errors" detected during
data entry occur because of illegible handwriting.
17.22 An important function of checking is to verify that
completed questionnaires properly identify the holding as
an agricultural holding meeting minimum requirements such
as size of holding or livestock, or value of sales, as defined
for a specific census.
17.23 Numerical codes now replace "words" with numbers as
a means of condensing information to be stored. Thus, the
words "spring wheat" are replaced by a number (usually 3 digits
in most countries), reducing the number of characters used
from 12 (including blanks) to 3, which reduces the possibility
of mistyping (misspelling) the crop when information is keyed
into the computer.
17.24 Data checking, editing and coding is considered to
be the most difficult phase of data processing. Most first-time
census planners/statisticians can prepare a reasonably good
table, but have great difficulty with the organization of
data management. It is recommended, therefore, that this phase
be planned early so that computer programs and related procedures
can be prepared and tested to ensure that the overall approach
is realistic and functional.
Data processing activities
17.25 The main activities in data processing are as follows:
- Monitoring and controlling of questionnaires.
- Checking (manual editing) and coding.
- Data entry and verification.
- Computer editing and coding.
- Storage and security.
- Tabulation.
- Calculation of sampling error and additional data analysis.
17.26 These activities are closely interrelated and must be
coordinated within a well-planned timetable. Sufficient documentation
must be prepared to enable everyone to understand the specific
steps to be undertaken. Cooperation between the computer processing
unit and the statistical unit is important to reduce the possibilities
of misunderstanding and to clarify any issues which arise.
17.27 Countries where provincial offices are involved in
the processing will have some of the activities listed above
completed in the provincial offices. Provincial offices need
to establish a control system to ensure receipt of questionnaires
from every enumeration area. Generally, if there is a provincial
office system, the provincial offices will also carry out
the functions ensuring the completion of the enumeration process
and the questionnaires. Provincial offices will reduce the
processing workload of the central office. However, with a
provincial office system, the central office will need to
be prepared to track and verify the processing completed in
the provincial offices and will have to provide technical
assistance (instructions, software, hardware, training, etc.).
Monitoring and control of questionnaires
17.28 The agricultural census is a large operation usually
involving thousands of questionnaires even if a small-scale
sample enumeration is organized; in the case of complete enumeration
in larger countries there may be millions of questionnaires.
Obviously, special control measures are required to ensure
that all questionnaires are received. Adequate physical storage
space should be made available in time to avoid damage or
misplacement of questionnaires. When the completed questionnaires
are returned by the enumerators, they should be transferred
through supervisors at different administrative levels to
the designated processing centre (central or provincial offices).
To simplify control measures, questionnaires should be grouped
by geographical areas and identified by appropriate forms
relevant to the filing system adopted.
17.29 During the processing, questionnaires are removed from
storage many times for manual editing, data entry and verification,
checking of figures when computer editing detects potential
errors, etc. Strict control during this phase is essential
but difficult. It is important, therefore, to establish very
rigid control over the flow of questionnaires and to make
periodic reviews in order to detect misplaced questionnaires.
Good organization in filing the questionnaires will greatly
facilitate control. For example, in sample enumeration, an
important practical advantage of having a fixed number of
respondents selected in the secondary sampling units is in
the fact that the size of folders is generally the same.
Checking (manual editing) and coding
17.30 The main task of checking or manual editing in data
processing is to detect omissions, inconsistencies and other
obvious errors in the returns and to correct them before subsequent
processing stages or at least to reduce the level of errors
to acceptable limits. Faulty questionnaires can be sent back
to the field, or corrected in the office on the basis of instructions
given to the editing personnel (e.g., using averages from
the province or data from neighbouring holdings).
17.31 Manual editing should begin as soon as possible after
data collection and as close to the source of the data as
possible, such as in provincial, district or lower level offices.
This procedure facilitates any necessary re-enumeration and
has the advantage of utilizing personnel familiar with local
conditions.
17.32 The errors, which might be discovered through internal
and external consistency checks, may be response errors; they
may also result from recording the replies in the wrong place
on the questionnaire or from faulty and illegible handwriting.
17.33 During manual editing it is beneficial to conduct a
random review of the checking and coding operations because
many "editors" develop a pattern for correcting errors and
for interpreting difficult-to-read hand-written responses.
Although some kind of bias may be introduced by these "editors",
it is also important that the corrections be done consistently.
17.34 Coding refers to the operation where original information
from the questionnaire, as recorded by enumerators, is replaced
by a numerical code required for processing. Typical examples
are when names of crops, livestock, farm machinery, activities,
etc., are replaced by a unique number (code) or when data
expressed in local units are converted to a standard unit.
The modern trend is either to enter the complete answer or
to use fully precoded questionnaires and to leave the problem
of local units to enumerators who are expected to enter in
the questionnaires data ready for processing.
17.35 Furthermore, since manual editing is followed by computer
editing, the two phases should be coordinated. Since the computer
has the capability to implement instructions quickly, completely,
consistently and accurately, some of the functions of editing,
such as imputation of missing entries, if implemented, should
be entrusted to computers rather than to manual editing.
17.36 As already mentioned, manual editing should be organized
in the field as part of the supervisors' responsibility. Instructions
for editing should be included in the supervisors' manual.
It is important to prepare a detailed manual for central office
editing, not only to give instructions to editing clerks who
can be trained verbally, but for the benefit of other staff
to describe exactly how the procedures are applied. For example,
staff responsible for computer editing should know the exact
rules for manual editing to avoid possible contradictions
and the introduction of personal bias.
Data entry and verification
17.37 Data entry, which refers to transfer of data from questionnaires
to the computer-readable media, is one of the greatest time
and resource consuming phases of data processing.
17.38 This operation is normally done by data entry clerks
who key in data from questionnaires to disks or tapes using
keyboards of data entry stations (terminals, micro-computers
or similar units). Work can be organized in different ways;
all data for the questionnaire can be entered at the same
time or the data can be entered section by section using data
entry clerks specialized in the specific sections. The present
trend is to enter data for the whole questionnaire at one
time, using software which simulates parts of the questionnaire
on the micro-computer monitor.
17.39 The speed of data entry in ideal cases is considered
to be 8000 keystrokes/hour but it may be much less if the
questionnaire is not designed for rapid data entry. In particular,
interactive editing may slow the speed of data entry. This
matter will be discussed later in the section on computer
editing and coding.
17.40 It is recommended that data entry be 100 percent verified
for agricultural censuses based on a small sample of holdings.
Verification should be done by a data entry clerk who alternatively
changes from data entry to verification of other operators
but who did not do the original data entry. Experience shows
that when a second data entry clerk is just a verifier who
reviews/corrects the work done by the first data entry clerk,
the verifier tends to agree with what has already been completed.
This second method of verification should be avoided whenever
possible.
17.41 In the case of larger censuses, complete verification
should be done at the beginning of data entry, not only to
to identify errors but to identify clerks with low performance.
Subsequent verification on a sample basis may be sufficient
to monitor the performance. One-hundred percent verification
may be reintroduced for clerks failing to maintain an adequate
standard of work. Verification could be reduced as performance
improves, but a sample verification at some level should continue
for all data entry clerks.
Data entry alternatives
17.42 Data entry through a keyboard is a time-consuming operation
subject to human error. One alternative is to use automatic
reading devices which are capable of scanning human-recorded
documents and reading them into the computer directly without
keying. Basically, there are two kinds of optical readers:
(i) optical character readers capable of reading numbers or
letters written by hand on a strictly predetermined position
on the questionnaire and (ii) optical mark readers which can
recognize marks made by a special pencil on numbers or letters
preprinted on very special questionnaires. These methods have
been tried in many countries with relative success, but a
general conclusion seems to be that they are not suitable
for agricultural census and survey applications which involve
large digit numbers. The conclusions from cost/benefit analyses
advise against using these machines in developing countries.
Nevertheless, among the infrequent cases reporting successful
experiences with optical mark readers we found the 1983/84
agricultural census of Bangladesh and the 1990 agricultural
census of Japan.
17.43 Another approach used to speed up data entry concerns
the field use of hand-held computers instead of questionnaires
(or in addition to) with data entry completed directly by
the enumerators who then send the data file using a telephone
modem to a computer in the central office. This method has
been used for a number of applications but still does not
appear to be suitable for agricultural statistics on a cost/benefit
basis, even in developed countries. Technological advances
in the near future may reverse this evaluation.
Computer editing and coding
17.44 Computer editing is checking the general credibility
of the data by computer with respect to (i) missing data,
(ii) range tests, and (iii) logical and/or numerical consistency.
Examples could be: (i) non-response (e.g., age of the holder
not reported); (ii) improbable or impossible entries (e.g.,
yield is a hundred times higher than normal, age of the holder
is less than 15 years); (iii) internal inconsistencies (e.g.,
wheat production reported but area not reported, pigs under
6 months plus pigs 6 months and over not equal to total pigs).
In many cases these errors occur because of the failure to
define the terms completely, or because the enumerators have
not had sufficient training to detect incomplete information.
And, of course, it is possible that the errors were created
during the data entry phase.
17.45 Computer coding (or precoding) refers to assigning
special codes to important classes of data, such as size class
codes (codes 1,2,3,...) for consecutive zones, in order to
avoid repetition of required calculations by the computer.
This approach is a classic approach normally not used any
more with fast modern computers (or PCs).
17.46 Computer editing is in fact a continuation of manual
editing with several differences. Firstly, computer editing
is aimed at discovering not only errors in questionnaires,
but also errors committed at the data entry stage. Secondly,
computer editing has an important advantage in that the computer
is not subject to human errors, that is, instructions given
will be implemented repeatedly as written so that consistency
will exist. Thirdly, the process of computer editing is much
faster than manual editing and, although detected errors may
require manual intervention, it can immediately and accurately
insert imputations and reduce the workload.
17.47 It should also be noted that manual editing may have
some advantages. For example, manual editing can identify
questionnaires which should have been returned for completion
and initiate follow-up action. Similarly, with a quick review
of the questionnaires, supervisory enumerators can detect
poor enumeration and inconsistent responses and take corrective
action.
17.48 It is important, therefore, to coordinate manual and
computer editing so that advantages of each are fully utilized
and, above all, that instructions given for the two kinds
of editing are not contradictory. For example, irrigated land
which is not cropped may exist in countries where pastures
are sometimes irrigated. Statisticians preparing instructions
for computer editing should, therefore, know exactly what
kind of manual editing is to be done and how.
17.49 Computer editing can be done in two ways: (i) interactively
at the data entry stage, or (ii) using batch processing after
data entry, or some combination of both. Modern equipment
and software facilitate checking during data entry and immediately
provide error messages on the monitor and/or may reject the
data unless they are corrected. This process is very useful
in the case of simple mistakes such as keying errors, but
may greatly slow down the data entry process in the case of
errors which require consultation with supervisors. Interactive
editing at the data entry stage is aimed mainly at discovering
errors made in data entry, while more difficult cases, such
as non-response, are left for a separate computer editing
operation.
Imputation
17.50 Most detected errors cannot be corrected without re-interviewing
the holder. When returning questionnaires to the field is
not possible, a remedy (imputation) is available which consists
of correcting inconsistent data or providing missing entries
on the basis of knowledge available in the office. These inserted
values may be averages for groups of holdings with similar
characteristics, or may be logical conclusions based on other
information available (e.g., missing age of the holder may
be estimated from information on the age of his/her children).
Missing data for a typical agricultural holding can be copied
from another similar holding, without major effects on final
results.
|
17.51 Whichever method is used, data correction or imputation,
it is a delicate procedure difficult to implement. Imputation
(see Frame 17.4 for examples of two methods) can be done
manually or automatically by computer. Generally, manual
corrections are recommended for smaller surveys and sample
enumeration, particularly in developing countries. Manual
imputations generally involve consulting the questionnaire
for some additional information which may be useful, or
simply modifying a keying error that has been discovered,
often due to illegible handwriting. One of the problems
with manual imputations is in the repetition of the editing
process. Typically, many computer runs may be required
before all errors are eliminated (e.g., out of, say, 800
errors discovered in a province during the first edit
check, only 600 may be successfully corrected and 50 new
ones made so that during the second edit check, 250 errors
are detected, and so on). Furthermore, in order to avoid
repeated edit checks of "good" records, they should be
either stored separately, or flagged; some organizational
problems may be created in either case. |
| "Cold deck" and "hot deck" are the
names of two procedures of imputation for missing
or wrong values.
The "Cold deck" imputation consists of having
pre-selected data for typical agricultural holdings,
for each administrative area, and copying these
data to replace non-response.
The "Hot deck" procedure consists of using data
from a recently processed holding with similar
characteristics instead of using data from pre-selected
holdings.
"Cold deck" is obviously easier to implement
but requires that the choice of replacement values
should be perfect in order not to bias the data
and artificially minimize the variability (particularly
in case of intensive use). "Hot deck" avoids this
risk but is more difficult to define and requires
more powerful computers.
In any case, counts should be kept of the number
of accesses to each procedure, by areas, regions,
etc., and these numbers should remain within reasonable
limits.
|
Frame 17.4 Methods of Imputation |
17.52 Some of the above problems can be avoided by using
automatic computer correction. However, this operation is
very delicate and may change the values of the original data
considerably. Some surveys have been ruined because programming
errors have spoiled the data.
17.53 The philosophy of computer editing and imputation may
consider the following aspects: (i) the immediate goal in
an agricultural census is to collect data of good quality.
If only a few errors are discovered, any method of correcting
them may be considered satisfactory; (ii) it is important
to keep a record of the number of errors discovered and the
corrective action (by kind of correction); (iii) non-response
can always be tabulated as such in a separate column. The
data user, however, is generally less qualified than the statistician
to guess what non-response means (say age of holder not reported)
and prefers not to see this category; and (iv) redundancy
of information collected in the questionnaire is very useful
to help detect response error and quality of data in general.
For example, if data are collected on total number of pigs
classified by age: under six months and six months and over,
this is redundant information (which may result in 5=2+2).
It is difficult to correct these data unless the holding is
visited again or data entry errors are discovered. Too much
redundancy may, however, slow data processing considerably.
It is considered, therefore, that a reasonable amount of redundancy
of data, particularly for important data is useful, but when
there is too much redundant data some may have to be ignored.
Storage and security
17.54 Two levels of storage and security are required: first,
at the questionnaire level, and second, at the level of data
stored on computers. When data are collected, most countries
emphasize the confidentiality of the responses. Thus, it is
necessary to prevent unauthorized access to the data at each
level. When data are entered into a computer or edited, the
individuals involved should be aware of this requirement and
of the penalties associated with disclosure. Passwords should
be used to limit access to files and, at a certain stage,
it may be worthwhile to encrypt the data so that unauthorized
access does not permit direct reading of data. The failure
to do so may impact on the credibility of the organization
and lead to less accurate responses or even refusals in the
future.
17.55 It is impossible to know when data are going to be
destroyed unintentionally; natural disasters, fires, power
failures, programming errors can all contribute to the loss
of important data files. For this reason, it is always stressed
that there should be backup copies of data and that, as the
processing continues, changes in the data require new backup
copies. These copies could be both "on-line" and "off-line",
but remember to take proper precautions to prevent the destruction
of all copies at the same time, because they are stored on
the same micro-computer and/or in the same room or building.
For example, one copy of the data could be stored in a fire-proof
safe or a copy of sub-national data could be maintained in
each of the sub-national offices.
Tabulation
17.56 The tabulation plan was described in Chapter
9. Tabulation is not only the main component of data processing
but the tables produced are the most visible outcome of the
whole census operation and the most used output. Nevertheless,
all preparations (dummy tables, computer programs, etc.) must
be completed and tested, and the data editing and corrections
properly done before the tabulation can become a reality.
The main problems with final tables are those mistakes committed
in earlier phases of the census operations, but which may
not become visible until tabulation. The need for correction
of data and of processing programs and retabulation at this
stage can delay the final output considerably.
Calculation of sampling errors and other analysis
17.57 As pointed out in Chapter 9,
Tabulation plan, the data collected by sample enumeration
cannot be properly used and evaluated unless an indication
of the sampling error is associated with values obtained.
These calculations are usually simple to obtain with "statistical
packages". However, one should be careful about calculating
sampling errors without having a good knowledge of sampling.
17.58 The data can be prepared not only in table format,
but also as a data file which can be distributed to the users
(see also Chapter 18). In this case
it is important to ensure the confidentiality of data; even
if questionnaire-level data is needed and can be legally
supplied, names and addresses of holders should not be provided
to users.
Testing computer programmes
17.59 Considerable time is required to write computer programmes
for error identification, automatic error correction (if applied),
tabulation, calculation of sampling errors, etc., using available
software. The computer programmes prepared should be tested,
possibly with data from pretest surveys. Questionnaires used
in the main data collection operation are likely to differ
from questionnaires used for pretesting; in such a case, data
on questionnaires referring to holdings enumerated in the
pretest must be transferred to the census questionnaires.
It may also be necessary to enter estimates on the census
questionnaires for items not included in the pretest, as well
as erroneous data designed to test the full range of error
detection specified for the computer programmes. Computer
printouts should list identified errors and corrections. Corrections
should also be reviewed to determine whether all errors have
been detected. If they have not been detected, additional
specifications are required to correct the remaining errors
or inconsistencies.
17.60 Computer programmes should be tested, normally by verifying
results of both error detection and tabulations for a group
of 100500 questionnaires. Data used for such tests should
be tabulated manually to check each item or its classification
in the tabulations. Manual tabulation of 100500 questionnaires
is a timeconsuming operation and requires qualified staff.
When such staff are not available, the number of questionnaires
used for testing may be reduced. In any case, it is best to
conduct an initial test using questionnaires with artificial
data in an attempt to cover all items in as few questionnaires
as possible. If the data are well prepared, only 2050 questionnaires
may be sufficient.
Suggested reading
FAO (1965). Sampling methods and censuses (by S.S. Zarkovich).
FAO (1987). Micro-computer-based data processing: 1990 World
Census of Agriculture.
UN (1982). Survey data processing: A review of issues and
procedures. NHSCP technical study.
|