Section IV
Information Needs and Computer Systems (continued)

Data Identification Considerations in International Interchange of Food Composition Data

John C. Klensin

INFOODS Secretariat, United Nations University, PO Box 500, Charles St Sta, Boston, MA 02114-0500, USA

Correct use of food composition tables and databases outside the country of origin requires identification of the values in those tables. In addition to the problems of adequate nomenclature, identification, and classification of foods, problems also exist in adequate description of laboratory samples, identification of the food components being reported, and identification of the accuracy, precision and representativeness of the data values themselves. This paper reviews the procedures for identifying food components developed by INFOODS in collaboration with IUNS and their increasing use around the world. The paper then discusses the issues associated with data value identification and, in particular, methods of reporting accuracy and precision that provide maximum information to sophisticated users and compilers of food composition tables.

When data are exchanged among countries, or even among researchers within a country, the recipients must have adequate identification, or at least description, of those data to make intelligent use of them. Some of that identification is provided implicitly, by the conventions of the field. For example, scientists doing cryogenic studies always use degrees Kelvin to report temperatures. The use of Fahrenheit, or even Celsius, degrees would be odd indeed, so the scale is almost never explicitly reported. A peculiarity of food composition data is that there are so many different aspects of the data that must be identified, and yet few established international conventions that would permit this implicitly.

Adequate identification of the data values depends on the purpose for which they will be used, but typically requires describing:

the food involved in terms of what it is called, since we typically want to match foods-that-are-analyzed with the foods-that-people-eat or report eating
the food involved in terms of its biological or recipe origins, since we often need to know how one food is related to another to compare values
how the food was sampled, stored, packaged, prepared, etc., since these factors can greatly affect magnitudes of nutrient values and the degree to which the values reported actually represent the quantities present in the food as eaten
how the food was handled after selection but prior to analysis, since this, too, can greatly affect the resulting values
the nutrient or other food component being reported, since a value given without indication of what it represents is useless
the analysis method used and how the “nutrient” was defined, since different methods and definitions, and even different conversion factors, where required (energy, protein, vitamins A and E, and so on), can produce different values that cannot be compared directly
the statistical (distributional) properties of the value reported, since it is useful to know both how similar the values are from different analyses and samples (precision or variance) and, to the degree possible, how closely the value is related to the nutrient levels that would be encountered in the food as found in nature (representativeness).

In addition to being issues of description, many of these items bear on data quality both the quality of the data values themselves and, in the presence or absence of appropriate description, the overall quality of the tables or databases in which they are embedded.

Many other papers, including some at this conference, have focused on the first of the above elements and particularly on the issues associated with attempting to describe or classify foods accurately. Accurate and standardized description of sampling methods has been discussed a great deal and identified as important (1, 2), but there are no known specific proposals for how to do this that are applicable to food composition data work as actually practiced. The last three elements in the list above the identification of the data values themselves are the topic of this paper.

• Identification of Food Components and Analysis Methods

Many of the nutrient values reported in food composition tables actually are the result of (sometimes local) standards for conversion factors, conventions about the relationship of one value to another, or differing assumptions about the relationship of measurable properties to bioavailability, rather than things that can be uniquely and unambiguously determined in the laboratory. For example, while energy measurement by putting people into calorimeters is well-understood, it is rarely done today. Instead, conversion factors are applied to other nutrients, but those conversion factors differ over time and from one country to another. So having a value in a food composition table labeled “energy” is rarely sufficient to permit comparing that value to others. Similar issues arise for a variety of other commonly-reported nutrients. For others, definitions have changed over time and sometimes remain controversial: a value that is simply identified as “fiber” may be nearly useless. And for still others, differences in methods of analysis produce differences in results, i.e., not exactly the same things are being analyzed.

Table I. Some recent additions to INFOODS Food Component Identification Tags

<BRD>	Bromide
<CYAN>	Cyanide
<F10D1>	Fatty acid 10:1
<F18D1N7>	Fatty acid 18:1 ω-7
<F18D1N7>	Fatty acid 18:1 ω-9
<F22D1>	Fatty acid 22:1
<F23D1>	Fatty acid 23:1
<F18D2>	Fatty acid 18:2
<F18D3>	Fatty acid 18:3
<F22D3>	Fatty acid 22:3
<F22D5N6>	Fatty acid 22:5 ω-6
<F24D6>	Fatty acid 24:6
<FIBADC>	Fiber, acid detergent method, Clancy modification
<FIBTSW>	Fiber, total dietary; Wenlock modification
<SB>	Antimony
<TOCPHT>	Total tocopherol
<F10D1F>	Fatty acid 10:1; expressed per quantity of total fatty acids
<F18D1N7F>	Fatty acid 18:1 ω-7; expressed per quantity of total fatty acids
<F18D1N9F>	Fatty acid 18:1 ω-9; expressed per quantity of total fatty acids
<F23D1F>	Fatty acid 23:1; expressed per quantity of total fatty acids
<F18D2F>	Fatty acid 18:2; expressed per quantity of total fatty acids
<F18D3F>	Fatty acid 18:3; expressed per quantity of total fatty acids
<F22D3F>	Fatty acid 22:3; expressed per quantity of total fatty acids
<F22D5N6F>	Fatty acid 22:5 ω-6; expressed per quantity of total fatty acids
<F24D6>	Fatty acid 24:6; expressed per quantity of total fatty acids
<NPR>	Nitrogen-protein ratio

These issues were examined from the standpoint of food component identification a few years ago. That work resulted in publication of a listing of food components and value-affecting methods for analysis that could be found in the various food composition tables and databases of the world (3). That list also contained abbreviated names for each component-method pair. These names can be used in electronic data interchange and abbreviated table headings and, using the terminology of the International Standard (4) on which the associated data interchange system (5) is based, are called “tags” or “tagnames”. The list is now being incrementally updated, using an electronic mail distribution list as the primary mechanism for suggesting and reviewing new proposals. To subscribe to that list, send Internet mail to [email protected]. Several new definitions, especially of fatty acids, have been added recently (see Table I). As additional nutrients of interest are identified and incorporated into tables, the list is likely to be extended further.

Figure 1. Small-sample normal distributions with 5% confidence intervals

It is important to note that these food component identification “tags” are not normative and are not associated with any concept of good or desirable practice. There are only two requirements for something being listed: (i) a national or regional food composition table compiler, somewhere, thought that the value was important enough to include in his or her table, and (ii) there is an adequate definition available. The second requirement was waived in the original publication for commonly-occurring under-identified values (e.g., “energy” with no further description, is tagged as <ener->), but future registrations are expected to be adequately defined.

At the other extreme from “unknown method”, some tags have provision, through sub-elements and keywords, for substantially more information than today's food tables and databases provide. This added detail is intended to provide a target for improvement so that no one assumes that the tags represent as much information as might be desired. It should also encourage the recording of more detailed information, as it becomes available and is appropriate in the view of table compilers, in databases.

• Data Value Description

Just as the choice of analytic methods can have a significant impact on the particular value that is produced for a nutrient, decisions about the statistic to use to represent the result of multiple analyses, estimates, or methods of imputation, may make a considerable difference in the value placed in the table. Means cannot be readily compared to medians and, especially with small sample sizes, different estimates of variability are even more difficult to compare in a reasonable way.

Figure 2. Small-sample normal distributions: medians and fences

To an even greater degree than with nutrient identification, the nature of numeric data values is typically not reported to a degree specific enough to make them usefully comparable (6). Values reported are typically not identified as to whether they are means, medians, or some other estimate of location, nor is the type of data censoring (e.g., “outlier elimination”) reported and discussed. Standard errors or variances are often reported with sample sizes as small as two or three. Even with normally-distributed data, such small sample sizes tend to yield confidence limits broad enough to make this type of variance reporting almost useless.

The relationships between sample size and confidence limits are illustrated in Figures 1 and 2. The plots show repeated samples, using a good random number generator, from a Gaussian distribution with the traditional mean of zero and standard deviation of one. Figure 1 shows “boxplots” from these successive samples, with the white bar representing the median and the shaded area representing the hinges or “fourths” (approximately quartiles—the middle half of the data). The “whiskers” on the two plots to the right extend out to the “fences” or outlier cutoffs, calculated as 1.5 time the hinge-spread past the hinges. These types of plots, widely introduced after Turkey's “orange book” (7) and explained in detail in Hoaglin et al. (8), usually provide a better overview of small-sample data than more traditional scatter plots or histograms.

It is interesting to observe with this group that the second and third random draws produced values all of which fell below the known population mean of zero. While this is clearly a random event, it illustrates the dangers of making statistical estimates that are designed for the large sample case with only two points. The three-point samples are better, as one would expect, but the medians fall well away from the expected mean (especially in the third case), and the hinge spreads are quite wide relative to the standard deviations one would hope for. Things begin to stabilize at 30 points (the value at the very top is an outlier when the fences are used to set the criteria) and the 100 point sample looks quite reasonable.

Figure 3. Two-hundred point sample from Normal (1,1) distribution

Figure 2 exhibits these same data and plots in more traditional confidence interval terms (shown by crosshatching): while some food composition tables report medians for small samples rather than means, none that INFOODS has discovered report hinge spreads or similar robust measures. The confidence intervals for the sample size of two are artificially small due to the nature of the computation. But those for sample sizes of three illustrate the problem: 5 per cent confidence intervals extending out past two standard deviations of the universe being sampled. It is nearly impossible to make statements about values with these types of confidence intervals: they could be used to “prove” almost anything. Things start to become acceptable at 30 points: the 5 per cent confidence intervals actually fall within the hinge spread.

Some food composition tables try to avoid the difficulties with small samples by reporting the range, i.e. the maximum and minimum values actually obtained. But, since they represent extremes, those values are exceptionally sensitive to sampling and experimental error: it is almost impossible to create a statistical estimate of the reliability of an extreme value.

Worse yet, empirical evidence is accumulating that the distributions of many nutrient values are asymmetric. Rand and Pennington discussed the issues two years ago (9); Pennington provided an update and some additional data in a more recent paper (10). Those efforts attempt to examine variability in foods, but, when quantities of nutrients are being measured that are close to the detection level of the instrumentation, inherent censoring of trace levels also causes asymmetry in the values actually obtained.

Instrumentation censoring occurs when nutrients exist in foods at levels below the detection thresholds of the measurement methods being used. For illustration, one possible situation was simulated by drawing 200 points at random from a Gaussian distribution with mean 1 and variance 1. When those points are sorted into ascending order to make an easy-to-understand plot, they appear as in Figure 3. The corresponding frequency histogram appears as Figure 4. In both cases, it is easy to observe that the distribution is approximately Gaussian with a mean at 1, as one would expect. (If the negative values are bothersome, mentally shift the graphs by adding about 3 to all of the “measurements” in Figure 3 and the “data values” in Figure 4. That shift, of course, has no impact on the analysis.)

Figure 4. Frequency histogram corresponding to 200 point Normal (1,1) sample: data values

Figure 5. Truncated sample from Normal (1,1) distribution: small values removed at Y=0.2

Figure 6. Frequency histogram corresponding to truncated sample: data values

Now suppose that the method involved is incapable of detecting any values smaller than 0.2 (marked as “presumed detection limit” in Figure 3). One would then observe plots that look more like Figures 5 and 6 instead of the “true” plots in Figures 3 and 4. The new histogram is especially interesting, since it shows not only significant asymmetry, but the mean of the values actually detected has shifted from 1.0 to about 1.4. A different assumption, that all the undetectable values were actually at the theoretical minimum (somewhat below -2 if one judges from the sample illustrated in Figure 3), would shift the mean considerably in the other direction.

The combination of distributions that represent asymmetric natural phenomena and instrumentation censoring is worse than additive in terms of the degree to which it tends to force measured distributions into non-normal form.

Since the mean value is very sensitive to extreme values and the shape of the distribution, it may not be very useful when the data are severely asymmetric or when trace-censoring eliminates very small values without a corresponding impact on the higher tail. The median is often considered a cure for fussy data problems, but asymmetry due to tracecensoring can distort it even more than the mean, moving it well away from the subjective “center” of the data.

Combination of means and medians in a single table, or comparison of them, is rarely appropriate, especially where central limit assumptions may not apply. Their use together is usually confusing. Neither of them can easily be compared with the more sophisticated measures of location that are appropriate for distributions that are known to be asymmetric. In particular, it is not possible to compute a “weighted average” of a mean from one report with a median from another, even if the sample sizes are known. It is, in general, not even possible to combine two medians this way since substantially all of the distributional information is discarded when half of the data are eliminated from each side, leaving only a single point. When data are to be re-used and re-evaluated by others, as in interchange situations and reference databases, and only the usual small numbers of data points have been determined by analysis or combining values, it is perhaps better to list the actual values themselves, rather than using marginally appropriate, or inappropriate, statistical summaries.

• Tagging the Data Values

As with nutrient identification, while doing things correctly is important, it may be even more important that whatever is done be identified accurately so that a recipient or evaluator of data can determine if they are suitable for his or her purposes. Just as it provides for identification of food components and methods by the use of “tags” with exact definitions, the INFOODS interchange system provides tags to identify data values and descriptions of variability. As with the nutrient tags, these tags provide more information than appears in any known food composition table today. At the same time, and again like the food component tags, the data tags are not normative: tags are provided for values that are reported in tables even if they are useless from a statistical point of view. The system for data values extends beyond labeling of simple measures of location (e.g., mean, median, trimmed mean, or the “X percent of RDI” values that appear on food labels in some countries) and variability (e.g., standard deviation, standard error, quartiles, range, or percentage points) to permit description of distribution-based statistical filtering procedures applied to the laboratory data and description of particular challenges encountered in analysis that might bias the results. If much of this type of information were provided, it would pose a serious challenge to database management systems, since few of those are designed to handle data with these types of interrelationships. However, the advantages to those trying to do serious evaluation or quality assessment of data values under consideration for use in studies, calculation of imputed food values, or for inclusion into other tables would make it worth the trouble.

• References

(1) Greenfield, H., & Southgate, D.A.T. (1992) Food Composition Data: Production, Management, and Use, Elsevier Applied Science, London

(2) Truswell, A.S., Bateson, D., Madafiglio, D., Pennington, J.A.T., Rand, W.M., & Klensin, J.C. (1991) J. Food Comp. Anal. 4, 18– 38.

(3) Klensin, J.C., Feskanich, D., Lin, V., Truswell, A.S., & Southgate, D.A.T. (1989) Identification of Food Components for INFOODS Data Interchange, UNU Press, Tokyo

(4) Standard Generalized Markup Language (1986) ISO 8879

(5) Klensin, J.C. (1993) INFOODS Food Composition Data Interchange Handbook, UNU Press, Tokyo

(6) Rand, W.M., Pennington, J.A.T., Murphy, S.P., & Klensin, J.C. (1991) Compiling Data for Food Composition Databases, UNU Press, Tokyo

(7) Turkey, J.W. (1977) Exploratory Data Analysis, Addison-Wesley, Reading, MA

(8) Hoaglin, D.C., Mosteller, F., & Turkey, J.W. (1983) Understanding Robust and Exploratory Data Analysis, John Wiley and Sons, New York, NY

(9) Rand, W.M., & Pennington, J.A.T. (1991) Proceedings of the 16th National Nutrient Databank Conference, The CBORD Group, Ithaca, NY, pp. 179–182

(10) Pennington, J.A.T. Albert, R.H., & Rand, W.M. (1993) Proceedings of the 18th National Nutrient DDatabank Conference, ILSI Press, Washington, DC, pp. 155–158

Food Data: Numbers, Words and Images

Barbara Burlingame, Fran Cook, Graham Duxfield, Gregory Milligan

Nutrition Programme, New Zealand Institute for Crop & Food Research, Private Bag 11030, Palmerston North, New Zealand

Food composition databases are generally collections of numeric and descriptive data in various formats with a variety of limitations related to proper documentation. Current technologies make it feasible for databases to go beyond words and numbers now to include images and graphical representations of foods. Presently there are over 130 food images in the New Zealand Food Composition Database, ranging in size from 25 KB to 1.3 MB each, and occupying a total of about 33 MB of disk space. The process at Crop & Food Research involves digitizing photographs of the actual food samples using an optical scanner at 400 dpi resolution. Advanced Revelation 3.0, the development environment system used, does not deal with images yet, but can call DOS-based programs which convert and display digitized images in several different formats such as PCX and GIF. To date, several important uses for food database images have emerged. These include sample validation where common name could relate to several different scientific names; data validation where intensity of the orange color led to accepting β- carotene values outside the expected range; food intake surveys where food descriptors were insufficient due to language or cultural differences or where children were subjects; and for international interchange of food composition data.

Many problems arise as a result of poor, incomplete or ambiguous descriptions of foods listed in databases and as a result of confusion over the interpretation of commonly used names for foods. Many solutions have been recommended to deal with these problems (1,2). These solutions typically rely on words, alphanumeric codes, position-specific facets, etc, and go some way toward alleviating some problems. These systems will never solve all the problems.

A picture is worth a thousand words, as the old adage goes, and technologies have advanced to the stage that all food descriptor files could contain a field or an accompanying file of digitized images or series of images so that barriers of language, culture, and the limitations and subjectivities of our vocabularies are minimized (e.g., how lean is lean meat? how do you describe the depth and intensity of the color of an apricot? and what is a muttonbird?).

• Documentation by Image

In the New Zealand Food Composition work, the process of documentation begins at the sample preparation stage. Food samples are collected and then prepared in the laboratory. Samples are photographed intact, raw and after consumer-type preparation (e.g. processed by cooking). Each sample is photographed as prepared for consumption and with a scale definition (e.g. metric ruler), and lately with a color index (see Figure 1). Food packaging and labels are also routinely photographed (see Figure 2). All this is done in addition to the recording of word descriptors and detailed text containing the standard documentation details (age of sample, date of sampling, geographic region, common and scientific name, physical state, processing, packaging, etc).

The photos are then digitized into PCX formats (IBM PC Paintbrush Picture File) using an optical scanner at 400 dpi (dots per inch) resolution. Much higher resolution is available, but there is a trade-off between resolution and space required to store the image. Presently, there are 130 PCX images in the NZ Food Composition Database, occupying about 33 MB (megabytes) of disk space. The size of the individual files ranges from 25 KB (kilobytes) to 1.3 MB each. Compression would significantly reduce amount of disk space required.

Disk space requirements vary depending on size of the image, number of colors, and the image resolution. Various manipulations can be done to achieve efficient storage. One NZ beverage record represents a composite of three different brands of powdered drink mix. The packaging scanned in 256 colors occupies 630 KB; this same file compressed with PKZIP (compression format by PKWare) occupies 416 KB; and as a GIF (Graphics Interchange Format) file, 93 KB. The same information contained on the packaging, when entered into the database as text, occupies a mere 30 bytes. Table I shows disk space required by other images and plain text.

More and more software products are allowing the inclusion of digitized images. Advanced Revelation 3.0 (ARev), the development environment used for the New Zealand Food Composition Database, does not incorporate graphics procedures. Presently, however, we associate scanned images of food via other software. ARev is programmed to call DOS-based programs — we use SVGA (Super Video Graphics Adaptor) and Color View — which can display digitized images stored under various formats.

Using a number of different software package and shareware, images stored in PCX format can be transferred to media as PCX or other less byte-consuming formats such as GIF. This is important because users will have different hardware and software products available to them. GIF and TIFF (Tag Image File Format) have become industry standards, and JPEG (Joint Photographic Experts Group) with the ISO (International Standards Organization) and CCITT (Consultative Committee on International Telegraph and Telephone) backing (3) is becoming popular for compressing still images for storage. Exchanging of images will be facilitated by having image format flexibility.

Table I. Disk space requirements for food record images in bytes compared with non-graphical text

Storage format	Bread bag	Powdered drink packaging	Powdered drink packaging; composite of 3
.PCX (color)^a	1,925,289	805,234	2,238,904
.PCX (grey)^a, ^b	704,053	277,092	780,405
.ZIP^a, ^c	1,240,687	471,347	1,327,914
.GIF^a, ^d	317,399	110,505	295,897
.JPG^a, ^e	116,613	52,925	116,716
Non-graphical text; database descriptors	250	401	946

^a scanned at 16 million colors; saved as Zsoft PC paintbrush format
^b converted and stored by PhotoMagic as greyscale Zsoft PC paintbrush format
^c compressed and stored by PKZIP
^d converted and stored by PhotoMagic as Graphic Interchange Format
^e converted and stored by PhotoMagic as JPEG format

• The Hardware

The ability to view images is dependent on the hardware available. Images require, as a minimum, a Super VGA (Video Graphics Adapter) monitor which can display 1024 × 768 pixels in at least 256 colors. Some images require a 1 MB video card capable of displaying 32,000 colors from a palette of over 16 million colors. Graphical printers are also now readily available, with as little as 300 dpi resolution and 24 bit color.

Flopticals

Flopticals have been used already in the exchange of images between New Zealand and INFOODS. Floptical drives are inexpensive and can use both floptical disks and normal 3.5" floppy disks. Floptical disks are 21 MB in size, compared to the 1.44 MB size of standard 3.5" disks. This capacity is important because some high resolution images can be 20 MB and would require fifteen standard floppies for a single image. Most of the images for the New Zealand Food Composition Database are between 25 KB and 1.3 MB each.

Other Media

Third party software will allow integration of compact disks and proprietary technologies such as Photo-CD with food composition databases. Many information systems have been developed using CD-ROM technology. Conventional information retrieval techniques including full-text searching and relational databases were integrated for accessing information stored on the CD-ROM for agricultural extension information (4).

• Lossy Compression

Lossy compression is so named because redundant or otherwise unnecessary data are deleted in the compression process. Two compression types that can be used for lossy compression are accepted as current standards for still images: JPEG (Joint Photographic Experts Group) and Fractal compression. JPEG was designed as a digital image compression standard for continuous-tone, gray scale and color still images (3). It is based on a generic mathematical function known as forward DCT (Discrete Cosine Transform), which basically transforms the image into a different form which takes up less space. Its compression is very fast, but the JPEG-compacted image files are larger for the same quality than files compressed by other methods. Fractal compression uses a mathematical transformation called an affine map which identifies all patterns that can be matched even if it means rotating, stretching or squashing the pattern. It is resolution-independent. Lossy compression used by both these compression types involves a trade off between information and compressed size. Both methods intentionally discard parts of the data (5, 6, 7).

• Limitations

There are some limitations and problems with using images in food composition databases. These include hardware and software restrictions related to storage, compression, decompression, image resolution and faithfulness to the original. Additionally, an image cannot be searched in the same way as text files. For example, the bread wrapper images will identify ingredients, one of which may be potassium bromate. However, the graphics files of bread wrappers cannot be searched for potassium bromate the way a descriptor text file or a Langual file can, and therefore will not substitute for documentation by words or alphanumeric codes.

• Uses of Images

Data Validation

Verification of information has become the most valuable use to date of the effort to document by images. Sometimes we have reason to question our own data, and images have on many occasions allowed us to make the decision about accepting or rejecting the results of some nutrient analyses. For example, we obtained some very high values for β-carotene in apricots in our 1989 work. In some later work we obtained values which were significantly lower. We examined details of methods, compared sampling and sample preparation protocols, and finally resolved the problem by comparing images of the actual samples used. The images showed that the earlier samples had a much deeper, darker, orange color than the more recent samples (see Figure 3). This, of course, raises more issues about the introduction and widespread adoption of modified cultivars, which is often done without consideration of the nutritional implications.

Food Intake Surveys

The NZ Food Composition Database has over 100 beef records. For many of these records, the word descriptors are identical up to the facet containing ratios of separable lean and separable fat (e.g., there are five records for beef, rump steak, grilled, having different ratios of separable lean and fat: 80:20, 85:15, 95:5, separable lean only and separable fat only). Once the database is searched for the words grilled rump steak, and four records are presented, a judgment is required which many people cannot make without the benefit of visual examples. It is far easier for most people, nutrition professionals and lay alike, to select a picture of meat which looks like what they would consume, rather than to say with confidence that their grilled rump steak was 95 per cent separable lean and 5 per cent separable fat.

Language differences present a challenge which is dealt with by including an alternative names facet in each food descriptor file. Still, with international interchange and international trade in agricultural products, some descriptors, however comprehensive and however many language translations are provided, will never be enough. For example, the New Zealand kumara, with the alternative name sweet potato, is quite unlike the North American sweet potato; the New Zealand pumpkin is unlike the typical North American pumpkin. The differences seen in the nutrient composition are not so surprising when the physical differences are shown with a picture of the food (see Figure 4).

Communication barriers exist within countries and with the rest of the world. Language, culture, age, are just a few. In a clinical setting, it is often necessary to determine the nutrient intake of patients. In New Zealand there are several Polynesian languages in use, as well as Maori and English. Children are often subjects in nutrient intake surveys. Food images can assist overcome these communication barriers.

Wildlife Feeding Programs

A recent project involves providing nutrient data to an aquarium in New Zealand. This organization will soon bring in penguins for exhibition which have been successfully bred and reared for many generations on fish from the Northern seas. We are assisting them in determining what locally available foods could substitute for their present diet. The task of matching the nutrient composition of our Antarctic finfish with Arctic finfish would be easier if the nutrient data, a plethora of which is available in the USDA Standard Reference 10 (8), were accompanied by images. This would be particularly useful where the sample numbers are only one or a few, where the information presented does not specify different stages of maturity, different seasons of the year and different catch areas. Image comparisons between the two databases would help us assess the physical similarities of the different species, for example, as size of the finfish would be relevant to the penguin's diet.

In the same area of wildlife nutrition, our supply of nutrient data, coupled with images, will assist others attempting to reproduce the dietary aspect of native habitats. The right nutrients in the right sorts of foods will improve the well-being of the wildlife, including enhancing the potential for reproduction (9). We experience problems even within New Zealand, where endangered bird species must be relocated from their native habitats in the South Island, to small off-shore island sanctuaries. Their traditional foods are not all available, so the nutrient content, as well as physical similarities of the native food, are considered when designing the supplementary feeding program.

International Interchange

What is a feijoa? What is a pukeko? What is a karaka berry? Most people outside of New Zealand would have no idea at all what these foods are. Even the alternative names would be useless, as these are (almost) uniquely New Zealand foods (see Figure 5).

INFOODS has considered the issue of images in food composition databases (10), and an image element is included in the interchange model (11). The structure for interchange using the INFOODS' model requires elements that indicate the picture encoding type as well as providing the actual image. A comment element may also be used. The images are subsidiary to the classification element, which is the first immediate subsidiary of the food element. Images associated with a cut of meat record might include a carcass diagram showing the position of the cut and a photograph of the cut itself. These would be included in an interchange file as follows (where cmt means comment):

<image><pcx/> the first image itself in PCX format </pcx/><cmt/>beef carcass diagram with cut sites identified</cmt/></image>

<image><gif/> the second image itself in GIF format </gif/>image of cut</cmt/></image>

• Acknowledgments

Funding for this work has come from the New Zealand Department of Health and Public Health Commission, and the Foundation for Research, Science and Technology. We acknowledge permission to publish the pukeko photograph from R.B. Morris, Department of Conservation, New Zealand.

• References

(1) Pennington, J.A.T., & Butrum, R.R. (1991) Trends Food Sci. Technol. 2, 285–288

(2) Truswell, A.S., Bateson, D.J., Madafiglio, K.C., Pennington, J.A.T., Rand, W.M., & Klensin, J.C. (1991) J. Food Comp. Anal. 4, 1, 18–38

(3) Wallace, G.K. (1991) Commun. ACM 4, 30–45

(4) Watson, D.G., Beck, H.W., & Jones, P.H. (1991) Am. Soc. Agric. Engin. 91, 7017–7024

(5) Carlson, W.E. (1991) Comp. Graph. 25, 67–75

(6) Simon, B. (1993) PC Magazine, June 29, pp. 305–313

(7) Simon, B. (1993) PC Magazine, July, pp. 371–382

(8) US Department of Agriculture (1993) Nutrient Database for Standard Reference, Release 10, USDA, Washington, DC

(9) James, K.A.C., Waghorn, G.C. Powlesland, R.G., & Lloyd, B.D. (1991) Proc. Nutr. Soc. NZ 16, 93– 102

(10) Klensin, J.C. (1991) Trends Food Sci. Technol. 2, 279–282

(11) Klensin, J.C. (1992) INFOODS Food Composition Data Interchange Handbook, UNU Press, Tokyo

• Other Key References on Image Compression

Barnsley, M.F. (1993) Fractal Image Compression, A.K. Peters, Wellesley

Storer, J.A. (1988) Data Compression Methods and Theory, Computer Science Press, Rockville, MD

Netravali, A.N. (1988) Digital Pictures: Representation and Compression, Plenum Publishing Corporation, New York, NY

Russ, J.C. (1992) The Image Processing Handbook, CRC Press, Boca Raton, FL


Figure 1. Gold berries (a new cultivar) photographed with a color index.		Figure 2. Food packaging for foods recorded in the New Zealand Food Composition database.

Figure 3. Apricots with different shades of orange


Figure 4. New Zealand pumkin (left): in shape, color and size is very unlike its North American counterpart (right).


Figure 5. Some (almost) unique New Zealand foods for which descriptors and/or codes would never suffice. From left to right: (top) feijoas, pukeko and (bottom) karaka berries.

Computer Construction of Recipes to Meet Nutritional and Palatability Requirements

Leslie R. Fletcher, Patricia M. Soden

Department of Mathematics and Computer Science, University of Salford, Salford, Lancs M5 4WT, UK

This paper describes a microcomputer package which carries out the inverse process to dietary analysis—that is, given a list of nutrient targets the software modifies a food list so that its nutritional analysis meets those targets. The initial aim of the work was the development of a decision support system to be used by dietitians, nutritionists and other medical personnel when giving dietary advice to patients with chronic diseases such as diabetes and renal failure. This paper contains a detailed example of another application of the same software, namely the formulation of recipes for acceptable versions of traditional dishes which also meet predetermined targets for some key nutrients.

George Stigler's solution (1) of the classical diet problem — ensuring adequate nutrition at minimum cost — is a celebrated example in optimization and is frequently mentioned in textbooks. However, it is of limited practical significance in human dietetics — the “optimal” solution contains only five foods — and, rather more importantly, the method used is rather inflexible. In particular, given the objective of minimum (monetary) cost and a range of foods from which to choose, the nutrient targets uniquely determine the solution. Adding non-nutritional constraints, limiting the quantities of particular foods in the optimal diet, for example, will ensure that the computed diet is more varied (2, 3). Nevertheless, there is still only one solution for each collection of targets and there is no convenient way of taking individual preferences into account.

We have developed, and implemented in microcomputer software, a different model of the diet problem (4, 5, 6). This generates, in a natural way, varied diets which meet the needs and wishes of individuals as well as nutritional targets. In this paper we describe an application of this same model to the modification of a recipe so that the resulting dish is not only palatable but also has a predetermined nutritional composition.

Solving the diet problem is the inverse of the familiar, and (mathematically) much simpler, process of nutritional, or dietary, analysis. The software implementation of our model is an extension to the dietary analysis package Microdiet (7), which is based, in turn, on the authoritative UK food analysis data (8, 9, 10, 11). The new software selects, from all the possible combinations of foods with a predetermined nutritional composition, one which is as close as possible to the wishes of a client or patient. This uses a standard variant of conventional linear programming (12, Chapter 14). Some algebraic details are given by Fletcher et al. (5) and we will report others elsewhere, particularly those relating to our expression of the basic optimization in dimensionless terms. This has proved to be an important technical device, allowing all the targets to be assessed relative to each other when seeking, for example, other ingredients to include in a recipe, and circumvents a possible difficulty mentioned in (2, p. 389). Careful formulation of the algebraic model has also ensured that the solution to the dual problem (12, Chapter 5) provides significant nutritional insight.

• Method

A recipe for lasagne verdi was taken from a domestic cookery book (13) and a nutritional analysis carried out. The recipe and some of the corresponding nutrient totals are shown in the indicated columns of Tables I and II, respectively. As a demonstration of the capabilities of the model it was decided to seek a modified version of the recipe which would produce a dish reasonably similar to lasagne verdi but with the modified nutritional composition shown in the column labeled “Target” in Table II. Although these targets are only illustrative, they are also intended to reflect recent expert advice to UK citizens (14) regarding desirable dietary modifications.

The other columns in Tables I and II show the various stages in the modification of the recipe until, at version F, an acceptable version was obtained. The test of “acceptability” was the willingness of the first author's family to consider eating the resulting dish. Ingredients were exchanged, introduced into, or removed from the recipe at various stages in the modification process. A blank entry, denoted by “-”, indicates that the particular ingredient was not considered at that stage in the optimization. The reference to “olive oil” in Table II indicates that a (lower) limit was placed on this quantity of this ingredient during the final stages; it is convenient to list this with other, nutrient, targets. There was no target on the quantity of fat in the recipe and it appears in Table II for illustration only. Had the fat content of the diet become too high (or too low) a further constraint could have been added to limit this.

Table I. Ingredients and quantities in lasagne recipe

Food name		Ingredient quantity (g) in recipe number
Food name		A (original)	B	C	D	E	F
Meat sauce
	Onions, raw	250	250	250	250	250	250
	Butter, salted Removed after stage C	40	0	0	-	-	-
	Olive oil	30	0	0	0	15	30
	Beef mince, raw Reduced to 100g after stage B	300	115	100	100	100	100
	Lentils, boiled Introduced after stage B	-	-	100	100	100	100
	Haricot beans, boiled Introduced after stage B	-	-	100	100	100	100
	Garlic, raw	5	5	5	5	5	5
	Mushrooms, raw	100	100	100	100	100	100
	Bay leaf, dried	2	2	2	2	2	2
	Tomatoes, canned	400	747	449	416	417	417
	Sugar, white	10	10	10	10	10	10
	Basil, fresh	5	5	5	5	5	5
White sauce
	Flour, plain white	25	25	25	25	25	25
	Butter, salted	25	0	0	25	6	13
	Milk, cows, whole Exchanged after stage C for	300	300	300	-	-	-
	Milk, cows, semi-skimmed Exchanged after stage E for	-	-	-	300	300	-
	Milk, cows, skimmed	-	-	-	-	-	300
Topping
	Cream, double Exchanged after stage C for	40	16	9	-	-	-
	Yoghurt, low fat, natural	-	-	-	40	40	40
	Cheese, cheddar type Exchanged after stage C for	50	50	50	-	-	-
	Cheese, reduced fat, cheddar-type	-	-	-	50	50	50
Lasagne, boiled		225	477	225	225	225	225

• Results

The steps in obtaining the displayed results (Table I, Table II) were as follows. Recipe A refers to the original recipe. When the targets were set a software alert pointed out that the fibre contents of garlic, bay leaf and basil were recorded as “unknown”. Recipe B represents the smallest change to the quantities of the ingredients in recipe A which will meet the nutrient targets set. Although these quantities do not constitute an acceptable recipe, these results and other subsidiary results from the linear programming show that other ingredients are required to complement those already there. The subsidiary result also enabled the pulses introduced thereafter to be selected from amongst the variety of possible new ingredients.

Table II. Nutrient targets and analyses for recipe

Nutrient	Target	Analysis for recipe number:
Nutrient	Target	A (original)	B	C	D	E	F
Fiber (g)	>20	13	20	25	25	25	25
Energy (kcal)	<1500	2475	1502	1342	1383	1377	1381
Sodium (mg)	<1100	1444	816	739	1008	843	907
Potassium (mg)	>3500	3453	3882	3501	3499	3499	3500
Iron (mg)	>16	17	16	17	16	16	16
% energy from fat	<35	66	35	35	35	35	35
Quantity of olive oil (g) (in stages E and F)	>15	-	-	-	-	15	15
Fat (g)	-	182	58	52	54	54	54

Recipe C shows the beneficial effect on the changes to the recipe of the new ingredients. The subsidiary results from this stage show that the limit on the percentage of energy from fat is causing most of the changes made by the program to the ingredient quantities. Exchanges of existing ingredients for the lower-fat alternatives were made. Recipe D shows the results of making these modifications to the starting recipe and recomputing the smallest changes which will enable the nutrient targets to be met. This still resulted in the removal of all the olive oil so a lower bound of 15 g was imposed on this ingredient. Recipe E represents the smallest changes to the modified recipe required to meet the nutrient targets with at least 15 g of olive oil. The consequent reduction in the remaining quantity of butter in the white sauce was judged to be unacceptable so semi-skimmed milk was replaced by skimmed milk. The investigation closed with recipe F which was deemed an acceptable version of the original recipe.

• Conclusions

We have demonstrated the use of a linear programming model of food and nutrition in updating a recipe to reflect current dietary expert opinion. The eventual recipe in the example discussed here had limits placed on the totals of various nutrients, on the percentage energy derived from fat and on the quantity of one of its ingredients. Other targets which the software can accommodate include the P/S ratio, the amino acid profile of the protein and the ratios of the quantities of two ingredients. This last target is available to ensure that, for example, the quantities of flour and milk in a computed recipe were appropriate for a white sauce.

The model also allows targeting of nutrient density rather than nutrient totals though some modification of the software would be required to implement this. However, in seeking to minimize the overall change to ingredient quantities in moving to a nutritionally acceptable recipe, the present software tends to maintain the total weight of the recipe approximately constant, leading to a stable relationship between nutrient totals and nutrient density.

• References

(1) Stigler, G.J. (1945) J. Farm Econ. 27, 303–314.

(2) Henson, S. (1991) J. Agric. Econ. 42, 380–393.

(3) Smith, V. E. (1963) Electronic Computation of Human Diets, Michigan State University Press, East Lansing, MI

(4) Fletcher, L. R., & Soden P. M. (1991) Diab. Nutr. Metab. 4 (S1), 169–174

(5) Fletcher, L. R., Soden P. M., & Zinober, A. S. I. (1994) J. Oper. Res. Soc. 45, 489–496

(6) Soden, P.M., & Fletcher, L.R. (1992) Br. J. Nutr. 68, 565–572

(7) Bassham, S., Fletcher, L. R., & Stanton, R.H.J. (1984) J. Microc. App. 7, 279–289.

(8) Paul, A. A. & Southgate, D. A. T. (1978) McCance and Widdowson's The Composition of Foods, 4th Ed., HMSO, London

(9) Paul, A. A., Southgate, D. A. T., & Russell, J. (1980) First Supplement to McCance and Widdowson's The Composition of Foods, HMSO, London

(10) Tan, S. P., Wenlock, R. W., & Buss, D. H. (1985) Immigrant Foods: Second Supplement to McCance and Widdowson's The Composition of Foods, HMSO, London

(11) Holland, B., Unwin, I. D. & Buss, D. H. (1991) McCance and Widdowson's The Composition of Foods, 5th Ed., Royal Society of Chemistry, Cambridge.

(12) Chvatal, V. (1983) Linear Programming, W. H. Freeman and Company, New York, NY

(13) Allison, S. (1977) The Dairy Book of Home Cookery, Milk Marketing Board of England and Wales, Thames Ditton

(14) Committee on Medical Aspects of Food Policy (1991) Dietary Reference Values for Food Energy and Nutrients for the United Kingdom, HMSO, London

Requirements for Applications Software for Computerized Databases in Research Projects

Dorothy Mackerras

Department of Public Health, University of Sydney, NSW 2006, Australia

Numerous programs have been written to access nutrient databases using words rather than numeric codes. In general, they have been directed towards the needs of clinical dietitians but many of their features, such as graphs of individual dietary intakes, are irrelevant to the needs of nutrition researchers. A recent dietary survey conducted on a Pacific island highlighted some of the data entry needs of researchers. Survey participants described food intakes using standard volumes and measures (fluid oz, oz, g, mL), household units (bowl, can, slice, tablespoon), small, medium, large (glasses, coconuts, pandanus, donuts, papaya) and locally developed measures (mountain table/teaspoon, small and large tuna steaks, cm of reef fish). Neither the teaspoon nor the tablespoon matched the metric or US standards.

The abilities of two programs, A and B, from two different countries to meet researchers' needs are described. Both these programs, or their earlier versions, have been available for a number of years and are widely used in their respective countries. Both programs were used on a Compaq Deskpro 486/33M computer with a math coprocessor and 8 MB RAM including 558 KB of available conventional memory. Program A was used for surveys involving food frequency questionnaires and diet records and Program B in a survey gathering 24-hour recall data (its food frequency capabilities have not been used).

Facilitating data entry is important. A major goal is to reduce the amount of coding required prior to entry. Every step that has to be coded will also require double coding on at least a proportion of forms to examine the error rate. If the software allows household or common measures as food descriptors, the weight of household measures of each food only needs to be “coded” once into the program and individual diets will not require conversion into grams prior to entry. Both programs allow standard measures (cup etc.) to be used. In addition, Program A allows three household measures to be defined per food and the user can choose from 51 different words to describe the serving. Program B allows only one household measure, called either serving, item, slice or piece, to be defined per food. Abbreviations used for data entry should be standard or intuitively obvious. Program A uses SI units (“g” for gram etc.) and abbreviations such as “oz” for ounce. By contrast, Program B uses “a” for gram, “b” for ounce etc., and this increases the likelihood of data entry errors. Both programs allow the household measures to be altered which is important in cross-cultural studies. However, Program B requires the operator to change each of the nutrient values in the database if the gram weight of the measure is changed whereas this is calculated automatically in Program A. Neither program appears to be capable of converting the assigned volume of cups, pints etc. between the metric, imperial and US systems or of allowing new words to describe serving sizes (e.g. mountain tablespoon) which would have been useful in the study.

After entry, data need to be checked and cleaned. It is also useful to have a code for an unknown food so that incomplete records are flagged until the relevant coding decisions are made. As diet records may contain 25 or more different food items per day it is useful if the program allows foods to be inserted anywhere in the diet list so that the printout matches the order of the original form. Program A has this feature but Program B does not. Neither program appears to have a range checking facility; this would be particularly useful for food frequency information when the list of foods can be pre-specified.

It is also useful to be able to enter some other data into the dietary program. Both programs allowed long names for the subject and the field will take numbers instead of letters. Thus items such as subject number, date of interview, interviewer code and household number could all be coded and used as the subject's “name”. Managing the database needs care, especially if the same program is being used to analyze data from several different surveys at the same time. The programs had different approaches to the data organization. Program A saves the diet files within the database and a separate database of food composition information can be made for each study. This means that some care is needed to prevent the file becoming too large to backup. Program B saves each diet file as a separate file. This makes backup easier, but means that alterations to the database (e.g. deletions) may make the files invalid.

Dietary data are often exported into a statistical program. This can be used for detecting errors in the data, such as outliers etc., and for analysis. Data may need to be cleaned and exported several times prior to the final analysis being done. Most surveys involve large numbers of people and many lines of data per respondent and so batch processing is needed to export a large number of diet files into a single file, generally with a rectangular ASCII format. Program A has this function which is clearly described in the manual. It took approximately ten minutes to export about 9000 lines of data from 45 food frequency files each containing about 200 lines of data. Program B does not have an inbuilt export function although the company will write a program on request. It took three hours to export 4160 lines of data from 369 files and required reconfiguring the computer to free all the conventional memory. These functions will take longer on slower computers.

Particular needs of research in developing countries therefore include:

programs which allow flexibility in which system of units is used (metric, imperial, US) and which also allow for a mixture of systems and units and for words which describe non-standard units
programs which allow new (local) serving descriptors to be specified
programs which can find foods with names of only one and two letters long
foods may need to have multiple spellings or entries in countries where spelling is not yet standardized.

Attention to some of these details would allow local people greater participation in all phases of the research, and improve the speed and quality of information processing and data output.

Section IV Information Needs and Computer Systems (continued)

Data Identification Considerations in International Interchange of Food Composition Data

Food Data: Numbers, Words and Images

Computer Construction of Recipes to Meet Nutritional and Palatability Requirements

Requirements for Applications Software for Computerized Databases in Research Projects

Section IV
Information Needs and Computer Systems (continued)