6 Nutrient Data Banks from the Point of View of the Computer Programmer

K.C. Day

6.1 Introduction

A nutrient data bank or food composition table is of necessity a long document, being a listing of foods, their names, components, and the component values. As national food tables continue to increase in size and complexity, such data lend themselves to computer handling. This discussion of nutrient data banks from the point of view of the computer programmer is based on experience with two sets of food composition tables: the nutrient data bank prepared by the Department of Health and Social Security which was based on the 3rd edition of the tables of McCance and Widdowson [1] and the 4th edition revised by Paul and Southgate [2].

6.2 Department of Health and Social Security table

This table was by later standards a very simple table, with entries for about 660 food items each with 17 nutrients, the values of which were expressed in units per ounce of edible matter. The file was constructed in a similar way to the McCance and Widdowson tables. There were 34 food groups with gaps between the groups so that any additional food could be added to the appropriate group. Originally there were 16 values for each food item: energy expressed in kilo-calories, total and animal protein, fat, carbohydrate, calcium, iron, retinol, equivalents, thiamin, riboflavin, nicotinic acid, vitamin C, vitamin D, pyridoxine, added sugars and water. To these, energy expressed in kilo-joules was added, but many items of interest to nutritionists today, such as electrolytes and fibre, were missing. The addition of fibre values to the table proved not to be simple. As there were only 660 codes on file, a single code often represented more than one food item, and although they had the same composition with regard to the listed nutrients, they very often had different fibre compositions. Thus, as it was not possible to add fibre values to each record on the file, a separate fibre file was created, each record consisting of the food code, a suffix code, and the fibre values. Each food item in the main table was allocated a suffix. If there was no fibre component to the food, this suffix was zero. During calculation, if the suffix was zero, no action was taken, but if the value was greater than zero, the program searched the fibre file to extract the fibre values for the food in question. This technique was successful, but it did increase operating time.

6.3 4th Edition of the McCance and Widdowson Tables

In 1978, the 4th edition of the tables of McCance and Widdowson was published and for the first time was made available on paper tape to facilitate computer handling. Basically, these tables are in three parts: (1) the main table listing up to 39 values for about 960 foods; (2) the amino acid composition table, and (3) the fatty acid composition table.

There were several problems with reading the tapes and storing the data. In the published book it was quite logical that only those constituents should be listed which were likely to be found in a food. Thus, a value for say lactose in non-milk foods would not be found. However, from the computing point of view, this presented a problem. The structure of the record for each food item varied according to the food group to which it belonged. On examination, it was found that for every food item 30 values could be found, but there were nine other values in the file, which varied in number and component from one food group to another. If this structure had been maintained, program time would have been increased by an unacceptable amount, because each food code would have been interrogated as to food group, and then the food group checked to see whether any of these values were required by the program. It was decided that every food would have a common format with a value for every component likely to be found in the table. The 4th edition of the McCance and Widdowson tables is divided into 13 food groups with no free codes between the groups, thus additions must be made at the end. As the highest code on the table is 969, it was decided that new food codes would begin at 1001. There were two options open: the addition of new codes to the end of the table without regard to food group or the allocation of blocks of codes, one block for each food group. The second option would then necessitate deciding how big a block to allocate to each group. It was decided to add new foods to the end of the table without regard to the parent food group. This has not made selection of foods by group difficult, because the group is part of each record, and selection is not slower because the foods are recorded randomly.

Some of the values in the book are recorded as ‘trace’ or ‘no data available’, and on the computer, these are recorded as - 1 to distinguish them from zero, but for calculation, they are treated as zero. Some other values are marked as ‘estimated’ values, and these are taken as listed. Such values could have been indicated in the file, but this would have doubled its size. It was decided to store the nutrient values as integers, i.e., whole numbers, because whole numbers only occupy one computer word, whereas real numbers, those with decimal fractions, occupy two words. Each value is multiplied by a factor to eliminate decimal fractions, and the new values are recorded on the file. The maximum value for integers on the computer is 32767, and in a few cases, vitamin D in cod liver oil for instance, the value exceeds this limit. Thus, before storing such a number, it is first divided by 10 and then made negative. This means that a check has to be made for negative values of - 3000 and less, and in such cases, the reverse transformation made to restore them to their correct values. There is the possibility of losing up to 9 parts in 33,000, but this is a very small amount.

The fatty acids are recorded as grams per 100 g fatty acids. It was hoped to store the values as grams per 100 g food, but it was found that the maximum value for some of the fatty acids was too great to be accommodated by the method used for the basic nutrient table. The percentage values were multiplied by 100 and the fatty acids stored as integers, still as a fraction of the total fatty acids.

This table presented two problems. Firstly, many foods were not represented directly, but by recipes with up to seven food constituents. For each such food, the fatty acid composition for each recipe component was calculated, and all added together to give the composition for the whole food. Secondly, there was a variation in the fatty acids listed for each food group. Again, the list was rationalized, every food having a record in the fatty acid file, each record having a value for each of the 40 possible fatty acids or fatty acid mixtures. Cholesterol values were also added to this file. The amino acid table is set out in a way similar to the fatty acid table in that many of the foods are cross-referenced to up to seven constituent foods. There were less problems with this table because there is no variation in the list of amino acids.

6.4 Calculation of Nutrient Intakes

Part of the research at the Dunn Nutrition Unit involves the calculation of nutrient intakes from volunteer subjects. This is done by a suite of programs which does more than just calculate these intakes. It also is used for checking incoming data, listing calculated results or the tables themselves, and for preparing results in such a way that they may be transmitted by telephone for storage or printing at a remote station.

The suite consists of a short mainline program which serves only as an entry point to the suite and making the food tables available, together with about ten segments. Control is then passed to a ‘routing’ program which presents a list of options. The required segment is loaded into the computer memory and control passed to it. On completion of its task, this segment returns control to the routing segment.

The checking segment checks that only valid food codes have been used and that the date and day of the week agree. A check is also made for missing records. At the Dunn Nutrition Unit the form for recording dietary intake data has been standardized. A subject records on the form the food eaten and the time. The appropriate food code and weight are added in the laboratory in spaces provided on the form. The first 21 spaces on the form hold the numbers of the survey and subject, the date, day of the week, whether the weights are in grams or ounces, and six spare characters which may be used to record other information (for example, sex, age, or religion). The remainder of the record contains the time of consumption, food code, and weight eaten for up to five foods. The times are recorded to the nearest 10 min. Weights may be in grams or ounces, but not both on the same record: ounces are converted to grams by the program. Each page on the form has space for up to four records, and the program can accommodate up to 20 records for 1 person for 1 day. These records are then all combined into a file which is submitted to the program for processing. The first 21 characters are only entered into the computer file for the first record of the day for any subject. This means that such a file cannot be sorted or merged with existing files. To overcome this, there is a ‘filling’ segment which restores the missing data to each record.

When calculating intakes, the program can deal with up to 19 nutrients which may be selected as required. Kilo-calories, kilojoules, protein, fat, and carbohydrate are always supplied. Intakes for up to six time periods during a day may be assessed. These periods may be of any length and do not have to be of equal length, form a continuous series or start at midnight. However, time periods may not overlap and may not cross the midnight boundary. Total intakes for the day are calculated independently of the intakes during time periods, so any food consumed outside a specified period is still accounted for in the overall daily intake. Results may be calculated from up to 100 selected foods, from all foods except those selected, or from all foods from a specific food group or collection of food groups. Results may be given in absolute amounts or expressed as ratios of kilo-calories or kilo-joules. The program can take into account any time discontinuity in the data file. Should a time discontinuity greater than a given maximum number of days be found during the listing of the results, the program will print the mean values and standard deviations for the nutrients before listing the results for the next time period. At the end of a run, the overall means etc. With the ranges of the nutrients are listed and if required, the output can be limited to this overall summary. The food table can be listed in alphabetical or numerical order, with the values in absolute amounts or as ratios of the energies.

In a study with a large amount of raw data (with about 30,000 records), the nutrient intakes can be calculated in about 2 h. Once calculated and listed, the results are lost to the computer. To save processing time should a second run be required, there is a segment to convert the results, which are in machine code, to a form which can be transmitted by telephone for storage on magnetic tape on any other computer. The results can be retrieved, converted back to machine code, and listed as required. The running time required for calculation in large survey could be reduced if a substantial part of the food tables were read into the computer memory. Then as each food is encountered in the data file, the memory version of the table would be searched not the version on the disc file. The continual reading of the disc lengthens the running time of the programme. Ideally, the whole table should be stored in the computer memory. With large computers this is feasible, but a large memory is required. With the 1,300 foods of the basic McCance and Widdowson table [1] plus foods added over 4 years, each record consisting of 41 words, computer storage of over 53,000 words is needed for the table alone. This is double the total memory size of the computer in the Dunn Nutrition Unit.

Fortunately, not all of the table need be held in core. From various studies carried out it has been found that of 822 foods occurring a total of 99,850 times, 9 foods accounted for 50% of the food items consumed, 50 foods for 70%, 100 foods for 80%, and 90% of all food items eaten could be accounted for by 200 foods. If these 200 most popular foods were stored in the computer memory, about 8,500 words would be required, a much more reasonable storage requirement for a minicomputer. Extra programming would be necessary to determine whether data on a particular food item were stored in the memory or on disc, but any increase in program size would be more than compensated for by the reduction in program running time. At the Dunn Nutrition Laboratory, the extra coding amounted to about ten lines in one subroutine, a new subroutine of about 12 lines, and one extra line per segment. This reduced the running time on a file of 400 records by about 50 s and halved the running time to 20 min for one data file of 7,800 records.

6.5 Problems Encountered when Using Microcomputers

Dietary analysis on a microcomputer presents a few problems. Their core sizes tend to be smaller, although memory is becoming cheaper, and microcomputers with larger and larger cores are becoming available. However, these machines tend to be slower in operation mainly because of the mass storage devices and the languages used. The most common mass storage device used on microcomputers is the floppy disc which tends to have a smaller capacity than a hard disc, and the rate of data transfer to and from the disc is slower. The combination of these two factors tends to increase data processing time. Further, there is no agreed standard for recording data, making transfer of program and data files very difficult, except between machines of the same manufacturer.

The language most commonly found on microcomputers is Basic, whereas larger machines use languages such as Fortran. A program written in Fortran is entered into the computer where it is ‘compiled’, i.e., it is converted into machine code and loaded into the memory, with any subroutines, and is ready to run. Any cross-referencing and interpreting of the coding has been done before data are submitted to the program. Basic, however, is not compiled, but each line of coding is interpreted as it is encountered during the running of the program. Thus, in an iterative loop, each line of the program is interpreted, verified, and acted upon as it is met on each pass through the loop. Some microcomputers do have Fortran, but even so their operation, if it involves retrieving data from a floppy disc, will be still slow.

6.6 Considerations in the Design of Food Composition Tables and Nutrient Data Banks

From the point of view of a computer programmer, nutrient data banks should be as ‘general’ as possible. That is every food item should have a direct reference in subsidiary tables, e.g., the amino acid and fatty acid tables, and every food item should have a value for every possible nutrient or component listed in the table. This will inevitably make the file larger, but this increase in size will be more than compensated for by the reduction in time, effort and duplication spent in programming, and the reduction in the size of software required to make use of the tables. Again from the point of view of the computer programmer, ease of data bank maintenance, that is the correcting and updating of data held on the file, should be considered. While correction of data is really in the province of the individual user, the methods used will be influenced by the format of the data base. More of a problem is updating. There are two forms of update, the addition of new foods and the inclusion of additional data for constituents. The addition of new foods needs to be centralized in some way. As already stated, the Dunn Nutrition Laboratory has added about 300 new foods to the McCance and Widdowson tables [2], and, although these have been made available to other units, it has not been the intention that these become ‘official’ additions to the table. Other additions have also been made to the table, and it is in the interest of all users that these food items should be brought together. This should be the province of the producers of the original table, i.e., central government. The introduction of new data for existing food items should necessitate only some minor alteration to programs.

The object of this workshop is to move towards greater understanding among users of nutrient data banks. As each national data bank increases in size and becomes more complex and more comprehensive, such understanding and cooperation becomes more and more difficult without the use of computers.

Many European countries now have large immigrant populations. Each ethnic group brings its own culinary skills and traditions which must be taken into account in nutrient data banks. For example, in Britain in the past few years, there has been a large increase in the number of Chinese restaurants, most offering a take-away service which has made their food very popular among non-Chinese people. This increase in popularity has taken place since the last edition of the McCance and Widdowson tables [2] and thus is not reflected in the tables. There is the movement of workers between countries, especially between member countries of the European Community, and there is the large movement in holiday seasons. If such mobile populations maintain their eating habits to the best of their ability, then their native foods must eventually influence, to some degree, the eating habits of the host country. Therefore, such imported foods will have to be included in new editions of food tables for the host nation. The inclusion of ‘foreign’ foods in a national data bank could be accomplished by computer, and this could be made easier if cross-linkage tables could be set up to reference foods between national data banks. There would be at least two advantages in this. Firstly, foods not found in one data bank could be found in another, so that permanent additions to food composition tables would not be necessary. Secondly, the task of recompiling data banks to include such imported foods would be simpler, and this in turn would lead to a reduction of work duplication.

References

1 McCance, R.A.; Widdowson, E.M.: The chemical composition of foods; 3rd ed. Special Rep. Ser. MRC No. 297 (Her Majesty's Stationery Office, London 1960).

2 Paul, A.A.; Southgate, D.A.T.: McCance and Widdowson's the composition of foods; 4th ed. (Her Majesty's Stationery Office, London 1978).