Sensory difference tests - Overdispersion and warm-up.pdf

(135 KB) Pobierz
doi:10.1016/j.foodqual.2005.09.015
Food Quality and Preference 18 (2007) 190–195
www.elsevier.com/locate/foodqual
Sensory difference tests: Overdispersion and warm-up
Ofelia Angulo a , Hye-Seong Lee b , Michael OMahony b, *
a Instituto Tecnologico de Veracruz, M.A. de Quevedo 2779, Veracruz, Ver, Mexico
b Department of Food Science and Technology, University of California, Davis, CA, United States
Received 14 February 2005; received in revised form 5 July 2005; accepted 28 September 2005
Available online 8 November 2005
Abstract
For sensory difference tests, one way, but not the only way, of dealing with the problem of overdispersion is to use a beta-binomial
analysis. Commonly, binomial statistical analyses are used for these methods and they assume that the sensitivity of the judges is uni-
form. However, judge sensitivity varies and this adds a problematical extra variance to the distribution. This is termed overdispersion
and renders simple binomial analysis prone to Type I error. The distribution of sensitivity of the judges is described by a beta-distribu-
tion. The analysis, combining beta and binomial distributions, gives an index, gamma. This ranges from zero, for no overdispersion, to
unity, for total overdispersion. A compact beta-distribution clustered around the mean of the binomial distribution, would add little
extra variance and elicit minimum distortion of the binomial distribution, yielding a zero or near zero gamma value. A more scattered
or even bimodal beta-distribution would have a substantial effect and yield a significant gamma value. One question that has been posed
is whether some test methods are more prone to overdispersion than others. Yet, a consideration of the reasons for overdispersion would
suggest that significant gamma values were more a result of obtaining a heterogenous sample of sensitive and insensitive judges by
chance. To confirm this, less sensitive and more sensitive samples of judges performed 2-AFC and 3-AFC tests with resulting zero
gamma values, indicating no overdispersion. However, when the less and more sensitive groups were combined, significant gamma values
were obtained, indicating the presence of overdispersion. However, in a further experiment using 2-AFC tests, when the less sensitive
group had its sensitivity increased by a warm-up procedure, combination with the more sensitive group did not result in
overdispersion.
2005 Elsevier Ltd. All rights reserved.
Keywords: Overdispersion; Beta-binomial; Gamma; Warm-up; Difference tests; 2-AFC; 3-AFC; Triangle; Duo-trio
1. Introduction
result (test correct) is also constant over repeated tests,
should there be no fatigue or adaptation effects over rep-
licate testings. Yet, if data from several judges were to be
combined, this situation would no longer hold; judges
have different sensitivities and thus their probabilities of
getting the target result (test correct) would vary. The
assumptions for the binomial test would be violated. This
violation can result in Type I errors, the declaration of a
significant difference when the results were probably due
to chance or guessing.
The variation in the sensitivity of the judges provides an
extra source of variance in the computation. There is more
variance than would be expected from a mere binomial
analysis. The problem of this extra variance has a name:
the problem of overdispersion.
Statistical tests based on the binomial distribution are
generally used to determine whether the proportion of dif-
ference tests performed correctly is greater than chance,
thus indicating that the difference was significant. Yet,
such binomial statistics were designed to analyze the
results of tossing coins or dice when the probability of
getting a target result (heads or six) is constant for each
item (1/2 or 1/6). With a set of difference tests performed
by a single judge during a single experimental session, it
could be argued that the probability of getting a target
* Corresponding author. Tel.: +1 530 756 5493; fax: +1 530 756 7320.
E-mail address: maomahony@ucdavis.edu (M. OMahony).
0950-3293/$ - see front matter 2005 Elsevier Ltd. All rights reserved.
doi:10.1016/j.foodqual.2005.09.015
448698919.001.png 448698919.002.png
O. Angulo et al. / Food Quality and Preference 18 (2007) 190–195
191
There are various solutions to the problem ( Brockhoff,
2003; Brockhoff & Schlich, 1998; Kunert, 2001; Kunert &
Meyners, 1999 ). However, the purpose of this paper is
experimental; it is not to discuss the relative merits of the
various statistical approaches. The approach that will be
discussed here uses beta distributions to describe the distri-
butions of judge sensitivities encountered during difference
testing. The beta distributions are combined with the regu-
lar binomial distributions to give what are called beta-bino-
mial distributions. These are the basis for the beta-binomial
statistical analysis for difference tests. The beta-binomial
with the extra variance brought in by the beta distribution
will have a greater variance than the binomial distribution.
This greater variance means it will be more dicult to reject
the null hypothesis and declare a significant difference. In
this way, it can be seen that there is a risk of Type I errors
if a binomial analysis is used rather than a beta-binomial
analysis.
Harries and Smith (1982) studied the beta-binomial
analysis for triangle tests with demonstrations of the effects
of some beta distributions. More recently, the beta-bino-
mial test has been developed by Ennis, Bi and their
coworkers ( Bi & Ennis, 1998, 1999a; Bi, Templeton-Janik,
Ennis, & Ennis, 2000; Ennis & Bi, 1998 ). They describe
overdispersion by an index called gamma (c). A gamma
value of zero indicates no overdispersion while a gamma
value of unity indicates maximum overdispersion. Gener-
ally, values are intermediate between zero and unity. Bi
and Ennis (1999b) published tables indicating how the pro-
portions of tests required to be correct to declare a signif-
icant difference varied with gamma.
For experiments in which the data analysis is in terms of
d 0 , the beta-binomial approach fits conveniently into a
Thurstonian framework ( Bi & Ennis, 1998 ). The required
increase in variance for the beta-binomial distribution is
obtained by using a multiplier, which is a function of
gamma. The same multiplier can be used whether the
beta-binomial represents the variation in the proportion
of tests correct or the variation of d 0 .
A further aspect of being able to correct for overdisper-
sion is the ability to increase the sample size by combining
judges and replicate testings to boost the power of the anal-
ysis. If a coin is tossed three times and three separate coins
are tossed once, it is possible to combine coins and repli-
cate tosses and have a sample size of six tosses. The same
is possible for dice but not for judges performing difference
tests. A judge performing three triangle tests is not equiva-
lent to three separate judges each performing a single trian-
gle test. Judges and replicate testings cannot be combined
using a binomial statistical analysis unless the gamma value
was zero. To do so, would open the possibility of Type I
errors. Yet, with a beta-binomial analysis, judges and rep-
licate testings can be combined. Thus, if 10 judges each per-
form 10 tests, the data can be treated as a sample size of
10 · 10 = 100.
In practice, sometimes overdispersion (a c value signifi-
cantly greater than zero) is encountered; sometimes it is
not. Rousseau and OMahony (2001) working with orange
drinks found no overdispersion with triangle and same–dif-
ferent tests. Braun, Rogeaux, Schneid, OMahony, and
Rousseau (2004) using 2-AFC tests, reported overdisper-
sion with a gamma value of 0.036, which was significantly
greater than zero. Yet, it could be argued that the value
was unimportantly small and only significant because of
the large sample size. Rousseau and OMahony (2000)
using triangle, dual pair, and same–different tests, with
orange flavored beverages, combined judges and replicate
testings but did not quote a gamma value in their analysis.
Ligget and Delwiche (2005) required judges to perform
paired preference and 2-AFC tests for fruit flavored bever-
ages, chips and cookies. Significant gamma values were
obtained for two of the five paired preference studies and
two of the five 2-AFC studies. Yet, these did not occur
for the same foods; there was no consistency over products.
Increasing the number of replicate testings also had little
effect. Comparing judges performance on 2-AFC, 3-
AFC, triangle and duo-trio tests with cherry flavored
drinks, a significant gamma value was obtained only with
the duo-trio. Using cherry flavored drinks, once more,
and 2-AFC tests, the performance of judges over 10 days
was measured. Significant overdispersion occurred on three
of the days.
Considering these studies, there does not appear to be
any particular pattern that supports that one test may be
more prone to overdispersion than another. Yet, the evi-
dence, as yet, is still sparse. The first goal of this study
was to collect some further evidence.
To gain further insight, it is worth reconsidering how
values of gamma should be interpreted. Given that judges
are not clones, then, it may be asked why cases occur when
gamma values are not significant and there is no significant
overdispersion. It would only be possible if the addition of
a beta distribution to a binomial distribution had a mini-
mal effect on the shape and variance of the latter. As Har-
ries and Smith (1982) pointed out, a beta distribution that
was fairly compact and clustered around the mean of a
binomial distribution would have such a minimal effect.
It would mean that the sensitivities of the judges in the dis-
crimination tests were close to the mean. As they further
remarked, a beta distribution that was more scattered or
even bimodal would have a substantial effect on the bino-
mial distribution and thus produce a significant gamma
value. Bimodal distributions can occur in preference testing
when the judges are split on their preferences and also with
difference tests, if the sample contains a group of more sen-
sitive and a group of less sensitive judges. The more sensi-
tive and less sensitive judges under consideration may
come from a bimodal distribution, indicating differences
in the judges sensory systems per se. Alternatively, they
might be drawn from a single distribution, and the terms
more sensitive and less sensitive merely applied to varia-
tion in judges performance in that particular test.
To support these considerations, firstly, it may be
hypothesized that if a sample of judges performing a
192
O. Angulo et al. / Food Quality and Preference 18 (2007) 190–195
difference test were to be made up of such more sensitive
and less sensitive groups, then gamma values for each of
the two separate groups would be low. Yet, when the
two groups were combined, the beta distribution would
cause significant distortion of the binomial distribution
and produce a higher gamma value. The second goal of this
study was to test this hypothesis.
A second hypothesis concerns what Pfaffmann (1954)
called warm-up and which increases the measured sensi-
tivity of judges. The term was borrowed from early studies
on paired associate learning (for example, Heron, 1928;
Thune, 1950 ) and skilled behavior acquisition (for exam-
ple, Ammons, 1947 ). For gustation, the phenomenon is
more akin to the latter, taste discrimination having many
of the attributes of skilled behavior. In current practice
( Dacremont, Sauvageot, & Duyen, 2000; OMahony, Thi-
eme, & Goldstein, 1988; Thieme & OMahony, 1990 )
warm-up involves judges alternately testing the two stimuli
to be discriminated, knowing which is which, until they
have identified the difference in sensations elicited by the
two confusable stimuli. Often, with confusable taste stim-
uli, a judge will not, at first, perceive a difference. Yet, after
repeated testing, the differences between the stimuli begin
to appear, sometimes suddenly. The increase in sensitivity
induced by warm-up can be considered as a focusing of
attention: focusing on or amplifying the differences in input
from the two stimuli, while attenuating the similarities.
Consider the situation where there was a sensitive and
an insensitive group combined to give a high gamma value.
Then, if the less sensitive group performed the warm-up
procedure, it could be hypothesized that their increase in
sensitivity would render them as sensitive as the more sen-
sitive judges. In this way, the two groups of judges would
become more similar in sensitivity; gamma would decrease.
The third goal of this study was to test this hypothesis.
For all three goals, a model system was used to confine
the variance to judge variation rather than product
variation.
(6M, 10F, 21–58 yrs). A further group of 40 judges (21M,
19F, 21–75 yrs) used all four methods.
2.1.2. Stimuli
The stimuli to be discriminated were 3 mM vs. 5 mM
NaCl solutions for the group of 40 judges who performed
all four tests. For the judges who performed just one test,
3 mM vs. 5 mM NaCl solutions were used for the 2-AFC
and 3-AFC tests and 1 mM vs. 5 mM NaCl solutions for
the less powerful ( Ennis, 1993 ) triangle and duo-trio tests.
The NaCl was reagent grade (Mallinckrot, Inc. Paris, KY)
and the solvent was Milli-Q purified water (Millipore Corp,
Bedford, MA). The Milli-Q purified water had a specific
conductivity < 10 6 mho/cm and a surface tension P
71 dynes/cm.
The purified water was also used for interstimulus rins-
ing. Stimuli were dispensed in 10 ml aliquots using Oxford
Adjustable Dispensers (Lancer, St. Louis, MO.) in plastic
cups (1oz. portion cups, Solo Cup Co., Urbana, IL). All
stimuli were served at constant room temperature (21–
24 C) on white plastic cutting trays.
2.1.3. Procedure
A related-samples and an independent-samples design
was used. For the related-samples design, judges were
required to perform all four test methods in a single ses-
sion. After taking demographic details, establishment of
rapport and experimental instructions, judges took seven
mouthrinses. They then proceeded to perform six 2-AFC,
six 3-AFC, six triangle and six duo-trio tests. No interstim-
ulus rinses were taken within tests; judges were able to rinse
ad lib between tests. The order of the four test methods and
the order of stimulus presentation within a method was
counterbalanced over judges. Judges gave their responses
verbally. Session lengths ranged between 12–30 min.
For the independent-samples design, separate samples
of judges (noted above) used only one test method. The
procedure was as above, except that each judge performed
12 tests in a session. Some judges performed in more than
one group. Session lengths ranged 5–15 min.
2. Experiment I
The goal of this experiment was to collect further data
comparing the proneness of various difference test methods
to overdispersion. The test methods studied were: the trian-
gle, duo-trio, 2-AFC and 3-AFC.
2.2. Results
For each test method, for both the related-samples and
independent-samples designs, gamma values were derived
according to the beta-binomial computation ( Bi & Ennis,
1998, 1999a, 1999b; Bi et al., 2000 ). The computations were
performed using IFPrograms software based on maximum
likelihood (Institute for Perception, Richmond, VA). Table
1 displays the gamma values along with probabilities of
getting values this large on the null hypothesis (c =0).
From the table, it can be seen that a significant gamma
value was obtained for the 2-AFC method (p = 0.01) with
the related-samples design. For the independent-samples
design, a near significant gamma value (p = 0.08) was
obtained for the triangle method. There was no consistent
trend for one particular method to be more prone to over-
2.1. Materials and methods
2.1.1. Judges
One hundred and six judges, students, staff and friends
at the University of California, Davis, participated in the
experiment. None of the judges had consumed food and/
or beverages within an hour before starting the experiment.
Of the 106 judges tested, 13 had participated in taste sen-
sory experiments before. The groups tested were as follows:
2-AFC method (12M, 9F, age range 22–58 yrs), 3-AFC
(7M, 8F, 22–58 yrs), triangle (5M, 9F, 21–57 yrs), duo-trio
O. Angulo et al. / Food Quality and Preference 18 (2007) 190–195
193
Table 1
Gamma values for the 2-AFC, 3-AFC, duo-trio and triangle methods for
the related-samples and independent-samples designs for Experiment I
Experimental design
3.2. Results
2-AFC 3-AFC duo-trio Triangle
For both, the 2-AFC and 3-AFC tests, judges were split
into more sensitive and less sensitive groups. For this, a
simple rule was adopted, despite different chance probabil-
ities. Judges who performed 16–20 tests correctly were
deemed more sensitive, while those performing 10–15 tests
correctly were categorized as less sensitive. Judges with
inferior or chance performance levels were considered not
to be suciently sensitive to the differences between the
stimuli to be included in the more sensitive or less sensi-
tive groups. They were eliminated during the screening
and their data are not recorded here.
For the 2-AFC method, gamma and d 0 values, with
associated null probabilities and variances, were computed
as in Experiment I, using IFPrograms software. Separate
values were computed for the more sensitive and less sen-
sitive groups as well as for the two groups combined. The
same computation was performed for the 3-AFC-method.
The data are shown in Table 2 .
From the table, it can be seen that the more sensitive
groups have higher d 0 values than the less sensitive groups,
while the combined values are intermediate, as expected. It
can also be seen that within each sensitivity group, gamma
values are zero, indicating no overdispersion. Yet, the
gamma values of the combined groups are finite and signif-
icantly greater than zero (p 6 0.03), indicating significant
overdispersion. Thus, two groups of different sensitivity,
each having no overdispersion, when combined can pro-
duce finite gamma values. This confirms the first
hypothesis.
It might be argued that the significance results obtained
for the combined samples owe their significance to the lar-
ger sample sizes. This might indeed be true. However, the
fact that the more sensitive and less sensitive samples
had zero gamma values and the combined samples had
finite values, indicates how overdispersion increased when
the two samples were combined.
To test consistency, the experiment was repeated on the
same judges and gave the same results. For the 2-AFC,
gamma values were 0.03 (more sensitive), 0.00 (less sensi-
tive) and 0.09 (combined). For the 3-AFC, the correspond-
ing values were 0.00, 0.00 and 0.06.
Related-samples design a
N =40
c
0.12
0.04
0.00
0.00
p
0.01
0.35
1.00
1.00
Independent-samples design b
N =21 N =15 N =16 N =14
c
0.03
0.01
0.05
0.07
p
0.34
0.75
0.17
0.08
a Each judge performed 6 tests for each method.
b Each judge performed 12 tests for each method.
dispersion than another. This accorded with the previous
findings discussed above. Values of d 0 ( Ennis, 1993 ) are
not quoted in the table, but for the related-samples design
they were not significantly different, ranging from 0.8 to
1.2. For the independent-samples design, d 0 values ranged
from 1.1 to 2.5 for separate sets of judges. Computations
of d 0 were performed using the IFPrograms software.
3. Experiment II
This experiment was designed to test the first hypothesis
that two groups of judges of different sensitivity, each dis-
playing no overdispersion would, when combined, give sig-
nificant gamma values. For this, judges performed 2-AFC
and 3-AFC tests. On the basis of their d 0 values, they were
divided into two separate groups: more sensitive and less
sensitive. Gamma values were computed for the two
groups separately and also for when the two were
combined.
3.1. Materials and methods
3.1.1. Judges
Twenty judges (3M, 17F, age range 16–63 yrs) perform-
med 2-AFC tests and 16 judges (6M, 10F, 20–62 yrs) per-
formed 3-AFC tests. These judges had been screened as
being suciently sensitive to be included in the experiment.
As before, the samples of judges were drawn from students,
staff and friends at the University of California, Davis. All
judges had fasted, except for water, for at least 1 h prior to
the experiment. All judges were na¨ve to sensory testing
except for four judges in the 2-AFC group and eight in
the 3-AFC group.
Table 2
Gamma values for the 2-AFC and the 3-AFC methods with more sensitive
and less sensitive judges for Experiment II
2-AFC
3.1.2. Stimuli
The stimuli to be discriminated were 3 mM vs. 5 mM
NaCl solutions as described for Experiment I.
3-AFC
More
sensitive
Less
sensitive
Combined More
sensitive
Less
sensitive
Combined
3.1.3. Procedure
The procedure, including the rinsing protocols, was as
for Experiment I, with the following modifications. In a
single session, judges performed twenty 2-AFC (or 3-
AFC) tests. Session lengths ranged 10–20 min.
N =6 N =14 N =20 N =8 N =8 N =16
c 0.00
0.00
0.04
0.00
0.00
0.06
p 1.00
1.00
0.03
1.00
1.00
0.01
d 0
1.75
0.52
0.81
2.10
0.99
1.46
r 2
0.05
0.01
0.01
0.03
0.02
0.01
For both methods, each judge performed 20 tests.
448698919.003.png
194
O. Angulo et al. / Food Quality and Preference 18 (2007) 190–195
4. Experiment III
Table 3
Gamma values for the 2-AFC method with more sensitive and less
sensitive judges before and after warm-up for Experiment III
2-AFC
Before warm-up
This experiment was designed to confirm the second
hypothesis that given a more sensitive and a less sensitive
group of judges, warm-up given to the less sensitive group,
would increase its sensitivity to such an extent that the
combined group would not display overdispersion.
After warm-up
More
sensitive
Less
sensitive
Combined Less
sensitive
Combined
N =10 N =9 N =19 N =9 N =19
4.1. Materials and methods
c
0.00
0.00
0.07
0.00
0.01
p
1.00
1.00
0.00
1.00
0.50
4.1.1. Judges
Nineteen judges (3M, 16F, age range 16–63 yrs) stu-
dents, staff and friends at the University of California,
Davis participated. All had fasted for at least 1 h before
the experiment took place. All, except seven, were na ¨ ve
to sensory testing.
d 0
1.70
0.52
1.05
0.98
1.31
r 2
0.03
0.02
0.01
0.02
0.01
Each judge performed 20 tests in each condition.
sensitive groups, whether warmed-up or not, were su-
ciently homogeneous to have zero gamma values. How-
ever, when the two groups were combined, without any
warm-up, there was sucient overdispersion to give a finite
and significant gamma value. Yet, if the less sensitive
judges were given a prior warm-up, their measured sensitiv-
ity increased enough (d 0 = 0.52 vs. 0.98) for the judges to be
suciently homogeneous with the more sensitive group.
This resulted in the combined group having a zero gamma
value, indicating no overdispersion and confirming the sec-
ond hypothesis.
4.1.2. Stimuli
The stimuli to be discriminated were 3 mM vs. 5 mM
NaCl solutions, as in Experiments I and II.
4.1.3. Procedure
The procedure, including rinsing protocols, was the
same as for Experiment II, judges performing twenty 2-
AFC tests in a single session. Depending on performance
in the session, judges were divided, as before, into a more
sensitive and a less sensitive group, using the same perfor-
mance criteria as in Experiment II. Those judges in the less
sensitive group repeated the experiment with the addition
of a prior warm-up procedure. For this group, after the ini-
tial seven mouthrinses, judges went through the warm-up
before performing the twenty 2-AFC tests. For the
warm-up, judges were presented with a set of 3 mM and
5 mM NaCl solutions. Each set was labeled so that the
judge was aware of the stimuli. They tasted the 3 mM
and 5 mM stimuli alternately, until they felt that they could
distinguish the signals indicating the difference between the
two stimuli. Initially, six of each stimulus was presented
and this was found sucient for warm-up. When judges
reported that they could distinguish between the two stim-
uli, testing was started immediately. Some judges reported
that they did not think that they had warmed-up, but they
proceeded with the experiment, anyway. No interstimulus
rinses were taken during warm-up. Session lengths ranged
10–20 min.
5. Discussion
In the first experiment, there was no particular trend for
one difference test to be more prone to overdispersion than
another. This corresponded to previous research discussed
above. It would appear that overdispersion was less the
result of the test method than the result of a particular sam-
pling of judges. From a consideration of the suggestion of
Harries and Smith (1982) , it would appear that overdisper-
sion is the result of a particular distribution of judge sensi-
tivities, the beta distribution. A compact beta distribution
would have little effect on the binomial distribution, so that
the beta-binomial would resemble the binomial, resulting
in little or no overdispersion and non-significant gamma
values. This was confirmed by Experiments II and III,
where less sensitive groups of judges were combined with
more sensitive groups, both having zero gamma, to form a
combined group. The latter then, demonstrated overdisper-
sion, with gamma values significantly greater than zero.
The effect was prevented by providing the less sensitive
group with a warm-up procedure, to match its sensitivity
to that of the more sensitive group.
Overdispersion is thus avoided if the sample of judges
happens to have sensitivities distributed close to the mean
of the binomial distribution. One may speculate regarding
when such a group of judges might be sampled. It might be
expected with regular long-term consumers who had become
uniformly highly sensitive to product differences over the
years. With casual consumers, variation in frequency of
use might have more of an effect on sensitivity and therefore
generate overdispersion. This is a testable hypothesis.
4.2. Results
Gamma and d 0 values with associated null probabilities
and variances, were computed as in Experiment II for the
less sensitive, more sensitive and combined groups, and
in the formers case, before and after warm-up. These data
are displayed in Table 3 .
From the table, it can be seen that the more sensitive
group had higher d 0 values than the less sensitive group,
with the combined groups having intermediate values, as
expected. It can be seen, that the more sensitive and less
448698919.004.png
Zgłoś jeśli naruszono regulamin