Research Note on Web Accessibility Metri

Research Note on Web Accessibility Metrics
contents
Research Note on Web Accessibility Metrics
W3C
Editors' Draft 9 May 2012
This version:
Latest published version:
none
Latest internal version:
Previous published version:
none
Previous internal version:
Editors:
Markel Vigo, University of Manchester
Giorgio Brajnik, University of Udine
Joshue O Connor, NCBI Centre for Inclusive Technology
W3C
MIT
ERCIM
Keio
), All Rights Reserved. W3C
liability
trademark
and
document use
rules apply.
Abstract
Web accessibility metrics are an invaluable tool for researchers, developers, governmental agencies and end users. Accessibility metrics help to better grasp the accessibility level of websites and are therefore helpful to make decisions based on the scores they produce. Recently, a plethora of metrics have been released; however the validity and reliability of most of these metrics is unknown and those making use of them are taking the risk of using inappropriate metrics. In order to overcome such situation, this note provides a framework that considers validity, reliability, sensitivity, adequacy and complexity as the main qualities that a metric should have.
A symposium was organised to observe how current practice is addressing such qualities. We found that metrics addressing validity issues is scarce although some efforts can be perceived as far as inter tool reliability is concerned. This is something that the research community should be aware of, as we might be making futile efforts by using metrics whose validity and reliability is unknown. The reseach realm is perhaps not mature enough or we do not have the right methods and tools. We therefore try to shed some light on the possible paths that could be taken so that we can reach a maturity point.
Status of this document
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current
W3C
publications and the latest revision of this technical report can be found in the
W3C
technical reports index
at http://www.w3.org/TR/.
This 9 May 2012 Editors Draft
deleted content:
[First Public Working Draft]
of Research Note on Web Accessibility Metrics is intended to be published and maintained as a
W3C
Working Group Note after review and refinement. The note provides an initial consolidated view of the outcomes of the
Website Accessibility Metrics Online Symposium
held on 5 December 2011.
The
Research and Development Working Group (
RDWG
invites discussion and feedback on this draft document by research and practitioners interested in metrics for web accessibility, in particular by participants of the online symposium. Specifically,
RDWG
is looking for feedback on:
Summaries of the extended abstracts contributed to the online symposium;
Discussion about the state-of-the-art and conclusions drawn in the document;
Related resources that may be useful to the discussion within the document.
Please send comments on this Research Note on Web Accessibility Metrics document by
@@@
to
@@@
(publicly visible
mailing list archive
).
Publication as a Working Draft does not imply endorsement by the
W3C
Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document has been produced by the
Research and Development Working Group (
RDWG
, as part of the
Web Accessibility Initiative (WAI) International Program Office
This document was produced by a group operating under the
5 February 2004
W3C
Patent Policy
. The groups do not expect this document to become a
W3C
Recommendation.
W3C
maintains a
public list of any patent disclosures
made in connection with the deliverables of the group; this page also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains
Essential Claim(s)
must disclose the information in accordance with
section 6 of the
W3C
Patent Policy
Table of Contents
Introduction
1.1 Definition and background
1.2 The Benefits of Using Metrics
A Framework for Quality of Accessibility Metrics"
2.1 Validity
2.2 Sensitivity
2.3 Adequacy
2.4 Complexity
Current Research
3.1 Addressing Validity and Reliability
3.2 Tool Support for Metrics
3.3 Addressing Large-Scale Measurement
3.4 Targeting Particular Accessibility Issues
3.5 Novel Measurement Approaches
3.6 Beyond Conformance
3.7 Concluding Remarks
A Research Roadmap for Web Accessibility Metrics
4.1 Ensuring Metric Quality
4.2 Validity
4.3 Reliability
4.4 Other Qualities
4.4.1 Other Qualities
4.4.2 Sensitivity
4.4.3 Adequacy
4.4.4 Complexity
4.4.5 Other Qualities
A Corpus for Metrics Benchmarking
5.1 Credibility issues
5.2 User-tailored metrics
5.3 Dealing with dynamic content
Conclusions
Acknowledgements
References
Appendices
Appendix A: Acknowledgements
...
Introduction
Definition and background
In the web engineering domain, a metric is a procedure
for measuring a property of a web page or web site. A
metrics can be the number of links, the size in KB of a
HTML file, the number of users that click on a certain
link, or the perceived ease of use of a web page. In the
realm of web accessibility, amongst others, a metric can
measure the following qualities:
The number of pictures without an alt
attribute.
The number of Level A and AA success criteria
violations.
The number of possible failure points where
accessibility issues can potentially happen (such as
the number of images in a page).
The severity of an accessibility barrier.
The time taken to conduct a task.
In order to measure more abstract qualities, more
sophisticated metrics are built upon more basic ones. For
instance, readability metrics [
readability
] take into account the
number of syllables, words and sentences contained in a
document in order to measure the complexity of a text.
Similarly, metrics aiming at measuring web accessibility
have been built on specific qualities, which can be
inherent in a website (such as images with no alt
attribute) or observed from human behaviour (e.g., user
satisfaction ratings or performance indexes such as
number of errors). For instance, the failure-rate metric
computes the ratio between the number of accessibility
violations of a particular set of criteria over the
number of failure points for the same criteria.
As a result of the computation of accessibility
metrics, different types of data can be produced:
Ordinal values, like WCAG 2.0 conformance levels (AAA, AA,
A), or "accessible"/"non-accessible" scores; these conformance
levels can be computed by a metric defined as "a web page is
only accessible if all relevant success criteria are met,
otherwise it is inaccessible".
Quantitative ratio values such as 0, 175, -15 or
0.38.
Web accessibility can be viewed and defined in
different ways [
Brajnik08
]. One
way is to consider whether a web page/website is
conformant to a set of principles such as WCAG 2.0 or
Section 508. Even if WCAG 2.0 conformance levels are well
specified and, as seen above, they are ordinal values,
some other metrics could be defined on the basis of
success criteria and their sufficient, advisory and
failure techniques. We call these metrics, which are
based on whether success criteria of given guidelines are
met,
conformance-based metrics
Other metrics can be defined if one assumes that
accessibility is a quality that differs from conformance.
For example, Section 508 defines accessibility as the
extent to which "a technology [...]
can be used as effectively by people with disabilities as
by those without it". Provided that
effectiveness can be measured, such metrics could yield
results that differ from conformance-based ones. In
analogy to the notion of "quality in
use" for software, we call these
accessibility-in-use
metrics to emphasise that
they try to measure performance indexes that can be shown
by real users when using the web site in specific
situations. In addition, they do not require the notion
of conformance with respect to a set of principles.
Traditional usability metrics such as effectiveness,
efficiency and satisfaction could be considered
accessibility-in-use metrics. Also, any measure of the
perceived accessibility of a web page by users is a
metric belonging to this second group. Notice that this
notion of accessibility covers not only accessibility of
the content of web pages, but also accessibility of user
agents, features of assistive technologies, and could
even address different levels of expertise that users
have with these resources.
Most of the existing metrics - see a review in
Vigo11a
] - are of the former type
because they are mainly built upon criteria implemented
by automatic testing tools such as the number of
violations or their WCAG priority. Moreover, in order to
overcome the lack of sensitivity and precision of ordinal
metrics, conformance metrics often yield ratio scores.
The main reason for the widespread use of these types of
metrics relies on their low cost in terms of time and
human resources since they are based on automatic tools.
Although no human intervention
(experts' audits or user tests) is
required in the process, this does not necessarily entail
that only fully automated success criteria are to be
considered. Some metrics estimate the violation rate of
semi-automatic success criteria and purely manual ones
like in [
Vigo07
]; some others adopt
an optimistic vs. conservative approach on their
violation rate [
Lopes
].
The error-rate of these estimations, in addition to
reliance on testing tools are the major weaknesses of
automatic conformance metrics. In fact, these metrics
inherit tool shortcomings such as false positives and
false negatives, that affect their outcome [
Brajnik04
].
A benchmarking survey on automatic conformance metrics
concluded that existing metrics are quite divergent and
most of them do not do a good job in distinguishing
accessible pages from non-accessible pages [
Vigo11a
]. On the other hand, there are
metrics that combine testing tool metrics and those
produced by human review, with the goal of estimating
such errors; one example is SAMBA [
Brajnik07
]. Other metrics do not rely on
tools at all; an example is the evaluation done with the
AIR method [
AIR
] .
The Benefits of Using Metrics
There are several scenarios that could benefit from
web accessibility metrics:
Quality assurance within web engineering can
exploit metrics as a way for developers to precisely
know the accessibility level of their artifacts
throughout the development cycle.
Benchmarking can exploit metrics as a way to
explore, at a high-scale, the accessibility level of
web pages, such as within a domain (like .gov) or
within geographical areas (like in different European
states).
Information retrieval systems can implement
metrics as one of the criteria to rank web pages.
Therefore users would be able to retrieve not only
pages that suit their information needs but also
those that are accessible.
Adaptive hypermedia techniques can benefit from
metrics to enhance the interface to provide guidance
or as a criteria to perform adaptations.
A Framework for Quality of Accessibility Metrics
Several quality factors can be defined for web
accessibility metrics, factors that can be used to assess
how applicable a metric is in a certain scenario and
potentially, how to characterize the risks that adopting
a given metric yields. As discussed in [
Vigo11a
], validity, reliability,
sensitivity, adequacy and complexity appear to be the
most important factors.
Validity
This attribute is related to the extent to which the
measurements obtained by a metric reflect the
accessibility of the website to which it is applied, and
this could depend on the notion of accessibility:
conformance vs accessibility is use. The former refers to
how a web document meets specific criteria (i.e,
principles and guidelines), whereas the latter indicates
how the interaction is perceived. These two perspectives
are not necessarily the same and it can be illustrated as
follows: a picture without alternative text violates a
guideline making a web page non-confomant; however, the
lack of alternative text may not be perceived as an
obstacle if the goal of the user is to navigate or even
purchase an item in a e-commerce site.
As discussed above, most existing conformance metrics are plagued
by their reliance on automatic testing tools and do not provide
means to estimate the error rate of tools. Furthermore, the way the
metric itself is defined could lead to other sources of errors,
reducing its validity. For example, the failure rate should not be
used as a measure of accessibility-in-use; using it as a measure of
conformance is also controversial: it is sometimes claimed that it
measures how well developers coped with accessibility features
rather than providing an estimation of conformance
Brajnik11
]. Validity with respect to
accessibility-in-use should cope with the evaluator effect
Hornbæk
and lack of validity of users in their severity ratings
Petrie
].
Validity is by far the most important quality
attribute for accessibility metrics. Without it we would
not know what a metric really measures. The risk of not
being able to characterize validity of metrics is that
potential users of metrics would choose those that appear
simple to be applicable and that provide seemingly
plausible results. In a sense, people may therefore
choose a metric because it is simple rather than because
it is a good metric, with the unforeseen consequence that
incorrect claims and decisions could be made regarding
webpages and sites. These are important issues as they
strike at the heart of our notions of conformance. We are
assessing the validity of a user interface without truly
knowing if our method of assessment is actually valid
itself.
Reliability
This attribute is related to the reproducibility and
consistency of scores, i.e. the extent to which they are
the same when evaluations of the same web pages are
carried out in different contexts (different tools,
different people, different goals, different time).
Reliability of a metric depends on several layers that
are interconnected. These range from the underlying tools
(what happens if we switch tools?), to underlying
guidelines (what happens if we switch guidelines?), to
the evaluation process itself (if random choices are
made, for example when scanning a large site).
Unreliable metrics are not good because they are
inconsistent, they limit the ability of people to predict
their behavior and they limit the ability to comprehend
them at a deeper level. However, reliability will not
always be necessary. For instance, if we switch guideline
sets we should not expect similar results as a different
problem coverage is assumed.
It is worth noting that one of the aims of this
research note is to help identify errors, or spot gaps in
current metrics. The idea is that we can thereby both
confidently reject faulty metrics, or improve them in
order to halt a process of
"devaluation". This
devaluation happens in the mind of the end user, in terms
of the perceived value of the
"ideal" of
conformance. This process can be a byproduct of poor
metrics themselves or come from misunderstanding the
output from metrics that are not clear or easy for end
users to understand. In other words,
if a metric is not stable, it is
very difficult to effectively use it as a tool of either
analysis or comprehension.
Sensitivity
Metric sensitivity is a measure of how changes in
metric output are reflected in actual changes to any
given website. Ideally we would like metrics not to be
too sensitive so that they are robust and not
over-reacting to small changes in web content. This is
especially important when the metric is applied to highly
dynamic websites as we show later in this note.
Adequacy
This is a general quality, encompassing several
properties of accessibility metrics, for instance: the
type of data used to represent scores, the precision in
terms of the resolution of a scale, normalization, the
span covered by actual values of the metric
(distribution). These attributes determine if the metric
can be suitably deployed in a given scenario. For
example, to be able to compare accessibility levels of
different websites (as would happen in the large scale
scenario discussed above) metrics should provide
normalized values as otherwise comparisons are not
viable. If the distribution of values of the metric is
concentrated on a small interval (such as between 0.40
and 0.60, instead of [0, 1]) then also large changes in
accessibility could lead to small changes in the metric;
roundoff errors could influence the final outcomes.
Complexity
Depending on the type and quantity of different data
that is used to compute a metric and the algorithm which
it is based on, the process can be more or less
computationally demanding with respect to certain
resources, such as time, processors, bandwidth, memory.
Therefore the complexity of a metric reflects the
computational and human resources that prevent
stakeholders from embracing accessibility metrics. Some
scenarios rely on the fact that metrics have to be
relatively simple (such as when metrics are used for
adaptations of the user interface, and have therefore to
be computed on the fly). However, some metrics may
require high bandwidth to crawl large websites, large
storage capacity or increased computing power. For those
metrics that rely on human judgment, another complexity
aspect is related to the workflow process that has to be
established to resolve conflicts and synthesize a single
value. As a result, these metrics may not suit particular
application scenarios, budgets or resources.
Current Research
The papers that were presented at
the symposium cover a broad span of issues addressing the
quality factors we outlined above to different extents.
However, they provide new insights and ask new questions
that help shaping future research avenues (see section
4).
Addressing Validity and Reliability
Validity in terms of conformance
was tackled by Vigo et al. [
Vigo11b
] by comparing automatic
accessibility scores with the ones given by a panel of
experts obtaining a strong positive correlation.
Inter-tool reliability of metrics was also addressed by
comparing the behaviour of the WAQM metric assessing 1500
pages with two different tools (EvalAccess and LIFT). A
very strong correlation was found when pages were ranked
according to their scores; to obtain the same effect with
ratio scores the metric requires some ad-hoc adjustment
though. Finally, the authors investigated inter-guideline
reliability between WCAG 1.0 and WCAG 2.0 finding again a
very strong correlation between ordinal values although
this effect fades out when looking at ratio data.
Fernandes and Benavides [
JFernandes
] addressed metric
reliability (UWEM and web@X) by comparing two tools
(eChecker and eXaminator) with a different interpretation
of success criteria and coverage, assessing the
accessibility of about 300 pages. An initial experiment
shows there is a positive moderate correlation between
those tools.
Reliability of metrics very often
relies on reliability of the underlying testing tools,
and it is well known that different tools produce
different results on the same pages. Raised during the
webinar was the point that this problem could lead to
situations where low credibility is attributed to tools
and metrics; metrics would make it even more difficult to
compare different outcomes and diagnose bad behavior. In
addition, stakeholders could be tempted to adopt the
metrics that provides the best results on their pages, or
those that can be more easily interpreted and explained,
regardless of whether it is related to accessibility.
However, as we mention previously, we should be cautious
about when we should expect reliable behaviour across
tools, guidelines or domain.
Tool Support for Metrics
The availability of metrics in
terms of publicly available algorithms, APIs or tools is
one critical issue so that the spreading of usage of
metrics gains momentum and their adoption is fostered.
Providing such mechanisms will help facilitate a broader
adoption of metrics by stakeholders - especially by those
that, even if interested in using them, do not have the
resources to operate and articulate them. There are some
incipient proposals in this direction that implement a
set of metrics: Naftali and Clúa [
Naftali
] presented a platform where
failure-rate and UWEM are deployed. However this does
entail that human intervention is required as the system
needs the input of experts to discard false positives.
There are some other tools that help to keep track of the
accessibility level of websites over time [
Battistelli11a
]. These sort of
tools tend to target the accessibility monitoring of
websites within determined geographical locations,
normally municipalities or regional governments. The tool
support provided by Fernandes et al. [
NFernandes11a
], QualWeb,
incorporates a feature within traditional accessibility
testing tools to detect templates; the novelty of this
approach is that the metric employed uses the
accessibility of the template as a baseline. As a result,
accessibility is measured from such starting point. If
the accessibility problems of the template were repaired,
these fixes would automatically spread to all the pages
built upon the template. Therefore, the distance from a
particular web page to the template (or baseline) can be
used to estimate the effort required to fix this
instance, which is very appropriate for Quality Assurance
scenarios.
Addressing Large-Scale Measurement
Large scale evaluation and
measurement is required for those websites that contain a
great deal of pages or when a number of websites have to
be evaluated. Managing these large volumes of data cannot
be done without the help of automated tools. An example
of large websites is provided by Fernandes et al.
NFernandes11a
]. They
present a method for template detection that aims at
lessening the computing effort of evaluating large
amounts of pages. This can be useful, for instance, in
those websites that massively rely on templates such as
on-line stores. In this case, a vast majority of the
pages follow a determined template. In the on-line stores
example, normally, the only content that changes is the
item to be sold and the related information; however, the
layout and internal structure keeps the same. One example
that contemplates the measurement of the accessibility of
large number of distinct websites is depicted by
Battistelli et al. [
Battistelli11a
] using the BIF
metric; similarly, AMA is a platform that enables keeping
track of a large number of websites which is used to
measure how conformant to guidelines are the sites of
specific geographical locations. Finally, Nietzio et al.
Nietzio
] present a metric to
measure WCAG 2.0 conformance in the context of a platform
to keep track of the accessibility of Norwegian
municipalities.
Targeting Particular Accessibility Issues
Battistelli et al. [
Battistelli11a
] present a metric to
quantify the compliance of documents with respect to
their DTDs. Instead of measuring this compliance as if it
was a binary variable (conformant/non-confomant),
compliance is measured as the distance of the current
document to the ideal one. Although its relationship with
accessibility is not very apparent, code compliance is
one of the technical accessibility requirements according
to the Italian regulation and it also impacts on those
success criteria that claim for the correct use of
standards [see
WCAG
SC 4.1.1 Parsing
. Also, this approach could be
followed to measure accessibility. For instance, a web
page could be improved until it was accessible according
to guidelines or until it provides an acceptable
experience to end users. The accessibility level of the
non-accessible page could be computed in terms of the
effort required to build the ideal web page in terms of
coding lines, mark-up tags introduced or removed, or
time. Another approach that tackles a particular
accessibility problem is addressed by Rello and
Baeza-Yates [
Rello
] who address the
measurement of text legibility. This is something that
affects the understandability of a document, a
fundamental accessibility principle [see the
Understandable
principle]. The interesting contribution of this work is
its reliance on a quantitative model of spelling errors
automatically computed from a large set of pages handled
by a search engine. Compliance with the DTD and
legibility of a web document can be considered not only
accessibility success criteria but also quality
issues.
Novel Measurement Approaches
When it comes to innovative ways of measuring, the distance from
a given document to a reference model can inspire similar approaches
to measure web accessibility. As suggested by
Battistelli11b
],
compliance can be measured considering the distance between a given
document and an ideal (or acceptable) one. In this case this
distance can be measured, for instance, in terms of missing
hypertext tags or effort required to accomplish changes. Another
example is illustrated by measuring the distance from a instance
document to a baseline template using a metric [
NFernandes11a
]. Another novel way of measuring
accessibility can be by using a grading scale and an arbitration
process, as proposed by Fischer and Wyatt
Fischer
]: the use of a five-point Likert
scale aims at going beyond a binary accessible/non-accessible
scoring scale. It would be interesting to see, in the future, how
the final outcome of an evaluation depends on the original scores
given by individual evaluators and what level of agreement exists
between evaluators before arbitration takes place.
Vigo [
Vigo11c
] proposes a method that enables to
manage a number of checkpoints that, depending on the
contextual requirements, have to be simultaneously met or
when the fulfillment of just one of them suffices.
Nietzio et al. [
Nietzio
] suggest a
stepwise method to measure conformance to WCAG 2.0, where
aspects of success criteria applicability or tool support
are considered. Such method adapts to the specific
testing procedures of WCAG 2.0 success criteria (SC) by
providing a set of decision rules: first, the
applicability of SC is analysed; second, if applicable,
the SC is tested; third, if a common failure is not
found, the implementation of the sufficient techniques is
checked; and finally, tool support is checked for the
techniques identified in the previous step. The metric
computed as a result of this process is a failure rate
that takes into account also the logic underlying
necessary, sufficient and counter-example techniques for
each SC.
Beyond Conformance
Vigo [
Vigo11c
] proposes a method that not to
only considers guidelines when measuring accessibility
conformance, but also considers the specific features of
the accessing device (e.g., screen size, keyboard
support) as well as the assistive technology operated by
the users. Including these contextual characteristics of
the interaction could lead to more faithful measurements
of the experience. Finally, Sloan and Kelly [
Sloan
] claim that understanding
accessibility as conformance to guidelines is risky in
those countries (e.g., the UK) where accessibility
assessment is not limited to guidelines but it also
focuses on the delivered service and user experience.
Therefore, they encourage moving forward and embracing
accessibility in terms of user experience in a time where
user experience is becoming so salient and prevalent and
think of conformance of the production process, rather
than conformance of a product (that constantly changes).
This perspective is novel in that it looks beyond the
current conformance paradigm and aims to tap more into
the user experience, and this is something that is not
necessarily defined by current methods of technical
validation or document conformance.
Concluding Remarks
The authors of the above papers
where inquired about some aspects of web accessibility
metrics. The first aspect is about the target users of
metrics; the goal of this question is to ascertain
whether metrics researchers have in mind application
scenarios or the profile of the end user that will make
decisions based on the scores provided by metrics. Our
survey shows that the majority of respondents do not have
in mind an specific end user of metrics or their answers
are too generic. However, three papers are focused on
web accessibility benchmarking (see [
Nietzio
Battistelli11a
JFernandes
]) and some other can
potentially applied in such domain. This means that this
is the application scenario with broader acceptance and
where the application of metrics is taking off. In the
remaining scenarios (quality assurance, information
retrieval and adaptive web) again, there are potential
applications although the intent of applying in these
scenarios is not evident.
Secondly, we wanted to know
whether accessibility metrics researches are aware of the
costs and risk made on the base of wrong values of
metrics. Most users consider that validity and
reliability of metrics should be guaranteed although many
contemplate it as future work. There is some tendency
towards employing experts in such validations although
most agree that users will have the last word as fas as
validation is concerned. This is closely related our last
question about what is the research
community's point of view on measuring
accessibility beyond conformance metrics. All answers we
received claimed that measuring accessibility in terms of
user experience should be explored more thoroughly.
A Research Roadmap for Web Accessibility
Metrics
This research note aims at highlighting current
efforts in investigating accessibility metrics as well as
uncovering existing challenges. Research on web
accessibility metrics is taking off as the benefits of
using them are becoming apparent; however, their adoption
is far from being widespread. In addition to their
relative novelty, this may occur because (1) there are a
plethora of metrics out there and frameworks for metrics
comparison that show their strengths and weakness are
relatively recent [
Vigo11a
]; (2)
quality frameworks require further investigation as there
are unexplored areas for each of the defined qualities -
this areas are uncovered in section 4.1; (3) the low
validity of existing metrics, which calls for a
standardized testbed to show how they perform with regard
to metrics quality. Setting up a corpus of web pages for
benchmarking purposes could be the first step towards
this goal. It would work in the same way that the
Information Retrieval community does to test the
performance of their algorithms [see the
Text Retrieval Conference,
TREC
] - see section 4.2. A side-effect of the lack
of validity and reliability of metrics is their lack of
credibility. This could partially be tackled by the
mentioned benchmarking corpus. However the credibility
problem goes beyond - see section 4.3. Finally, some
other issues such as user-tailored metric and dealing
with dynamic content require special attention for those
who aim at conducting research on web accessibility
metrics.
Ensuring Metric Quality
To be more precise and focusing on investigating
accessibility metric quality there are still many
challenges to pursue. The way a metric satisfies
validity, reliability, sensitivity, adequacy and
complexity qualities remains open and can be addressed by
the following questions. Even if all qualities are
important, we emphasize that validity and reliability of
metrics should be given priority. No matter how sensitive
or adequate a metric is if we cannot ensure its
reliability and especially validity.
Validity
Studies of "validity with respect
to conformance" could focus on the following
research questions:
Does validity of the metric change when we change
guidelines?
Does validity change when we use a subset of the
guidelines?
Does validity depend on the genre of the
website?
Is validity dependent on the type of data being
provided by the testing tool?
Does validity change when we switch the tool used
to collect data? And what if we use data produced by
merging results of two or more tools, rather than
basing the metric on the data of a single tool?
Are there quick ways to estimate validity of a
metric?
The above questions could be addressed in the
following way:
By a panel of judges that would systematically
evaluate all the pages using the same guidelines used
by the tool(s).
By artificially seeding web pages with known
accessibility problems (i.e. violations of
guidelines), and systematically investigate how these
known problems affect the metric scores.
By exploring the impact on validity of manual
tests when (1) they are excluded or (2) their effect
is estimated.
Studies of "validity with respect
to accessibility in use" should overcome the
evaluator effect [
Hornbæk
] and
lack of agreement of users in their severity ratings
Petrie
] and could address the
following questions:
Which factors affect this type of validity?
Is it possible to estimate validity of the metric
from other information that can be easily
gathered?
Is validity with respect to accessibility in use
related to validity with respect to conformance?
Reliability
Some efforts to understand metric reliability could go
in the following direction:
How results produced by different tools vary when
applied to the same site?
Study the differences in the metric scores when
metrics are fed with data produced by the same tool
on the same web sites but when applying different
guidelines.
The analysis of the effects of page sampling, a
process that is necessary when dealing with large web
sites or highly dynamic ones.
See how reliability changes when merging the data
produced by two or more evaluation tools applied to
the same site.
The analysis of how reliability of a metric
correlates with its validity.
Other Qualities
Sensitivity
Experiments could be set up to perform sensitivity
analysis: given a set of accessibility problems in a test
website, they could be systematically turned on or off,
and their effects on metric values could be analysed to
find out which kinds of problems had the largest effect
and under which circumstances. Provided that valid and
reliable metrics were used, this could tell us which
accessibility barriers would have a more or less strong
impact on conformance or use.
Adequacy
Provided that a metric is valid and reliable, research
directions about metric adequacy should analyse the
suitability and usefulness of its values for users in
different scenarios, as well as metric visualization and
presentation issues.
Complexity
The most important issue about
metric complexity relies on its relationship with the
rest of the qualities. In this regard we can pose the
following questions:
Does complexity on a metric
ensure more valid and reliable results? If so, could
we pursue a compromise solution between the degree of
maximum complexity in a metric and its minimum
validity?
Can we find proxies (e.g.
number of pictures in a web page) to predict the
accessibility of a web page? As a side effect we
could dramatically reduce the complexity on
metrics.
The role that metric complexity
plays on its adoption and employment could also be
another line to follow.
A Corpus for Metrics Benchmarking
One option to have a common playground so that the
research community could shed some light on these
challenges would be to organise the same kind of
competitions as the TREC experiments. Recently, some
efforts have been directed towards this goal by the
W3C
or in
the context of the
BenToWeb
project. There are several
issues that need to be tackled.
How do we create test collections?
How do we select our test-participants? (The
metrics highly depend on the tester)
Do we make use of existing web pages?
How do we inject accessibility defects in these
pages?
Which criteria do we use to rank metrics?
How do we isolate the metric from the underlying
testing tool?
Which factors should influence metrics (e.g.,
defects per page for a given criterion, defects
repetition due to a single defect on a server-side
Web page template, WCAG severity level, etc.)?
How do we make these outputs accessible to
"non-experts"?
How can we
"dove tail" the user experience with metrics used in
the wild?
How about comparing the results of user tests with
accessibility evaluation tools? This would be very
interesting in terms of sites that are already
borderline or considered inaccessible.
To start with, pages we know are accessible could be
collected, and pages where we know they are not (because
we injected faults in them or collected from some other
repositories such as
www.fixtheweb.net
), and ask
participants to apply their metrics to such pages and
tell us how far apart are the accessible pages from the
non-accessible ones. Another option would be to use pages
from initiatives such as the one promoted by the WAI,
BAD: Before and After
demonstration
" where for
educational purposes, the process of transforming a
non-accessible page into an accessible one is shown.
Credibility issues
Accessibility scores are a great device to grasp the
accessibility level of web pages. However, metrics can
turn out to be a double-edged sword: while they enhance
comprehension, they can also hide relevant information
and details on the accessibility of a page. This side
effect can lead end users to choose the most lenient
scores among those metrics that are available. As a
result, there is a risk of hindering the credibility and
trust of accessibility metrics.
The fact that different evaluation tools yield
different results directly affects on metric validity and
, in particular, on metric reliability. The poor
reproducibility of evaluation reports and accessibility
scores has a side-effect on the perception of individuals
in that the web accessibility assessment process can be
regarded as lowly credible.
User-tailored metrics
There is a challenge on the personalization of metrics
as not all success criteria impact all users in the same
way. While some have tried to group guidelines according
to their impact in determined user groups, user needs can
be so specific that the effect of a given barrier is more
closely related to his/her individual abilities and
cannot be inferred from user disability group membership.
Individual needs may deviate considerably from groups
guidelines (e.g., a motor-impaired individual having more
residual physical abilities than the group guidelines
foresee). There are some research actions that could be
taken to improve user-tailored metrics:
Users' interaction context
could be considered in metrics, encompassing the
Assistive Technology (AT) they are using, the
specific browser, plug-ins and operating system
platform. In this regard, capturing and encapsulating
user's context data in a profile
would be a priority.
Quantifying
guidelines relevance:
in order to tailor
evaluation and measurement to the particular needs of
users, accessibility barriers or checkpoint
violations should be weighted according to the impact
they have on determined user group or
individual.
Reasoning
over guidelines.
This way, variables that metrics
normally require (priorities, number of applied
guidelines) can be easily extracted and automatically
inferred from violated SC.
Dealing with dynamic content
Measuring something that changes
over time can give different results depending of the
magnitude of such changes. Web pages these days are not
an exception as dynamic content causes updates in Web
documents. Web pages are
alive and are not
inert anymore as these changes are not always a reaction
to user interaction but to some other factors such as
time or location. Especially in Rich Internet
Applications these updates are frequently provoked by
scripting techniques that mutate web contents. Therefore,
the mark-up can give little hints to predict the
behaviour of a web document. Normally, the most
appropriate way to assess the current instance of a
dynamic web document is to retrieve and test its DOM;
then its subsequent mutations should be monitored and
tested. As expected, different instances of a document
caused by updates show inconsistent accessibility
evaluation results [
Fernandes11
]. As a result, if a metric
is sensitive enough, it should be able to reflect this
updates.
This area calls for research on
the frequency of the testings, that is, should pages be
tested every time they update or should be retrieved at
sampling intervals? Additionally, there are some other
questions: what would be the accessibility score of a
determined URL if page updates entail changes in the
accessibility? An average of all instances should be
cumulated?
The conformance
to
WAI-ARIA
and the accessibility elements subsumed
by HTML5 could also be explored by future accessibility metrics.
Conclusions
This research note introduces web
accessibility metrics: they have been defined and
specified, the benefits of using them have been
highlighted and some possible application scenarios have
been described. Spurred by the growing number of
different metrics that are being released, we present a
framework that encompasses the qualities that a good
metric should have. As a result, metrics can be
benchmarked according to their validity, reliability,
sensitivity, adequacy and complexity. We believe this
framework can help individuals to make decisions on the
adoption of existing metrics according to the qualities
required from metrics. In this way, there will not be the
need to reinvent the wheel and design new metrics in a
blindfolded way if available metrics already fit
one's needs.
A symposium was held in order to
check how metrics address the above-mentioned qualities
and to keep track of current efforts targeting quality
issues of accessibility metrics. The webminar provided a
partial, but concrete, snapshot of most of the research
activity around this topic We found that tool reliability
is a recurrent topic in this regard, whereas there is
still a long way to go in the realm of methods and
examples for metric validity, which are rare. The editors
of this research note believe that more efforts should be
directed to investigate the validity and reliability of
metrics. Employing metrics whose validity and reliability
is at stakes is a very risky practice that should be
avoided. We therefore claim that accessibility metrics
should be used and designed responsibly.
One way to hide the inherent
complexity of metrics is to provide tools that facilitate
their application in an automatic or semi-automatic way.
This need for automatization comes from the necessity of
assessing large volumes of data and websites; that is why
large scale analysis of accessibility calls for metrics
that can easily be deployed and implemented. Some other
efforts are targeting specific quality aspects of the Web
such as the lexical quality or the compliance to DTDs.
Finally, an emerging trend aims at measuring
accessibility not only in purely compliance terms. Since
contextual factors play an important role on the user
experience, accessibility measurement should be able to
consider these factors by collecting and including them
in the measurement process or by observing the behaviour
and performance of real users on real settings
la
usability testing. This perspective can be
understood as a complementary approach to current
accessibility measurement practice.
Based on the needs and gaps that
hinder current accessibility measurement we propose a
number of research avenues that can help to boost the
acceptance and quality of accessibility metrics. Mostly,
quality issues of metric validity and reliability need
urgent action but there are also some other actions that
can help to make metrics more credible and widespread. A
common corpus for metrics benchmarking would be a good
step in this direction as it could potentially tackle
quality and credibility issues at the same time. Dynamic
content and user-tailoring aspects can open new research
paths that can have strong impact on the quality of
assessment practices, methodologies and tools.
Acknowledgements
Some excerpts of this document are
extracted from an initial brainstorming document at
www.w3.org/WAI/RD/wiki/Benchmarking_Web_Accessibility_Metrics
where a number of members of the RDWG helped to populate. We are
therefore grateful to Shadi Abou-Zahra, Mario Batusic, Simon Harper,
Shawn Lawton Henry, Rui Lopes, Máté
Pataki, Peter Thiessen and Yeliz Yesilada.
References
AIR
] Accessibility Internet Rally (AIR).
Available at
Battistelli11a
] M. Battistelli, S.
Mirri, L.A. Muratori, P. Salomoni (2011) Measuring
accessibility barriers on large scale sets of pages.
W3C-RDWG Symposium on Website Accessibility Metrics,
paper 2.
Battistelli11b
] M. Battistelli, S.
Mirri, L.A. Muratori, P. Salomoni (2011) A metrics to
make different DTDs documents evaluations comparable.
W3C-RDWG Symposium on Website Accessibility Metrics,
paper 4.
Brajnik04
] G. Brajnik (2004) Comparing
accessibility evaluation tools: a method for tool
effectiveness. Universal Access in the Information
Society 3(3-4), 252-263, DOI:
10.1007/s10209-004-0105-y
Brajnik07
] G. Brajnik, R. Lomuscio (2007)
SAMBA: a semi-automatic method for measuring barriers of
accessibility. ASSETS 2007, 43-50, DOI:
10.1145/1296843.1296853
Brajnik08
] G. Brajnik
(2008) Beyond Conformance: The Role of Accessibility
Evaluation Methods. WISE Workshops 2008, 63-80, DOI:
10.1007/978-3-540-85200-1_9
Brajnik11
] G. Brajnik (2011) The troubled
path of accessibility engineering: an overview of traps
to avoid and hurdles to overcome. ACM SIGACCESS
Accessibility and Computing Newsletter, Issue 100, June
2011.
Fischer
] D. Fischer, T. Wyatt (2011) The
case for a WCAG-based evaluation scheme with a graded
rating scale. W3C-RDWG Symposium on Website Accessibility
Metrics, paper 7.
Hornbæk
] K.
Hornæk, E.
Frœkjær (2008) A study of
the evaluator effect in usability testing. Human-Computer
Interaction 23 (3), 251-277, DOI:
10.1080/07370020802278205
JFernandes
] J. Fernandes, C. Benavidez
(2011) A zero in eChecker equals a 10 in eXaminator: a
comparison between two metrics by their scores. W3C-RDWG
Symposium on Website Accessibility Metrics, paper 8.
Lopes
] R. Lopes, D. Gomes, L.
Carriço. (2010) Web not for all: a large
scale study of web accessibility. W4A 2010, article 10,
DOI: 10.1145/1805986.1806001
Naftali
] M. Naftali, O. Clúa
(2011) Integration of Web Accessibility Metrics into a
Semi-Automatic evaluation process. W3C-RDWG Symposium on
Website Accessibility Metrics, paper 1.
NFernandes11a
] N. Fernandes, R.
Lopes, L. Carriço (2011) A Template-aware
Web Accessibility metric. W3C-RDWG Symposium on Website
Accessibility Metrics, paper 3.
Nfernandes11b
] N.
Fernandes, R. Lopes, L. Carriço (2011) On
web accessibility evaluation environments. Proceedings of
the International Cross-Disciplinary Conference on Web
Accessibility, W4A 2011, article 4. DOI:
10.1145/1969289.1969295
Nietzio
] A. Nietzio, M. Eibegger, M.
Goodwin, M. Snaprud (2011) Towards a score function for
WCAG 2.0 benchmarking. W3C-RDWG Symposium on Website
Accessibility Metrics, paper 11.
Petrie
] H. Petrie, O. Kheir (2007)
Relationship between accessibility and usability of web
sites. CHI 2007,
397-406, DOI:
10.1145/1240624.1240688
readability
] Readability test.
Rello
] L. Rello, R. Baeza-Yates (2011)
Lexical Quality as a Measure for Textual Web
Accessibility. W3C-RDWG Symposium on Website
Accessibility Metrics, paper 5.
Vigo07
] M. Vigo, M. Arrue, G. Brajnik, R.
Lomuscio, J. Abascal (2007) Quantitative metrics for
measuring web accessibility. W4A 2007, 99-107, DOI:
10.1145/1243441.1243465
Vigo11a
] M. Vigo and G. Brajnik (2011)
Automatic web accessibility metrics: where we are and
where we can go. Interacting With Computers 23(2),
137-155, DOI: doi:10.1016/j.intcom.2011.01.001
Vigo11b
] M. Vigo, J. Abascal, A. Aizpurua,
M. Arrue (2011) Attaining Metric Validity and Reliability
with the Web Accessibility Quantitative Metric. W3C-RDWG
Symposium on Website Accessibility Metrics, paper 6.
Vigo11c
] M. Vigo (2011) Context-Tailored
Web Accessibility Metrics. W3C-RDWG Symposium on Website
Accessibility Metrics, paper 9.
Sloan
] D. Sloan, B. Kelly (2011) Web
Accessibility Metrics For A Post Digital World. W3C-RDWG
Symposium on Website Accessibility Metrics, paper 10.