Thomas Baker

Thomas Baker

Organization Dublin Core Metadata Initiative
Organization type Other
Country United States of America
Tom Baker is CIO of the Dublin Core Metadata Initiative (DCMI), chairs the Library Linked Data Incubator Group of the World Wide Web Consortium (W3C), and recently chaired W3C's Semantic Web Deployment Working Group. In these roles, Tom has been instrumental in the standardization of two of the top vocabularies used for Linked Data -- Dublin Core and Simple Knowledge Organization System (SKOS). Tom is an advocate for the use of Linked Data technologies, especially for publishing library and public-sector information. He serves as a member of the Semantic Web Coordination Group, the advisory forum for W3C's Semantic Web activity.

This member participated in the following Forums

Forum Forum: "Building the CIARD Framework for Data and Information Sharing" April, 2011

Question 4: What actions should now be facilitated by the CIARD Task Forces?

Submitted by Thomas Baker on Sat, 04/16/2011 - 23:24

Congratulations to FAO for the exciting news about AGROVOC, VocBench, and Agrotagger!

As defined in the "five-star" approach, the fourth star is about making your resources "citable" by identifying them with URLs, and fifth star -- the summit of the Linked Data mountain -- is about "linking your data to other people's data to provide context".

As I see it, linking your data to others' data is about embedding your data into a rich web of cross-references -- pathways by which people can discover your data

Some of those pathways may connect your resources with other resources -- "this research report is the basis for that article", or "this news item summarizes that conference paper".  Other pathways connect people to resources -- "Hugo wrote this report" or "Sanjay recommends that blog".  Others connect resources to "topics", as in "this research report is about maize (http://aims.fao.org/aos/agrovoc/c_12332)".

Focusing on simple connections suggests a way forward:

1) Ask: what Resources, People, and Topics are important enough to be linked to or cited?  Then aim at providing guidance on how to give those things URLs.

2) Then ask: What are the most important ways to link those things?  One could perhaps boil this down to a few types of statements such as those listed above.  Then aim at providing guidance on publishing simple metadata to make those connections.  The guidance would describe how to extract basic information from existing data.

3) Then ask: How can we pull these links together and make them searchable?  Some of these goals are already implicit in the CIARD Pathways to Research Uptake (http://www.ciard.net/pathways), just with a tighter focus on harvesting and querying the linked data.

A colleague of mine experienced in "selling" linked data approaches to organizations tells me that the single most convincing demonstration of the utility of the new approach is when people see their own data linked and discoverable in a new context.

Question 2: What are the prospects for interoperability in the future?

Submitted by Thomas Baker on Mon, 04/11/2011 - 21:03

I recognize, with Diane, that part of the problem has indeed been the use of technologies pushed by IT departments because they lie within their comfort zones, which typically means XML and SQL.  (It should however be added that not all data needs to be exposed as linked data, and that managing data in XML or SQL may in many circumstances be the most practical solution.)

That being the case, the question becomes: How can this or that database or XML database be tweaked to expose linked data -- perhaps only an extract of the full data, or perhaps on-the- fly?  Data can be managed in XML or SQL and exposed as RDF. If a given XML or SQL database was originally designed with linked data in mind, or if it happens to map cleanly to linked-data structures, such transformations will be that much easier to implement.

The VIVO project has alot to say about this, as much of their data is extracted and converted from the wide range of databases and formats used on their campuses.  In today's world, the (growing) diversity of data formats is a given.  It is precisely because the linked data approach does not require data to be managed in a particular format that it stands a chance of succeeding.

Submitted by Thomas Baker on Sat, 04/09/2011 - 19:21

san_jay writes about the Interoperability Triangle:
> It is good to see that some of us are trying to bring the human factor in
> interoperability. ...
> But if I summarise from everything from this thread, doesn't everything
> comes to people, processes and technology?

kbheenick writes:
> I feel that the concept of 'interoperability' needs to be considered ,
> ranging all the way from people collaborating to systems collaborating,
> with concepts and information interoperability being somewhere in
> between. ...
> People successfully interoperating means that there has been...

> an agreed set of communication protocols...

I like Sanjay's notion of an Interoperability Triangle
of "People, Processes, and Technology", and I also like
Krishan's point that "processes" have to do with "concepts"
and "communication".

One might summarize this as a triangle of "People --
Communication -- Technology".

PEOPLE

I enthusiastically agree with the emerging emphasis in this
discussion on the "human factor" in interoperability.  VIVO is
an excellent example, as the emphasis since its beginnings
some five years ago has been on "connecting people" and
"creating a community" [1].

COMMUNICATION

What makes Linked Data technology different from traditional
IT approaches is that it is analogous to the most familiar
of all communication technologies -- human language.

RDF is the grammar for a language of data.  The words of
that language are URIs -- URIs for naming both the things
described and the concepts used to describe those things, from
verb-like "properties" to noun-like "classes" and "concepts".
The sentences of that grammar -- RDF triples -- mirror the
simple three-part grammar of subject, predicate, and object
common to all natural languages.  It is a language designed
by humans for processing by machines.

The language of Linked Data does not itself solve the
difficulties of human communication any more than the
prevalence of English guarantees world understanding.
However, it does support communication across a similarly
broad spectrum.

When used with "core" vocabularies such as the fifteen-element
Dublin Core, the result may be a "pidgin" for the sort
of rudimentary but serviceable communication that occurs
between speakers of different languages.  When used with
richer vocabularies, it supports the precision needed for
communication among specialists.  And just as English provides
a basis for second-language communication among non-native
speakers, RDF provides a common second language into which
local data formats can be translated and exposed.

TECHNOLOGY

Given the speed of technical change, it is inevitable that the
software applications and user interfaces we use today will
soon be superseded.  The Linked Data approach acknowledges this
by addressing the problem on a level above specific formats and
software solutions, expressing data in a generic form designed
for ease of translation into different formats.  It is an
approach designed to make data available for unanticipated uses
-- uses unanticipated both in the present and for the future.

[1] http://www.dlib.org/dlib/july07/devare/07devare.html

Question 1: What are we sharing and what needs to be shared?

Submitted by Thomas Baker on Sun, 04/10/2011 - 15:51

Asad, are you saying that data should be validated in the sense of "schema validation" -- i.e., making sure the data conforms to a format and constraints understood by particular software applications? 

Or do you mean "validation" to refer to an evaluation of the quality of information or to verification that the information comes from a reliable source (or even that it has been vetted by experts)?

Both senses of validation are significant but would require different approaches.

Submitted by Thomas Baker on Fri, 04/08/2011 - 01:47

kbheenick writes:
> Does that mean that we need to look at our information with
> new 'lenses' and label it with appropriate keywords so they
> can be 'found'. Does it mean that we have to repackage our
> information into different modular formats such that they
> can fit into the the larger information systems; or can the
> technology do all that for us?

I once worked at an economic research institute
which found that people in the region found jobs less by
reading classified ads or visiting employment offices than
through the advice of friends or relatives. 

A few years later, a class of mine at the Asian Institute of
Technology in Bangkok found that members of the AIT faculty
each tended to identify with a specialized sub-field consisting
of some 100 colleagues spread over the globe.  To remain
current, these faculty members relied less on generalized
literature searches than on recommendations and advice from
their international colleagues.

The general point is that as we design information systems
to serve different audiences, we also consider that people
like to find things by asking other people or looking to them
for recommendations.  Assembling information into coherent
packages for particular target audiences is not just a question
of formats but of enabling people to discover information
through following links from people they know or trust.

Submitted by Thomas Baker on Fri, 04/08/2011 - 01:08

jimcory wrote:
> I know from working with CrisisCommons that there are
> structured tweets, email chains and skype chats that are
> important to capture for future reference. Forums are
> perhaps more formal ways of capturing discussions, but in
> some cases the immediacy of chat is necessary. Do we rely on
> the conversation participants to capture the info into more
> traditional forms (wikis, summary papers) or do we need to
> somehow tap into live discussions?  What does this entail
> when older chats/emails may be archived?

RDF and OWL are great, but much of the utility of Linked
Data derives simply from its use of URIs as globally citable
identifiers for making cross-references between things.

W3C working groups provide a fine example of how URIs,
generated automatically and routinely by the software
environment in which its teleconferences are held, make it
easy to link from live discussions to other types of resources.

Consider, for example, a mailing-list posting of 16 February
[1], which refers to an ACTION recorded in the teleconference
minutes of 10 February [2] -- minutes which were, in turn,
generated automatically from the chat channel log [3].

To me, this is related to what makes a good Tweet -- being
able to: 1) provide a comment, 2) refer to a person (e.g.,
@jenit), 3) give the comment a subject (#tpac), and 4) link
to a document in a compact form that is easy to scan, as in:

    @jenit Core vocabularies - FOAF, DC, SKOS etc - reduce
    need for invention, provide focus for tools #tpac
    http://bit.ly/c1mqxn

Note that this tweet is itself citable with a URI [4].

Tweets and triples use URIs to tie things together.  The trick
is to make it easy for people to make these connections,
for example by making URI generation into something that just
happens in the underlying software -- and to make it easy for
people to leverage those URIs effectively when they search
for things.

Tom

[1] http://lists.w3.org/Archives/Public/public-xg-lld/2011Feb/0034.html
[2] http://www.w3.org/2005/Incubator/lld/minutes/2011/02/10-lld-minutes.htm…
[3] http://www.w3.org/2011/02/10-lld-irc#T16-02-40
[4] http://twitter.com/#!/tombaker/status/1270560629727232

Submitted by Thomas Baker on Tue, 04/05/2011 - 15:42

Scientists will be more motivated to share when the benefits of doing so can be demonstrated -- not just to themselves but to their employers or funders.  Search engines that target Linked Data, perhaps for a specific domain such as "agricultural research", will be able to follow incoming links to a scientist's work in order to generate statistics and analytics, as Twitter engines do for "trending topics".

Submitted by Thomas Baker on Tue, 04/05/2011 - 04:27
Valeria writes: > So in my opinion any potentially useful piece of information is worth sharing, > possibly as small information units with metadata, like single records (e.g. > the name and specialization of an expert, or scientific data on a gene), single > electronic resources (e.g. pictures or videos), even single semantic units > automatically extracted from an article... Taking this to two interesting extremes...: -- At the one extreme "opinions" or "annotations" about resources, when exposed as Linked Data, can become part of an extended description of that resource -- statements as simple as the Facebook-like "Johannes likes this article". Such annotations can become part of what we refer to in the W3C Library Linked Data Incubator Group as the "infinitely expandable description". URIs are the anchors that pull widely distributed references together into such an enriched description. -- At the other extreme, one can express opinions or make annotations about anything, in principle, that has identity (i.e., a URI). One of my favorite examples is genomic research, where specific gene sequences, identified with URIs, can become the object of statements such as "disease A may be related to genetic sequence B". By way of introduction... I have been involved with the development of Semantic Web standards since before the notion of "Semantic Web" was popularized in circa 2000-2001 -- starting with Dublin Core in 1996 and more recently SKOS. Last year I completed an autoevaluation of AGROVOC and helped plan its publication as Linked Data [1]. I currently co-chair the W3C Library Linked Data Incubator Group [2], which is looking at the potential benefits (and implementation obstacles) to enriching library data with URIs and making it available for linking on the Web. [1] http://3roundstones.com/led_book/led-baker-et-al.html [2] http://w3.org/2005/Incubator/lld/

Become a member

As e-Agriculture Forum member you can contribute to ongoing discussions, receive regular updates via email and browse fellow members profiles.