>> Good morning, everybody.
Thanks for joining us today for the NCI CBIIT Speaker Series.
I'm Tony Kerlavage, the chief
of the cancer informatics branch here at CBIIT.
I want to remind everybody that today's presentation is being recorded
and will be available on the cbiit website at cbit.nci.nih.gov,
and you can find information about future speakers on that site,
and also by following us on Twitter, and our Twitter handle is @nci_ncip.
Today I'm very happy to welcome Dr. Vivek Navale,
who is from NIH's Center for Information Technology,
and the title of this presentation today is Intelligent Biomedical Archives:
A Conceptual Architecture for Big Data Science.
With that, I will turn to floor over to Dr. Navale.
>> Thank you, Tony.
Good morning to all, and I will get going on it.
So my talk outline is motivated by a few questions that I have here
when I charted on this topic of intelligent biomedical archives,
especially with relation to big data,
and so as you can see that there are many drivers for big data today,
but in the scientific and business arena.
I'll discuss that a little bit,
followed by I will discuss open archival information system model,
and this model originates back to the space science community,
when they were dealing with large big data, so it's [inaudible] standards,
so we'll discuss that, followed by the different architecture
that I've been working with National Institutes of Science and Technology,
how that can be useful, and then tie it
to really what intelligent archives means, and then, of course, the [inaudible],
can we develop intelligent biomedical archives and how,
and perhaps by the end of the presentation, you'll have more questions,
and I'll be back at the drawing board, redoing the presentation.
So with that, so here are some, you know, examples of the big data drivers,
which most of you are all familiar with,
but just to show you that there can be different types of data,
whether they are scientific, or business, or other types.
You can see the data types you can broadly categorize as structure, text, audio,
image, and video, and also unstructured data,
which is implied within some of these, our volumes.
So as you can see in this slide that really when we talk about big data,
you have heard the term four Es, or three Es and the fourth has been added.
So for the first feed, the scale of the big data can be clearly seen here,
that under here, as you can see in the middle, towards the latter 2000s,
there was a significant increase in the volume of data going,
which we always worry that terabytes or subterabyte range, and now, by 2020,
it will be an X item beyond.
So these are some things which just about everybody has to contend with,
depending on what discipline they are and what mission they have.
Also, on the right, you can -- on the other side, you will see that really,
it's not just only the volume part and not just only the velocity,
but it's really the variety of the data
and the complexity involved in processing data.
For example, you can see that if you go from the bottom up all the way to video,
the expressiveness, the sophistication of analysis that you require,
and the computational need, which we know is demanding.
Not shown here but very important is the genomic,
which is generally referred in the unstructured data category,
which also is shifting this whole volume as well as the sophistication
of analysis required all towards the high end in the X of [inaudible] range
and beyond, and so these are some facts which also,
for you all familiar with [inaudible], it might be useful for you to see,
and the other aspect is that the source could be just about anything.
Healthcare on there could be one driver, wider imagery, and other satellite.
[ Inaudible ]
Sure. So those lines that you have, it's just two,
and you have real models that are based --
they're wondering about the -- you know --
>> Yeah.
>> But you're saying that they would be kind of plateauing,
and if it's real [inaudible] or it's just --
>> These are best projection.
The source was from IBM.
It was not data that I have created, so the source of that -- so these are --
this is based on what the data is
and how the growth is collectively, so by no means.
It tries to give a precise --
but certainly gives you the scale,
so use the figures more for relational difference between the scales rather
than any precise mathematical or computational endeavor.
So with that, so the next step is, of course, just to give you one example,
and there can be many, and these data types are a lot more quantifiable
because the source from NASA, who know their holdings very well,
and they have a pretty good idea,
and they have been able to calculate it, project it.
So what this shows is that for climate data alone, you will see that by 2030,
you will be easily in the 350 terabyte range,
and also shown in this very graph is these blue and yellow lines,
it's not just original data that you create, but it's also the model data,
the modeling that you kind of reuse and all of that.
That, also by 2030, as you can see here, towards the end,
will be as much or even more than what original data.
So you see you will have drivers, not only the original data,
but also the reuse of data, and the same could hold good in biological sciences,
in medical field, in other fields, social field services.
Just one example to illustrate that span.
So the question comes is okay, we have this huge amount of data that's coming
from all over, so how do we -- how do we ensure that for long-term access,
what kind of approaches or what kind of methods of models
that we need to at least keep in mind.
So this problem of big data, as most of you,
that the space science community experienced it in the mid-1990s to early 2000s
because they were putting more and more satellites and the observing systems,
satellites, and they were collecting more.
So as a result of consultative committee of space science,
there was an international effort somewhere in the mid-1990s started off
with the space science community to come up with some kind of a reference model
or some way that could help people to figure out how access to these data sets
that are produced, especially in the case of satellite.
You cannot reproduce them that easily that they can be made available
for a longer period of time.
So for that purpose, that's just the background and history,
but the community -- I'm sorry.
So the community was space science, but it was open to all other communities
as well developed lots of models.
So here, open doesn't mean unrestricted.
It just means that it was developed, the model was developed with consensus
by way of engaging the community.
So an archival information system refers to the model that is trying to figure
out how the data that is being collected and is going to be preserved,
so these entities that I have shown you preservation storage,
which we were discussing, archival storage, and data management.
All of these could be addressed by a model of that,
and this model now, there's a lot written.
It's an [inaudible] standard, and there's [inaudible] book,
and I will not be going into a lot of detail here, but let me highlight.
It compromises of two parts.
One is the functional part, and other is the informational part.
So let me know you what the functional part looks like.
Basically, the functional part relates producers to the consumers
and the stewards, who are the people who manage the data,
are the people who are integral to this functional part.
How does it look like?
Here is a schematic.
As you can see that really, the idea or the concept of this model is
by the basis of what we call information packages.
So in this diagram, you can see that the SIP, AIP,
and DIP are when you submit a data model, that relates it to what we call
as an information package, and the data is being submitted,
and the AIP at the other end is when you keep the data, --
-- store the data as well as add value to the data,
is reflected in the information package, and dissemination in terms of use.
So this model focused more on the idea of having information packages,
whether within produce, producer, but someone was generating it,
or who's keeping it, or who is accessing it.
But in order to do that, if you have that information package idea
that the model has shown here,
what are those important functions and services you require?
And so this, your big boxes which are inside reflect the ingest,
the data management, archival storage, preservation planning, administrative,
and access, so these six entities were agreed upon
that really they are very critical as part of this model to have functions
and services for having data be in a way that could be used by a community.
I also wanted to point out when we have producers and consumers,
this OAIS model is what -- also have in that it is for a scientific community,
so meaning it is true that you could use this model,
but it is not a model that is implied
that this one model will meet the needs of all of the communities.
So it is important that when the information packages, submission information,
AIP, as well as DIP to keep in context with what discipline or what discipline
of which it needs to be in the times of implementation.
So these are some aspects of the model.
So let me move on by showing that this is a functional part.
There is an information part for it, and so if that was all the functions
and services, so now, if you want it from the operations aspect of --
from an information technology part,
you would have question of how does it all relate, that the model,
OAIS or [inaudible] bridges the gap between the media layer,
which we all relate to, which is the base, the disk,
and the infrastructures as well as the application layer, which is on the top,
which is more for business or for analysis as well as display,
which we use in our current world.
And then it doesn't prescribe, but it explains in this whole idea
in data of streams, layers.
So it defines this transition from going
from the infrastructure level all the way
to where the data is meaningfully expressed or analyzed.
It bridges the gap by presenting these ideas of layers.
So layer model is what is shown.
What is important to see here is that when you move from the basic layer
where the binary digits, ones and zeros, are being produced,
more and more structure is added,
and so you can see that the structure layer is a lot more value added to it
in order to get an understanding or a representation, and then object layer,
which further relates to the application layer in terms
of the process relationship content and context.
So the contents of the data, original data,
and the context of what it all means really improves as you go upwards here,
and that's actually from the media layer all the way to the application layer.
Also important to relate is, as I said, if you want to relate it to day
to day systems, the tapes and disks are in the media layer.
A string layer will be more like file paths or other kinds of things
which we do, and then structure layer will be more like, you know,
data types or records that we have.
An object layer would be libraries that you could have of data,
and then this layer, and the application analysis layer would be
for the different applications.
So again, so the reason I spend some time on this is I want
to show you quickly applications of this, examples of OAIS,
and so I will present some time on the really the concept behind it.
So let me show you.
So here is what NASA does.
So NASA has utilized OAIS, and it does mean that everything that is
within their OAIS has to be implemented,
but you can see that NASA receives data, as I discussed,
large-scale data from different sensors and satellites,
and they go through a progression of raw data, deliberation,
and different levels of data before they are able to produce meaningful data
that is used for general public.
So you can see that the layered model that I showed you just now,
is reflected here in terms of the different layers, that you can, right,
see here going from the raw data all the way towards modeled outputs,
and NASA also doesn't have one data center,
although I have represented it as [inaudible].
NASA has what is called distributed archive access,
distributed active archive centers.
There are about eight of them which are responsible for archiving
and data management of all of the satellite data
that is coming space satellites going around the world, and it also has --
an important aspect is all of them have this conformance to the process
that I showed to you, and they're also engaged with the scientific community,
and the principle investigators were involved in processing as well
as making these data more available.
So here is just another example of NASA.
Really just I'll quickly show you that the scale of raw data was
in satellite range, and this was several years earlier on, but as you move on,
you can see that not only there is an addition of more volume
of data being added in processing, and analysis,
and information synthesis as you move, but the scales decrease,
meaning through calibration, transformation, certainly accurate,
but compared to the raw data that they get it
in real time all the time, the scales decrease.
Another thing to note is really, compared to the raw data that we receive
and the amount when we go to knowledge,
which is transfer information from the raw data all the way, really,
the volume part becomes shrunk, only a few megabytes,
meaning you may need to know, as shown in this graph, there's the temperature,
or humidity, or the conditions on that region,
where you have to go through significant amount of processing to reach
to that point of extracting knowledge.
Same holds good if you want to relate to a biomedical community
that will receive genomics data, and it goes through the entire process,
raw data, FASTQ files, and if you go all the way forward, the [inaudible],
and then you receive, and you have an understanding of which
of those nuclear types may be related to the disease.
So this whole process, as you can see, is not only time-intensive but also,
in order to extract, that real knowledge that is in their model,
not only for all the processed steps here, management of the data.
So the data management is a critical and an important part, and added to it,
added to this, is the time dimension.
If I recall the OAIS, that if you want to keep any given large-scale data
from satellite data or from biomedical data, over time,
you would really have to figure out how you're going to preserve that data,
because constantly we are seeing migration of technologies,
constantly we are seeing the obsolescence of technology,
so that is a life cycle that is running forward,
and there is a data life cycle also running, and then sometimes,
they're not in sync, and that's the operational challenge
that most of us have dealt with.
So a quick operational example.
This is from NOAA, and most of you are familiar there.
They do deal with a lot of the data that's received from all the oceans all
over the world, even from the deep sea.
They also deal with satellite data.
They also deal with [inaudible] sensors,
which are all over the world, and what do they do?
How do they handle that kind of scale of the data, complex data variety?
So they do also implement the OAIS, the reference model that I showed you,
and here you can see they bridged the gap between or they bridged the producer
and the consumer aspect by this whole idea of the DIPs, and the AIP,
and the DIP, which I discussed.
And here, I just want to show you that really, it's from an operational point.
They do also have to put the data or they have done that for the last several --
at least decades, and they have to have the data piece on this
or in their file system, which is a larger [inaudible] based system.
So this is an example of an operational aspect of how the reference model
and OAIS, which might look abstract,
how it can relate to a real-time operational system of how to manage the data.
And then how do we provide access?
You can see that really, data is only good if the content
of the data is more available, integrated, and which [inaudible] explain.
So metadata, which is commonly used,
and there can be very many different types of metadata.
There could be technical metadata, descriptive, structural,
so there's so many aspects of it, but they don't capture that either
in their databases or as part of the ISO standard, and then combining the WAF,
web-accessible folders, which they make associated with their different portals
that they have -- for example, the data.gov, or the Google, or WIS,
and enable the searches to take place via
that to locate what the researchers are looking at.
So the access end, this part, is the DIP part for the consumer
to use this approaches, and for the data itself,
of course they have huge archive piece in different places,
primarily now in the National Centers for Environmental Information,
where they carry out a different mode of ensuring
that the data could be either available via different modes,
like using FTP or STP, or different servers, and depending on the metadata
and the data requested, they can provide downloads
of those parts or information.
The Hyrax is also a server, but they use what is called an open doc.
What it does is you don't have to download the whole entire data set.
If you exactly know what you're looking for,
so you can do a small subset of that data extractor.
So there, they have many mechanisms there provide different [inaudible].
Let me just move on.
This is more familiar to all of you, and given I wanted to show this is
because this idea of information packages and the idea of the OAIS that I talked
about very much is reflected here.
The Genomic Data Commons, which most
of you are familiar with, NCI and related activity.
So you can see that the sources are very many systems, legacy systems,
or is the TGC or the target of ICGGC.
All of them, the information or the data has now been migrated to the GDC,
and migration is just one step.
This is from Bob Grossman, et al.
You can see that there is so many things.
That AIP is the archival information package that I referred to.
There can be many, many activity.
It's not just about description of the data itself.
You can have standardization, harmonization of the data shown in GDC,
as well as many other aspects of security and management implied,
and towards the bottom of this GDC,
you can see that in order to make the accessible layer, this DIPs on the right,
you would have to have means and mechanisms, as shown in number five,
by enabling browsing, download, or analysis.
So I've given you like three or four examples,
and the purpose of giving is to really emphasize that the OAIS model is content
and technology agnostic, and it's also scale agnostic.
It does not get into any one of those,
but it gives you this idea of how a logical approach could be taken for managing
and preserving data over time.
So the next.
I want to move on to this, more toward about medical data challenges
which many of us are familiar with.
You can see we have widely distributed heterogenous data, a broad,
diverse community, and the community is not researchers or clinicians.
Community is not just a patient, but even people who don't have any issues
at the moment and want to know about the state of their health,
and so the community can range from someone who's very interested
in some specific aspect of a nucleotide availability, or loss of,
or non-presence of, versus someone who just generally wants
to know how different types of activities in his
or her own body over time it is affecting.
So that's a major challenge.
The other challenge, which you are very well aware,
the scales that I showed you [inaudible] are enormous, and the capacity,
the human capacity to be engaged either by searching, querying,
and all will soon be done or it has already been done.
So what you would -- and really all we are is what we call a huge idle data,
meaning data that, meaning data that is constantly getting accumulated
for very good reasons and really is not active or we are not getting meaning
out of it fast or enough to help us out in really operational needs
or what we call in the biomedical sense for patient care,
both in the needs of individual people, you know, when I go to my physician,
or collective studies, where they have large cohorts being looked at,
or clinical studies.
So if that knowledge is not coming in faster time,
then work that is going on will not have that benefit
of really asking more precise questions, and getting more answers and variety,
and the bottom line in terms of biomedical research to the care
of the population, and so patients are general participants.
So really, the transformation odyssey of what I see is going from data
to information knowledge is going to be the major challenge
or is the major challenge for now and onwards.
So let me switch gears a little bit here.
So you may ask this question.
Okay, we have all these challenges.
We have all these models.
So really, if you are an architect or you want to architect it,
how are we going to architect such a thing?
So in the few slides and some of my work with the National Institutes of Science
and Technology, I'm going to share some of thinking.
They are high-level conceptual slides, but really,
it applies to any discipline that you may think.
So let me show you.
So for big data, as you can see, I showed the drivers of data.
I also talked about the OAIS reference model.
You can see that the provider or the producer of the data
and the consumers are the two ends as far as the users, but really,
it is also important part is the framework, the framework that enables, really,
from an operational point of view enables the data information knowledge
transformation as well as the management of it is a critical component.
So in the case of big data, what makes it different than --
if you may ask the question, okay, what was it before is the scalability issue,
[inaudible] systems, other things that were all -- that not --
really had not had the scalability issues that they had
to contend with at that time.
They had other challenges to contend both at every level.
At the infrastructure level platform as well as from a framework level,
you require this scalability, both horizontal and vertical,
and including the scalability is not just hardware.
It is also applies to software and algorithms, so that's one major challenge,
as well as things that differentiate.
So any framework provided would have to address that.
Now, again, by showing you this,
this is not a reflection of one giant system or something, one entity.
It's just a framework, and it could apply to one of many systems
when they're being doubled up,
and the application layer which resides all the framework is the key layer,
because all I showed you, the layered model early on,
is the key layer of a lot of the processes, the collection, curation,
visualization, and analytics, access side, and again, in big data,
it is the analytics is the one that it really drives the transformation
or extraction of the data to information to knowledge.
So the value added to the data constantly is driven by the sophistication --
-- and the combination of needs that are driving this data analytics
to provide other things, like visualization and analysis.
So this is just -- so what we did in this is we tried to --
and in any architectural diagram and this, we tried to simplify,
to pick broad categories, and this is how we started it,
and the system orchestrator and that green arrow shows that it is the service
that really is the driver, and it is always the consumer or,
in the case of biomedical science, you know,
the patient or an individual who wants to know his or her state of health.
So it is the consumer which is driving, and the provider, it goes to a provider,
and all of these processes as well as the interaction then orchestrated by --
what we call the system orchestrator is not necessarily system
in terms of what we see.
We just wanted to say something that either coordinates,
or organizes, or engages all this.
So this is the first part.
The second slide is --
>> Excuse me, Vivek.
Can you explain what horizontal and vertical mean in this context?
I presume that vertical means volume,
but perhaps that's not the right [inaudible].
>> Yeah. So horizontal and vertical is -- so thank you for that question.
It's very important.
So in big data, one of the challenges is analyzation.
So the data load is so much that current systems and the way we process,
whether you collect and try to do analysis, may not work sequentially,
and you need to analyze, and if you have analyzed,
you would want horizontal scalability,
and I'm just framing to go to a specific technology, but there are many.
I'm just going to throw a name, like [inaudible],
and what it means is a distributed file system with many [inaudible].
So that is horizontal.
Vertical would be computing speed.
If you require a certain -- like by any of our systems computers here
versus a high-end -- in many, many decades --
IBM Deep Blue used to be a big thing, but now there are even more,
many, many faster computers.
So that would be speed, processing speed, CPU would be for.
So the next slide is just this, but I map the reference model,
so it will be a little busy, but I'm sure you'll be able to see this.
So here, this is just the previous slide that I described,
except what I did was I took that whole description
of all the reference model and I mapped it.
So let's look at this more important and get that.
So what does a framework provider --
we discussed all of the aspects of a framework provider.
So preservation planning, data management, archival storage,
and as well as administration for providing the data use does fall significantly
on big data framework providers, so that would be the responsibility
of this framework one that you can see in this picture, and then,
you can also see I mapped all of the early information management [inaudible].
They are all information packages moving between different states,
and all of the submission, and archival, and the dissemination,
all have been mapped to the application here, but this view,
that is actually what is most important is you can see here
that the green arrow is what we discussed before.
It drives what kind of service is needed,
and the red arrow along with it engages what kind of tools,
be it software or what other aspects you really require for it.
So the blue arrow is towards the consumer, is the data that you need.
So this model here, just reference architecture,
explains the flow of data in this framework, driven by the service needs as well
as the tools that are associated with it.
Also important to see here is you put the top layer by giving an impression
that everything is done in real time
and there is no what I call as data accumulation.
That isn't the case.
In fact, real-time analysis may be significant,
but large scales of data are not needed or acted upon right then,
so with big data, you will see operationally more
and more of these data being accumulated in the framework,
be it your archival storage, or whatever repository you choose,
or any infrastructure, so it will decide.
Another important thing to note is this architecture is really based
on the factors of security and privacy.
It is very, very critical.
You cannot have any architecture without having that as a base
because we all know that the confidentiality as well as the privacy
of data can vary significantly from discipline to discipline,
especially in the case of biomedical community.
I think the practical issue are a lot more, more stringent and harder rather
than other, as an example, and that's somewhere almost all data is almost
complete, yeah, and NOAA, so versus in a health science
where that is not that easy to achieve.
So the challenges can be vary, and then the management side of it,
which we discussed, that data management is also a critical part
for long-term access of data.
So these are a critical part of the fabric.
So about what I really want to show is that orange line shows you that really,
the information value increases as you move from the person, or person,
or system, or anything that produces the data more towards the consumer,
meaning the data, the radiology reports, or the EHR reports,
or everything my doctor orders for me is not so much, but really,
what I as a patient or a person really can understand what really is happening
to me.
So in the same light, it is the consumer of the data who is really --
is the one where the value increase.
So that's one, and from an IT point of view,
if you see the last graph, you can see it is true.
I said that the framework provider is very important, but again,
that's the backbone which is critical, and you really require it, but again,
the value part, in terms of technological or in terms of informatics,
really falls in more as you move up this base layer that I've shown you
into this application there, where analytics and interpretation
of the data is what that provides most value to the consumer.
So this whole reference architecture was to show you that.
So let me move on.
So before I get into this, so you may have the question okay,
so I've shown you your architecture.
I've also shown you or I've also said that really, with big data,
what's going to happen or is happening is large part of the data will be idle
and staying not acted upon.
Well, I worked at the National Archives also for a good period,
and they [inaudible] organizing their electronic records archive system,
and we were facing a lot of -- I'm sure most of you can relate.
There were many manual processes and we were trying to automate it,
but use all of these business re-engineering in such a way that there's more.
So we developed this idea of virtual workspaces, the idea, a concept.
So in order for everything to be manual, paper-driven, fax on top and on,
we went into this idea of giving people workspaces, and so how to go about?
So the idea of work benches then led to this whole development scheme
that I showed you, but again, an important thing to note is really,
you always require some means
of cataloguing the information or the data that you have.
Without having that, which was not reflected in the previous slide --
without that, whether it is for information here, with [inaudible],
or other activities of biochem, so you really require those mechanisms
to be able to provide access as well as to be able
to have this whole OAIS information package really be working.
So a quick story here is that so we applied the OAIS here.
We dealt with the skills that they were [inaudible].
Where we really had a major problem, and we also might raise it just
like that -- you know, very commonly, we were migrating different technology,
yeah, different technologies to this online or virtual spaces,
which underneath the level of, certainly, like the NOAA example I showed you,
where all the disks are big.
But what this says was this model was so static,
meaning unless somebody asked for something, you never --
the data was just sitting there.
Nothing was happening.
So that is still the case, and so that's why I want to go forward.
So I'm having some glitch here.
I'm waiting for the slide to show up.
Like so then you go -- it's temporary.
So the critical part of that whole idea is we need to move from static archive
to a dynamic archive, where it would need to have intelligence as part
of the whole process, as well as activity.
So here, you can see why it is needed.
You would ask the question that I tried to describe
in the previous slide, why it's needed.
Really, we want to minimize the data generation and then the amounts of time
that I showed you, the analytics that you require,
and the amount of data that use that,
and then we want to improve the knowledge-building efficiency.
So again, examples are the NASA example, as well as in biomedical community,
you can that really the knowledge is being created,
but we want to improve the efficiency,
both not just from one part of the work that's going on,
but to be able to integrate different sources of data
and come up with meaningful extraction that we are not able to do now, and then,
of course, discovery, application, then basically, how do we foster innovation?
That is also one of -- so can we do this?
So these are -- see why you can have many.
So I'm going to go forward.
Okay, you'd asked if we want to do that, but tell me some features of that,
of this intelligent archive.
So some features, you can see here that you would --
as I showed you the architecture as well as the model,
you would require intelligence both at the system and service level both,
and then that would allow you to transform data information and knowledge.
That's transformation, and the "D" is data, and "K" is knowledge,
and that is where you really want to be.
So what do you really -- what can --
what kind of things could be useful to do that?
Smart algorithms that can detect.
Underneath are a few examples, and by no means I imply that this one,
a few bullets here, addresses all-encompassing needs of every discipline,
but we certainly require intelligent data understanding methods,
which would enrich data, metadata and then provide autonomous holdings,
and important thing is reduce idle state.
So how does it look like?
So let me spend some time here
to describe the functional view of intelligent archive.
How would that be?
So let's look at the data protection systems.
Users are producing data a lot, and then there could be cooperating systems.
Just to give you an example, I showed you GDC,
so the source of a lot of the previous legacy systems
or all of the other systems, and then we have a genomic data [inaudible] as sort
of like a cooperating system.
In this case, all data migrated, but as the scale increases, it's not necessary.
That's why I showed it by a directional manner
that this may not have to be a single system.
It could be systems and interfaces,
so thereby you can access those repositories,
those systems that are already massing large scales of the data.
Along with it, you would require also those tools we talked about,
if you recall -- I showed you in the architecture --
and science models, which are integral part to produce and extract knowledge.
So these are some of the entities that could be part
of either a cooperating system or an interface, and to relate to genomics,
data common would be one example, and there could be many other,
and this is specified to cancer, but you could have for many diseases,
or completely different, not necessarily disease entry.
You could have it for an individual's entirely different types of data.
So it can vary.
This doesn't specify a disease, or it doesn't specify any of it,
but the key is this aspect, is that if we really require automated processes
and autonomous technology that what become the heart of an intelligent archive,
and the intelligent systems which I select here are smarter algorithms.
What are those algorithms?
Many of those algorithms will work,
and what is an area where significant research would be needed
from the subject matter, expertise, as well as the informatics
and information technology specialists.
So here is a conceptual representation,
but what I want to convey from this graph or from this display is really,
the data that is being idle or will become a lot more than idle,
what is important is to be finding ways and having means,
developing algorithms that would automate the processes
and implement the autonomous technology, meaning the data won't sit there
and have to be, then wait for a human to all the time act on it.
So I'm not saying a replacement of entire --
of the activities [inaudible] by individuals.
I'm just saying here is identifying those processes,
those activities that can be carried out without constant intervention
of the human presence or direction.
So those aspects, those approaches in terms
of developing smart algorithms can be one way
of really developing this biomedical --
an archive, intelligent archive, and I am showing that same knowledge building,
operational data management.
If we approach, move towards this direction,
what we will see is that in an operational setting, you would have more faster,
real-time valuable insights coming from all of the data
that resides in any given framework.
So in order to do that, we have to figure
out ways how we can develop these smart, intelligent interfaces
and cooperating systems and interfaces that will engage with --
and again, by no means this prescribes the entire one system
or one archive for everything.
It's just a concept, and it could be applied,
depending on the scale of the problem.
Let me show you one scenario.
Let's say you have one scenario, that you have a biomedical multi-modal data.
You have OMICS data, radiology imaging, clinical, pathology,
and you're a researcher, or you want to have means
by which the data is integrated together
and provides you meaningful information.
So in order to do that, I have shown a Venn diagram which shows the overlay
of all of the different accessories or data types, and you would have to build,
if you remember the previous slide, where I was saying intelligent assistance
as well as the interfaces, the coordinating systems.
So the IBA, which is an intelligent biomedical archive,
is the one that will engage with these different depositories, different sites.
We have lots -- and integrated services.
So this is just one scenario, one example, for either the researcher,
or a medical study, or a cohort study that's going on to provide you,
and it's not necessary you have to have an IBA, or intelligent biomedical,
that has to integrate all of this.
Your need -- yours could be --
you could be wanting to just integrate genomic data
with imaging data all in one.
So you could -- then you could apply this functional concept that I described,
developing these interfaces between when there is research going
on in these areas of integration, so in order to develop services
that could be very useful on an operational basis.
You could also have a question related to imaging data.
For example, next is, just to give you an example,
that you are doing cancer studies, and then you have a question.
Which regions of tumor are undergoing active angiogenesis
in response to hypoxia?
I mean there can be many questions you could have, but one of them,
and then you have thousands of images, or you have --
or many disparate sources you have, and you have some idea of what kind
of [inaudible] condition you would impose, and as you can see in this example,
it says, you know, you would expect to find certainly major regions
where the blood density is so much, and the [inaudible] product,
and if that is a viable defining condition, then you'd have lots --
an archive there, all of these images present.
And then apply these algorithms and apply these smart techniques
that will help you provide information for a specific question they're asking.
Same can apply to different fields as well.
So the most important thing that I want to emphasize is this idea
of intelligent biomedical archives.
Really, the need for developing is in more than,
many more than just what I've listed,
but critically important is patient care services,
where you require precision medicine, and to improve personalized care,
as well as to support research.
So let's look at patient care services.
Patient care, as I suggested, is that you could have real-time data coming
from multiple centers when we are in a clinical setting,
or we go for periodic checks, and so the data is coming back.
What? They already accumulated repositories.
So static and the dynamic real-time data it would be useful to have integrated
for either a patient or if you're doing time series
of that whole -- of individual over time.
So those things are currently missing if we have repositories just creating
and the data not being acted.
So precision medicine is another area we have large cohorts,
longitudinal studies, and having synthesis and integration services
for the individuals in the cohort being studied over time is another area
where this can be very useful, followed by personalized health care,
which is most closest to my heart,
just because when I go to the doctor each time, each time when I go,
they look at it, they were earlier, they were writing notes.
Now they're typing to either a laptop or something like that,
and then next time I go to them, then they are again going and typing again,
and all of the information before is not either looked over, or there is no.
As a patient, I don't have any insight
where I could do search on it or do some analysis.
So not only at the moment I see myself, that I have access to some portals
with my physicians or specialists to show that different is existing there
but doesn't give meaning or doesn't give me the ability to spend my own time
to figure out what is the state of my health, and how I would,
because a physician has only so much time, and it wouldn't hurt.
So step one is to enable patient engagement as well
as intelligence behind these services that are being available,
and most important of all, then different aspects of clinical research as well
as basic and translation if we have innovation hubs,
these intelligent biomedical service innovation hubs
where you could have different domain data being integrated,
acted upon, and studied.
So there is a lot of opportunity as well as --
so with that, I really want to emphasize the importance
of having such a program.
The next is to leave you with a visual.
As you can see from this graph is that really, we have two cycles,
and usually it's two, an information cycle,
where data is being produced, and then the system cycle.
There is some coupling today, meaning all of the rest of my talk, if you recall,
and both systems are being used to produce the data,
and the data is being produced, and then, again, new systems are built.
So this process is great, but really,
where it's lacking is this dynamic coupling between these two life cycles,
and those two life cycles are somewhat independent as well as it's [inaudible],
so we require some kind of an intelligent archive
that will make the coupling strong as well as turn this whole thing
into what I call a learning system,
so instead of systems being just static entities,
they could evolve over time based on the information that is being produced.
So that would be a fundamental improvement in the entire spectrum
of system development as well as data management and all, and, most importantly,
reduction in costs of maintaining the system.
So the coupling interface is the most important aspect which comes
out as a vision for a long period of time.
So also, as I referred to, was the paradigm shift, even in health science.
We need to move towards what I call a predictive, personalized, you know,
activities more rather than more always static or being unclear way
of approaches of doing a diagnosis.
In summary, I'd like to highlight a few points, that the model OAIS
and big data reference architecture are both content, technology agnostic.
You can apply it to any discipline.
Data mining across modalities can enhance extraction.
I showed you one example where really, you can have more --
if you have different repositories that interact.
One was IBM.
And then next would be developing the scenario-based intelligent data
understanding algorithms can help us to start or initiate this concept,
and we can move toward the dynamic archive.
Finally, I'd like acknowledge many people here.
My colleague, Ramapriyan, at NASA, and folks at NOAA and NIST,
and my colleague Denis Von Kaeppler, who's here, Andrea Norris, Phil Bourne,
and Leslie Biesecker, with whom I had --
many of whom with whom I had many discussions, and Warren,
for inviting me for presenting today, and then Eve, for making it all happen.
So thank you.
Any questions?
>> Okay. Let's thank Dr. Navale.
[ Applause ]
We only have a couple of minutes left for questions.
Folks on [inaudible], if you want to use the raise hand feature
in the WebEx dashboard or on meet your line, or in the room,
if you have a question, just raise your hand.
I will make the comment first that this is a very timely presentation since,
you know, we have that the NIH has announced, you know,
upcoming funding opportunity announcements for a data commons,
an NIH-wide data commons, and several of the ICs, including ENCI, and NHLDI,
and others are making major investments in data commons, so this,
all this background is very relevant to activities going
on right now across the NIH.
I was curious about the OAIS and the big data reference models.
Have the cloud service provider --
commercial cloud providers actually implemented any of that architecture as part
of their platform, or softwares and servicers,
or is that something that really is up to the users to actually layer
that on top of what they already have in their offerings?
>> Great. It's a very good question.
So the quick answer is the latter, which it is up to the discipline people,
or the subject area experts, or the discipline specialists who want to ensure
that for long-term access, these models and these architectures are applicable.
The cloud providers are more focused towards enabling --
if I may, in that whole layer of information,
they are more focused on the information as [inaudible] some of the lower level
to provide more services and options.
They have not necessarily as much as focused on --
although there are some papers,
but they're not focused on data as a service model.
>> Okay. Right.
>> They are more [inaudible] the softer end, the services.
>> Right. Well, I do know that they've been placing more and more emphasis
on the data management aspect of it in terms of especially, you know,
when you're talking about preservation, and archives,
and being able to actually automatically move data through different layers
of access, depending upon how frequently they're used, for example,
and automating that process,
which seems to fit very nicely in the models you showed.
>> Right. So I want to preface my earlier statement.
Just that there are some -- if not [inaudible],
there are some technology providers for digital preservation who have,
in the systems side, implemented some intelligence,
whether intelligence meaning like you just described, hearing,
and then policy-based approaches that would allow for automated analysis,
so there is some, but not necessarily from a discipline
or a data [inaudible] life cycle end to end management, no.
>> Great. Are there questions here in the room,
or anybody online have any questions?
>> No.
>> Okay. The other thing that really stuck
out to me was you mentioned the growth of idle data.
I mean I know we see that.
You know, in particular, we think about that in the context of data storage
like the GDC, for example, so there's a lot of that raw data that comes in
and goes through these steps of moving from data to knowledge,
and the question then becomes the value of, say, the raw BAM files
or FASTQ files versus the derived variants and other information that comes
from that, as well as that, as I liked the concept you talked about,
an intelligent archive that could more automate and speed up that process of --
-- managing more data, in the GDC,
what would be called harmonization, of new data coming in.
>> Correct.
>> And it also links back to what we were just talking about,
that intelligent archive, as well as what data is of most use to the community.
It all has importance at some level, and there are always --
it's kind of like an 80/20 rule, or maybe it's a 90/10 rule in this case, that,
you know, there might be ten percent of researchers out there really want
to dive back into getting all of the raw data versus having small slices of it
or looking at more of the tertiary data.
What is lacking today are very well-tuned processes for moving data
through that whole pipeline, and it has created a backlog.
As more and more people are contributing data,
it's creating a real backlog right now.
I don't know if there's a question in there.
Was it more of a comment?
I don't know if you have any thoughts about how do we move this, you know,
the biological analysis through sort of this rigorous process that has been laid
out in the NIS framework, for example to get to that intelligent archive.
>> So I will quickly answer, too.
One is, of course, we have, as you are well familiar with pipelines,
that are carrying out different steps in bioinformatics processing,
so one would have to address which areas are slowing
and which areas are using certain --
when I use the word "algorithms," certain methods that are reasonably acceptable
in the community to apply, and then accelerate that process.
So one approach would be to address it that way.
The other approach is policy-based driven, which is also extremely important.
Some systems have it.
So if FASTQ files are created, and then you have a time, oh,
and then you do not need those files at all.
As an example, NASA is completely out of it.
They keep all of it, the raw data.
You could choose to let go of that data, but then it would be policy-based.
So both intelligence goes from the system level,
policy-based as well as process level, and one algorithm --
and I did not mean that a single algorithm can do all.
You would have to have multiple algorithms of intelligence there
and what is the main key question [inaudible].
>> Great. Thanks.
I'm afraid we're out of time.
I just want to alert people our next presentation will be June 21st,
when Aviv Regev from the Broad Institute will be our speaker, and once,
thanks so much for joining today, and thanks to Vivek for his presentation.
>> Thank you all.
[ Applause ]
[ Background Conversations ]
Không có nhận xét nào:
Đăng nhận xét