Okay, so it is my great pleasure to introduce Chen Li,
whom I've known, of course, for many years.
He did his PhD at Stanford and is now a full professor at
UC Irvine, University of California Irvine.
And his area, like mine, has been around data management
system, mostly query processing, text analytics.
And query processing means both execution and optimization.
He got the NSF CAREER award,
he was also our program coach here for PLTB 2000?
>> 15. >> 15, yes.
And he also did a startup, so he's a full package.
So without much further introduction, Chen,
take it away.
And he's gonna talk about how the Office research products.
>> All right, Sergei, thanks for hosting my visit.
It has been a while for me.
Last time I visited here was more than maybe five years ago,
and it's good to see all the old friends,
also to see some new faces.
So the last few years, as a surrogate, I took some adventure
to try to do a startup, to commercialize some of my work.
It's a pretty eye-opening experience,
pretty interesting and crucial for people who have done
a startup before, like Adon, you know how it feels.
And I came back about two or three years ago,
and that experience taught me many things.
The one thing I learned was being in academia or in general
in this research field, building systems is more exciting.
Even before I did a startup,
together with my colleague Mike Kerry, we were building an open
source project called Acid DB, which I'll talk about briefly.
And after I came back, I continued developing that
project and also building some other systems.
So in this talk, I want to use the time to give you an overview
of what I've been doing for the last few years.
And I want to make at a little bit high-level.
Maybe one, two places, I will become technical, but
you're very welcome to talk about some technical details.
In addition, I talk about two systems, and
both systems have demos.
And roughly,
I want to spend about two-thirds of my time on the first system,
and then one-third of the time on the second system.
But we can obviously
talk offline about some of the issues.
So the first project is called Cloudberry.
And the motivation is about how to support big
queries in sub-seconds, okay?
And I know I talked to some of the colleagues here,
I know there are a lot of work in this space.
And I'm telling you what do we wanna do in this space.
So at a high level,
Cloudberry is a general purpose middleware solution,
which can support interactive analytics and visualization.
And it supports different kinds of backends,
different databases.
Of course we have the BIOS of supporting an Acid DB, but
it also support other databases.
And we also support different kinds of front ends.
So even though the first demo is a kind of visualization,
I want to emphasize this part is not about visualization,
it's about supporting visualization.
For example, Tableau can be one of our front end users.
Okay, so I'm going to start the first part with a demo.
So the demo I'm gonna show here is, both demos about the social
media analysis, even though both solutions are general purpose.
So for this demo, we call it Twitter map,
it's one application of Cloudberry club area.
I want to differentiate between Cloudberry and Twitter map,
because Cloudberry can support different kinds of applications.
For this Twitter map demo,
the backend has about this number of tweets,
close to 1 billion tweets, collected over 1 year and
10 months, starting from November 2015.
It is about, roughly, 1% of all the US tweets, okay.
Still, it's a small number, I know many people have seen
bigger scales, but it's a proof of concept.
And plus the backend is using parallel solutions, so
if you have more data,
we just need more hardware, the whole thing is scalable.
So the goal we want to achieve in this project is for this
amount of data, with textual, temporal and spatial conditions,
we wanna allow the user to be able to see the data from
different angles, by submitting different kinds of conditions.
Okay, so my example, currently,
we have this Hurricane Irma coming.
And here, we are lucky because we're in the Northwest,
[LAUGH] very far from Florida.
But let's say we want to see how the social media is talking
about hurricanes.
Okay, so we can just type in the keyword hurricane,
even though you can type in any keyword.
So the user types in this keyword, and
we want to see all the tweets, that is an aggregation results.
The number of tweets mentioning hurricane per state.
And we show the map with the aggregation results.
We also show the histogram over different time periods.
So you can see some of sample tweets.
Here, currently it is like a 141.
And this one, 139, is real I didn't censor the tweets,
okay, so don't be offended by the words there.
It's very organic.
So we get some rough idea about the distribution.
And people may say, what if you look at the population of one
state, and do the division or do the normalization?
We can allow to do normalization,
to see on the average how many tweets per person,
are talking the hurricane in each state.
You can still get the idea.
Still, you can see Texas and Florida, these are two states
that people are very concerned about this topic.
>> Virginia appears to be be very concerned also.
>> [LAUGH] >> For whatever reason.
>> [LAUGH] >> Maybe [INAUDIBLE] everything.
>> It's interesting, you see that there are two peaks.
The second peak is not surprising.
There's a peak here, maybe for some reason last summer there
was some, maybe hurricane, that was very troublesome.
So you can pick a range for the time dimension, and
then the results will be updated with those answers within that
time range.
>> And you can easily change your time window,
by changing the start time or ending time, or
you can even slide the window.
So as the user changes the conditions on the time
dimension, and the system can responsively give the results
for this amount of data, okay?
And then let's say,
what if the user wants to go deeper into that data?
So let's say, in Texas, we wanna see within Texas, how the tweets
are geographically located, based on different counties.
So we zoom in, we show all the different regions,
different counties, and
then the color will indicate the density per county.
Okay, and you can go even further down to even the county
level and pick just one of them, Houston.
And then at this level,
now we show the city-level aggregate results.
So now for us, we show three different levels, but
you can define your own hurricane, and
we allow it to zoom in, zoom out, in this space dimension,
and text, and the time, given dimensions.
And the main thing is, for these about 1 billion records,
which is about 1 terabyte of data,
how we can do these kinds of queries within subseconds.
So this is what we want to achieve.
And on the backend, this is kind of a mini cluster we're using.
Now, we moved it to into my office.
The backend is running a cluster of six nook.
Boxes.
Each one is about $800 with half terabyte of SSD,
16 gigabytes of memory, and I think maybe two cords,
or four cords, I forgot.
And we use heavily would rely on the disk or
the flash drive to make it scalable.
Since the whole infrastructure is parallel,
if you have more data just throw in more boxes, but
total budget's about less than $5,000 okay?
So within $5000 we wanna deliver these kind of user experiences.
So now let's see what's happening behind the scenes.
So this is what CloudBerry is capable of doing.
Architectural wise, CloudBerry is a middleware
that sits on top of your existing database.
And, the one we saw just now, the back end is using AsterixDB.
Because we have a connector that talks to the database using
our native language.
Two languages, we're bilingual.
EQL and SQLS*Plus.
And we also recently just built one more connector or
two more connectors.
One talking to MySQL, one talking to a [INAUDIBLE].
Plus the requirement is your database should provide or
your backend, not database, your backend should provide core
language either through database [INAUDIBLE] or RESTful API so
that the middleware can talk to you by issuing some queries.
In this you will miss some assumptions.
One assumption is the data for now is append only,
once your data, because somehow logic relies on assumption,
once the data is already inserted into your tables,
we can not modify anymore.
And we know a lot of cases such as this condition,
we have one more assumption we show you later.
And the CloudBerry mediaware that's a lot of techniques
optimization is to deliver that kind of
view experiences which I'll explain later.
>> You allow joint queries.
>> We allow joint queries, we support all kinds of queries.
Some of [INAUDIBLE] >> [INAUDIBLE]
>> Okay, I'll talk about the
front end library, the front-end API here, we use a RESTful API.
But the RESTful API is kind of a SQL, using a JSON format.
Okay, I use this one here.
So for Asterix DB, I think I'll just talk about briefly,
Mike and I have been building, and
some other colleagues from Riverside, other Couchbase.
We've been building this one for last seven years.
At the high level, it combines the techniques from
semistructure data management, parallel database systems,
and Hadoop and its friends into one system, it's open source in
Java, essentially it's a power databases with all the support.
Storage, indexing, clear processing, clear optimization,
clear language, the whole stack, okay?
There's a shared [INAUDIBLE] architecture,
we dont take the other approach it's like an Oracle is taking
some of their products, we just assume people kept throwing in
some commodities hardware easily to make it scalable.
These are some of features AsterixDB has.
Semi-structured data models, will support ptree,
rtree invert index all of them, that covers a lot of data types.
We use RCM For dynamic data.
Similar to other systems like Mongo and.
And, we have our own runtime engine called Hyrex.
We have our own query language.
Now, previously it's called AQL.
Coming from a x query.
Now, people that don't like x query People like SQL, so
we change it to SQL Plus Plus.
The plus plus part supports same structured part, okay?
And we are being working with Couchbase very closely and
the recent announcement from them says, Couchbase decided to
use the backend AsterixDB to support their analytics.
That's a one-minute overview of AsterixDB.
So let me focus on Cloudberry.
So let me first talk about this API here and
how the middleware talks to the front end.
I use a very simple example.
In an earlier demo we used hurricane, but
here uses Zika which is very similar.
So, from the fountain application perspective,
it would talk to Cloudberry through a RESTful API because we
know web service architecture is more and more integratable.
Right, you can plug into anyplace.
And, plus by doing this RESTful API it can kind of provide
uniform interface for the front and the top for
this layer which can talk to given databases.
Potentially this layer could do even in the greater from
databases but
for now each instance only talks to a single database.
So this is the query.
You turn it into signals more like, I want to get
the number of trees per stage that mention the key word Zika.
Okay?
And this is the SQL in a JSON format, okay.
But they use different language.
And we have many different way to find
certain properties like what you're talking to.
And what kind of a predicate you want to ask or
post to that table.
We have to group by on which attribute or
attributes you want to do the group by.
And when you do a group by,
what kind of application you want to do?
Count, sum, Mean, max, they all configurable.
>> [INAUDIBLE] >> User defined aggregates,
so far no.
So we just do user standard aggregates.
But I wanna mention few things, the middleware
is doing it to achieve that performance.
I mentioned some of them in the meetings this morning.
One thing is caching, not surprisingly.
In example, if a user types in hurricane,
the first time the user types in hurricane,
then the engine would do the work, the mediaware would
translate that query into a simple query to backend, or
even moderate query depending on the semantics.
And then the results of the hurricane query would be
restored as a view and materialize inside a database,
as one of the tables.
And from database perspective, they don't know what it is,
just one of the tables.
So it's up to the media ware to decide the meaning of this view
and how to maintain it.
This is the larger we do here.
And later we'll cover the case where what if we don't
have a view available there?
And there's a second technique that we're doing there.
Now at the high level
this one looks a little bit very overwhelming.
The main idea is if the user has typed in a drug, a Zika or
hurricane and then the middleware has decided to put
the results into a view, into a database,
there's still a lot of issues we had to solve, which is, how
can we make the view consistent with the backend database.
Because, our principle is, we want to support real time query
processing, meaning When the user types in a query we want
to give the user latest results.
That just came in maybe five minutes ago.
To do it our current solution is the developer,
the system admin has to decide a frequency in which the middlle
ware or the view and the base table should be synchronized.
So let's assume it's one hour, so
every hour the middleware will trigger logic to tell
the database to put the latest tweets about hurricane,
the delta, and put the delta into the view.
Okay this is done by the middleware.
The still issue is even if the view is maintain
synchronize with the base table every hour What if, so
now, it is like 1:15, right?
Last time it was synchronized it was at 1:00 o'clock.
There are still some records that came in in the last 15
minutes.
So, we also want to include this, the latest entries
that came in the last 15 minutes, in the results.
So, what the middleware is doing is, if at this moment that we
ask for hurricane twist, it will first talk to the view, which
is the full hurricane twist to fetch all the twist, which is
much smaller compared to the entire 1 billion record table.
In addition, it will also talk to the base table,
to get all the trees that just came in the last 15 minutes.
Then the middle [INAUDIBLE] combination.
>> Yeah, so I have two questions.
>> Sure.
>> So, the first one is do [INAUDIBLE] storage agent for
the systems of the data,
the same to all the parts are value [INAUDIBLE].
>> This is the back end, this is the middleware.
There's view, maintain it's logic by the middleware.
From an Adespex data perspective, it is a table.
>> So, this way, the Azure DB is just storing the data-
>> Yes, just store.
>> More like a storage engine?
>> Correct, correct.
>> The other thing is that merging the results
from archive the data and the new incoming data.
>> Right. >> Do you assume
the data is sort of append [INAUDIBLE].
>> Yes.
Yeah, that's why we made assumption, we assume the data
that came in last 15 minutes is not modifying earlier data.
>> Okay.
>> That's why we need this assumption.
Do you have a question [INAUDIBLE]?
>> I was just thinking that, for example,
in many cases that the dev from a streaming system,
it's not going to like, the time for this to get cluster.
>> Azure DB, yeah.
>> Right there and it's gonna be kind of too long.
If you have to pull from, let's say, I don't know something
that's coming from an event you have Microsoft staff, or for
example, any of this.
So, an eventing system, then how quickly, that's one, and
the main question there is related to that if your query
is a dynamic query, it's a not a register standing query.
Then many other things that you're talking about the views
are not really actionable.
So, how much flexibility are on the fly?
So, you have a rich API, but that's different from saying
can I write an ad [INAUDIBLE] query just in time and
you can actually search history.
[INAUDIBLE] >> Yeah, for
the first one about the [INAUDIBLE].
We in this architecture we treat it at the problem for databases.
>> I see.
>> The data comes from the back door.
As long as, data is with me, in terms of architecture,
the cover doesn't care about how it accomplishes it.
As long as it is visible through a query it will get it.
Right, so it's very [INAUDIBLE] business to make sure
if you have fast data that comes to the system,
they are visible to the query.
The transaction probability is should be guarantee.
But doesn't care about the.
>> Right, just thing from many of the newly coming events
situation.
Well, how realistic that assumption I mean.
>> Yes, is because when we developed this software,
we assumed we treated your database as a black box.
>> With the rich query.
>> Yeah, it's basically SQL for now operations.
This is what we need.
And for the second issue, if you wanna do a what
you really mean, like kind of a continuous queries.
If a user subscribes
query to the system whether the system can periodically,
or continuously give the user the update of that query.
Is that what you mean?
>> [CROSSTALK] answer that question you have
never seen this work before, right?
>> Right.
>> So, there's a distinction between a query
template versus parameter of the query.
I'm talking about the query itself, the template itself,
not the parameter of the query.
>> So far in the kernel we have done the view materialization,
we assume each view has a keyword for now.
So, we have a hurricane view, we could have like an armor view,
we could have zinger view.
And then- >> Okay, are they have to be
known a?
>> No, no, no.
>> On the fly?
>> It's not on the fly.
It's all those difference.
I think, in some of the earlier work,
people would have to do some kind of all flying like
collecting a histogram amount of information.
The one person we follow is, we start from scratch.
The views will be offline created and maintained.
We don't do offline datacubes.
We don't do offline like workload analysis and
see which of the views should be virtualized.
And one more thing that's very real for our case, at least for
this case text is very common.
And the user can type in arbitrary keywords.
It's very hard to decide what kind of keyword you can write.
So, it won't do everything on the fly, okay,
there's no offline process per step.
>> Do you always maintain the view inside the same database
that the original data is in, or do you also provide for
being able to do that in the separate system.
>> Yeah, the question is really about if we always put
the views into the send database.
Currently, we do.
We treat you back in the database like a storage layer,
and when I put some results there you do it.
If possible we can always have a separate database just to
store views.
That's doable, but he didn't do it.
Okay, other questions?
>> So, I'm just curious whether the keywords,
the search is exact keyword search, or more than that?
>> So far we haven't done any like, but advanced,
expansion, [INAUDIBLE] those are things we have not done yet.
This is only kind of the first step.
I [INAUDIBLE] who is working on this [INAUDIBLE].
We know how to do it, it's just a matter of time.
Next thing I want to say is,
back to your tutorial, right?
What if the views are not available.
The first time user, it gives the query, hurricane, there's
nothing inside that database, in the views, how can we do it?
So, people in here are expert in the topic.
So, when we first developed the prototype one year ago,
that's the first point,
is the user experience, which is really troublesome.
Because whenever I show them all,
I always try to show [LAUGH] key word, I have issues before.
But in the audience,
people can give me as an operative key word,
this is what I usually experienced.
Which is not nice.
So, what we decided to do, is this approach,
which is you do, first and last name.
I wanna go a little bit deeper with this topic,
cuz I talked to some people in this audience.
That the reason we take this approach is
we want to treat the back end data base as a box.
We do not want to make any change in the database.
We just assume you have a standard API,
maybe a sequel API and I want to develop a solution
that does not have any prior knowledge about your histogram.
I can do this kind of user experience and
let me show you the example here.
So, what do we do if I show you the solution.
What we do is very simple.
It's one more assumption we make.
We assume in your data or table to be more specific you can pick
one attribute using which you can slash your [INAUDIBLE] rate.
This is the assumption we make.
So in this example we assume,
we pick the time of the tweet as a dimension.
As they mentioned in the meetings this morning,
I have Twitter for about two years and if I do a query for
all the tweets it takes a long time to finish.
Instead, I can first give you the results for
one week, the second week, the third week.
Hopefully, by dividing this big query into these small queries,
we call them mini-queries, each of the mini-queries is more and
more responsive.
At the same time, we want to progressively keep it updated.
One very simple solution is the following, and
that's that I do this week by week, so this is a behavior.
In this implementation, in our first implementation,
was just fixed interval slice, like one week or one month.
I do one query to give me all the results for January, and
then results of February, results of March and
piece them together.
We see a lot of issues here, first issue is skewness,
data is not evenly distributed.
So of course we assume there is a kind of index structure on
this attribute.
By accessing that index structure,
the mini-query is much faster.
But still even in this case, since the distribution of that
dimension is not uniform, you get this behavior.
Sometimes it can be really fast, sometimes it
can be very slow, especially in the hurricane
example you can see last summer, there's a peak.
Once you go to that range,
then the engine is very busy processing that query,
the user waited for a long time.
This is one drawback, the other drawback is even for
the database itself, even if the distribution is uniform,
still your database is busy with many other queries.
Even in a multi-tenant environment,
your virtual machine probably is serving lots of
different tenants.
So even the behavior
of the database can be very unpredictable.
>> So I think I'm missing the basic idea of slicing.
So from a user's point of view,
let's say I have an average query, average aggregate.
So underneath,
you're doing some slicing that I don't know about right?
>> Yes.
>> So you're giving me an answer for, let's say, some slice and
that has a certain average.
Now once you add the second slice,
that average may go up or down, change drastically.
So as a user, what do I see?
>> You see this.
So what we are essentially doing is giving the results
progressively.
>> No but I'm saying that okay,
if we look at something like online aggregation, there was
some case a confidence interval, here we don't have that.
>> This one, to some degree, you can say, this is also one way to
do sampling, I'm doing the sampling very consistently.
And I don't have any assumption about execution,
that's why I cannot give you any confidence.
If you do something like a random sampling or
some of the work the team here is doing.
Or you look at what Johan's team has done 20 years ago
on online aggregation, some of the approaches have
the assumption that the database engine supports random sampling.
And we don't have the luxury here, we want to take
the approach where I just have a database sitting there,
I don't know what's inside it,
and I don't want to modify database, and can I
do something in the middleware that can be very self adaptive?
So in the whole solution, when we started the system,
the collaborator system,
it doesn't know anything about database, except of course API.
And so the distribution of that dimension, I don't know,
especially since the user can type in any keyword.
I cannot build a histogram for
every keyword, or even combinations of keywords.
Even the behavior of the database engine can be
very unpredictable.
Suddenly, it becomes faster, sometimes, it can be slower.
The middleware should be able to adjust itself to give you
the very smooth experience, this is the approach we take.
And of course, which one is right depends on the scenario.
If you have the access to a source code,
you have the luxury to modify your source code to provide this
random sampling interface, you should go with that approach.
But the approach we take assumes it just has a black box.
>> So it is only an assumption here data are partitioned by
time, and so you can query a specific time switch?
>> Right, yes, you have one attribute,
using which you can do core slicing.
>> That actually has to match the grouping attribute?
>> Not necessarily.
>> So any attribute works?
>> But you can go deeper to see that.
I think you can come up with some requirements on the slicing
attribute, but in this query, it's grouped by state.
What we're doing is, first we get all the results for
January, and then do the count.
One requirement of the results are aggregate-able, sum-able,
or you can keep some small result and put them together.
But at a high level, the requirement for
the slicing attribute should be very minimal.
>> So like you said, you're taking data week by week right?
>> Right.
>> So I'm just asking if whether it's configurable that
you were taking it week by week, or state by state,
because it could also do it state by state, right?
>> Yeah, so we try to make the middleware configurable
by allowing you to say, I do it week by week or month by month.
But I want to mention that you could take that approach, but
that experience is not very pleasant.
Exactly I showed here, if you just do it on a fixed interval,
due to the behavior of the distribution and the data
engine, the times of delivered results are not very smooth.
We want to give the user a very smooth, or
what you call a rhythmic approach,
that's exactly the solution we came up with.
We just finished a paper just maybe two weeks ago.
So the idea is the following, at high level how can
middleware decide a slicing interval for
the backend to make sure the execution time of the meta
query can be finished in a fixed time?
Two seconds, suppose when you're developing a system you wanna
say, the middleware should give the front end an update every
two seconds, suppose this is a configuration.
In more of a dancing rhythm here, how can you make sure
every two seconds you give the front end an update?
Instead of delivering different results at different times.
The key challenge is, how can the middleware
model the behavior of the distribution and
also model the behavior of the engine at the black box and
then be adaptive by itself?
>> Maybe I'm missing something.
There is of course a time constraint of [INAUDIBLE]
seconds, thus if you view that as a constraint, I get that.
Is there an optimizing function?
>> In this formation,
we don't assume you have to finish by a certain deadline.
We don't have the assumption.
>> You have the assumption?
>> No, you don't have the assumption,
we just want to get the results whatever time it takes,
we have to get the results.
We don't have this deadline notion here.
>> Okay. >> The focus of
this work is more about the rhythm.
How can I give you the results using this fixed,
two second windows?
>> But what is the trade off?
I mean, can you make them every hour, right?
>> [CROSSTALK] >> The slice can be very small.
>> We assume this two second, this parameter,
is given by the system admin.
So in the NSC anatomy when you develop a system you say I want
to give the user an update every two seconds.
Two seconds in parameter you can do,
it can be two seconds it can be one second,
it could be five seconds and- >> That's the updating not
slice time right?
>> Exactly so
that the middle has picked out at the slicing time.
This exactly you show where to solve.
So the system tells the system,
give the user update every two seconds.
>> So the question I was getting at,
okay, you have to paint the slice.
Okay, I got it, and we'll talk about that.
But either here it is, this is over-constrained,
or it is a situation that multiple solutions are involved,
I can sort of slice in some 15 ways.
Then is there desirable properties of this?
Are there optimizing function for
this that you're trying to get right?
There are the other user I'm looking at.
Let's say you fix doing non deterministic
one cannot get a certain behavior and another time.
And, given that there is no statistical Basis for it,
I don't know how do you interpret it.
>> That's the purpose of this slide okay?
So using a very simple example let's say
we consider three different ways to deliver results okay?
The first way I just ask for the whole range and these are called
a schedule, and this schedule takes Six second to finish.
But during six seconds, user has to wait.
Okay?
A second schedule, magically,
you can give the user update every two seconds, and
all of these four queries run in the same time.
Of course, altogether it takes longer time,
eight seconds, but the update is very regular, very rhythmic.
The third approach also issues four queries.
And the first query took one second,
the second query took three seconds,
the third took one second, the fourth took three seconds.
So these four different queries,
they finished at different times.
Well if whenever you get a result you give the result
back to user.
Well this one the user sees the result after one second and then
the user will wait for some time after two seconds no update.
The user has to wait for one more second to get the results.
The third one the user gets the results immediately after
one second.
And the fourth one users still needs to wait for
the one second after the two second deadline.
So back to your question sardent so I have three different
ways to get results how can I quantify the quality here?
So we need to come with the notion to say, which is better.
Okay?
And now when you come up with the there is no single answer
here but I give you our answer to the question.
We considered two things in this formulation.
One total running time.
Ideally we want to make it as small as possible, right?
Because overall we don't want use the for a long time.
The second thing we care about is.
If there's a pause, and I say two seconds.
That means every time the user gets an update you
want the user,
you do not want the user to wait for longer than two seconds.
If the next query comes in after the two second deadline
The longer the user waits, the more penalty he had to pay.
So, what we formulate a problem.
The second issue is,
we would call it smoothness of the result delivery, which is.
Suppose this is your previous mini query that ends here,
to maintain this pace, we can still wait for sometime for
the mini ware to get the results.
And suppose this is a in a window.
The next if the first one was finished.
If this one took longer than expected if it finishes here.
Then during this period the user has to wait.
So the second factor we have to consider in this whole
quality metric is how much delay the user has to wait.
Ideally we want to minimize the total running time.
We also want to minimize the total number of waiting
time for the user.
So the solution we come up with or
the answer we have is a combination of the two factors.
A total running time which is the, we call the Cost T And
the total delay, and we use, a factor, a constant
to decide how much a penalty we want to give to a delay.
Of course, we can talk about what,
it's this one is really a golden standard,
but the idea is fair, the combination of those factors.
>> But if fewer comms are ready, and you use very small slides,
say one millisecond, >> Yes.
>> Then DILV is basically zero?
>> Yes, this will be very low.
>> I see so the cost will be higher.
>> Because you are sending a lot of queries in the background.
>> And how do you model that cost?
>> For now for our proof it's just a matter of time.
Again we don't know how many CPU cycle,
the resources are available in the database system.
We don't have the access.
So that either way, it's just a matter of time.
Of course, if you have the access to the database
resources, you can come up with a better way to model the cost.
>> So just a clarification again, are these many queries,
are they being executed in parallel, some of them are, or?
>> It's the previous question.
>> So far, in the current implementation,
we assume they are done sequentially.
We did my experiment where the meteor could send them out
at the same time.
Our finding is you're kind of overdoing your work, right.
You're competing with resources with other people.
So you could do it, but for
our solution we assume to do it sequentially.
Okay.
>> So the metric that you are saying is running time or
is it like?
>> This one?
>> Yeah.
>> It's total running time.
>> But when you're trying to optimize, figure this out,
in the quote unquote in your planning phase,
you don't have access to that.
>> Right we don't have it.
So this is more like a ruler to measure your quality and
at the beginning we tried to formulate
kind of an optimization problem.
We found it's really hard.
Because it's an online problem,
everything will change as you move on to schedule.
So what we do is we take a greedy approach
meaning every time when we generate the next media query,
based on this function I try to be very greedy.
I don't claim any optimality.
At the end I want to generate a schedule,
I use this function to measure the quality of my schedule.
So I don't go with this.
I'm just greedy.
Okay, so So this is a framework of what we're doing,
we call it Drum which stands for Delivery, dont Drum,
Results Upon Met Milestones.
Okay so one of the reasons you use a drum is either the pace,
you have to Meet the requirement.
So the idea,
I mean you have seen this structure in many other places.
The whole framework is very adaptive, meaning you have
a query and the middleware will initially generate some very
conservative mini-queries by using very small intervals.
Okay, you must first get some results, and
then you send the mini-query To the backend and
then the backend keeps the result.
The also keeps track of the numbers, performance.
And that information will be fed into this module
to model the behavior being that information will be use
by the generator to mini-query.
So that's why to you This adapts the width of
the slices that you're doing to generate the mini-queries.
>> Correct.
>> Do you also do any adaptation about choosing a different
dimension along which to do the >> We don't do it.
The reason is from the user experience perspective,
the UI is fixed.
It's always time.
>> Because if you try the to the space,
the UI doesn't So this is the idea, okay?
So, now talk about details
We want to model the relationship between the slice
in time and run time.
So the one thing we do have is a regression model.
So [INAUDIBLE] regression, but
you can [INAUDIBLE] any other regression model.
Any regression model can not be perfect.
We also have model to model the behaviour of the errors
and I think maybe I will spend almost like 40 minutes.
I think you get the main idea of what we're doing here.
So, if you don't mind, okay, just wrap up with this one,
move on to the next one here.
But you get the idea, we tried to
use as middleware to do your response or queries, without
making a very strong assumption about the back-end database.
Here is what you do, yeah.
>> Just one question.
>> Sure.
>> Studies show that pray for
this constant query response time?
>> We don't, we were talking to HCFI coder,
to see who can do either study >> I think this is expertise.
We haven't done it yet.
But from my bias perception, we see,
the 16th of width doesn't look good.
>> So this is considering the time, right,
so I was thinking always the time might not be the thing.
It might be that [INAUDIBLE] only for one second.
However, the amount of data that comes through that one second
be a lot, so shouldn't you consider this slicing
as some estimation of how many data will come on that slice and
then time them together.
>> Exactly.
When we decide the slides in individual width,
we do consider the factor.
How do you make that amount of data conversation.
>> That's the answer is a combination of the equation and
the air model.
We try to intuitively if I wanna finish a query
within two seconds by using the mini model,
maybe I should have used maybe it seven days.
But a user [INAUDIBLE] window have a risk of missing deadline.
So actually be a little bit conservative maybe by using
five days or four days.
So that is already captured by the error model there.
So I'm not targeting just to finish the whole thing in
two seconds.
That's too risky.
I need to be a little bit conservative.
But that's >> So given discussion,
it seems from a user study of this experience seems kind of
crucial [LAUGH] >> Yeah,
that's a real question right there.
>> So, how >> Is so
much tied to the CANR >> Yes, yes, yes.
>> And not so much, >> [LAUGH] [INAUDIBLE]
>> Sir, you are being recorded.
>> Okay yeah, so just answer my question.
>> I think this is not the first time people suggest that we use
the studies.
I think we should do it, we should do it.
>> Okay.
>> And record it.
So I want to use the remaining time to talk about
a second system.
>> So of all this beside, this is enough slicing,
how much to slice or not.
In this case, as far as I could understand,
wouldn't a simple histogram help you here?
>> I don't have a histogram >> And
was it particular to because in relation to [CROSSTALK]
>> To discuss this morning.
If the username keywords I can build the histogram for
some keyword I don't know I don't.
I start from scratch.
Are different All right.
So I want to take the meantime to talk about second system,
which is to me also very exciting.
I did a lot of work in text processing.
I did a startup about text processing,
and I feel text is a very rich domain.
There are a lot of opportunities.
This new system I am building is called the In the earlier,
later we'll name it because system.
At a high level, it is a system that come supports
text processing Okay and I'm going to show you
the reason is everybody in this room can write code.
Some people without IT background
they cannot write a code.
But how can we make it easier for
these people, including us, to do some text processing very
easily without writing a single line of code right?
If I can finish one thing Within five minutes,
I don't wanna do it in one hour.
It's about the cost, okay?
So what we're building is a web service context error.
Using which a system can easily drag and
drop some basic operators to formulate a work flow. Okay?
I didn't need any any software.
Texera is a business web.
So here I'm also using Twitter as an example.
And so, I can just look at scan.
Refresh your page.
So I want to allow a person to analyze tweets,
without writing a single line of code.
So I can say I do a scan.
I have a lot of tables there.
I pick one table called the data for last week.
I can just see the results.
We know this is [INAUDIBLE] with a limit.
Top readers.
I run a query, and
I should be able to see some of them within ten seconds to it.
That's not innovative.
Let's do more analysis.
Suppose I only wanted to look at all the Tweets mentioning
a particular keyword, say, hurricane.
I go here.
I do a keyword search.
I need to specify the attribute and
a keyword I used, like a hurricane, hurricane and
I gave the name of the results like a operator.
Search results, so it link us to operators, and
these two operators.
So this is a new query, so you can click each of them,
you see this one should have hurricane here.
So, other analysis.
Suppose I want to do recognition.
Okay so what I can do is I go to this manual option.
I pick recognition and I pick one of the attributes and
I want to locate pick one of the types noun, verb.
Suppose I want to look at locations okay and
I say location results. Okay.
>> Sir,
are you looking for all the results.
Or, like, I want to see, is there a notion of ranking,
is there a notion of top ten, or, because-
>> We have limit.
You know what databases people are no good at a ranking.
[LAUGH] We give you time [INAUDIBLE] [LAUGH].
That's the difference between our database.
You know it.
>> So with entity [INAUDIBLE] >> In my experience,
that is an operator that is incredibly sensitive to context,
and whether I wanna recognize products
historical figures of this stuff.
How can you have a general purpose single entity
recognition upgrade?
>> This standard entity recognizer is
wrapping Stanford NLP.
We wrap Lucene. We wrap Stanford NLP. We wrap
Energy K will route disguise.
>> But when you do it's still, what your positioning this as
a one stop shop for all sorts of domains in task right.
>> Yes. >> So in a way
whatever you route >> Still has to
generalize and- >> Correct.
>> In your experience is that the case or
is there some way to customize the HTTP recognition [CROSSTALK]
>> It's both,
if you look at machine right,
Very powerful search engine, stemming, ranking, free search,
multi language support, they're all available.
We just wrap it and bring it back to you.
And some of the operators that you use on your machine,
standardized the rapid.
At the same time, if the user wants to add their own logic,
we should allow the user to do it.
>> So wait, so
there's one interpretation of this which is a query builder.
All I'm doing is I have individual components and
I'm just, not necessarily by planning, but
a workflow, kind of, right.
And it's just that, it's a visual thing and
the boxes are what they are.
What is the semantics of this workflow, for
each one figure out what it is and whatever the semantics is,
I take the results and put it in this pipe and that's it, right.
Is this how I should view it or this is something else?
>> It's for text.
>> What? >> It's invented for text.
>> That's okay but.
>> Do we have a solution on the market?
I don't see it.
>> I didn't say I misunderstand everything [CROSSTALK]
>> When you work on
the [INAUDIBLE] that's fine,
but I don't see any solution on market.
There are solutions that are standalone software packages
like 9 [INAUDIBLE] AutoRx, these are a few solutions,
either open source or proprietary.
They are not web based, they're not cloud based.
And from a user perspective,
they do not want to install software.
>> Do you have any debugging support for this?
>> We will.
>> To ask the question in a different way lets say
I firmly believe in Vivex Entity Extractor as far superior.
>> [LAUGH] >> He's very good at this.
>> What would it take for me to actually integrate one
specific entity recognizer in your software, is that possible?
>> Yeah it is possible, we first of all it took us about one and
a half years to reach this milestone.
Saying and doing are different,
it took a lot of time to reach this step.
A few things that we are doing,
I want to summarize the few things we are going to do.
One accessibility, you have your own logic or even some package
you want to wrap, we want to make the process very easy.
Want to support Python, we want to support Java and R,
these are common languages.
How can we allow the developer to wrap their own logic into
this whole pipeline?
That's one, we had to do it. Second is debugability, and
while you have this long-running job on a very large amount of
records, the whole execution can take a long time.
How can you give a user some kind of a progress report
of where you are in the whole execution?
Allow the user to even pause one of the operators and
do some evaluation of the state of that
operator to see those immediate results,
even do some latency to see where that record comes from.
We now allow the user to debug it.
>> It's more than just debugging right?
If I know like principally let's say sort of all names that have
a middle name that's a single initial are misrecognized.
How do I actually then, then enable that feedback right?
It's one thing to recognize that, it's another thing to
actually systematically modify my entity recognition component.
>> So far, we don't have the feedback mechanism,
we don't have it yet.
We talked about the [INAUDIBLE].
This kind of software are very powerful in terms of
allowing the users to highlight some of the places, and
then they can recommend some rules here.
We are not there yet, but down the road.
Let him talk about a kernel focus,
accessibility, debugability and usability.
And one more thing we care about really about
potential securability.
And so far the whole thing runs on a single engine, and
nothing can prevent us from running one operator in
a parallel environment.
Because we know text processing can be very expensive,
can take hours or days to finish.
So, if we can parallelize the execution of each operators
to this cluster then that's even more powerful.
This utilities have not been done yet,
which is only the initial prototype.
And the good news is for
both systems I'm presenting here, they are used by people.
In my talk abstract we are working with
some UCI researchers, they're not from ICS.
They are working public health
politics to use social media to do their analysis.
And they love the first timer because you
can easily see the data.
And you also use this one here because they do not want to
write a program, they do not want to install software,
they do not want to apply patches.
And this is a big advantage of cloud based trend,
everybody here knows how important the trend in cloud is.
And we believe even if you take
the idea of either [INAUDIBLE] or [INAUDIBLE],
push it to the cloud, and make it very easy to use.
And make the execution detached from the front end, and
the user can access the execution from anywhere.
Even allow multiple users to share that execution interface,
this is very powerful.
That's why Google Docs is very, very, very easy to
use because multiple people can share the same document.
This infrastructure can allow multiple developers to share
the same workflow without using the remote desktop, right.
So this is, I believe this is a trend, and a lot of the new
challenges we had to solve in this new architecture here.
It was certainly imbued in this.
You give me an idea a round the field the machine, NRTK,
standard NRP, maybe a Vivex future model we can wrap it.
And then we wanna make it extensible by
allowing the user to write out their own logic.
So those things are already happening here,
so where was the technical part?
The technical part, you already got a second technical part,
let me maybe give you one slide here.
And one interesting thing is how to,
some people ask the question,
how is it different from a database engine?
At the high level its all about operators forming a tag,
at that level they are the same.
But the main difference is inside a database system,
people don't interact with operators.
People interact with the system using SQL, using string, but
here you, the user, can interact with each operator, okay.
So the architecture is very different.
A second difference is, inside a database system,
how many operators we have, maybe 30, 40 at the most.
But in this text domain there is so
many different operations you can do.
The number of operators is much much larger.
I heard they claim to have more than 1,000 operators for
different purposes.
Because we have this whole architecture that allows
a new user, developer,
to contribute a new operator to the whole framework.
Because the whole thing is very open, very extensible.
Now, when you develop such a engine,
how can we make it easy for a new operator to implement it?
That's very different from a database case,
where everything is under control, but
here it's more open-ended.
So what do we do, in our current engine,
is every operator has kind of a descriptor in a JSON format.
The descriptor has all the information about this operator
in addition to the code,
of course you have to have the code logic.
The descriptor could have how many inputs what is the output.
Even the front end, what kind of icon you want to use, and
what kind of description do you want to attach to this operator.
And from a developer perspective as long as you use a framework
to describe a new operator with the code and
the meta data The whole thing will be integrated through
the whole system very easily.
>> So you're printing an arbitrary offering of somebody
else's code.
There may not be the same realization of multiplayer.
So who has to write that?
>> So first of all, we have not done that UDF part yet, right?
Because I'm not saying every package can be wrappable, right?
The model might not be wrappable.
But there's certain protocols the person has to follow,
like what's the input with the output?
Right, so, but I want to identify the commonality and
make it easier for the developer to write a piece of code.
But so far, the way we integrated NRTK which is in
Python is pretty interesting because we tried different ways
to include Python in our Java engine.
One approach is to use a Java class to
interpret all the different Python code.
The other one is running a Python at a separate process.
And and Python process use IPC to [INAUDIBLE].
We take the second approach.
But even for second approach,
there's one question which is how the DVM and
the Python process share information back and forth.
How you can minimize overhead by doing batching?
Those things need to be figured out.
>> It seems to me that [INAUDIBLE] something like
SQL Server intergration services which has a very similar sort of
work in the sense that there are a bunch of built-in operators
which you as a user can connect for your tasks.
And they also have custom script operators that you can
write your own scripts and put it into the thing and
as long as it meets certain input/output criteria,
you can plug it in anywhere into the Python.
In a sense, those kinds of engines have been built, right?
So what do you see as the new challenges because it's text
processing?
>> I'm not saying this idea of formulating a workflow
using drag and droppable operators is new.
Microsoft did it.
[INAUDIBLE] did it.
I believe that the uniqueness here is the cloud.
I believe if you pick some idea, you can make a similar argument
why does Microsoft move from Office to Office 3.06?
You have to move everything to the cloud, and
that is architecture shift.
Previously, you run the whole thing on your single desktop.
I don't know the software image.
I saw some GUI to formulate a query.
I saw that interface.
I don't know what's that one you're talking about.
But in general I believe the software you're talking about
is something is running on your local machine.
>> Is this for the ETL scenario primary targeted to that?
>> Right, but
I don't know whether it's a based I don't know yet. But-
>> But
we can think about that >> Yeah.
>> I guess the question is basically,
because it's cloud based [INAUDIBLE]?
>> I feel few- >> As a functionality, we
are fine with the examples you gave of Office 365 and whatnot.
Makes sense, right, nobody is questioning that.
But from a technical perspective, what's changed?
[INAUDIBLE] >> I don't have a perfect
answer to your question because we are also exploring.
I really do the [INAUDIBLE] on gut feeling here.
I see people want this.
This is my show dance or non-technical answer.
If I go deeper, I believe first when to the cloud,
there is a very big potential of automatically scale up
the whole competition, right?
If you have a job that's very expensive to run,
then in the cloud, you have a much better freedom to launch
multiple virtual machines to paralyze some of the operators.
The web gives you the opportunity.
I know they look similar but
that's not what there talking about.
>> The position logic.
>> Yes, that one is a new opportunity you cannot explore
on a single desktop.
In addition, in terms of the user experience,
I think is very different.
Of course, you may not say the technical but when designing
architecture since we're using this conservative architecture,
the execution of your logic should be detached
from your fountain.
You can easily set up and open a new browser and
attach that one here.
Engineer or technical, we don't know.
But you have to think about this whole thing very differently.
You can say the idea has been implemented by RapidaMiner.
RapidaMiner is using everything as a travel program of
running on a single machine but I believe software's like
RapidaMiner have spent a significant amount of effort on
even the UIDs that's on there.
That's why it's very hard for
them to migrate to the web cloud.
They can't.
They have a burden they have to carry with them.
Because us, we start from scratch.
From day one, we do the web interface.
Maybe one or two years down the road,
I have a more technical answer to your question.
>> You're going to have to beat Unix pipes.
It's very interesting to listen to the questions.
Everybody here's a database person.
[LAUGH] >> You are also a database
person.
>> The thing that would this kind of thing.
And Unix pipes were actually used exactly for
this except it's not JSON.
It's a different data model.
It's comma or
space separated fields carriage return separated fields.
>> I would say that hoc is probably [INAUDIBLE].
>> Exactly.
>> Yeah, but- >> And you talk about the cloud
in terms of scale out but probably the first thing that
people that would worry about is how to get parallelism on
a multi core machine that shared memory because they just
thought they're a lot of performance issues there.
We know just for database processing,
query processing to get the most out of the machine.
I have a hard time thinking about text processing that is of
such a scale that you would go to multiple machines to get
there.
>> I have one billion ways to do everything And
the user did not want to wait for a long time.
So having the ability to run the whole thing on 100
machines will help for sure.
>> [INAUDIBLE] >> But back to your question,
I agree with you that even on a single machine there is a lot of
potential to paralyze the competition using multiple
cores, but, strategically, back to [INAUDIBLE] question.
I will not focus on that one yet
because that issue also exists on a single desktop software.
>> So if I understand you right,
what you're really talking about is, I'm trying to see in
terms of machine architecture in the Cloud.
How does it look, like, we are busy moving your boxes.
I can sort of, think of them as a bit of like, Micro-services.
I can say, hey, they have their own class, so they have their
own logic, they run in their own, in their own VMs.
>> Right.
>> And your stuff runs and
then if you [INAUDIBLE] service [INAUDIBLE] they send it back.
And you connect to all these micro services.
If you really want to be kind of cloud native, you should
think in terms of micro services in that case, right?
But then there are all kinds of issues, right?
In that case, your storage [INAUDIBLE] Rest API through
which you're communicating and that's pretty much it.
So it's some with the matter of fewer, right?
>> Yeah.
>> And it managing and what side of the network.
They may not be on the same rack.
You can't make any assumption on where those machines are.
So to get there,
the question is to get the right performance architecturally.
What are you assuming?
So as you said, that the model is not new.
We are going to cloud.
But with that comes a question, what is running where?
What did they share?
Where is the network?
How much is being transferred across the network, and
what are the options?
>> Very good point.
There are a lot of questions about architecture.
We could run each of the operators as
a microservice using some standard REST API.
Currently, we're looking at the architecture where we use,
I think they're called actor model [INAUDIBLE]
>> She just talked to Phil.
>> [LAUGH] >> [INAUDIBLE] I'm next on
the agenda.
>> [LAUGH] >> I think that's a coincidence?
>> [LAUGH] >> So
he gets another half an hour extra.
>> [LAUGH] >> So, hopefully by the end of
this quarter,
by the end of this year >> We're going to switch
the whole thing to the action model.
Because the idea is each of Twitter is running as
a thread with separate queue.
All of the different Twitters communicate with each other by
sending messages to the queue.
A big advantage of this architecture is I can pause it.
Because each of the actors is running using a thread,
like a part of thread.
Currently I'm using a single thread the pull model.
So we plan to assist the push model with all the actors money.
So I have a got a feeling, I dont have any ground
crews to support it which is Microsoft might be to expensive.
>> Okay.
>> That's being fair. >> That's my feeling.
I think, I think.
>> I'm not reprimanding Microsoft.
I'm just asking you in this spectrum, where you sit.
>> I guess the model should be more efficient, because it is
the same framework using the interthread and communication.
That one should be cheaper than HP computer.
That's my kind of thing.
We can go to the action model.
In fact, the cloud map we saw earlier,
you see videos in the actual model, but we're using Ascata.
In this one, we're using Java.
All right so I reached the end of my talk but
I wanna say one more thing which is in this text domain
mission learning models are used very common.
And one common question people ask is in the whole pipeline for
data analysis machine learning is very important that
where does machine learning into this whole architecture.
So the way I see it is we have this backend Asterix to
be running at a database system to do ingestion.
And we use the to do the visualization.
And then we use the text the older one we call it TextDB.
That does a based formulation.
And this whole suite of solutions can be used
to help you to do preparation for machinery.
You can use it to store data, visualize data and
then analyze data and once you use whole suite
to prepare some
data or label the data, then you can train the model here.
You get what I mean here.
So you use the whole thing to train the model.
The model is more like a mix ii just a file, and
then this model can be integrated back to either to
textera as no of the operators.
Once we finish the [FOREIGN] feature.
Or, data is being ingested into the database.
This model can be used as a kind of UDF to do some offline
processing, or even run time while you do the visualization.
For those trees you can even use UDF to do the online labeling.
So the short message is the whole suite of solutions and
machine learning are kind of complimentary,
we are focusing on the data preparation site.
Okay conclusion here and this is acknowledgements and
thank you especially for all the support.
[APPLAUSE]
>> [INAUDIBLE]
>> [LAUGH]
>> Excuse me I have one question
about the cloud various like think of that.
So you mentioned that you don't have random access to
database and that's the reason you don't get samples on a daily
basis to answer the quirks.
However you have this database there that you have data there
and you have access to the eskimo database to translate
the query to the SQL query, is that correct?
>> Correct, the middleware has the access to the data.
>> What I don't understand is that in realistic you are the
tutor or a third party, and when you are a third party you get
these data in store, and now you have access to these data.
So where are you exactly computing it?
>> On some machines, the backend database has a large amount of
data, and the middleware knows the API, the schema,
but the middleware is more like an accelerator to some degree.
It's more like a data warehousing accelerator
that sits on top of the database.
Database and here.
So, previously the application layer talks to the database
directly to ask queries but those queries can be slow.
By putting Cloudberry in between which also knows
the database schema this layer can use all techniques to make
those queries at the application layer much faster.
This is the position.
>> It does also, they do the same right,
the [INAUDIBLE] the Twitter.
But they also do the crawling, they have the data, and
also they do the [INAUDIBLE] of those questions.
>> Yeah Cloudvarius is a general purpose software,
it's not just for Twitter.
The database can be anything, so it's not specific to Twitter.
>> Okay thanks a lot for
a very interesting [INAUDIBLE] >> [APPLAUSE]
Không có nhận xét nào:
Đăng nhận xét