Thứ Năm, 28 tháng 9, 2017

Waching daily Sep 29 2017

Hello friends welcome to Madras Samayal

Today lets see how to make Ennai Kathirkai

We do have a recipe on ennai kathiirikai Kulambu

but this recipe is little different from that

This is an excellent side for with Biryani

and also goes well with variety rice and Chapathi

First lets make a special masala

roast 20 peanuts on a medium heat

Now to this add 1 tsp Fennel seeds

1 tsp Cumin

1/4 tsp Sesame seeds ( white or black)

finally add 10 - 12 fenugreek seeds

roast it for a minute on medium heat

Now lets transfer it to a mixie jar and grind well.

Now to make ennai kathirikai lets fry the brinjals in oil

we need 4 tbsp of gingelly oil for this recipe

we need to choose small sized brinjals for this recipe

make deep " X cut" at the bottom

place the brinjals and fry it until soft

If there is any water it may splutter so pat it dry

Now the brinjals are nice and soft

lets take it off from Oil

Now in the same oil lets temper

1/4 tsp mustard seeds , 1 tsp urad dal

Now to this add 2 finely chopped onions

also add in required salt

Once the onions are golden lets add the masalas

1 tsp Ginger garlic paste

1/4 tsp turmeric powder

1 tsp Coriander powder

2 -3 tsp chili powder

Also add in the freshly ground masala

Finally add 1 spring curry leaves

lets it cook for a couple of minuets

In the meanwhile lets grind 2 tomatoes

Add the tomatoes aswell

also add in required water and salt if required

let it cook for sometime

Cook on medium until the oil oozes out

Now it has been 8 minutes, oil has separated

Now lets add tamarind water

( soaked for 10 minutes in water)

Now lets add the tamarind extract

Cook until the oil separates

Now after 10 -12 minutes lets add the fried brinjals

Mix it and let it cook for 4 - 5 more minutes

That's it its nice and ready

Tadaaa our delicous ennai kathirikai recipe is ready to be served.

For more infomation >> Ennai Kathirikai for Biryani | Ennai Kathirikai Kulambu in Tamil | Brinjal Masala Curry - Duration: 4:36.

-------------------------------------------

Donald Trump Defends Hurricane Response While Puerto Ricans Wait For Aid | The 11th Hour | MSNBC - Duration: 5:05.

For more infomation >> Donald Trump Defends Hurricane Response While Puerto Ricans Wait For Aid | The 11th Hour | MSNBC - Duration: 5:05.

-------------------------------------------

Q2 Weather: 10 p.m. with Bob McGuire for Sept. 28, 2017 - Duration: 3:44.

For more infomation >> Q2 Weather: 10 p.m. with Bob McGuire for Sept. 28, 2017 - Duration: 3:44.

-------------------------------------------

Autumn Dryness for the Elderly : How to Prevent Dry Cough - Duration: 3:00.

In outpatient service,

how do you treat these patients?

Ok, to prevent dry cough in the dry weather in autumn

I recommend a prescription, Erdong Soup.

It is easy to make,

as it contains only two medical materials.

President Wang, please tell us about Erdong Soup.

Here, President Wang.

Look at the table.

We have two traditional Chinese medicine materials,

and two basins of plants.

Let`s first ask the guests and friends

to distinguish them.

Just now, President Wang talked about Erdong Soup

which contains asparagus cochinchinensiis

and lilyturf roots.

Who can tell them?

This is lilyturf roots

and this is asparagus cochinchinensis.

Actually we have known lilyturf roots,

when we made five-juice drink just now.

Who can tell from the two plants which is asparagus cochinchinensis,

and which is lilyturf root?

Good, please, take the microphone.

The one with flat leaves

is lilyturf root.

This is lilyturf root and this is asparagus cochinchinensis.

Is she right President Wang?

Right. You are great.

I have seen them before.

Asparagus cochinchinensis

and lilyturf root are roots of liliaceous plants.

Look at the asparagus cochinchinensis.

it belong to climbing herbaceous plant.

It looks like fennel, right?

Their leaves are alike.

Lilyturf also known as `Yanjiecao` in Chinese,

is mostly used to make the land green leek.

Looks like leek, right?

We use its roots.

Asparagus cochinchinensis is bigger

and stronger.

Lilyturf roots are smaller and thinner.

Besides their difference in sizes,

Asparagus cochinchinensi is better

in improving yin energy.

Asparagus cochinchinensi is better in improving yin energy. Right.

Asparagus cochinchinensi can moisten the lung and the kidney.

Thus, these two medicine can

moisten the lung and kidney

and prevent coughing.

It`s simple to use.

The clinic effects are quite good. Yeah.

The prescription is simple,

but the effects are quite good.

Good. Thank you.

Just now, President Wang told us about

Asparagus cochinchinensi and lilyturf roots.

Director Liu will tell us how to cook Erdong Soup.

Actually, the prescription was from

`Secret Prescription for Health Improvement`

Later, it was included in `Zhang`s Medical Prescription`

compiled by Zhang Lu, a doctor in the Qing Dynasty.

It was originally called Erdong Cream,

made from asparagus cochinchinensis and lilyturf roots

with white honey.

The white honey is actually the honey we eat in daily life.

Honey can help improve yin energy,

moisten dryness,

prevent weakness and moisten the lung.

the prescription shown in the video

can help us make

and take the soup.

For more infomation >> Autumn Dryness for the Elderly : How to Prevent Dry Cough - Duration: 3:00.

-------------------------------------------

What does randomization mean for research volunteers? - Duration: 7:26.

What does randomization mean for research volunteers?

[Intro] Hello!

The Federal Office for Human Research Protections, or OHRP, created these videos to help you

learn more about participating in research.

Deciding if you want to volunteer for a research study can be difficult, and this decision

can have important consequences.

Research that compares interventions or treatments commonly uses "randomization" as part

of the study design, which means that volunteers are assigned randomly to particular study

"arms," or groups.

Which intervention or treatment the volunteers receive depends on the study arm they are

assigned to.

This video provides some basic information about why researchers use randomization in

studies and what randomization means to you as a potential research volunteer.

[What does "random assignment" mean?]

When something happens "randomly," that means it happens completely by chance, and

that no one can predict or control the result.

Drawing numbers out of a hat to separate people into two teams is a random procedure.

So is flipping a coin to decide who goes first in a game.

Randomization is a commonly used procedure in clinical research.

Research volunteers may be randomized to different arms in a study.

This means that a volunteer's assignment to a particular study arm is by chance, and

that it is not planned or controlled by the researcher, the volunteer's doctor, or anyone

else.

Which study arm a volunteer ends up in is random, like whether a coin flip comes up

heads or tails, without any input from the study team.

[Why is randomized assignment used in research?]

Researchers use randomized assignment to help get reliable answers to research questions.

Suppose researchers want to know if a new drug can help people fight an infection better

than one already being used.

They enroll volunteers who have the type of infection the drugs are supposed to treat.

Then they randomly assign volunteers to one of two study arms.

In one study arm, volunteers receive a drug that is currently prescribed by doctors.

Volunteers in the other study arm receive the new, experimental drug.

Then researchers collect information about how the volunteers in each group respond to

the different drugs.

If the researchers get to decide who gets which drug and don't use randomization,

they might unintentionally give people who seem sicker the new drug—perhaps because

they think the new drug might work better, or maybe they would give sicker volunteers

the commonly-used medicine, because they have more experience with it.

But if either of those things happened, the results of the study wouldn't tell researchers

whether one drug really works better than the other because the volunteers in each arm

are too different from each other.

A difference in results between the two study arms might occur just because one arm includes

sicker volunteers.

To make sure that any differences in results between the study arms are caused only by

the different drugs, the volunteer groups need to be similar in health and other characteristics.

Like the saying goes, it's important to compare apples with apples.

Randomization is supposed to help make the groups more similar.

When volunteers are assigned randomly to the study arms, no one controls which group a

volunteer will be in.

Therefore, as long as there are enough volunteers, the study arms should be similar.

In our example, each study arm would have roughly the same number of volunteers with

mild and serious infections, and be generally similar in other characteristics.

This way, the only thing that is different between the two groups is the drug they take.

The researchers can be more certain that any differences in the results are caused by the

drugs being studied and not the characteristics of the volunteers in the groups.

This is why randomized studies can produce more reliable results.

Sometimes researchers take additional steps to avoid unintentionally influencing the results.

For example, they may design the study so that volunteers won't know, or are "blinded"

to, which group they are in.

Other times, both the researchers and the volunteers don't know which group the volunteers

are in.

This is called a "double-blind" study.

It ensures that no one can intentionally or unintentionally influence the results.

Double-blind randomized studies are one of the best research designs and generally produce

the most reliable results.

[So what does it mean for research volunteers to be "randomized"?]

If you are asked to participate in a research study with a randomized design, here's what

you need to know: • Your assignment to a particular study

arm or group is done randomly, like a coin flip.

The research team cannot choose which group you end up in.

• Similarly, your doctor cannot choose which study arm you end up in, even if she or he

thinks that one group might be better for you than the other.

Your assignment to a study arm is entirely by chance.

• You also cannot choose which group you are in, and you may not get the one that you

want.

• It is possible that the researcher, your doctor, and you will not know which study

arm you are in, and won't be allowed to find out as long as the study is still going

on.

• It is important to remember that, unlike medical treatment, research is not designed

to specifically address your needs and interests as an individual patient.

The care that you receive in a research study does not necessarily put your individual interests

first, will not necessarily benefit you, and could even be harmful, even though there are

protections in place.

Research volunteers can help science answer specific medical or behavioral questions.

Researchers hope that these answers will contribute to a better understanding of human biology

and behavior, and lead to more effective medical treatments in the future.

[Closing]

This video was designed to answer some basic questions about randomization in research

and give you some things to think about.

Deciding whether to participate in research can be hard.

Don't be afraid to ask the research team for more information and talk with them about

your concerns.

It's their job to give you the information you need so you can make the most informed

decision about whether to participate.

OHRP has created a variety of resources to help you think about research participation.

For more information, check out our website at www dot hhs dot gov forward slash about

dash research dash participation.

For more infomation >> What does randomization mean for research volunteers? - Duration: 7:26.

-------------------------------------------

Cloudberry for Interactive Big Queries and TextDB for Cloud-Based Text Analytics - Duration: 1:12:23.

Okay, so it is my great pleasure to introduce Chen Li,

whom I've known, of course, for many years.

He did his PhD at Stanford and is now a full professor at

UC Irvine, University of California Irvine.

And his area, like mine, has been around data management

system, mostly query processing, text analytics.

And query processing means both execution and optimization.

He got the NSF CAREER award,

he was also our program coach here for PLTB 2000?

>> 15. >> 15, yes.

And he also did a startup, so he's a full package.

So without much further introduction, Chen,

take it away.

And he's gonna talk about how the Office research products.

>> All right, Sergei, thanks for hosting my visit.

It has been a while for me.

Last time I visited here was more than maybe five years ago,

and it's good to see all the old friends,

also to see some new faces.

So the last few years, as a surrogate, I took some adventure

to try to do a startup, to commercialize some of my work.

It's a pretty eye-opening experience,

pretty interesting and crucial for people who have done

a startup before, like Adon, you know how it feels.

And I came back about two or three years ago,

and that experience taught me many things.

The one thing I learned was being in academia or in general

in this research field, building systems is more exciting.

Even before I did a startup,

together with my colleague Mike Kerry, we were building an open

source project called Acid DB, which I'll talk about briefly.

And after I came back, I continued developing that

project and also building some other systems.

So in this talk, I want to use the time to give you an overview

of what I've been doing for the last few years.

And I want to make at a little bit high-level.

Maybe one, two places, I will become technical, but

you're very welcome to talk about some technical details.

In addition, I talk about two systems, and

both systems have demos.

And roughly,

I want to spend about two-thirds of my time on the first system,

and then one-third of the time on the second system.

But we can obviously

talk offline about some of the issues.

So the first project is called Cloudberry.

And the motivation is about how to support big

queries in sub-seconds, okay?

And I know I talked to some of the colleagues here,

I know there are a lot of work in this space.

And I'm telling you what do we wanna do in this space.

So at a high level,

Cloudberry is a general purpose middleware solution,

which can support interactive analytics and visualization.

And it supports different kinds of backends,

different databases.

Of course we have the BIOS of supporting an Acid DB, but

it also support other databases.

And we also support different kinds of front ends.

So even though the first demo is a kind of visualization,

I want to emphasize this part is not about visualization,

it's about supporting visualization.

For example, Tableau can be one of our front end users.

Okay, so I'm going to start the first part with a demo.

So the demo I'm gonna show here is, both demos about the social

media analysis, even though both solutions are general purpose.

So for this demo, we call it Twitter map,

it's one application of Cloudberry club area.

I want to differentiate between Cloudberry and Twitter map,

because Cloudberry can support different kinds of applications.

For this Twitter map demo,

the backend has about this number of tweets,

close to 1 billion tweets, collected over 1 year and

10 months, starting from November 2015.

It is about, roughly, 1% of all the US tweets, okay.

Still, it's a small number, I know many people have seen

bigger scales, but it's a proof of concept.

And plus the backend is using parallel solutions, so

if you have more data,

we just need more hardware, the whole thing is scalable.

So the goal we want to achieve in this project is for this

amount of data, with textual, temporal and spatial conditions,

we wanna allow the user to be able to see the data from

different angles, by submitting different kinds of conditions.

Okay, so my example, currently,

we have this Hurricane Irma coming.

And here, we are lucky because we're in the Northwest,

[LAUGH] very far from Florida.

But let's say we want to see how the social media is talking

about hurricanes.

Okay, so we can just type in the keyword hurricane,

even though you can type in any keyword.

So the user types in this keyword, and

we want to see all the tweets, that is an aggregation results.

The number of tweets mentioning hurricane per state.

And we show the map with the aggregation results.

We also show the histogram over different time periods.

So you can see some of sample tweets.

Here, currently it is like a 141.

And this one, 139, is real I didn't censor the tweets,

okay, so don't be offended by the words there.

It's very organic.

So we get some rough idea about the distribution.

And people may say, what if you look at the population of one

state, and do the division or do the normalization?

We can allow to do normalization,

to see on the average how many tweets per person,

are talking the hurricane in each state.

You can still get the idea.

Still, you can see Texas and Florida, these are two states

that people are very concerned about this topic.

>> Virginia appears to be be very concerned also.

>> [LAUGH] >> For whatever reason.

>> [LAUGH] >> Maybe [INAUDIBLE] everything.

>> It's interesting, you see that there are two peaks.

The second peak is not surprising.

There's a peak here, maybe for some reason last summer there

was some, maybe hurricane, that was very troublesome.

So you can pick a range for the time dimension, and

then the results will be updated with those answers within that

time range.

>> And you can easily change your time window,

by changing the start time or ending time, or

you can even slide the window.

So as the user changes the conditions on the time

dimension, and the system can responsively give the results

for this amount of data, okay?

And then let's say,

what if the user wants to go deeper into that data?

So let's say, in Texas, we wanna see within Texas, how the tweets

are geographically located, based on different counties.

So we zoom in, we show all the different regions,

different counties, and

then the color will indicate the density per county.

Okay, and you can go even further down to even the county

level and pick just one of them, Houston.

And then at this level,

now we show the city-level aggregate results.

So now for us, we show three different levels, but

you can define your own hurricane, and

we allow it to zoom in, zoom out, in this space dimension,

and text, and the time, given dimensions.

And the main thing is, for these about 1 billion records,

which is about 1 terabyte of data,

how we can do these kinds of queries within subseconds.

So this is what we want to achieve.

And on the backend, this is kind of a mini cluster we're using.

Now, we moved it to into my office.

The backend is running a cluster of six nook.

Boxes.

Each one is about $800 with half terabyte of SSD,

16 gigabytes of memory, and I think maybe two cords,

or four cords, I forgot.

And we use heavily would rely on the disk or

the flash drive to make it scalable.

Since the whole infrastructure is parallel,

if you have more data just throw in more boxes, but

total budget's about less than $5,000 okay?

So within $5000 we wanna deliver these kind of user experiences.

So now let's see what's happening behind the scenes.

So this is what CloudBerry is capable of doing.

Architectural wise, CloudBerry is a middleware

that sits on top of your existing database.

And, the one we saw just now, the back end is using AsterixDB.

Because we have a connector that talks to the database using

our native language.

Two languages, we're bilingual.

EQL and SQLS*Plus.

And we also recently just built one more connector or

two more connectors.

One talking to MySQL, one talking to a [INAUDIBLE].

Plus the requirement is your database should provide or

your backend, not database, your backend should provide core

language either through database [INAUDIBLE] or RESTful API so

that the middleware can talk to you by issuing some queries.

In this you will miss some assumptions.

One assumption is the data for now is append only,

once your data, because somehow logic relies on assumption,

once the data is already inserted into your tables,

we can not modify anymore.

And we know a lot of cases such as this condition,

we have one more assumption we show you later.

And the CloudBerry mediaware that's a lot of techniques

optimization is to deliver that kind of

view experiences which I'll explain later.

>> You allow joint queries.

>> We allow joint queries, we support all kinds of queries.

Some of [INAUDIBLE] >> [INAUDIBLE]

>> Okay, I'll talk about the

front end library, the front-end API here, we use a RESTful API.

But the RESTful API is kind of a SQL, using a JSON format.

Okay, I use this one here.

So for Asterix DB, I think I'll just talk about briefly,

Mike and I have been building, and

some other colleagues from Riverside, other Couchbase.

We've been building this one for last seven years.

At the high level, it combines the techniques from

semistructure data management, parallel database systems,

and Hadoop and its friends into one system, it's open source in

Java, essentially it's a power databases with all the support.

Storage, indexing, clear processing, clear optimization,

clear language, the whole stack, okay?

There's a shared [INAUDIBLE] architecture,

we dont take the other approach it's like an Oracle is taking

some of their products, we just assume people kept throwing in

some commodities hardware easily to make it scalable.

These are some of features AsterixDB has.

Semi-structured data models, will support ptree,

rtree invert index all of them, that covers a lot of data types.

We use RCM For dynamic data.

Similar to other systems like Mongo and.

And, we have our own runtime engine called Hyrex.

We have our own query language.

Now, previously it's called AQL.

Coming from a x query.

Now, people that don't like x query People like SQL, so

we change it to SQL Plus Plus.

The plus plus part supports same structured part, okay?

And we are being working with Couchbase very closely and

the recent announcement from them says, Couchbase decided to

use the backend AsterixDB to support their analytics.

That's a one-minute overview of AsterixDB.

So let me focus on Cloudberry.

So let me first talk about this API here and

how the middleware talks to the front end.

I use a very simple example.

In an earlier demo we used hurricane, but

here uses Zika which is very similar.

So, from the fountain application perspective,

it would talk to Cloudberry through a RESTful API because we

know web service architecture is more and more integratable.

Right, you can plug into anyplace.

And, plus by doing this RESTful API it can kind of provide

uniform interface for the front and the top for

this layer which can talk to given databases.

Potentially this layer could do even in the greater from

databases but

for now each instance only talks to a single database.

So this is the query.

You turn it into signals more like, I want to get

the number of trees per stage that mention the key word Zika.

Okay?

And this is the SQL in a JSON format, okay.

But they use different language.

And we have many different way to find

certain properties like what you're talking to.

And what kind of a predicate you want to ask or

post to that table.

We have to group by on which attribute or

attributes you want to do the group by.

And when you do a group by,

what kind of application you want to do?

Count, sum, Mean, max, they all configurable.

>> [INAUDIBLE] >> User defined aggregates,

so far no.

So we just do user standard aggregates.

But I wanna mention few things, the middleware

is doing it to achieve that performance.

I mentioned some of them in the meetings this morning.

One thing is caching, not surprisingly.

In example, if a user types in hurricane,

the first time the user types in hurricane,

then the engine would do the work, the mediaware would

translate that query into a simple query to backend, or

even moderate query depending on the semantics.

And then the results of the hurricane query would be

restored as a view and materialize inside a database,

as one of the tables.

And from database perspective, they don't know what it is,

just one of the tables.

So it's up to the media ware to decide the meaning of this view

and how to maintain it.

This is the larger we do here.

And later we'll cover the case where what if we don't

have a view available there?

And there's a second technique that we're doing there.

Now at the high level

this one looks a little bit very overwhelming.

The main idea is if the user has typed in a drug, a Zika or

hurricane and then the middleware has decided to put

the results into a view, into a database,

there's still a lot of issues we had to solve, which is, how

can we make the view consistent with the backend database.

Because, our principle is, we want to support real time query

processing, meaning When the user types in a query we want

to give the user latest results.

That just came in maybe five minutes ago.

To do it our current solution is the developer,

the system admin has to decide a frequency in which the middlle

ware or the view and the base table should be synchronized.

So let's assume it's one hour, so

every hour the middleware will trigger logic to tell

the database to put the latest tweets about hurricane,

the delta, and put the delta into the view.

Okay this is done by the middleware.

The still issue is even if the view is maintain

synchronize with the base table every hour What if, so

now, it is like 1:15, right?

Last time it was synchronized it was at 1:00 o'clock.

There are still some records that came in in the last 15

minutes.

So, we also want to include this, the latest entries

that came in the last 15 minutes, in the results.

So, what the middleware is doing is, if at this moment that we

ask for hurricane twist, it will first talk to the view, which

is the full hurricane twist to fetch all the twist, which is

much smaller compared to the entire 1 billion record table.

In addition, it will also talk to the base table,

to get all the trees that just came in the last 15 minutes.

Then the middle [INAUDIBLE] combination.

>> Yeah, so I have two questions.

>> Sure.

>> So, the first one is do [INAUDIBLE] storage agent for

the systems of the data,

the same to all the parts are value [INAUDIBLE].

>> This is the back end, this is the middleware.

There's view, maintain it's logic by the middleware.

From an Adespex data perspective, it is a table.

>> So, this way, the Azure DB is just storing the data-

>> Yes, just store.

>> More like a storage engine?

>> Correct, correct.

>> The other thing is that merging the results

from archive the data and the new incoming data.

>> Right. >> Do you assume

the data is sort of append [INAUDIBLE].

>> Yes.

Yeah, that's why we made assumption, we assume the data

that came in last 15 minutes is not modifying earlier data.

>> Okay.

>> That's why we need this assumption.

Do you have a question [INAUDIBLE]?

>> I was just thinking that, for example,

in many cases that the dev from a streaming system,

it's not going to like, the time for this to get cluster.

>> Azure DB, yeah.

>> Right there and it's gonna be kind of too long.

If you have to pull from, let's say, I don't know something

that's coming from an event you have Microsoft staff, or for

example, any of this.

So, an eventing system, then how quickly, that's one, and

the main question there is related to that if your query

is a dynamic query, it's a not a register standing query.

Then many other things that you're talking about the views

are not really actionable.

So, how much flexibility are on the fly?

So, you have a rich API, but that's different from saying

can I write an ad [INAUDIBLE] query just in time and

you can actually search history.

[INAUDIBLE] >> Yeah, for

the first one about the [INAUDIBLE].

We in this architecture we treat it at the problem for databases.

>> I see.

>> The data comes from the back door.

As long as, data is with me, in terms of architecture,

the cover doesn't care about how it accomplishes it.

As long as it is visible through a query it will get it.

Right, so it's very [INAUDIBLE] business to make sure

if you have fast data that comes to the system,

they are visible to the query.

The transaction probability is should be guarantee.

But doesn't care about the.

>> Right, just thing from many of the newly coming events

situation.

Well, how realistic that assumption I mean.

>> Yes, is because when we developed this software,

we assumed we treated your database as a black box.

>> With the rich query.

>> Yeah, it's basically SQL for now operations.

This is what we need.

And for the second issue, if you wanna do a what

you really mean, like kind of a continuous queries.

If a user subscribes

query to the system whether the system can periodically,

or continuously give the user the update of that query.

Is that what you mean?

>> [CROSSTALK] answer that question you have

never seen this work before, right?

>> Right.

>> So, there's a distinction between a query

template versus parameter of the query.

I'm talking about the query itself, the template itself,

not the parameter of the query.

>> So far in the kernel we have done the view materialization,

we assume each view has a keyword for now.

So, we have a hurricane view, we could have like an armor view,

we could have zinger view.

And then- >> Okay, are they have to be

known a?

>> No, no, no.

>> On the fly?

>> It's not on the fly.

It's all those difference.

I think, in some of the earlier work,

people would have to do some kind of all flying like

collecting a histogram amount of information.

The one person we follow is, we start from scratch.

The views will be offline created and maintained.

We don't do offline datacubes.

We don't do offline like workload analysis and

see which of the views should be virtualized.

And one more thing that's very real for our case, at least for

this case text is very common.

And the user can type in arbitrary keywords.

It's very hard to decide what kind of keyword you can write.

So, it won't do everything on the fly, okay,

there's no offline process per step.

>> Do you always maintain the view inside the same database

that the original data is in, or do you also provide for

being able to do that in the separate system.

>> Yeah, the question is really about if we always put

the views into the send database.

Currently, we do.

We treat you back in the database like a storage layer,

and when I put some results there you do it.

If possible we can always have a separate database just to

store views.

That's doable, but he didn't do it.

Okay, other questions?

>> So, I'm just curious whether the keywords,

the search is exact keyword search, or more than that?

>> So far we haven't done any like, but advanced,

expansion, [INAUDIBLE] those are things we have not done yet.

This is only kind of the first step.

I [INAUDIBLE] who is working on this [INAUDIBLE].

We know how to do it, it's just a matter of time.

Next thing I want to say is,

back to your tutorial, right?

What if the views are not available.

The first time user, it gives the query, hurricane, there's

nothing inside that database, in the views, how can we do it?

So, people in here are expert in the topic.

So, when we first developed the prototype one year ago,

that's the first point,

is the user experience, which is really troublesome.

Because whenever I show them all,

I always try to show [LAUGH] key word, I have issues before.

But in the audience,

people can give me as an operative key word,

this is what I usually experienced.

Which is not nice.

So, what we decided to do, is this approach,

which is you do, first and last name.

I wanna go a little bit deeper with this topic,

cuz I talked to some people in this audience.

That the reason we take this approach is

we want to treat the back end data base as a box.

We do not want to make any change in the database.

We just assume you have a standard API,

maybe a sequel API and I want to develop a solution

that does not have any prior knowledge about your histogram.

I can do this kind of user experience and

let me show you the example here.

So, what do we do if I show you the solution.

What we do is very simple.

It's one more assumption we make.

We assume in your data or table to be more specific you can pick

one attribute using which you can slash your [INAUDIBLE] rate.

This is the assumption we make.

So in this example we assume,

we pick the time of the tweet as a dimension.

As they mentioned in the meetings this morning,

I have Twitter for about two years and if I do a query for

all the tweets it takes a long time to finish.

Instead, I can first give you the results for

one week, the second week, the third week.

Hopefully, by dividing this big query into these small queries,

we call them mini-queries, each of the mini-queries is more and

more responsive.

At the same time, we want to progressively keep it updated.

One very simple solution is the following, and

that's that I do this week by week, so this is a behavior.

In this implementation, in our first implementation,

was just fixed interval slice, like one week or one month.

I do one query to give me all the results for January, and

then results of February, results of March and

piece them together.

We see a lot of issues here, first issue is skewness,

data is not evenly distributed.

So of course we assume there is a kind of index structure on

this attribute.

By accessing that index structure,

the mini-query is much faster.

But still even in this case, since the distribution of that

dimension is not uniform, you get this behavior.

Sometimes it can be really fast, sometimes it

can be very slow, especially in the hurricane

example you can see last summer, there's a peak.

Once you go to that range,

then the engine is very busy processing that query,

the user waited for a long time.

This is one drawback, the other drawback is even for

the database itself, even if the distribution is uniform,

still your database is busy with many other queries.

Even in a multi-tenant environment,

your virtual machine probably is serving lots of

different tenants.

So even the behavior

of the database can be very unpredictable.

>> So I think I'm missing the basic idea of slicing.

So from a user's point of view,

let's say I have an average query, average aggregate.

So underneath,

you're doing some slicing that I don't know about right?

>> Yes.

>> So you're giving me an answer for, let's say, some slice and

that has a certain average.

Now once you add the second slice,

that average may go up or down, change drastically.

So as a user, what do I see?

>> You see this.

So what we are essentially doing is giving the results

progressively.

>> No but I'm saying that okay,

if we look at something like online aggregation, there was

some case a confidence interval, here we don't have that.

>> This one, to some degree, you can say, this is also one way to

do sampling, I'm doing the sampling very consistently.

And I don't have any assumption about execution,

that's why I cannot give you any confidence.

If you do something like a random sampling or

some of the work the team here is doing.

Or you look at what Johan's team has done 20 years ago

on online aggregation, some of the approaches have

the assumption that the database engine supports random sampling.

And we don't have the luxury here, we want to take

the approach where I just have a database sitting there,

I don't know what's inside it,

and I don't want to modify database, and can I

do something in the middleware that can be very self adaptive?

So in the whole solution, when we started the system,

the collaborator system,

it doesn't know anything about database, except of course API.

And so the distribution of that dimension, I don't know,

especially since the user can type in any keyword.

I cannot build a histogram for

every keyword, or even combinations of keywords.

Even the behavior of the database engine can be

very unpredictable.

Suddenly, it becomes faster, sometimes, it can be slower.

The middleware should be able to adjust itself to give you

the very smooth experience, this is the approach we take.

And of course, which one is right depends on the scenario.

If you have the access to a source code,

you have the luxury to modify your source code to provide this

random sampling interface, you should go with that approach.

But the approach we take assumes it just has a black box.

>> So it is only an assumption here data are partitioned by

time, and so you can query a specific time switch?

>> Right, yes, you have one attribute,

using which you can do core slicing.

>> That actually has to match the grouping attribute?

>> Not necessarily.

>> So any attribute works?

>> But you can go deeper to see that.

I think you can come up with some requirements on the slicing

attribute, but in this query, it's grouped by state.

What we're doing is, first we get all the results for

January, and then do the count.

One requirement of the results are aggregate-able, sum-able,

or you can keep some small result and put them together.

But at a high level, the requirement for

the slicing attribute should be very minimal.

>> So like you said, you're taking data week by week right?

>> Right.

>> So I'm just asking if whether it's configurable that

you were taking it week by week, or state by state,

because it could also do it state by state, right?

>> Yeah, so we try to make the middleware configurable

by allowing you to say, I do it week by week or month by month.

But I want to mention that you could take that approach, but

that experience is not very pleasant.

Exactly I showed here, if you just do it on a fixed interval,

due to the behavior of the distribution and the data

engine, the times of delivered results are not very smooth.

We want to give the user a very smooth, or

what you call a rhythmic approach,

that's exactly the solution we came up with.

We just finished a paper just maybe two weeks ago.

So the idea is the following, at high level how can

middleware decide a slicing interval for

the backend to make sure the execution time of the meta

query can be finished in a fixed time?

Two seconds, suppose when you're developing a system you wanna

say, the middleware should give the front end an update every

two seconds, suppose this is a configuration.

In more of a dancing rhythm here, how can you make sure

every two seconds you give the front end an update?

Instead of delivering different results at different times.

The key challenge is, how can the middleware

model the behavior of the distribution and

also model the behavior of the engine at the black box and

then be adaptive by itself?

>> Maybe I'm missing something.

There is of course a time constraint of [INAUDIBLE]

seconds, thus if you view that as a constraint, I get that.

Is there an optimizing function?

>> In this formation,

we don't assume you have to finish by a certain deadline.

We don't have the assumption.

>> You have the assumption?

>> No, you don't have the assumption,

we just want to get the results whatever time it takes,

we have to get the results.

We don't have this deadline notion here.

>> Okay. >> The focus of

this work is more about the rhythm.

How can I give you the results using this fixed,

two second windows?

>> But what is the trade off?

I mean, can you make them every hour, right?

>> [CROSSTALK] >> The slice can be very small.

>> We assume this two second, this parameter,

is given by the system admin.

So in the NSC anatomy when you develop a system you say I want

to give the user an update every two seconds.

Two seconds in parameter you can do,

it can be two seconds it can be one second,

it could be five seconds and- >> That's the updating not

slice time right?

>> Exactly so

that the middle has picked out at the slicing time.

This exactly you show where to solve.

So the system tells the system,

give the user update every two seconds.

>> So the question I was getting at,

okay, you have to paint the slice.

Okay, I got it, and we'll talk about that.

But either here it is, this is over-constrained,

or it is a situation that multiple solutions are involved,

I can sort of slice in some 15 ways.

Then is there desirable properties of this?

Are there optimizing function for

this that you're trying to get right?

There are the other user I'm looking at.

Let's say you fix doing non deterministic

one cannot get a certain behavior and another time.

And, given that there is no statistical Basis for it,

I don't know how do you interpret it.

>> That's the purpose of this slide okay?

So using a very simple example let's say

we consider three different ways to deliver results okay?

The first way I just ask for the whole range and these are called

a schedule, and this schedule takes Six second to finish.

But during six seconds, user has to wait.

Okay?

A second schedule, magically,

you can give the user update every two seconds, and

all of these four queries run in the same time.

Of course, altogether it takes longer time,

eight seconds, but the update is very regular, very rhythmic.

The third approach also issues four queries.

And the first query took one second,

the second query took three seconds,

the third took one second, the fourth took three seconds.

So these four different queries,

they finished at different times.

Well if whenever you get a result you give the result

back to user.

Well this one the user sees the result after one second and then

the user will wait for some time after two seconds no update.

The user has to wait for one more second to get the results.

The third one the user gets the results immediately after

one second.

And the fourth one users still needs to wait for

the one second after the two second deadline.

So back to your question sardent so I have three different

ways to get results how can I quantify the quality here?

So we need to come with the notion to say, which is better.

Okay?

And now when you come up with the there is no single answer

here but I give you our answer to the question.

We considered two things in this formulation.

One total running time.

Ideally we want to make it as small as possible, right?

Because overall we don't want use the for a long time.

The second thing we care about is.

If there's a pause, and I say two seconds.

That means every time the user gets an update you

want the user,

you do not want the user to wait for longer than two seconds.

If the next query comes in after the two second deadline

The longer the user waits, the more penalty he had to pay.

So, what we formulate a problem.

The second issue is,

we would call it smoothness of the result delivery, which is.

Suppose this is your previous mini query that ends here,

to maintain this pace, we can still wait for sometime for

the mini ware to get the results.

And suppose this is a in a window.

The next if the first one was finished.

If this one took longer than expected if it finishes here.

Then during this period the user has to wait.

So the second factor we have to consider in this whole

quality metric is how much delay the user has to wait.

Ideally we want to minimize the total running time.

We also want to minimize the total number of waiting

time for the user.

So the solution we come up with or

the answer we have is a combination of the two factors.

A total running time which is the, we call the Cost T And

the total delay, and we use, a factor, a constant

to decide how much a penalty we want to give to a delay.

Of course, we can talk about what,

it's this one is really a golden standard,

but the idea is fair, the combination of those factors.

>> But if fewer comms are ready, and you use very small slides,

say one millisecond, >> Yes.

>> Then DILV is basically zero?

>> Yes, this will be very low.

>> I see so the cost will be higher.

>> Because you are sending a lot of queries in the background.

>> And how do you model that cost?

>> For now for our proof it's just a matter of time.

Again we don't know how many CPU cycle,

the resources are available in the database system.

We don't have the access.

So that either way, it's just a matter of time.

Of course, if you have the access to the database

resources, you can come up with a better way to model the cost.

>> So just a clarification again, are these many queries,

are they being executed in parallel, some of them are, or?

>> It's the previous question.

>> So far, in the current implementation,

we assume they are done sequentially.

We did my experiment where the meteor could send them out

at the same time.

Our finding is you're kind of overdoing your work, right.

You're competing with resources with other people.

So you could do it, but for

our solution we assume to do it sequentially.

Okay.

>> So the metric that you are saying is running time or

is it like?

>> This one?

>> Yeah.

>> It's total running time.

>> But when you're trying to optimize, figure this out,

in the quote unquote in your planning phase,

you don't have access to that.

>> Right we don't have it.

So this is more like a ruler to measure your quality and

at the beginning we tried to formulate

kind of an optimization problem.

We found it's really hard.

Because it's an online problem,

everything will change as you move on to schedule.

So what we do is we take a greedy approach

meaning every time when we generate the next media query,

based on this function I try to be very greedy.

I don't claim any optimality.

At the end I want to generate a schedule,

I use this function to measure the quality of my schedule.

So I don't go with this.

I'm just greedy.

Okay, so So this is a framework of what we're doing,

we call it Drum which stands for Delivery, dont Drum,

Results Upon Met Milestones.

Okay so one of the reasons you use a drum is either the pace,

you have to Meet the requirement.

So the idea,

I mean you have seen this structure in many other places.

The whole framework is very adaptive, meaning you have

a query and the middleware will initially generate some very

conservative mini-queries by using very small intervals.

Okay, you must first get some results, and

then you send the mini-query To the backend and

then the backend keeps the result.

The also keeps track of the numbers, performance.

And that information will be fed into this module

to model the behavior being that information will be use

by the generator to mini-query.

So that's why to you This adapts the width of

the slices that you're doing to generate the mini-queries.

>> Correct.

>> Do you also do any adaptation about choosing a different

dimension along which to do the >> We don't do it.

The reason is from the user experience perspective,

the UI is fixed.

It's always time.

>> Because if you try the to the space,

the UI doesn't So this is the idea, okay?

So, now talk about details

We want to model the relationship between the slice

in time and run time.

So the one thing we do have is a regression model.

So [INAUDIBLE] regression, but

you can [INAUDIBLE] any other regression model.

Any regression model can not be perfect.

We also have model to model the behaviour of the errors

and I think maybe I will spend almost like 40 minutes.

I think you get the main idea of what we're doing here.

So, if you don't mind, okay, just wrap up with this one,

move on to the next one here.

But you get the idea, we tried to

use as middleware to do your response or queries, without

making a very strong assumption about the back-end database.

Here is what you do, yeah.

>> Just one question.

>> Sure.

>> Studies show that pray for

this constant query response time?

>> We don't, we were talking to HCFI coder,

to see who can do either study >> I think this is expertise.

We haven't done it yet.

But from my bias perception, we see,

the 16th of width doesn't look good.

>> So this is considering the time, right,

so I was thinking always the time might not be the thing.

It might be that [INAUDIBLE] only for one second.

However, the amount of data that comes through that one second

be a lot, so shouldn't you consider this slicing

as some estimation of how many data will come on that slice and

then time them together.

>> Exactly.

When we decide the slides in individual width,

we do consider the factor.

How do you make that amount of data conversation.

>> That's the answer is a combination of the equation and

the air model.

We try to intuitively if I wanna finish a query

within two seconds by using the mini model,

maybe I should have used maybe it seven days.

But a user [INAUDIBLE] window have a risk of missing deadline.

So actually be a little bit conservative maybe by using

five days or four days.

So that is already captured by the error model there.

So I'm not targeting just to finish the whole thing in

two seconds.

That's too risky.

I need to be a little bit conservative.

But that's >> So given discussion,

it seems from a user study of this experience seems kind of

crucial [LAUGH] >> Yeah,

that's a real question right there.

>> So, how >> Is so

much tied to the CANR >> Yes, yes, yes.

>> And not so much, >> [LAUGH] [INAUDIBLE]

>> Sir, you are being recorded.

>> Okay yeah, so just answer my question.

>> I think this is not the first time people suggest that we use

the studies.

I think we should do it, we should do it.

>> Okay.

>> And record it.

So I want to use the remaining time to talk about

a second system.

>> So of all this beside, this is enough slicing,

how much to slice or not.

In this case, as far as I could understand,

wouldn't a simple histogram help you here?

>> I don't have a histogram >> And

was it particular to because in relation to [CROSSTALK]

>> To discuss this morning.

If the username keywords I can build the histogram for

some keyword I don't know I don't.

I start from scratch.

Are different All right.

So I want to take the meantime to talk about second system,

which is to me also very exciting.

I did a lot of work in text processing.

I did a startup about text processing,

and I feel text is a very rich domain.

There are a lot of opportunities.

This new system I am building is called the In the earlier,

later we'll name it because system.

At a high level, it is a system that come supports

text processing Okay and I'm going to show you

the reason is everybody in this room can write code.

Some people without IT background

they cannot write a code.

But how can we make it easier for

these people, including us, to do some text processing very

easily without writing a single line of code right?

If I can finish one thing Within five minutes,

I don't wanna do it in one hour.

It's about the cost, okay?

So what we're building is a web service context error.

Using which a system can easily drag and

drop some basic operators to formulate a work flow. Okay?

I didn't need any any software.

Texera is a business web.

So here I'm also using Twitter as an example.

And so, I can just look at scan.

Refresh your page.

So I want to allow a person to analyze tweets,

without writing a single line of code.

So I can say I do a scan.

I have a lot of tables there.

I pick one table called the data for last week.

I can just see the results.

We know this is [INAUDIBLE] with a limit.

Top readers.

I run a query, and

I should be able to see some of them within ten seconds to it.

That's not innovative.

Let's do more analysis.

Suppose I only wanted to look at all the Tweets mentioning

a particular keyword, say, hurricane.

I go here.

I do a keyword search.

I need to specify the attribute and

a keyword I used, like a hurricane, hurricane and

I gave the name of the results like a operator.

Search results, so it link us to operators, and

these two operators.

So this is a new query, so you can click each of them,

you see this one should have hurricane here.

So, other analysis.

Suppose I want to do recognition.

Okay so what I can do is I go to this manual option.

I pick recognition and I pick one of the attributes and

I want to locate pick one of the types noun, verb.

Suppose I want to look at locations okay and

I say location results. Okay.

>> Sir,

are you looking for all the results.

Or, like, I want to see, is there a notion of ranking,

is there a notion of top ten, or, because-

>> We have limit.

You know what databases people are no good at a ranking.

[LAUGH] We give you time [INAUDIBLE] [LAUGH].

That's the difference between our database.

You know it.

>> So with entity [INAUDIBLE] >> In my experience,

that is an operator that is incredibly sensitive to context,

and whether I wanna recognize products

historical figures of this stuff.

How can you have a general purpose single entity

recognition upgrade?

>> This standard entity recognizer is

wrapping Stanford NLP.

We wrap Lucene. We wrap Stanford NLP. We wrap

Energy K will route disguise.

>> But when you do it's still, what your positioning this as

a one stop shop for all sorts of domains in task right.

>> Yes. >> So in a way

whatever you route >> Still has to

generalize and- >> Correct.

>> In your experience is that the case or

is there some way to customize the HTTP recognition [CROSSTALK]

>> It's both,

if you look at machine right,

Very powerful search engine, stemming, ranking, free search,

multi language support, they're all available.

We just wrap it and bring it back to you.

And some of the operators that you use on your machine,

standardized the rapid.

At the same time, if the user wants to add their own logic,

we should allow the user to do it.

>> So wait, so

there's one interpretation of this which is a query builder.

All I'm doing is I have individual components and

I'm just, not necessarily by planning, but

a workflow, kind of, right.

And it's just that, it's a visual thing and

the boxes are what they are.

What is the semantics of this workflow, for

each one figure out what it is and whatever the semantics is,

I take the results and put it in this pipe and that's it, right.

Is this how I should view it or this is something else?

>> It's for text.

>> What? >> It's invented for text.

>> That's okay but.

>> Do we have a solution on the market?

I don't see it.

>> I didn't say I misunderstand everything [CROSSTALK]

>> When you work on

the [INAUDIBLE] that's fine,

but I don't see any solution on market.

There are solutions that are standalone software packages

like 9 [INAUDIBLE] AutoRx, these are a few solutions,

either open source or proprietary.

They are not web based, they're not cloud based.

And from a user perspective,

they do not want to install software.

>> Do you have any debugging support for this?

>> We will.

>> To ask the question in a different way lets say

I firmly believe in Vivex Entity Extractor as far superior.

>> [LAUGH] >> He's very good at this.

>> What would it take for me to actually integrate one

specific entity recognizer in your software, is that possible?

>> Yeah it is possible, we first of all it took us about one and

a half years to reach this milestone.

Saying and doing are different,

it took a lot of time to reach this step.

A few things that we are doing,

I want to summarize the few things we are going to do.

One accessibility, you have your own logic or even some package

you want to wrap, we want to make the process very easy.

Want to support Python, we want to support Java and R,

these are common languages.

How can we allow the developer to wrap their own logic into

this whole pipeline?

That's one, we had to do it. Second is debugability, and

while you have this long-running job on a very large amount of

records, the whole execution can take a long time.

How can you give a user some kind of a progress report

of where you are in the whole execution?

Allow the user to even pause one of the operators and

do some evaluation of the state of that

operator to see those immediate results,

even do some latency to see where that record comes from.

We now allow the user to debug it.

>> It's more than just debugging right?

If I know like principally let's say sort of all names that have

a middle name that's a single initial are misrecognized.

How do I actually then, then enable that feedback right?

It's one thing to recognize that, it's another thing to

actually systematically modify my entity recognition component.

>> So far, we don't have the feedback mechanism,

we don't have it yet.

We talked about the [INAUDIBLE].

This kind of software are very powerful in terms of

allowing the users to highlight some of the places, and

then they can recommend some rules here.

We are not there yet, but down the road.

Let him talk about a kernel focus,

accessibility, debugability and usability.

And one more thing we care about really about

potential securability.

And so far the whole thing runs on a single engine, and

nothing can prevent us from running one operator in

a parallel environment.

Because we know text processing can be very expensive,

can take hours or days to finish.

So, if we can parallelize the execution of each operators

to this cluster then that's even more powerful.

This utilities have not been done yet,

which is only the initial prototype.

And the good news is for

both systems I'm presenting here, they are used by people.

In my talk abstract we are working with

some UCI researchers, they're not from ICS.

They are working public health

politics to use social media to do their analysis.

And they love the first timer because you

can easily see the data.

And you also use this one here because they do not want to

write a program, they do not want to install software,

they do not want to apply patches.

And this is a big advantage of cloud based trend,

everybody here knows how important the trend in cloud is.

And we believe even if you take

the idea of either [INAUDIBLE] or [INAUDIBLE],

push it to the cloud, and make it very easy to use.

And make the execution detached from the front end, and

the user can access the execution from anywhere.

Even allow multiple users to share that execution interface,

this is very powerful.

That's why Google Docs is very, very, very easy to

use because multiple people can share the same document.

This infrastructure can allow multiple developers to share

the same workflow without using the remote desktop, right.

So this is, I believe this is a trend, and a lot of the new

challenges we had to solve in this new architecture here.

It was certainly imbued in this.

You give me an idea a round the field the machine, NRTK,

standard NRP, maybe a Vivex future model we can wrap it.

And then we wanna make it extensible by

allowing the user to write out their own logic.

So those things are already happening here,

so where was the technical part?

The technical part, you already got a second technical part,

let me maybe give you one slide here.

And one interesting thing is how to,

some people ask the question,

how is it different from a database engine?

At the high level its all about operators forming a tag,

at that level they are the same.

But the main difference is inside a database system,

people don't interact with operators.

People interact with the system using SQL, using string, but

here you, the user, can interact with each operator, okay.

So the architecture is very different.

A second difference is, inside a database system,

how many operators we have, maybe 30, 40 at the most.

But in this text domain there is so

many different operations you can do.

The number of operators is much much larger.

I heard they claim to have more than 1,000 operators for

different purposes.

Because we have this whole architecture that allows

a new user, developer,

to contribute a new operator to the whole framework.

Because the whole thing is very open, very extensible.

Now, when you develop such a engine,

how can we make it easy for a new operator to implement it?

That's very different from a database case,

where everything is under control, but

here it's more open-ended.

So what do we do, in our current engine,

is every operator has kind of a descriptor in a JSON format.

The descriptor has all the information about this operator

in addition to the code,

of course you have to have the code logic.

The descriptor could have how many inputs what is the output.

Even the front end, what kind of icon you want to use, and

what kind of description do you want to attach to this operator.

And from a developer perspective as long as you use a framework

to describe a new operator with the code and

the meta data The whole thing will be integrated through

the whole system very easily.

>> So you're printing an arbitrary offering of somebody

else's code.

There may not be the same realization of multiplayer.

So who has to write that?

>> So first of all, we have not done that UDF part yet, right?

Because I'm not saying every package can be wrappable, right?

The model might not be wrappable.

But there's certain protocols the person has to follow,

like what's the input with the output?

Right, so, but I want to identify the commonality and

make it easier for the developer to write a piece of code.

But so far, the way we integrated NRTK which is in

Python is pretty interesting because we tried different ways

to include Python in our Java engine.

One approach is to use a Java class to

interpret all the different Python code.

The other one is running a Python at a separate process.

And and Python process use IPC to [INAUDIBLE].

We take the second approach.

But even for second approach,

there's one question which is how the DVM and

the Python process share information back and forth.

How you can minimize overhead by doing batching?

Those things need to be figured out.

>> It seems to me that [INAUDIBLE] something like

SQL Server intergration services which has a very similar sort of

work in the sense that there are a bunch of built-in operators

which you as a user can connect for your tasks.

And they also have custom script operators that you can

write your own scripts and put it into the thing and

as long as it meets certain input/output criteria,

you can plug it in anywhere into the Python.

In a sense, those kinds of engines have been built, right?

So what do you see as the new challenges because it's text

processing?

>> I'm not saying this idea of formulating a workflow

using drag and droppable operators is new.

Microsoft did it.

[INAUDIBLE] did it.

I believe that the uniqueness here is the cloud.

I believe if you pick some idea, you can make a similar argument

why does Microsoft move from Office to Office 3.06?

You have to move everything to the cloud, and

that is architecture shift.

Previously, you run the whole thing on your single desktop.

I don't know the software image.

I saw some GUI to formulate a query.

I saw that interface.

I don't know what's that one you're talking about.

But in general I believe the software you're talking about

is something is running on your local machine.

>> Is this for the ETL scenario primary targeted to that?

>> Right, but

I don't know whether it's a based I don't know yet. But-

>> But

we can think about that >> Yeah.

>> I guess the question is basically,

because it's cloud based [INAUDIBLE]?

>> I feel few- >> As a functionality, we

are fine with the examples you gave of Office 365 and whatnot.

Makes sense, right, nobody is questioning that.

But from a technical perspective, what's changed?

[INAUDIBLE] >> I don't have a perfect

answer to your question because we are also exploring.

I really do the [INAUDIBLE] on gut feeling here.

I see people want this.

This is my show dance or non-technical answer.

If I go deeper, I believe first when to the cloud,

there is a very big potential of automatically scale up

the whole competition, right?

If you have a job that's very expensive to run,

then in the cloud, you have a much better freedom to launch

multiple virtual machines to paralyze some of the operators.

The web gives you the opportunity.

I know they look similar but

that's not what there talking about.

>> The position logic.

>> Yes, that one is a new opportunity you cannot explore

on a single desktop.

In addition, in terms of the user experience,

I think is very different.

Of course, you may not say the technical but when designing

architecture since we're using this conservative architecture,

the execution of your logic should be detached

from your fountain.

You can easily set up and open a new browser and

attach that one here.

Engineer or technical, we don't know.

But you have to think about this whole thing very differently.

You can say the idea has been implemented by RapidaMiner.

RapidaMiner is using everything as a travel program of

running on a single machine but I believe software's like

RapidaMiner have spent a significant amount of effort on

even the UIDs that's on there.

That's why it's very hard for

them to migrate to the web cloud.

They can't.

They have a burden they have to carry with them.

Because us, we start from scratch.

From day one, we do the web interface.

Maybe one or two years down the road,

I have a more technical answer to your question.

>> You're going to have to beat Unix pipes.

It's very interesting to listen to the questions.

Everybody here's a database person.

[LAUGH] >> You are also a database

person.

>> The thing that would this kind of thing.

And Unix pipes were actually used exactly for

this except it's not JSON.

It's a different data model.

It's comma or

space separated fields carriage return separated fields.

>> I would say that hoc is probably [INAUDIBLE].

>> Exactly.

>> Yeah, but- >> And you talk about the cloud

in terms of scale out but probably the first thing that

people that would worry about is how to get parallelism on

a multi core machine that shared memory because they just

thought they're a lot of performance issues there.

We know just for database processing,

query processing to get the most out of the machine.

I have a hard time thinking about text processing that is of

such a scale that you would go to multiple machines to get

there.

>> I have one billion ways to do everything And

the user did not want to wait for a long time.

So having the ability to run the whole thing on 100

machines will help for sure.

>> [INAUDIBLE] >> But back to your question,

I agree with you that even on a single machine there is a lot of

potential to paralyze the competition using multiple

cores, but, strategically, back to [INAUDIBLE] question.

I will not focus on that one yet

because that issue also exists on a single desktop software.

>> So if I understand you right,

what you're really talking about is, I'm trying to see in

terms of machine architecture in the Cloud.

How does it look, like, we are busy moving your boxes.

I can sort of, think of them as a bit of like, Micro-services.

I can say, hey, they have their own class, so they have their

own logic, they run in their own, in their own VMs.

>> Right.

>> And your stuff runs and

then if you [INAUDIBLE] service [INAUDIBLE] they send it back.

And you connect to all these micro services.

If you really want to be kind of cloud native, you should

think in terms of micro services in that case, right?

But then there are all kinds of issues, right?

In that case, your storage [INAUDIBLE] Rest API through

which you're communicating and that's pretty much it.

So it's some with the matter of fewer, right?

>> Yeah.

>> And it managing and what side of the network.

They may not be on the same rack.

You can't make any assumption on where those machines are.

So to get there,

the question is to get the right performance architecturally.

What are you assuming?

So as you said, that the model is not new.

We are going to cloud.

But with that comes a question, what is running where?

What did they share?

Where is the network?

How much is being transferred across the network, and

what are the options?

>> Very good point.

There are a lot of questions about architecture.

We could run each of the operators as

a microservice using some standard REST API.

Currently, we're looking at the architecture where we use,

I think they're called actor model [INAUDIBLE]

>> She just talked to Phil.

>> [LAUGH] >> [INAUDIBLE] I'm next on

the agenda.

>> [LAUGH] >> I think that's a coincidence?

>> [LAUGH] >> So

he gets another half an hour extra.

>> [LAUGH] >> So, hopefully by the end of

this quarter,

by the end of this year >> We're going to switch

the whole thing to the action model.

Because the idea is each of Twitter is running as

a thread with separate queue.

All of the different Twitters communicate with each other by

sending messages to the queue.

A big advantage of this architecture is I can pause it.

Because each of the actors is running using a thread,

like a part of thread.

Currently I'm using a single thread the pull model.

So we plan to assist the push model with all the actors money.

So I have a got a feeling, I dont have any ground

crews to support it which is Microsoft might be to expensive.

>> Okay.

>> That's being fair. >> That's my feeling.

I think, I think.

>> I'm not reprimanding Microsoft.

I'm just asking you in this spectrum, where you sit.

>> I guess the model should be more efficient, because it is

the same framework using the interthread and communication.

That one should be cheaper than HP computer.

That's my kind of thing.

We can go to the action model.

In fact, the cloud map we saw earlier,

you see videos in the actual model, but we're using Ascata.

In this one, we're using Java.

All right so I reached the end of my talk but

I wanna say one more thing which is in this text domain

mission learning models are used very common.

And one common question people ask is in the whole pipeline for

data analysis machine learning is very important that

where does machine learning into this whole architecture.

So the way I see it is we have this backend Asterix to

be running at a database system to do ingestion.

And we use the to do the visualization.

And then we use the text the older one we call it TextDB.

That does a based formulation.

And this whole suite of solutions can be used

to help you to do preparation for machinery.

You can use it to store data, visualize data and

then analyze data and once you use whole suite

to prepare some

data or label the data, then you can train the model here.

You get what I mean here.

So you use the whole thing to train the model.

The model is more like a mix ii just a file, and

then this model can be integrated back to either to

textera as no of the operators.

Once we finish the [FOREIGN] feature.

Or, data is being ingested into the database.

This model can be used as a kind of UDF to do some offline

processing, or even run time while you do the visualization.

For those trees you can even use UDF to do the online labeling.

So the short message is the whole suite of solutions and

machine learning are kind of complimentary,

we are focusing on the data preparation site.

Okay conclusion here and this is acknowledgements and

thank you especially for all the support.

[APPLAUSE]

>> [INAUDIBLE]

>> [LAUGH]

>> Excuse me I have one question

about the cloud various like think of that.

So you mentioned that you don't have random access to

database and that's the reason you don't get samples on a daily

basis to answer the quirks.

However you have this database there that you have data there

and you have access to the eskimo database to translate

the query to the SQL query, is that correct?

>> Correct, the middleware has the access to the data.

>> What I don't understand is that in realistic you are the

tutor or a third party, and when you are a third party you get

these data in store, and now you have access to these data.

So where are you exactly computing it?

>> On some machines, the backend database has a large amount of

data, and the middleware knows the API, the schema,

but the middleware is more like an accelerator to some degree.

It's more like a data warehousing accelerator

that sits on top of the database.

Database and here.

So, previously the application layer talks to the database

directly to ask queries but those queries can be slow.

By putting Cloudberry in between which also knows

the database schema this layer can use all techniques to make

those queries at the application layer much faster.

This is the position.

>> It does also, they do the same right,

the [INAUDIBLE] the Twitter.

But they also do the crawling, they have the data, and

also they do the [INAUDIBLE] of those questions.

>> Yeah Cloudvarius is a general purpose software,

it's not just for Twitter.

The database can be anything, so it's not specific to Twitter.

>> Okay thanks a lot for

a very interesting [INAUDIBLE] >> [APPLAUSE]

For more infomation >> Cloudberry for Interactive Big Queries and TextDB for Cloud-Based Text Analytics - Duration: 1:12:23.

-------------------------------------------

Senate Pre-selection Endorsements for Mehreen - Duration: 1:46.

I'm supporting Mehreen Faruqi because now, more than ever we need strong anti-racist

campaigners in federal parliament. No one has done more than Mehreen to improve

Green's relationships with multicultural communities across New South Wales and

fight for them. She's the only politician who has made

women's right to bodily autonomy a political priority and put it on the New

South Wales agenda and had it debated for the first time in a hundred years.

I'm supporting Mehreen Faruqi to be the next New South Wales Green Senator. Mehreen

is honest, she's got the integrity that we'd like to see in all politicians and

she fights hard for social justice. I want the next New South Wales Greens

Senator to be reliable, to show integrity and to be bold to pick battles that

others are too afraid to pick and that is why I'm supporting Mehreen in the

next Senate preselection. In the face of many detractors and much opposition

she remains steadfast with strength and integrity and that is what we need in

the federal parliament, politicians who will bring about real change for women.

Mehreen was the first to introduce abortion law reform legislation into New

South Wales, she was the first to introduce legislation to outlaw shark

fin, she continues to speak up for the rights of animals and the environment. She's made a

real difference when it comes to reforming the greyhound racing industry,

she's fighting hard against deforestation.

I feel represented by Mehreen, she's a principled leader who always does what's

right not what's easy. Mehreen's the way forward

Không có nhận xét nào:

Đăng nhận xét