Transcript (view)

0:00
next up we have a speaker from clerestory data
0:18
clear stories another one other companies that’s been championing spark
0:21
in the community for the last be year to two years plus
0:25
and by Baba is the VP even engineer
0:28
founder cofounder I so he’ll tell us more about the use a stark
0:34
thanks Andy so I’m
0:37
5 have her thank everyone for joining me today I am that technical founder and
0:41
chief architect at history data
0:43
I’m here to talk about how we use parking chart as part of its history
0:46
stack
0:47
to enable fast cycle analysis on divers data and its data could be private to an
0:52
organization as well as external data
0:54
which could be free or premium little bit about your background first
0:58
so please read it as a star to be sort of Palo Alto we’ve been very lucky to
1:02
partner with I’m using investors by Kleiner Perkins Google Ventures
1:05
Andreessen Horowitz
1:06
and a on a party we have an incredibly diverse team
1:10
but they have experience working on both ends of the Big Data spectrum
1:14
back on site at the application site in my case I was an early engineered Aster
1:18
Data
1:19
mating key parts that have to get back on their may be integrated mapping uses
1:23
Pardesi but I nicked
1:24
talking about is an Indian were up its skin or infrastructure that Facebook
1:29
cloud infrastructure RTS was dark on as well as intuitive applications like
1:33
Google Analytics and
1:34
AdWords the problem that you’re focused on a true story though brings
1:39
a scale are need a platform banking integrated with an application
1:43
had one of the reasons for that is how we have seen the morning and evening
1:46
landscape the one with the last couple of years
1:48
and what’s becoming more and more evident is that there are many leading
1:51
donor sources of data
1:53
there are ubiquitous number of external sources of data data that’s available or
1:57
APIs and
1:58
premium data providers like under yeah Dun & Bradstreet
2:01
greater Facebook on the other news sources
2:04
and there is are increasing demand for different users with an organization are
2:09
some of whom may not be as comfortable courting in our or
2:12
spar scholar sharper hi and for them having access to all these data sets on
2:18
the flight to be able to converge them
2:20
uncover insights and share for the national team is becoming
2:23
a very obvious evidence back a and you’re seeing these
2:27
with our customers and with other accounts in various workers
2:30
like financial services and retail companies and
2:33
food and beverage in CPG companies it oftentimes distance down to 10
2:38
these three kinds of problems that have disappear the first one being like
2:42
situational analysis where
2:43
you have fast evolving data maybe that streaming in from social media or
2:47
center networks or what have you and then you have to give
2:51
updated with what they do to bring sanity and react to that really quickly
2:55
and its unreasonable to expect job really slow tedious process to keep up
3:00
with that
3:00
Austin stating brings a beta the next is multi-source analysis where the United
3:06
fresh import from internal and external sources
3:08
and that’s where traditionally have seen many companies have
3:12
different silos and different refusing to do you see resources here as
3:16
dashboards RB iTunes
3:17
or external third-party companies and it’s really hard to get these together
3:21
and finally as you start breaking the barrier afford
3:24
single be I too would or spec sheet or are not working for why do you start
3:29
seeing really high volumes of data
3:31
and this is where you see gigabyte of data might have been doing a leader sets
3:35
from
3:36
I ended up using Adobe Systems and not seek with course
3:39
so essentially what the problem that history is trying to solve is to provide
3:43
a fabric that sits on top of whatever internal as well as external sources
3:47
Android a very easy intuitive in which uses a different skill sets
3:51
can can was easy to sit on the fly and and cover in sight
3:54
let’s look at a canonical example of what I mean by Mikey source analysis
3:58
and for this I’m gonna go back to an event
4:02
that occurred not long ago which should be familiar to all of you here
4:05
I was impossible to miss send this was a bug you step back it
4:08
now this wasn’t amazing event for
4:11
Black Friday the entire city of San Francisco transformed into Gotham
4:15
are to have this tiny Caped Crusader fight crime missiles organized by the
4:20
make-a-wish foundation
4:21
and this is a hypothetical scenario noting that the used your story for this
4:24
but if if it were ready to intelligent organization
4:27
Mar wanted to analyze the data coming in from different sources
4:32
how that impacted this event who are the people in Warren was over a sentiment
4:36
and
4:37
maybe learn from this for future events like these this is the sort of process
4:40
they would go to
4:41
that even started with facebook cover this was announced and
4:45
were shared and commented on in like babies people there’s an API that these
4:49
were cast so you can get this data and if you want to understand the spread of
4:52
this
4:53
obviously speculator very nice wine by President Obama
4:57
open it up even more people so
5:01
if the make-a-wish foundation was using on richard Bennett or
5:05
Mar a 100 basis to analyze the clickstream data using these
5:09
3d resources and only can understand where the traffic is coming from
5:13
out what the overall sentiment is one of a group of people classed as a people
5:17
who are responsible for driving traffic to their website
5:20
the other sources that of corporate sponsors
5:23
the make-a-wish foundation as many of these as the run more of these events in
5:27
the future it’ll be interesting to see
5:28
what kind of audience react to what kind of event
5:32
and deceiver their employees from your organizations are my
5:35
signing up and looking at the website move this morning to you to Facebook or
5:38
Twitter
5:39
messages get one thing that happened during this event was there was some
5:44
negative
5:45
are criticism being levied against these guys for the amount of money that was
5:49
spent on this campaign
5:50
so overall there was a very positive message here and
5:54
job if something like this happens it’s really important to get ahead of this
5:58
to react to this end up dying for reacting to negative events like these
6:02
israeli shrinking
6:03
because our power Twitter and Facebook and the spread of information happens
6:06
today
6:07
the make-a-wish foundation did a really good job back so they looked at
6:10
news sources and blogs and pretend it said
6:14
our tonight sentiment from these and and reacted to
6:17
up to the negative coverage that was coming in
6:21
and finally that are donations coming in from new members
6:24
so as more and more these events take place it’s important to understand
6:28
what specific users are reacting to when hardest can be leveraged
6:31
going forward so you can see that in many different sources are being a
6:35
somewhat many of these have different formats it is
6:38
rock beats coming in as Jay Sean piece because it only be I
6:41
the internal data sets could be changing differently but then there are
6:44
legitimate use cases make different members in your organization want to
6:48
analyze this traffic
6:49
across wanna be sources and its not really feasible to get orders into a
6:53
single system
6:54
and then bed something on top 40’s users to analyze
6:58
so if you take a step back and up
7:01
miss our offer generalization but bear with me
7:05
in the traditional way of performing this kinda for in inside the revision
7:09
from data
7:09
there was a very linear process you start with a question
7:13
you approach someone with access to the data in IDDM already assigned to steam
7:17
and then God the bill it does work for you and it’s really hard to change that
7:21
the process is lossy
7:23
and you oftentimes don’t know what are the different moving parts
7:27
in the morning the landscape this process has has had many degrees of
7:31
freedom no
7:32
if you stuck with a question you might have to talk to my people individually
7:36
doing it deems
7:37
and get access from different silos because that an actual resting places
7:40
for dinner
7:41
idea maybe an Oracle database for a transaction data that may be for DuPont
7:44
an enterprise data warehouse for something else
7:47
and movement of data even though everyone wanted to be deflowered
7:50
is not really that straightforward his dad wanted me to be converge to a
7:55
and maybe there is some external data and wonders where
7:58
car being brought in a spreadsheet sorry to purchase from premium partners
8:03
like you then can merge and finally get an insight
8:06
so if you have a platform that means this process is really really simple
8:10
allows you to bring disdain quickly and reduces the cycle to
8:13
conversions that something that can cause this process to grow wider
8:17
within the organization have many people collaborate have people sharing inside
8:21
and as an overall amortization of effort and that’s the problem that you’re
8:25
trying to solve it
8:26
your story and one apart no for hypothesis dot
8:30
spark in shark are very reliever suited to perform this kind of convoys
8:34
and attics so if if a gloss over the fact that how do we actually get that
8:38
data how do you share that data
8:40
we come back that leader let’s see why spartan chart for debate very rare
8:45
and obviously one other pieces barabar Dover in memory
8:49
president distributed datasets as the computing unit of
8:52
India and spark is something that works very well
8:55
bit low latency interactive an iterative algorithms that you need to apply
9:00
and up you need the flexibility to apply these operations are not
9:04
torn up in advance so you have a little more I hope flexibility
9:09
the RTD is also important than usual predator comes from
9:12
the kind of operations that would apply to work and half for tolerance is where
9:17
so if due to failures or due to memory pressure or something if you lose access
9:21
to that data
9:22
began go back to the source apply those transformations and read arrive
9:25
the date over it was at the point of and say
9:30
the other you from thinking is that you’re not really restricted to working
9:33
with one
9:34
I’ll kind of programming Fremont a sequel is great for when you’re doing
9:38
aggregations and filtering and joins and
9:41
maybe even doing functions as as an extension but there are
9:45
operations that most business users as well can conceive very easily
9:49
but it doesn’t quite fit before love see for things like machine learning
9:52
year of competition I maybe approximate query segmentation and data mining
9:57
are something where spark its gonna really shine
10:01
are sparkle two dozen men did the detainees to mean a particular former
10:05
so if you have been lining up in st. affairs or a lot of folks are using
10:08
Cassandra or s3 your
10:11
like fires or any other on the sources it usually bring them together
10:14
paid RTD’s on doctor can skate across multiple machines
10:18
and again really account for the brady in volume of data
10:23
and up what are the important things is a
10:27
I’m talk about the open source community here I’ve been looking to open source
10:30
for a quite and I’m starting with the Linux kernel and
10:33
the Postgres database and it’s really important to have the right alignment
10:37
with the community
10:38
because if days and you end up having to maintain order that cord with yourself
10:42
and
10:42
I were and as for a rapidly warming community makes Park
10:46
its really important to be able to stay close to where the current board as to
10:51
available at the performance fixes and new features that are coming in
10:54
and wrestled a the book leading analytic stack and a badass stock
10:59
exposes things like backing on MLB San
11:02
graphics and you can easily see one from the the simplistic
11:05
back in Example verities lincroft competitions and
11:09
clustering or something that can be easily help somebody in marketing or
11:12
something seems to really be
11:14
extremely effective at their jobs
11:17
so case a now the hope you’re convinced you that this is something that spark in
11:21
charcoal really good actor
11:23
what does or on your story solution look like
11:26
let’s go back to the problem that we pondered on earlier because about
11:29
getting access to data
11:30
and one other thing that we have really investor to loren is being able to easy
11:34
access data from different sources
11:36
that’s what we call our data in front and profiling engine which is built on
11:40
top of a different open source technology
11:42
Scott storm coming from greater and up
11:45
what this letters do is bullpen talk to data from different sources
11:49
which could be internal information databases or fires
11:52
or Adobe Systems as well as external data sets that are exposed to APIs
11:57
and bring back together and cash as part of our in memory
12:02
scale be dirtier along with the cached data
12:05
me also harvest tomorrow meditate and signals
12:08
about this data because remember what are the iPod is is that we want to make
12:12
this really easy for business users
12:14
so we don’t want to leave the last mile in this analysis of them so you wanna
12:17
have
12:18
your help them I by making recommendations about the right kind of
12:22
the United look at
12:23
wider analyzing Canada an example might be Isengard
12:27
as you have to read on to analyzing tweets coming from California
12:31
you might want to look at donors from California or are
12:34
weather from California or any other any of these external data sets
12:38
and abuse foster the process of collaboration and
12:41
been ordered correlation doesn’t imply causation but having these datasets be
12:45
together
12:46
broken brought together in context makes everyone in the organization really very
12:50
effective
12:53
up access to this data once it’s in the system is unable to
12:57
application through a process not be called harmonization
13:00
and that’s something that’s really mean to pierce for you hear us talk about
13:03
that a lot
13:04
but what home in addition means is that be led users
13:07
very easily bring multiple data sets together
13:10
now this is something very look at signals like have to decide how
13:14
I’m information doesn’t have to information are you talking about in
13:17
Sacaton
13:18
comparing thanks for the support level to pain said the county never
13:22
and being able to have those building operators aware
13:25
as an end-user you don’t have to specify that yourself makes everyone really very
13:29
effective Sony for the last mile
13:32
so no audio yet again with a little more of a conceptual diagram this and more
13:36
not pictured are going to CVS parking charge actually fit
13:39
now starting bottom-up you see that’s where we have any dust or sand
13:43
as an extensively be I don’t up a storm we can easily add support for more going
13:48
to be against get it out to support different willing to beat him
13:51
and along with the cash data itself you remember normally be ready to invade the
13:55
data comes from
13:57
but the meditate associated with it as well so we
14:00
record on updating suggest if it has time information to Information category
14:04
categorical hierarchies entities and things like that
14:08
now this is available to our spark which are clustered we just call up unit with
14:11
SPSS
14:12
that’s very Cascada time you write data as RC fines right now
14:16
suck on nervous gives performance benefits over and you’re looking at work
14:20
and party
14:21
as addition for myself he could support
14:25
but all access to data and how the conversion texting this is something
14:28
that is driven through the use and application
14:31
we don’t expose a programmatic FBI in with the Expos a browser-based
14:34
application
14:35
where users can collaborate Nick and bring in their their own private data
14:39
sets
14:39
and wannabes actions are captured and can work in and Brendan Donnelly PI
14:44
as we reported by the clerestory PI which gets planned optimize and executed
14:48
on this
14:49
on this park last earned by the harmonization engine
14:54
testing a slightly deeper look s how we leveraged markets are condensed
14:58
but I mention since on access to the data and appears to environment is
15:01
through
15:02
are to use their actions now these are for million was business users they can
15:07
I view maps and bar charts and they can zoom in and read on
15:11
nigga new complex things like broccoli multiple datasets and cluster
15:15
and that’s where or not the complexity is really masked
15:19
I which makes it accessible to a wider audience of people
15:22
but this gets captured and converted to an eternity the I
15:25
now this grady’s get sent back to our home is a shin injury in
15:29
which is implemented as a RESTful service then maybe it’s analogous to a
15:33
Sharks over
15:34
or a party ever born a man and his team are building at we honor which is to
15:38
this part jobs over
15:40
but suited to our our needs so it takes the
15:43
custom PPI job specification from the front end
15:46
and then it implements to as an abstract abstract syntax tree
15:50
one of the things that you want to do is that we’ll because of the benefit of
15:54
having
15:54
being able to express operation size shark as best part
15:57
we do mix mornings accusing since we see performance benefits of having
16:02
bobbed Australian native spark buddies
16:05
got the job spec its planned out as an abstract syntax tree and found out that
16:09
a cluster
16:10
and harmonization engine can keep track of how far along the job as progress
16:14
our weather isn’t already on on whether it should be cancelled and then
16:17
eventually
16:17
such a stream back to get vacation
16:22
we have dialer performance optimizations as bentsen’s lead NC
16:25
is that really important thing in our important factor in our workload
16:30
being a product with a face having an application it’s important to get
16:33
feedback right away and it’s important to get results
16:35
are also nor did the analysis that uses doing your stories
16:39
expletive in nature so you might not wanna be’s
16:42
get some specific insights from and from converging multiple datasets
16:46
and in that scenario you almost have to feel passed very quickly
16:50
so you need to look at if to reduce its make sense composing together if not try
16:54
to use it
16:55
so that’s where we’ve got to make more execution of into leaving sparking char
16:59
ask for the same great execution I really cuz I’m gonna bang for the buck
17:04
as you can imagine a North for local rescue two words
17:08
job RedPrairie so there’s a lot of no general many consumers a beaner
17:12
while us a smaller subset bring in your ear so
17:15
cashing against the law to benefit as well southern extension
17:18
a system that backyard can give us a lot of our performance benefits
17:22
and we’re looking at integrating their since we can also understand
17:26
how users are interacting with the system you want to have a very rich
17:29
signal as to what are the popular dataset or the kinds a popular
17:33
are conversions as a data sets for the feeders and
17:36
projections and so on so we can breed computer love these
17:39
God to give the into to give really lonely agency
17:43
to end users were working with the system
17:49
now order that resulted in a return back to the application
17:52
are something that car visualize so we have
17:55
descriptive statistics about the defense as well as things such as maps and bar
17:59
charts where
18:00
I’ll users can’t see what’s going on with the headache and an update
18:04
what’s going on and they can comment and collaborate with others in the
18:06
organization
18:08
and we have a little tight to a few guys who are interested in looking at them
18:12
that’s something that’s began to
18:16
again because of the usage back down so how users interacting with the system
18:19
reno which RTD’s are more popular person on
18:23
so we can use doc as a as a cost-based optimization
18:26
to cash RTD’s dynamically without having to use it explicitly spell it to us
18:32
so we can cash in on casualties we can materialize 10
18:35
so that a more iterative operations on these happened really quickly
18:42
now ass on this is going on remember that there are
18:45
there are degrees of freedom when it comes to users who are looking to do to
18:48
as well as a Newbie that has been brought in
18:50
so even in the Bakken example it might be job
18:53
beauty queen from the fire hose you might be having
18:57
your backside clickstream data updated every ninety minutes or one hour
19:01
whatever the into delivery leading to: that might be
19:04
so as to be nice changing you need to update your views as to what the users
19:08
are looking up
19:09
there is also something that harmonization engine Anders by updating
19:13
praise as new data gets brought in
19:15
and this results in are trickle up the two of them
19:18
for the data set on the views or the visualizations that users are looking up
19:24
the final point is a little more talkin to house park in charge work but an
19:28
important point nonetheless
19:30
we have invested a lot in making it really easy for us to
19:34
pick up new changes and leaders changes from the amp lab and
19:39
the party was enough’s sparking sharking of because ass
19:42
I as and open source project involves it’s important to benefit from
19:46
new features performs fixes bug fixes
19:49
and one of the things that we have invested on heavily is being able to
19:52
operationalize
19:53
sparking shark so since we started working with the amp livestock very
19:58
early on we have
19:59
our own version of me a packaging car the
20:03
the other projects to be use share heavily
20:06
saw the first I can be easily deployed on a performance and developing clusters
20:10
we can monitor things using metrics and sense who n
20:14
collect logs and using knock stash so on this
20:17
job was like the classic devil to be off thinking about things
20:21
makes us job really comfortable in benchmarking
20:25
the leaders wasn’t a spark in char ensuring that there is no regression and
20:29
going to look like that you’re seeing
20:30
so that we can easily pick up new changes we can agree on our production
20:34
clusters and we can have our customers using
20:36
makes you weak one on on with the leaders were gonna spark very very
20:39
easily
20:40
now this is something back to be ours still in the process of packaging and
20:44
testing ourselves but
20:45
something that you’re considering contributing back to the community
20:49
of leaders interest in having she actually did cookbooks and data banks
20:53
and recipes
20:54
that make it very easy to get started with sparking shark
21:00
to looking forward to the roadmap for the amp lab and into bricks now
21:04
gonna things that really like actually this is on in there is a great one but
21:08
things like reconciliation
21:09
car progress indication are extremely important in our scenarios
21:13
vere job for wrong long-running grazing might want to consider
21:17
you might want to get in a synchronous notification when a query finishers
21:21
this is something that we are very close to integrating as part of a production
21:23
workloads
21:24
performance are lowering the cost of poor
21:29
access is something that’s really important knowing that you can’t even
21:32
further
21:34
we r seeing some of that which are commit hi
21:37
especially with the inspectors that I’ve users but that’s something that we’re
21:41
optimistic that to get fixed or time into the stuff that we can contribute
21:44
back as well
21:47
I as we see more and more enterprise use our customers are fortune 500 companies
21:53
very it’s important for them to have operational uptime
21:56
I’ll predictable agencies SLES so being able to
22:00
push beyond what the fair scheduler in Sparkes and
22:04
won’t be missed scheduler to have more formal workload management
22:07
is something that is really critical
22:11
of course the rest of the amp lab stack I expecting BB
22:15
to get approximate graze so you can start with a sample of data and then
22:19
if it’s not something that interesting you can find it on the entire dataset
22:23
machine learning and making it so that it’s accessible to working easy
22:26
intuitive user interface
22:28
and not are not have that be very confusing to business users
22:33
got back gonna mention that earlier something that you can use to easily
22:36
gosh data set in our duties and memories when it began go back and use reuse them
22:40
and graphics having craft beers competition network-based competition
22:45
for clusters abusers for supply chain analysis and these other
22:49
the use cases that our customers are seeing agree
22:53
so giving a torrent read that’s what I had for today
22:56
I be a really interested and excited to work with the community and contribute
23:00
back
23:01
this box disperse Park summit is a major head already and I’m
23:05
personally excited to look forward to more I believe you have barely scratched
23:08
the surface of what we can do with sparking shark
23:11
and I’ll be looking forward to more meet ups and more summits
23:14
and if you guys have more questions week you can feel free to ask me right now
23:18
there’s a bunch of us from your story including I mark was the guy who scored
23:23
during the first book
23:24
hardest part book but my day i John and Brian and I see a few of them so
23:28
you can catch it later as well at it thank you
23:37
thank you very much with your question time
23:40
time for questions
24:01
soul arm your opinion
24:04
for new company is trying to put this into production
24:08
water where that area is to invest in the most wanted
24:11
from the pitfalls you’ve run into you are you expect a new company would run
24:15
into
24:15
great so I yeah it’s a it’s a good question so as I mentioned earlier there
24:19
was a lot of focus that
24:20
initially be got been messed up front in making sure that we can
24:24
get that I inside tonight signals from various things crash ratings are not
24:28
working
24:29
identifying bottlenecks very easily very quickly so the investor in making that
24:33
the devil upside of things a lot more Stephen
24:35
and the up having
24:39
packaging and may even based business NP been packages
24:42
but it gives a lot more flexibility now as we look at new leases are coming in
24:47
and with a product that’s really evolving as quickly as Sparkes as well
24:50
as shark
24:51
that’s really very important to have enough operational visibility
24:54
and being able to debug issues very fast and be able to get to get the dogs very
24:58
easily
24:59
the community a for obviously has a big has been a big help
25:03
whenever you’re stuck with things I it’s under
25:06
the Corps Base is also pretty easy to modify and writing so yourself
25:10
so on the things that at least thing that helped us was to have enough
25:14
oppression visibility
25:15
into house parking charge car utilizing resources on a cluster how they’re
25:19
scaling
25:20
an as data sizes grow as a cluster size group where you starting to board index
25:37
you mentioned using shaffrey infrastructure have you already open
25:40
source your
25:41
cookbooks for the not here we should so we have
25:46
guy good books for most of the other components for use
25:49
as well as the rest of the items arrest under yesterday to stock has been
25:53
for us to consistently in which we deploy and develop so
25:57
that something that me I definitely digg as a feedback back to the team and
26:00
we try to get that done sooner alright alright
26:09
access