Transcript (view)

0:01
much of what we’re gonna talk about today
0:04
is why you need a a solution like spark
0:08
when it comes to big data and how this fits with what
0:12
customers in organizations today are really struggling with
0:15
when it comes to Ashley accessing a lot of data that scale and working with a
0:20
lot of data that scale
0:21
I so I’m gonna give you a little bit of
0:24
quick overview of what her story actually focuses on but to be clear
0:28
the purpose of this session is not so much to
0:32
taken to specifics about what’s to a close race solution it’s all about all
0:37
but rather
0:38
what organizations are really struggling with today and the typeof
0:42
from analytics interactivity and speed of analytics that they’re looking for
0:47
that’s part plays a big role in that we think is going to be very critical when
0:50
it comes to
0:51
the big data problem moving forward I so on a quick I’ll
0:56
overview here on close race so we are on
0:59
specifically focused on solving the problem
1:03
of converging dado from disparate sources both internal sources
1:08
external sources you internal sources could be sources like Udupi our legacy
1:13
data warehouse database is except row
1:15
as well as a host of external sources that companies and users want to bring
1:20
into the mix with the private data
1:22
to drive what we call a converged in sight or a connected in sight
1:26
this is becoming a bigger and bigger problem as organizations deal with a lot
1:32
of heterogeneity
1:33
Jade Etienne that data and I’m not trying to tap into very disparate
1:37
sources so
1:38
what are some of these sources that I’m talking about
1:41
when it comes to private data I think you’re all very familiar with what the
1:45
common sources of data are that they’re trying to tap into including
1:49
ahead debates platforms on such as those from cloud a row
1:54
Hortonworks Adaptec cetera as well as pre-existing data that they have housed
1:59
in their data warehouses RDBMS systems
2:02
as well as cloud-based applications such as you know Salesforce dot com
2:06
%uh Google Analytics and and such so there are a lot of private data sources
2:11
and merging across organizations that’s created all these islands have data
2:15
everywhere
2:16
and water point right now where organizations are now trying to bring us
2:20
all together a
2:21
in a more intelligent and smart way so they can actually drive a convergence
2:24
I’d
2:25
in addition to that there’s another big phenomena happening
2:29
which is that there’s been a explosion in the amount of public an external data
2:33
available over the web
2:35
and if you look back about of 56 years there were less than about 100 Open Data
2:40
API is
2:41
um on the web and today they’re all there over
2:45
8,000 Open Data API is an not
2:48
all of these are necessarily super high value to global 2000 companies but there
2:53
are many of them
2:55
that do offer a lot of data that these organizations are trying to tap into
2:59
that’s highly valuable when brought into the context of their own private data
3:03
so driving an enabling on this convergence a margin these worlds
3:08
of private data ok with public on and web data
3:12
is a big phenomena happening out there that we really believe in
3:16
the I interesting thing is when you look at
3:20
all of this data out there are specially on the in the on the external side
3:25
only about three percent of the state as Ashley tagged in today about
3:29
half a percent if it is actually brought into the kind of analysis that large
3:34
companies are trying to do
3:35
the reason for this is because it’s very hard to tap into a lot of those on
3:39
sources
3:40
and there’s no good way to do it because we have to go to each unique sauce
3:43
source and figure out all its nuances so trying to smooth and and
3:48
address that issue and making all this external data more usable in the context
3:53
a private data
3:54
is definitely one of the things that we have clear story see is an important
3:58
factor moving forward
4:00
when it comes to solving one other the problems in the Big Data world
4:03
I’m but when you try and do all this and bring all the stay together whether it
4:09
be
4:09
data from the web searches clicks rem data social data except for a combined
4:14
with private data from
4:16
the RDBMS solutions Hadoop CSV files combined with
4:20
premium data sources that may contain market intelligence data out
4:24
also paid a tax %uh trial or public data that may include things like census data
4:30
weather data that’s openly available on the web
4:33
when you try and pull this all together or not only are you dealing with
4:36
a lot of data at a at a large volume salata scale
4:41
but in addition to that the users insight if most organizations today that
4:46
are trying to drive this conversion al says
4:49
have expectations on a platform
4:52
where they want to be able to see things as they unfold
4:56
they want a high degree of interactivity with all this data
5:00
they expect basically to be able to get inside very quickly
5:04
and that’s really where spark comes and so
5:07
really solving the problem on top of Big Data platforms and bringing more
5:13
interactivity and low latency processing
5:16
to this issues so that you can now enable users across organizations to see
5:20
data when they need to see it to see if faster
5:23
and bring a high degree of interactivity around a conversion Alyssa’s
5:28
on the type her father specific examples at Stephanie’s gonna talk about
5:33
and again we highlight these couple examples that many many more but
5:38
I just to pick on a couple here are
5:41
the purpose of these particular examples that stuff is gonna go through is to
5:46
just again
5:47
shed light on what up companies trying to do
5:50
and what are the type of demands that they have on Big Data platforms today
5:55
that require our solutions like spark and shark
5:59
and how much AM team at the amp lab have been addressing this and what they are
6:04
building and what they’re working on
6:06
that we think is going to be very critical moving forward
6:09
so firstly I when you look at multi-source analysis as I talked about
6:14
a second ago
6:15
and you bring data and from many different sources customers expect
6:19
to be able to actually interact with us in a very fast low latency talk a matter
6:24
because they want to be able to answer questions at the right time
6:28
and sometimes in near real-time so high degree of interactive
6:32
ever interactivity low latency process saying they are real-time typeof Lancers
6:37
I across multi-source insight is very important
6:40
the second big area that on comes up a lot is this notion of life situational
6:45
analysis which means that
6:47
as events are happening out there and you’re tapping into data from multiple
6:51
sources
6:53
you want to be able to see insights out the unfold
6:57
so literally being able to let business users see insights as they happen and as
7:02
data starts entering the environment from multiple sources so
7:05
that’s what we sometimes call live situation all else is the many other
7:09
terms for this but the most interesting point here as
7:12
the speed and the level of interactivity that people expect when they’re doing
7:17
the Cybermen else’s
7:18
and then finally I some level of automated out
7:22
analytic so almost self learning analytics
7:25
where you bring together data from many sources
7:29
and you start understanding basically some ugly I’m
7:33
the aspects of the data from these sources and Lauren
7:37
as you go along the typeof trends that to look for
7:40
it to some extent you know gets into some level a predictive analytics as
7:44
well
7:45
but really being able to automate the idea the notion of being able to see
7:49
trends as they’re happening
7:51
so these are like three on examples that Stephanie will
7:55
you know give you some details around each of these but the main point as
8:01
there are many many examples out there
8:04
of the type of big data analytics that companies are trying to do that they’re
8:09
trying to do
8:10
this year are next year except rug that
8:13
are very critical to on house park and shark actually address this problem so
8:20
we believe that MapReduce being a great framework on top of the Duke
8:25
has its purpose and the Big Data world that’s great for batch process saying
8:29
it’s great for many other types of analytics
8:31
but it when it comes to the type of analytics that require
8:35
near real-time answers high degree of interactivity and low latency processing
8:40
we believe that technologies like sparking charkha gonna play a very
8:43
important role moving forward
8:45
and that’s what Matteis here to talk about today I’m
8:48
and as it relates to clear story we believe that
8:52
this basically becomes one interesting way to solve this problem that stuff is
8:57
just about to
8:58
got to detail so with that I’ll turn it over to her and the
9:01
but a will then take it from there think sure Milan
9:05
I’m gonna take us 30 and examples and that we’ve seen in the in the market
9:11
I with customers that we’re working within each of these three different
9:13
areas just to give you a quick flavor
9:15
I’ve why I rene what are some the hands-on use cases that individuals are
9:20
using at spark
9:21
up to need to address for the first is I interactive multi-source
9:25
analysis I’m and this has really applicability to any organization that
9:30
is focused on
9:31
outperforming their competition they need to really understand their market
9:35
position and the competitive landscape
9:38
and I’m what what you feeling today in most organizations is that they built
9:43
very and but a lot about burden to building
9:47
a very good store for information typically already
9:51
I in the form of a data warehouse for a data mart or maybe even a and do
9:55
I instant I’m in most companies are typically
9:58
at storing some other sales data up for you know what
10:02
I I make a comparison have I’m that type of an investment
10:06
anti making your first investment in your purse their system which is a mono
10:10
speaker system
10:11
you’re getting some gonna be in and out on the analytic I’m but it had a single
10:15
channel av
10:16
a areas providing to:
10:19
and other organizations to give them credit have gone out an added to that
10:23
data warehouse or that thing all
10:25
ask for the truth and life in our organization a second source our channel
10:29
and data they’ve gone into the two-channel realm other stereo system
10:33
and of an organ organizations from a business perspective
10:37
are looking out at second or third day to being a a
10:40
premium source data so that might be an axiom R dun and bradstreet are nielsen
10:45
depending on the industry that
10:47
that you’re an now with I am to add to channel
10:51
at the time that you have a stereo system you may e-mail evident her home a
10:54
two-channel stereo system
10:56
but a lot about want to move up to surround sound and that kinda the
10:59
ultimate had goals for maybe a movie watching her Cara
11:02
a television fan I’m and the question is why have these organizations not moved
11:07
into US around town and what the surround sound really
11:09
really look like I well you know one of the first barriers before that I benefit
11:14
I do
11:14
was that a lot of the data outside a about corporate
11:18
at sales data and premium data is web data
11:21
and so the first challenges I knows
11:24
I data sources like Twitter like Facebook I’m even something like a
11:29
whether be maybe unstructured in format
11:32
hard to process and I think maybe even more interestingly
11:35
the value the business value that and data
11:38
is at fairleigh I’m known in advance so you need some time for the analytics to
11:43
tease out
11:44
what’s important that data to justify bringing it into
11:47
a to stop and so what I’m
11:51
flight it’s not a building here and
11:55
what we see happening is organizations you want to move to that next level love
12:00
surround sound and add in all of those web sources the
12:05
AppData and make that very I interactive
12:08
a struggling with getting there in the in the market today
12:11
in terms of being able to have multi-source now office and also have
12:16
an interactive wrap platform this particular
12:19
customer keep yourself in the shoes I have
12:22
a movie studio I kinda mapped out how the source is a
12:26
data might help a movie studio get to business value with these interactive
12:30
sources
12:31
and if you think about adding take yourself back to maybe I’m 2008 I don’t
12:35
know any %uh view her fiance have
12:37
at the movie Iron Man are the Dark Knight every member both those movies
12:41
came to market nothing summer blockbuster year at the same time
12:45
and movie studios put a lot of effort into predicting out box office opening
12:49
weekend
12:50
on opening weekend it hard to you make an accurate prediction but
12:55
a lot a movie studios are placing their bets on how large opening weekend can be
12:59
in the last
13:00
fix the seven weeks have the ramp up to that movie relief
13:04
are looking for signals to the market as to how to adapt
13:07
and they release that requires behavioral information from consumers
13:11
and one of the best sources as the web so if you think about in that industry
13:15
how this type a multi-source announced it would play out their stereo system is
13:19
being able to
13:20
live in everything that’s going on I in social media channels like Twitter and
13:24
Facebook in their own website
13:26
and adapt their marketing strategies aiming for it and that is an interesting
13:29
articles
13:30
I’m if you look back and i cant share them with both the idea to look at how
13:34
and The Dark Knight and and Iron Man were marketed differently
13:38
even though they had very similar story lines were hitting a very select target
13:41
ID at the exact same time a blockbuster here
13:45
the perfect example a I can get to it here
13:52
I rebuild alright and the other day the fourth I had here in the
13:56
maybe example was public data feeds from
13:59
web sources like run tomatoes so many may not know this but
14:03
I’m run tomatoes API actually gives crowdsource ratings information
14:07
as well as I critic information I’m they start time feeding out what
14:12
I and it how individuals are scoring movies about a month in advance so that
14:15
comes another critical at source in this case
14:18
with the API’s in many different industries for that type up public data
14:21
becoming relevant as well
14:22
in today’s new a diverse data world the second example I have here
14:27
is so I what’s your Miller I mentioned and as live situational analysis
14:32
and live situation all mouth this is really focused
14:35
on being able to observe what’s happening
14:39
I and catch an emerging trend add is evolving
14:43
on because you may have a chance to take action on an to influence that
14:47
trying to that evolving and I how many folks in a crowded watch the Super Bowl
14:51
this year’s Super Bowl 47
14:55
not too many somebody must not be raising her hand come on guys you
14:59
the one with the most exciting event you know what that you probably know what
15:02
that most exciting about a favorable weather is here
15:06
power outage the blackout the blackout for Super Bowl 47 with the
15:10
as well Super Bowl 47 wardrobe malfunction rate you guys remember that
15:14
for a couple years ago
15:15
I’m you know no one really the fact that really care what’s going on with that
15:19
the game everyone was
15:20
interested in what happened during the blackout how to different organizations
15:23
respond
15:24
now you were an advertiser for the Super Bowl you had paid in advance of the
15:28
event
15:29
somewhere between three million and $4 million dollars per 30-second spot
15:33
one particular company that did that with or Yelp
15:36
and add and if you I were you can live to a small mouth of the boreal with he’s
15:41
alive situational now so they would be watching that spot if it came on in the
15:44
first
15:45
a quarter I believe up the game and evaluating how that holding a spot with
15:49
the pact in traffic on your website
15:51
I’m you might be looking at media I rankings after the fact that the how
15:55
Neil Finn was rating I’m the impacted and your
15:58
at commercial I’m but when the power outage happened
16:01
individuals who have live situation on office were able to respond
16:05
so what were you did immediately within 10 min up for the power outage starting
16:10
with a posted this beautiful add on Twitter
16:13
they had a social media team that with monitoring in a live situation awareness
16:16
situation what was happening as as the power went out
16:20
and a huge math the people started tweeting about the power outage they
16:23
took advantage in that moment in time
16:25
they posted this advertisement that was very vigilant than you can even Duncan
16:29
the dark
16:29
nice play on words there and I’m
16:33
that got not only I’m 16,000 recreate that the point
16:36
a posting but it also got them on the front pages
16:40
I many I journals they were never the Harvard Business Review article
16:45
on that was written about how smart marketers were and
16:49
what would look hot a responding in real time
16:52
the cop estimated for this ad with about 40 K
16:56
support million persons 40k I think that’s an interesting investment
17:00
I’m to be made an analytic
17:03
the lapd example and we want to share before passing over to Mackay
17:08
is one or more automated and I’ll let I’m and and this interface that’s really
17:12
exciting
17:13
I’m because that’s where the technology action against to inform the business
17:16
user a decision
17:17
to be to be made I’m into one other example
17:21
here is I’m looking at how
17:24
and data can be I brought in to assist tom
17:28
I in near real time and the
17:31
modeling about data automated so that I’m differences in time
17:35
or geography between multiple sources can be harmonized
17:38
that you think about I am something like failed data I make let my field data on
17:42
a monthly basis
17:44
I might want to join that failed data to market penetration data and I market
17:48
penetration data may
17:49
calm and you know every two weeks or
17:52
it may be an outsourced firm that only collect that data during the survey
17:56
period
17:57
and so might be unpredictable and to the harmonization up those
18:01
and that time is very challenging to the manual a
18:04
but the times today I’m are and now able to
18:08
automate that process or somebody technology that we have
18:11
on and the other I’m important aspect in this use case is the ability to you then
18:17
apply on top of that harmonize data more advanced outer limits and machine
18:21
learning X at Rs so
18:23
if I’m someone who looking at the penetration of my product in a mark in
18:26
trying to respond to that
18:27
I also ultimately want to fist and actually look at customer behavior and
18:31
begin to suggest different clusters
18:33
I love I’m customers I may be good targets for the product or
18:37
I’m attribute to those customers that may indicate a way to change my product
18:40
to market a better
18:41
I x Hunter and so they don’t need apply things like even simple k-means an
18:45
office which
18:46
outcomes kinda outta the box with a park technology is
18:49
is critical so we talk a lot today about
18:54
and counter hopefully I’m a tailored will get them in five today on specifics
18:58
on how to address both
18:59
and why we think this technology is a very exciting one I we talked about how
19:03
to converge data
19:04
from diverse sources on how to process that data really add at scale not only
19:08
for human insight but then
19:10
I’m our machine generated in fight and making this process is very valuable an
19:15
interactive I’m for business users and business users to find as individuals
19:19
who don’t have according background who don’t know how to
19:22
ir a pipe on our top I a or your favorite even think well sometimes
19:26
language
19:27
a choice I’m so we’re excited about transforming
19:30
and the what having a laugh thirty years we had a great transformation as
19:35
everyone move towards data and data warehousing technology
19:38
back in infrastructure we’re very excited about the future
19:41
which means transforming approach to analytic not just through the technology
19:45
but how we use
19:46
analytics and data and in our in our daily
19:50
job and we believe that means I’m moving from
19:53
abacha now with Antil interactive
19:57
an often that scale really enabling the businessperson not just the day to find
20:01
tested
20:02
to I’m hypothesis past and have more hypothesis driven
20:06
exploration experience and to you at converged data
20:10
from from anywhere whether that’s an API or a standard relational
20:14
database or another and structured format
20:19
so without McHale hand over to you
20:26
okay cool yep so so I’m going to talk about a
20:30
to you know software products out of Berkeley that I designed to address
20:34
these I i’m coming
20:36
needs for data analysis called spike in shock
20:39
I’m we see this is
20:43
I do I say I guess the keyboard is not great
20:47
ok got it that’s right that’s just okay
20:51
a okay so yessir so im as
20:54
sharm-el and Stephanie pointed out a modern data analysis to really take
20:58
advantage of all the sources that are out there
21:00
it requires integrating multiple sources of data many of which I instructor and
21:05
it requires dealing with these annandale time because that’s how they arrived
21:09
and it requires reacting I liked it too two things that happen
21:13
quickly are so mapping do this as a very successful
21:17
technology over the past the you know five or six years
21:21
that started to address this kind of need and ride my Peter silly and it is a
21:25
way to do
21:26
I batch processing on unstructured data
21:29
it was very easy and actually very low cost to do
21:32
and that was great but Matt Prater’s is fundamentally a batch model with
21:36
latencies anywhere from you know it tens of minutes
21:39
a two hour is and it can’t it can to deal with these
21:43
kinds of emerging workloads so we at
21:46
berkeley I cya a lot we talk with a lot of people who used
21:50
a map in years and we cya in particular 3 needs that they quickly had
21:55
once they got started with it but one of them was far more interactive Quays it’s
21:59
very clear why interactive ways help you explore
22:02
data better on the second one was far more complex
22:05
algorithms they want it done so in particular a show in Milan Stephanie
22:09
mentioned you might want to do things like machine learning
22:12
are or statistical clustering algorithms on and these are often multipass
22:17
algorithms that don’t fit well into the map eaters model
22:20
and the final thing is of course more real-time processing
22:24
so I yeah me hit so either
22:27
I’m yeah so-so at berkeley we r
22:30
I we we have a is set up this research project called the Berkeley Dana
22:36
analytic stack and I’ll go list actually design a unified stag
22:41
the tackle all three of these emerging workload so that’s that’s what we think
22:44
people will need in the future from that data analytics and this is what
22:48
we think you know future analytics applications will look like
22:52
and this is part of the anime club project this is a sixty-year
22:56
research by Dec started a year ago and supported by a of PTR
23:00
a fairly big grants from darpa and the NSF as well as
23:03
18 companies and we’re building a whole bunch of component in the stack from
23:08
resource management in the Mesa spider the different kinds of execution and its
23:13
I’m in the stock I’m going to talk about spark and shark which I
23:16
to have the execution engines after building
23:20
so just fairly quickly what is spire spark is a fast
23:23
a map a dislike engine that extend map a deuce to do
23:27
a bunch of new things that you care you you can to do with
23:30
with the current implementations so one of the ways and extended this within
23:34
memory storage
23:35
and computation which is in part and whenever you’re making many craze on the
23:39
same day 90 repeatedly and
23:41
it can really make a huge difference in performance
23:44
the second way is to extend this with my general execution grafton just the one
23:48
stage
23:49
you know map and reduce model that you get with Hadoop
23:52
and again it is optimize asians lets park on
23:55
up to a hundred times faster in real applications
23:58
I with the and memory data and in fact two to 10 times faster even with on this
24:02
data
24:03
depending on on the power at the same time spark is designed to be compatible
24:09
with Hadoop and its first compatible with our all the storage API as you have
24:13
in Hadoop so essentially any storage system you can plug into Hadoop
24:17
we use the same interface that had to pass called input format
24:21
the doctor at and so you can head that data inspire
24:24
and a spark on recently dined
24:27
basically just today I also add support for stream processing so this is
24:32
something is still in Alpha state but
24:34
it’s something we’re very excited about because in the same engine you can now
24:37
the stream processing interactive
24:39
and batch on wonder strike
24:43
shark is on are a implementation of I sequel on top of spike
24:48
and it’s basically a part of Apache hive so with you know I don’t like folks who
24:52
are building a new sequel engine from the ground up we actually took
24:55
hi van and made it faster by by hunting and on Spike and doing a whole bunch of
24:59
optimizations inside
25:01
and what’s nice about this is that it’s highly compatible
25:04
with hive itself you can use it on existing high
25:07
data if you have a high warehousing can connect to it and you can dine your
25:11
existing Quays
25:12
user-defined functions and so on and and get these benefits
25:16
and inherits rides his bike so I have execution benefited by mining
25:20
up to 100 times faster so these are both open source projects Park actually
25:25
started as a research project
25:26
about three years ago and shark I was
25:30
was released I last spring on and
25:33
they started as a science project but in the past few years they’ve also seen
25:36
quite a bit of I use a and contributions from industry
25:40
so that now multiple companies that are using these in production to even more
25:44
companies that I
25:45
I said of playing down with them are using that them on the back and forth
25:49
internal analytics on hand to give you a sense of the community we have about 500
25:54
a members who come to i San Francisco area meet ups
25:57
and we have for din companies have contributed code for the project in the
26:01
past year
26:04
so the stock I’m gonna give you a quick overview on both have these technologies
26:07
and you can find out
26:08
are more about them after but let me begin
26:12
with %uh spock I’m so what this by providing
26:15
so spark itself gives you a bunch of API’s high-level API is
26:20
to do data processing in Scala and Java and Python
26:23
and you can also use that interactively from the scholar and Python shells
26:27
so the same way is by thanking on your local machine type stuff in and work
26:31
with data
26:32
you can actually do that the work with a cluster frames back
26:35
and the key idea in days you get the manipulate these distributed
26:39
collections call his own and distributed datasets or ID days and you got a whole
26:44
array of parallel operations 2009
26:47
just as an example of something you can type in this is gonna be
26:50
some aren’t some some scholar code you can type into this bike shell
26:55
to do log mining and the story here as you know you have a bunch of log files
26:58
say and and a Hadoop route system
27:01
and maybe something is going on with the application anyone to interactively
27:05
search
27:05
for patterns and understand what’s going on in this big application
27:09
so in spite you can load data by I’m just
27:12
calling for example text file in passing at the HDFS you I L
27:16
and this gives you a I know hey it gives you a distributed collection of strings
27:21
you know one per line over the file
27:23
and this is the base Ridd are distributed data sent
27:26
that we’re gonna pass and once you do this you can do what’s called
27:30
transformations on it
27:31
on so for example we can filter out the lines that start with air
27:36
this is the the Scot the code no: engine as the scholar code for function and all
27:41
its like a lambda and Python and in fact in Python you can just pass
27:44
a lambda function and a but you can put any code you want in their calling the
27:48
libraries and so on
27:50
and this gives us are transformed ided it’s the lines of text that
27:54
that’s tied with our I’m you might do I’m
27:58
more transformations on it for example here we’re gonna
28:01
split each line by tabs and pull out feel number two
28:04
which might be the actual message in in the log entry
28:07
and finally can choose which date I think keep in memory so any call cash it
28:11
says keep only the error messages in memory
28:15
now once you’ve done that depend on a bunch of crazy on it
28:18
so for example here I S is the code to search how many lines contain full
28:22
and when you call this contest called an action counts you know something that
28:26
actually kicks off a job on the quest and gives you back a result
28:30
and what spike will do here is it a lot a radically come up with an execution
28:34
plan
28:34
for this great that sufficient I’m so in this case is going to look at where the
28:38
data sitting on the cluster
28:40
on and sends tasks based on data locality much like Matt painter’s word
28:45
it’s an ongoing process said to send some results back and then each node is
28:48
also going to build up a cache of
28:50
partitions over data it’s built along the way I’m so next time you grade the
28:55
state I said now
28:56
say no for wasn’t the problem you search up for bar
28:59
up you can get back the answer much faster because
29:02
spark will know that the data is in the cash and just hit that and give you back
29:06
to results
29:07
they give you a sense of the love the impact this makes
29:10
I’m gonna talk about here too results so one of the white one of the things that
29:14
we do kind about as a demo
29:16
is full text search for Wikipedia wikipedia has about 60 gigabytes of data
29:20
I know it’s not huge but it’s not a thing you can easily deal with on one
29:23
machine
29:24
and so I know training owed easy to cluster if you just try to do a full
29:28
text search with Hadoop always on disk data
29:31
it takes ten year thirty seconds if you do it or spark you can do the same
29:34
surgeon half a second
29:36
so it’s just good for research on this whole dataset and this
29:39
we are making this interactive really changes the that types of questions you
29:43
can ask about it
29:44
and to show that you can do light larger datasets to
29:48
we’ve also done one terabyte data center 100 nodes
29:51
and you can’t do a full text search in a in about five seconds and I guess is a
29:56
thing that could take
29:57
minutes was on display to that’s kind of a crash course in
30:01
in spike I’m not gonna talk too much more about how your program with that
30:04
but
30:05
you will be able to see an online on I do wanna mention though that
30:09
a you know if the scholar thing is a little scary we also have
30:12
but a nice API’s in Java and just released today in Python
30:17
so in Java you can have you can just pass in your functions
30:20
as classes and it still quite a bit more concise than if you try to do this in in
30:24
Hadoop MapReduce
30:26
and and by so you can pass on land as our local functions and we do the same
30:30
k cousins column automatically shipping that God the class 2a so
30:34
well we’re very excited to have people use it in these two languages as well
30:38
I’m the other thing I want to mention about the model is that it does
30:41
provide full fault tolerance any sent over the worker knows can go down
30:46
and the computationally cover and the the trick that it does there
30:50
is this this idea call many and which is
30:53
that a sparkly member how each day to surprise build an automatically compute
30:58
tanks that go missing so for example in the example we had before every last
31:02
one over the caches a new on a no then we lost the map and filter results it
31:07
would actually on that map
31:09
and filter function again so that’s
31:12
that’s kinda that’s kinda like the engine that’s on
31:15
apart from doing interactive Quays
31:18
the engines also much benefit for any kind of interactive multipass algorithm
31:23
you might have
31:24
and that’s actually what a lot of people have been using it for on so if you have
31:28
some machine learning algorithms
31:30
things like logistic regression okay means questing
31:33
are a perfect example of something that goes away today to multiple times
31:37
and that’s the thing that platforms like Matt painter’s don’t have any support
31:41
for you have tried many different map a nurse jobs that need to redo a lot of
31:44
work
31:45
in in Hadoop on using spike you can find these anywhere between a hundred and
31:49
are sort of thirty times are faster and even algorithms like page I and II quite
31:55
a bit more communication
31:56
intensive actually go faster but with spike as well because of the ability to
32:02
keep
32:02
update on memory so this is this is actually one of the things that a lot of
32:06
people have been using it for
32:08
and yes to give you a sense so applications I love the companies that
32:13
after
32:13
talked about what they’re doing so is people use that for
32:16
just %uh stand it sort of a sequel analytics and reporting that use that
32:20
for
32:21
anomaly detection are quite defines a a company that does predictive analytics
32:25
but precisely on the kind of social media
32:28
streams that that Asher mean I was talking about
32:31
there’s a go back Yahoo a exploring and for business intelligence
32:35
and a bunch of research projects I’m at berkeley as well
32:39
so that’s that’s kinda ever very quick and go to spark
32:43
I’m I also wanted to talk a little bit about shark
32:46
so shark is the engine to do sequel and we’re very excited
32:50
about it because I it really opens up this kind of performance
32:54
I to a much wider a abusers so the way we started as we thought
32:58
Apache hive is a is a great platform highly successful and
33:01
you know it’s very widely deployed about it because it’s running on the Hadoop
33:06
engine it takes minutes to do even tried grace not exactly intact
33:10
so I about rice to actually extend hive tonight on Spike
33:14
I if you wanna see the Watson died on the technical side behind us
33:19
this is basically what we did so when you look at hive today it’s a it’s
33:22
nicely structured into
33:24
there’s a Manchester which actually handles the catalogue of data and your
33:28
sister
33:29
and then a client that actually get to sleep well great comes up with a plan
33:33
and submitted to a term apparatus execution engine
33:36
on in shark we only change the client and the execution engine
33:40
so it works with the same storage system and same Esther as before
33:44
and basically we update the query optimizer it and and and planner at the
33:48
deal
33:49
with it. to use this bike engine and we had this cash manager
33:52
I to use memory and and this and by the nice thing is it really it to use as a
33:57
ton of the things and high
33:59
that give you compatibility with and on top of me
34:03
does guinier’s are substituting a box in there with that with another box
34:06
we’ve actually I’m taken on a whole bunch of optimizations in the engine
34:11
that help significantly so one of the things we do is call Minn storage and
34:15
compassion
34:16
which can can meet the demand XP tapped as well as a space-saving same IQ is
34:21
actually less space and memory in with your
34:23
on disc format in out in hype I’m
34:26
we are due by we we do statistics on partitions to do smarter pruning
34:32
we pick algorithms I can joins a based on what happens at Quay I’m time
34:36
and we also our support Co partitioning of tables that
34:40
that can speed up operations on them so
34:44
up to you start essentially you just
34:47
you just use hive QL and use the special syntax to create
34:51
tables that I and memory so all you have to do is create an in memory table and
34:55
then
34:55
just on graze against and I will you know I’m
34:59
II I don’t have a tough time for this but I do wanna show you
35:03
some results on so the just that to evaluate the
35:07
the system we actually took if you remember it to a while back there was
35:10
this big
35:11
I sorta II to call in the database community comparing the performance
35:15
of Hadoop against analytic databases things that verdict
35:19
way they said oh look I do can be a hundred times lower than that
35:22
so he said okay what if we compare I Hadoop against shock
35:25
and essentially what we found is shark both with on this can in memory data has
35:30
significant speedups and actually
35:32
starts to approach the speed you see from these very expensive
35:35
but database our systems job so this is a really simple SELECT way way it’s
35:40
going about a hundred times faster
35:43
this is a goat by I where it’s going I think about
35:47
20 times faster armed and even on complex ways with multiple joins and
35:51
things like that
35:52
we have a significant speedup because of the more general execution engine even
35:57
if the data is sitting on desk
35:59
I’m so that’s there I’m I don’t have a ton of time to go into it but it’s
36:03
something we’re excited about
36:04
and yours ok ways that people have tried see the same thing tonight not just
36:08
these benchmark ways
36:09
actual user quasar also grow significantly faster
36:13
so final thing I wanna and with is are what we’re doing
36:17
next so in stock so far about interactive
36:21
and complex analytics machine learning the
36:24
to other things that I really exciting that are coming up so one of them is
36:28
a stream processing we just released an album version on this today
36:32
it called spike streaming and spark streaming gives you a fault-tolerant
36:36
exactly one stream processing with a really nice API similar to the 1i showed
36:41
for spark and with actually really high performance or higher performance than
36:45
engines like star
36:46
that don’t provide fault-tolerance and this is due to a new
36:49
way of doing fault tolerance for stream processing that we have developed
36:53
sorry can you can look a at that online I
36:56
and the rest the anything is is just higher level libras
37:00
we have a bunch of exciting things in the pipeline that build up on the
37:03
Berkeley data analytic stack
37:05
including raft processing approximate graze
37:08
based on sampling and the library of machine learning algorithms
37:13
called ml base and you can look to see these over the next few years
37:18
so I hope that’s given you a bit over a introduction
37:22
it to the system’s I just to say no spark shark all open source and
37:26
there is being used and a bunch of companies we’ve tried to make them very
37:30
easy to deploy
37:31
and we love to get more users or contributors to them
37:34
on they are designed to be highly compatible with the Hadoop ecosystem
37:38
and if you want to learn them I and II documentation stereo there’s actually a
37:43
man I said I’ve
37:44
video details and online exercises you can do
37:48
on our website and I’d be granted to chat after with that with any people
37:52
interested in learning about him as well
37:55
so that it thanks