Confirmed Sessions for Data Day Texas 2018

Take advantage of our discount room block at the official conference hotel.
Use the following link to book your room:

We are just now beginning to announce the confirmed sessions. Check this page regularly for updates.

Next Generation Real Time Architectures

Karthik Ramasamy - Streamlio

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This is driven by enterprises not only producing data in volume but also at high velocity. Many daily business operations depend on real-time insights and how enterprises react to those situations. In this talk, we will describe what constitutes a real time stack and how the stack is organized to provide an end to end real time experience. Next generation real time stack consists of Apache Pulsar, a messaging system, Heron, a distributed streaming engine and Apache BookKeeper that provides a fast streaming storage. We will delve into details of each of the systems and explain why these systems are better than the previous generation system.

Machine Learning: From The Lab To The Factory

John Akred - Silicon Valley Data Science

When data scientists are done building their models, there are questions to ask:
* How do the model results get to the hands of the decision makers or applications that benefit from this analysis?
* Can the model run automatically without issues and how does it recover from failure?
* What happens if the model becomes stale because it was trained on data that is no longer relevant?
* How do you deploy and manage new versions of that model without breaking downstream consumers?
This talk will illustrate the importance of these questions and provide a perspective on how to address them. John will share experiences deploying models across many enterprises, some of the problems we encountered along the way, and what best practice is for running machine learning models in production.

Introduction to SparkR in AWS EMR (90 minute session)

Alex Engler - Urban Institute

This session is a hands-on tutorial on working in Spark through R and RStudio in AWS Elastic MapReduce (EMR). The demonstration will overview how to launch and access Spark clusters in EMR with R and RStudio installed. Participants will be able to launch their own clusters and run Spark code during an introduction to SparkR, including the SparklyR package, for data science applications. Theoretical concepts of Spark, such as the directed acyclic graph and lazy evaluation, as well as mathematical considerations of distributed methods will be interspersed throughout the training. Follow up materials on launching SparkR clusters and tutorials in SparkR will be provided.
Intended Audience: R users who are interested in a first foray into distributed cloud computing for the analysis of massive datasets. No big data, dev ops, or Spark experience is required.

Autopiloting #realtime processing in Heron

Karthik Ramasamy - Streamlio

Several enterprises have been producing data not only at high volume but also at high velocity. Many daily business operations depend on real-time insights, therefore real-time processing of the data is gaining significance. Hence there is a need for a scalable infrastructure that can continuously process billions of events per day the instant the data is acquired. To achieve real time performance at scale, Twitter developed and deployed Heron, a next-generation cloud streaming engine that provides unparalleled performance at large-scale. Heron has been successfully meeting the strict performance requirements for various streaming applications and is now an open source project with contributors from various institutions. Heron faced some crucial challenges from developers and operators point of view: the manual, time-consuming and error-prone tasks of tuning various configuration knobs to achieve service level objectives (SLO) as well as the maintenance of SLOs in the face of sudden, unpredictable load variation and hardware or software performance degradation.
In order to address these issues, we conceived and implemented Dhalion that aims to bring self-regulating capabilities to streaming systems. Dhalion monitors the streaming application, identifies problems that prohibit the application from meeting its targeted performance and automatically takes actions to recover such as restarting slow processes and scaling up and down resources in case of load variations. Dhalion has been built as an extension to Heron and contributed back open source. In this talk, I will give a brief introduction to Heron and enumerate the challenges that we faced while running in production and describe how Dhalion solves some of the challenges. This is a joint work with Avrilia Floratou and Ashvin Agrawal at Microsoft and Bill Graham at Twitter. .

Writing Distributed Graph Algorithms

Andrew Ray - Sam's Club

Distributed graph algorithms are an important concept for understanding large scale connected data. One such algorithm, Google’s PageRank, changed internet search forever. Efficient implementations of these algorithms in distributed systems are essential to operate at scale.
This talk will introduce the main abstractions for these types of algorithms. First we will discuss the Pregel abstraction created by Google to solve the PageRank problem at scale. Then we will discuss the PowerGraph abstraction and how it overcomes some of the weaknesses of Pregel. Finally we will turn to GraphX and how it combines together some of the best parts of Pregel and PowerGraph to make an easier to use abstraction.
For all of these abstractions we will discuss the implementations of three key examples: Connected Components, Single Source Shortest Path, and PageRank. For the first two abstractions this will be in pseudo code and for GraphX we will use Scala. At the end we will discuss some practical GraphX tips and tricks.

We R What We Ask: The Landscape of R Users on Stack Overflow

Dave Robinson - Stack Overflow

Since its founding in 2008, the question and answer website Stack Overflow has been a valuable resource for the R community, collecting more than 200,000 questions about R that are visited millions of times each month. This makes it a useful source of data for observing trends about how people use and learn the language. In this talk, I show what we can learn from Stack Overflow data about the global use of the R language over the last decade. I'll examine what ecosystems of R packages are asked about together, what other technologies are used alongside it, in what industries it has been most quickly adopted, and what countries have the highest density of users. Together, the data paints a picture of a global and rapidly growing community. Aside from presenting these results, I'll introduce interactive tools and visualizations that the company has published to explore this data, as well as a number of open datasets that analysts can use to examine trends in software development.