Confirmed Sessions for Data Day Texas 2018

We have many more confirmed sessions to announce. Check this page regularly for updates.

KEYNOTE: Deep Learning in the Real World

Lukas Biewald - Crowdflower

Deep Learning has made some incredible advances in the past few years. I've watched hundreds of organizations build and deploy machine learning algorithms in the past few years and I've seen it make a huge impact on many different applications. But deep learning isn't magic and it takes real work to make it effective. Everyone talks about algorithms, but that's rarely the biggest problem. This talk is about real machine learning from beginning to end, collecting training data, setting expectations, handling errors, dealing with potential adversaries and explaining why the model did what it did. It will cover a variety of use cases from medical diagnosis to sentiment analysis to self driving cars.

Cassandra Architecture FTW!

Jeff Carpenter - DataStax

In this talk we’ll take a deep dive into the architecture of Apache Cassandra to learn why it succeeds at scales where other databases fail. We’ll introduce the key distributed system design elements that Cassandra is built on, the problems that Cassandra solves especially well, and how to pair Cassandra with complementary technologies to build even more powerful systems. If you’ve heard about Cassandra and wondered if it was right for your use case, this talk is for you.

Making Causal Claims as a Data Scientist: Tips and Tricks Using R

Lucy D'Agostino - Vanderbilt University Medical Center

Making believable causal claims can be difficult, especially with the much repeated adage “correlation is not causation”. This talk will walk through some tools often used to practice safe causation, such as propensity scores and sensitivity analyses. In addition, we will cover principles that suggest causation such as the understanding of counterfactuals, and applying Hill’s criteria in a data science setting. We will walk through specific examples, as well as provide R code for all methods discussed.

R, What is it good for? Absolutely Everything

Jasmine Dumas - Simple Finance

Good does not mean great, but good is better than bad. When we try to compare programming languages we tend to look at the surface components (popular developer influence, singular use cases or language development & design choices) and sometimes we forget the substantive (sometimes secondary) components of what can make a programming language appropriate for use, such as: versatility, environment and inclusivity. I’ll highlight each of these themes in the presentation to show and not tell of why R is good for everything!

Something old, something new, something borrowed, something blue: Ways to teach data science (and learn it too!)

Albert Y. Kim - Amherst College

How can we help newcomers take their first steps into the world of data science and statistics? In this talk, I present ModernDive: An Introduction to Statistical and Data Sciences via R, an open source, fully reproducible electronic textbook available at, co-authored by myself and Chester Ismay, Data Science Curriculum Lead at DataCamp. ModernDive’s authoring follows a paradigm of “versions, not editions” much more in line with software development than traditional textbook publishing, as it is built using RStudio’s bookdown interface to R Markdown. In this talk, I will present details on our book’s construction, our approaches to teaching novices to use tidyverse tools for data science (in particular ggplot2 for data visualization and dplyr for data wrangling), how we leverage these data science tools to teach data modeling via regression, and preview the new infer package for statistical inference, which performs statistical inference using an expressive syntax that follows tidy design principles. We’ll conclude by presenting example vignettes and R Markdown analyses created by undergraduate students to demonstrate the great potential yielded by effectively empowering new data scientists with the right tools.

Performance Data Modeling at Scale

Aaron Ploetz - Target

The most important aspect about backing your application with Cassandra, is in building a good data model. In addition to designing a query-based model that distributes well, performance at scale should also be a prime consideration. After all, you want good things to happen when your application gets a sudden 10x increase in traffic. At Target, the holiday season hits our infrastructure hard, and engineering to withstand that 10x increase is our reality.
In this presentation, we will examine real-world use cases and data processing scenarios. We will cover Cassandra data modeling techniques, and considerations for both high performance and large scale. Performance engineering of existing models will also be discussed, along with ways to get that extra bit of lower latency.
Intended audience: Cassandra DBAs, developers, and data modelers.

The State of JanusGraph 2018

Ted Wilmes - Expero

Graph database adoption increased at a rapid clip in 2017 and shows no sign of slowing down as we begin 2018. When coupled with the right problem set, it's a compelling solution and word has spread from the startup world all the way to the Fortune 500. JanusGraph, an Apache TinkerPop compatible fork of the popular Titan graph database, was one of the newcomers into the marketplace last year. Its future was uncertain, but a dedicated community coalesced around it and three releases later and with an ever growing list of contributors, it is here to stay. This talk will introduce JanusGraph, discuss where it fits into the existing graph database ecosystem, and then review the progress made over the past year along with an eye to what exciting things are coming up in 2018.

Data Science Tools: Cypher for Data Munging

Ryan Boyd - Neo4j

Running data analysis using tools like Pandas, Scikit-Learn, or Apache Spark requires that your data be in a clean format. However, as data scientists, we're often forced to bring data in from many different sources and understand the relationships between the data before running our analysis.
This session will discuss and show how we can use the power of the Cypher query language to bring data in from a variety of different sources, clean it, and prepare it for analysis in a variety of tools. We'll also show how we can supplement the native functionality available in Cypher with APOC - an amazing library of hundreds of utility functions for cleaning, refactoring and analyzing data.
While Cypher is currently used in databases like Neo4j and SAP HANA to query graph structures, it can now be used on Apache Spark with the CAPS alpha project. We'll show how Cypher can be used for Data Prep in both of these scenarios.

Understanding the development of visual focus of attention in infants using computer vision tools

Qazaleh Mirsharif - CrowdFlower

Head cameras enable developmental scientists to have access to infant’s visual field from his/her own point of view. The head camera can be mounted on infant and collect his/her momentary visual experience about how they visually recognize objects, interact with their social partners and assign names to those objects. Analysis of such videos requires frame by frame human observation of infant’s behavior and high-level expertise. Computer vision have been emerging in this field recently to help the developmental scientists further their understanding of the development of visual focus of attention in infants by providing tools to process these videos in terms of objects and analyze motion that generates visual attention. Such computer vision tools reveal patterns in the developmental process of visual focus of attention in infants which cannot be estimated by humans as the head camera is in constant motion due to infant’s large and random head movements.

Cassandra Performance Tuning and Crushing SLAs

Jon Haddad - The Last Pickle

In an ideal world, everything would just be fast out of the box. Unfortunately, we’re not quite there yet. Getting the best performance out of a database means understanding your entire system, from the hardware and OS to the databases’s internals. In this talk, Jon Haddad will discuss a wide range of performance tuning techniques. We’ll start by examining how to measure and interpret the statistics from the different components on our machines. Once we understand how to identify what exactly is holding our performance back, we can take the necessary steps to address the problem and move to the next issue. We’ll examine common pitfalls and problems, learning how to tune counters, compaction, garbage collection, compression, and more. If you’re working on a low latency, high throughput system you won’t want to miss this talk.

Navigating Time and Probability in Knowledge Graphs

Jans Aasman - Franz, Inc.

The market for knowledge graphs is rapidly developing and evolving to solve widely acknowledged deficiencies with data warehouse approaches. Graph databases are providing the foundation for these knowledge graphs and in our enterprise customer base we see two approaches forming: static knowledge graphs and dynamic event driven knowledge graphs. Static knowledge graphs focus mostly on metadata about entities and the relationships between these entities but they don’t capture ongoing business processes. DBPedia, Geonames and Census or Pubmed are great examples of static knowledge.
Dynamic knowledge graphs are used in the enterprise to facilitate internal processes, facilitate the improvement of products or services or gather dynamic knowledge about customers. I recently authored an IEEE article describing this evolution of knowledge graphs in the Enterprise and during this presentation I will describe two critical success factors for dynamic knowledge graphs, a uniform way to model, query and interactively navigate time and the power of incorporating probabilities into the graph. The presentation will cover three use cases and live demos showing the confluence of knowledge via machine learning, visual querying, distributed graph databases, and big data not only displays links between objects, but also quantifies the probability of their occurrence.

Graph Analysis of Russian Twitter Trolls

William Lyon - Neo4j

As part of the US House Intelligence Committee investigation into how Russia may have influenced the 2016 US election, Twitter released the screen names of nearly 3000 Twitter accounts tied to Russia's Internet Research Agency. These accounts were immediately suspended, deleting the data from and Twitter's developer API. In this talk we show how we can reconstruct a subset of the Twitter network of these Russian troll accounts and apply graph analytics to the data to try to uncover how these accounts were spreading fake news.

This case study style presentation will show how we collected and munged the data, taking advantage of the flexibility of the property graph. We'll dive into how NLP and graph algorithms like PageRank and community detection can be applied in the context of social media to make sense of the data. We'll show how Cypher, the query language for graphs is used to work with graph data. And we'll show how visualization is used in combination with these algorithms to interpret results of the analysis and to help share the story of the data.

Cassandra pluggable storage engine

Dikang Gu - Facebook / Pengchao Wang - Facebook

Instagram is running one of the largest Cassandra deployments. In this year, the Cassandra team in Instagram has been working on a very interesting project to make Apache Cassandra's storage engine to be pluggable, and implement a new RocksDB based storage engine into Cassandra. The new storage engine can improve the performance of Apache Cassandra significantly.
In this talk, we will describe the motivation and different approaches we have considered, the high-level design of the solution we choose, also the performance metrics in benchmark and production environments.

Powers of Ten Redux

Jason Plurad - IBM

One of the first problems a developer encounters when evaluating a graph database is how to construct a graph efficiently. Recognizing this need in 2014, TinkerPop's Stephen Mallette penned a series of blog posts titled "Powers of Ten" which addressed several bulkload techniques for Titan. Since then Titan has gone away, and the open source graph database landscape has evolved significantly. Do the same approaches stand the test of time? In this session, we will take a deep dive into strategies for loading data of various sizes into modern Apache TinkerPop graph systems. We will discuss bulkloading with JanusGraph, the graph database forked from Titan, to better understand how its open source architecture can be optimized for ingestion.

Cassandra and Kubernetes

Ben Bromhead - Instaclustr

Kubernetes has become the most popular container orchestration and management API with cloud-native support from AWS, GCP, Azure and a growing enterprise support ecosystem. Leveraging Kubernetes to provide tested, repeatable deployment patterns that follow best practices is a win for both developers and operators.
In this talk Ben Bromhead, CTO of Instaclustr will introduce the Cassandra Kubernetes Operator, a Cassandra controller that provides robust, managed Cassandra deployments on Kubernetes. By adopting Kubernetes and Cassandra, you can provide DBaaS like services rapidly and easily to the rest of your team and have a simple on-ramp to true multi-cloud capabilities to your environment.

Improving Graph Based Entity Resolution using Data Mining and NLP

David Bechberger - Gene by Gene

“Hey, here are those new data files to add. I ‘cleaned’ them myself so it should be easy. Right?”
Words like these strike fear into the heart of all developers but integrating ‘dirty’ unstructured, denormalized and text heavy datasets from multiple locations is becoming the de facto standard when building out data platforms.
In this talk we will look at how we can augment our graph’s attributes using techniques from data mining (e.g. string similarity/distance measures) and Natural Language Processing (e.g. keyword extraction, named entity recognition). We will then walkthrough an example using this methodology to demonstrate the improvements in the accuracy of the resulting matches.

How to Destroy Your Graph Project with Terrible Visualization

Corey Lanum - Cambridge Intelligence

We are all using graphs for a reason - in many cases, it's because the graph model presents an intuitive view of the data. Unfortunately, the most elegant graph data models can often be stymied by bad visualizations that obscure rather than enlighten. In this talk, Corey Lanum will discuss a number of bad practices in graph visualization that are surprisingly common. He will then outline graph visualization best practices to help create visual interfaces to graph data that convey useful insight into the data.

Detecting Bias in News Articles

Rob McDaniel - Lingistic / Rakuten

Bias is a hard thing to define, let alone detect. What is bias? How many different types of bias exist? What, if any, lexical cues exist to identify bias-inducing words? Can machines help us qualify and improve news articles?
Using millions of individual Wikipedia revisions, we will discuss a supervised method for identifying bias in news articles. First, we will discuss the last several decades of linguistic research into bias and the various types of biased verbs and lexicons that exist. Then, with plenty of examples, we will explore the way that these words introduce hidden bias into a text, and will follow up with a demonstration of a model for predicting the presence of bias-inducing words.
We will conclude with an exploration of ways to automatically suggest improvements to an article, to associate bias with topics, future implications in the field of stance detection, and a discussion of the background bias of various publishers.

Text Mining Using Tidy Data Principles

Julia Silge - Stack Overflow

Text data is increasingly important in many domains, and tidy data principles and tidy tools can make text mining easier and more effective. I will demonstrate how we can manipulate, summarize, and visualize the characteristics of text using these methods and R packages from the tidy tool ecosystem. These tools are highly effective for many analytical questions and allow analysts to integrate natural language processing into effective workflows already in wide use. We will explore how to implement approaches such as sentiment analysis of texts, measuring tf-idf, and measuring word vectors.

Graph Convolutional Networks for Node Classification

Steve Purves - Expero

We describe a method of classifying nodes in an information network by application of a non-Euclidean convolutional neural network. The convolutional layers are kernelized to operate directly on the natural manifold of the information space, and thus produce output more accurate than analysis on information arbitrarily embedded in a Euclidean geometry. First, we describe the benefits of operating in a non-Euclidean geometry. We then sketch out how graph convolutional networks work. Finally, we demonstrate the application of this technique by predicting the credit-worthiness of applicants based on their population characteristics and their relationships to other individuals.

Building advanced search and analytics engines over arbitrary domains...without a data scientist

Mayank Kejriwal - USC Information Sciences Institute (ISI)

Although search engines like Google work well for many everyday search needs, there are many use cases to which they don't apply. For example, Google does not allow you to limit search to a given arbitrary 'domain' of your choice, be it publications, bioinformatics, or stocks, and it does not offer customized analytics over the domain that you would get if you were able to query the Web like a database. In the past, building such domain-specific 'search and analytics' engines required a full team of engineers and data scientists that would have to collect and crawl the data, set up the infrastructure, write and configure code, and implement complex machine learning algorithms e.g., for extracting useful information from webpages using natural language processing.
The open source Domain-specific Insight Graph (DIG) architecture meets the challenge of domain-specific search by semi-automatically structuring an arbitrary-domain Web corpus into an inter-connected 'knowledge graph' of entities, attributes and relationships. DIG provides the user with intuitive interfaces to define their own schema, customize search, and ultimately, build an entire engine in just a few hours of (non-programming) effort. The search engine itself, once set up, can be used by anyone who has access credentials, and, in addition to structured, faceted and keyword-based search, allows for complex analytics that includes geospatial and temporal analysis, network analysis and dossier generation. The approach is now widely used by law enforcement in the US for important social problems like combating human trafficking, and new uses for it have continued to emerge in DARPA, IARPA and NSF projects. In this talk, I will describe the problem of domain-specific search and the knowledge graph-centric architecture of DIG. I will also cover some important use cases, especially in social domains, for which DIG has already been instantiated and deployed.

What have we done!? 10 years of Cassandra

Patrick McFadin - DataStax

10 years ago a couple of engineers at Facebook put up a project on Google code and a legend was born. The project has grown and users have shown an enormous amount of success. Are we ready to say Apache Cassandra has won and have a party? Let me present the evidence and we can decide as a group. No other database has delivered on the initial promises of being a reliable, performant, multi-datacenter source of record for important data. No other project, vendor or cloud has done as well or, I would argue, ever will.
I will highlight the main use cases and data models that has put Apache Cassandra ahead of its peers. If you are new to Apache Cassandra, come learn how you are lied to buy every other database that makes this claim. If you are a veteran, let me revive some of the thinking that got you here in the first place and give you some fresh reasons to love this database of ours.

Building Shiny Apps: Challenges and Responsibilities

Jessica Minnier - Oregon Health and Science University

R Shiny has revolutionized the way statisticians and data scientists distribute analytic results and research methods. We can easily build interactive web tools that empower non-statisticians to interrogate and visualize their data or perform their own analyses with methods we develop. However, ensuring the user has an enjoyable experience while guaranteeing the analyses options are statistically sound is a difficult balance to achieve. Through a case study of building START (Shiny Transcriptome Analysis Resource Tool), a shiny app for "omics" data visualization and analysis, I will present the challenges you may face when building and deploying an app of your own. By allowing the non-statistician user to explore and analyze data, we can make our job easier and improve collaborative relationships, but the success of this goal requires software development skills. We may need to consider such issues as data security, open source collaborative code development, error handling and testing, user education, maintenance due to advancing methods and packages, and responsibility for downstream analyses and decisions based on the app’s results. With Shiny we do not want to fully eliminate the statistician or analyst “middle man” but instead need to stay relevant and in control of all types of statistical products we create.

Understanding People Using Three Different Kinds of Graphs

Misty Nodine - Spiceworks

There are various ways that we can learn about people using graph-based approaches.
Social graphs – These graphs help understand people via the connections they have with other people. They are characterized by having one kind of node type (person) and one type of edge type (whatever social relationship the graph is representing). Typical questions we ask in this space are: How important is this person in this relationship? How well-connected are the people? What are the interesting groups?
Knowledge graphs – These graphs represent information we have about a user, what things we can know about them. For instance, it may have nodes not only for people but for places, or companies. There are also a variety of edge types, like ‘lives_in’ between a person and a city. Knowledge graphs typically take two forms: RDF or entity-relationship. The RDF representations also are related to ontologies and the semantic web. Knowledge graphs enable you to leverage existential knowledge or knowledge related to other people to understand a person. Hence, these are graphs that we reason over. Example questions that a knowledge graph might answer include: How big a company does this person work for?
Probabilistic graphical models – Probabilistic graphical models allow us to infer information about a person based on things we have observed directly about the person based on probabilistic relationships. In a PGM, the nodes represent specific things you can observe (variables), and each edge has the conditional dependencies between the two variables. In real life, we observe actual values for some subset of the nodes and can then know the probabilities for the values of the unobserved variables.
This talk will provide an overview of these three different kinds of graphs and their desirable properties, and the algorithms and approaches that you use over those graphs to understand more about a person.

Vital Role of Humans in Machine Learning

Lynn Pausic - Expero / Chris LaCava - Expero

It doesn’t take much effort to stumble across high profile stories faulting “automated technology” for misguided decisions made by courts of law, medical professionals, financial institutions and other important establishments. Upon further examination, these “technology failures” are often attributed to a lack of human oversight or aiming the intelligence at ill-defined problems rather than some critical flaw in the algorithms per se. While the relationship between humans and machine learning (ML) is still in its infancy, one thing is clear - humans play a symbiotic if not vital role in augmenting intelligent technology. For example, training algorithms requires continuous curation, ML outcomes often need human counterparts who can sensibly apply them to real world contexts and any organization utilizing ML should routinely review the moral implications of decisions made using intelligent technology. Join us for a fun and engaging talk where we’ll demonstrate how the same ML can yield from good to very bad outcomes based key aspects of human involvement.

Here and now: Bringing AI into the enterprise

Kristian Hammond - Narrative Science

Even as AI technologies move into common use, many enterprise decision makers remain baffled about what the different technologies actually do and how they can be integrated into their businesses. On the plus side, technology we thought was decades away seems to be showing up at our doorstep with increasing frequency. However, little effort has been made to clearly explain the value and genuine business utility of this technology.
Kristian Hammond shares a practical framework for understanding the role of AI technologies in problem solving and decision making, focusing on how they can be used, the requirements for doing so, and the expectations for their effectiveness. Kris starts with a lecture outlining this functional framework and ends with hands-on exercises so you can practice using it in the real world when evaluating data, requirements and opportunities. You’ll leave with greater knowledge of the space and the skills to apply that knowledge to your businesses, ensuring that as you build, evaluate, and compare different systems, you’ll understand and be able to articulate how they work and the resulting impact.

Integrating Semantic Web Technologies in the Real World: A journey between two cities

Juan Sequeda - Capsenta

An early vision in Computer Science has been to create intelligent systems capable of reasoning on large amounts of data. Today, this vision can be delivered by integrating Relational Databases with Semantic Web technologies via the W3C standards: a graph data model (RDF), ontology language (OWL), mapping language (R2RML) and query language (SPARQL). The research community has successfully been showing how intelligent systems can be created with Semantic Web technologies, dubbed now as Knowledge Graphs. However, where is the mainstream industry adoption? What are the barriers to adoption? What are the open scientific problems that need to be addressed to overcome the barriers?
This talk will chronicle our journey of deploying Semantic Web technologies with real world users to address Business Intelligence and Data Integration needs, describe technical and social obstacles that are present in large organizations, scientific challenges that require attention and argue for the resurrection of Knowledge Engineers.

Debugging Apache Spark

Joey Echeverria - Rocana / Holden Karau - Google

Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.

Lexicon Mining for Semiotic Squares: Exploding Binary Classification

Jason Kessler - CDK Global

A common task in natural language processing is category-specific lexicon mining, or identifying words and phrases that are associated with the presence or absence of a specific category. For example, lists of words associated with positive (vs. negative) product reviews may be automatically discovered from labeled corpora.
In the 1960s, the semanticists A. J. Greimas and F. Rastier developed a framework for turning two opposing categories into a network of 10 semantic classes. This talk introduces an algorithm for discovering lexicons associated with those semantic classes given a corpus of categorized documents. This algorithm is implemented as part of Scattertext, and the output can be viewed in an interactive browser-based visualization.

Go big or go home! Does it still make sense to do Big Data with Small Nodes?

Glauber Costa - ScyllaDB

In the world of Big Data, scaling out is the norm. The prospect of running massive computation in commodity hardware is enticing, but what does "commodity hardware" really mean? The usual 8-core setup people have been deploying with can now be found on phones, and every cloud provider makes boxes with 32 cores and up available at the click of a button. And still, a lot of Big Data deployments are trapped in a sea of small boxes cluster.
With the advent of scalable platforms like ScyllaDB, node performance is no longer an issue and doubling the size of the nodes will usually double the available storage and memory and processing power. So which other reasons stop people from going big in the Cloud Native world? This talk will explore some of the popular knowledge associated with it and delve into which are true, and which aren't.

The Lesser Known Stars of the Tidyverse

Emily Robinson - Etsy

While most R programmers have heard of ggplot2 and dplyr, many are unfamiliar with the breath of the tidyverse and the variety of problems it can solve. In this talk, we will give a brief introduction to the concept of the tidyverse and then describe three packages you can immediately start using to make your workflow easier. The first package is forcats, designed for making working with categorical variables easier; the second is glue, for programmatically combining data and strings; and the third package is tibble, an alternative to data.frames. We will cover their basic functions so that, at the end of the talk, we will be able to use and learn more about the broader tidyverse.

Generating Natural-Language Text with Neural Networks

Jonathan Mugan - Deep Grammar

Automatic text generation enables computers to summarize text, to have conversations in customer-service and other settings, and to customize content based on the characteristics and goals of the human interlocutor. Using neural networks to automatically generate text is appealing because they can be trained through examples with no need to manually specify what should be said when. In this talk, we will provide an overview of the existing algorithms used in neural text generation, such as sequence2sequence models, reinforcement learning, variational methods, and generative adversarial networks. We will also discuss existing work that specifies how the content of generated text can be determined by manipulating a latent code. The talk will conclude with a discussion of current challenges and shortcomings of neural text generation.

Everything is not a graph problem (but there are plenty)

Dr. Denise Gosnell - DataStax

As the reality of the graph hype cycle sets in, the graph pragmatists have shown up to guide the charge. What we are seeing and experiencing is an adjustment in mindset: the convergence to multi-model database systems parallels the mentality of using the right tool for the problem. With graph databases, there is an intricate balance to find where the rubber meets the road between theorists and practitioners.
Before hammering away on the keyboard to insert vertices and edges, it is crucial to iterate and drive the development life cycle from definitive use cases. Too many times the field has seen monoglot system thinking pressure the construction of the one graph that can rule it all which can result in some impressive scope creep. In this talk, Dr. Gosnell will walk through common solution design considerations that can make or break a graph implementation and suggest some best practices for navigating common misconceptions.

Next Generation Real Time Architectures

Karthik Ramasamy - Streamlio

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This is driven by enterprises not only producing data in volume but also at high velocity. Many daily business operations depend on real-time insights and how enterprises react to those situations. In this talk, we will describe what constitutes a real time stack and how the stack is organized to provide an end to end real time experience. Next generation real time stack consists of Apache Pulsar, a messaging system, Heron, a distributed streaming engine and Apache BookKeeper that provides a fast streaming storage. We will delve into details of each of the systems and explain why these systems are better than the previous generation system.

G-CORE: A Core for Future Graph Query Languages, designed by the LDBC Graph Query Language Task Force

Juan Sequeda - Capsenta

In this talk, Juan will report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.
This work is the culmination of 2.5 years of intensive discussion between the LDBC Graph Query Language Task Force and members of industry (Capsenta, HP, Huawei, IBM, Neo4j, Oracle, SAP and Sparsity) and academia (CWI Amsterdam, PUC Chile, Technische Universiteit Eindhoven, Technische Universitat Dresden, Universidad de Chile, Universidad de Talca).
Link to paper:

Making Magic with Keras and Shiny

Nicholas Strayer - Vanderbilt University

The web-application framework Shiny has opened up enormous opportunities for data scientists by giving them a way to bring their models and visualizations to the public in interactive applications with only R code. Likewise, the package keras has simplified the process of getting up and running with deep-neural networks by abstracting away much of the boiler-plate and book-keeping associated with writing models in a lower-level library such as tensorflow. In this presentation, I will demo and discuss the development of a shiny app that allows users to cast 'spells' simply by waving their phone around like a wand. The app gathers the motion of the device using the library shinysense and feeds it into a convolutional neural network which predicts spell casts with high accuracy. A supplementary shiny app for gathering data will be also be shown. These applications demonstrate the ability for shiny to be used at both the data-gathering and model-presentation steps of data science.

Machine Learning: From The Lab To The Factory

John Akred - Silicon Valley Data Science

When data scientists are done building their models, there are questions to ask:
* How do the model results get to the hands of the decision makers or applications that benefit from this analysis?
* Can the model run automatically without issues and how does it recover from failure?
* What happens if the model becomes stale because it was trained on data that is no longer relevant?
* How do you deploy and manage new versions of that model without breaking downstream consumers?
This talk will illustrate the importance of these questions and provide a perspective on how to address them. John will share experiences deploying models across many enterprises, some of the problems we encountered along the way, and what best practice is for running machine learning models in production.

Introduction to SparkR in AWS EMR (90 minute session)

Alex Engler - Urban Institute

This session is a hands-on tutorial on working in Spark through R and RStudio in AWS Elastic MapReduce (EMR). The demonstration will overview how to launch and access Spark clusters in EMR with R and RStudio installed. Participants will be able to launch their own clusters and run Spark code during an introduction to SparkR, including the SparklyR package, for data science applications. Theoretical concepts of Spark, such as the directed acyclic graph and lazy evaluation, as well as mathematical considerations of distributed methods will be interspersed throughout the training. Follow up materials on launching SparkR clusters and tutorials in SparkR will be provided.
Intended Audience: R users who are interested in a first foray into distributed cloud computing for the analysis of massive datasets. No big data, dev ops, or Spark experience is required.

Autopiloting #realtime processing in Heron

Karthik Ramasamy - Streamlio

Several enterprises have been producing data not only at high volume but also at high velocity. Many daily business operations depend on real-time insights, therefore real-time processing of the data is gaining significance. Hence there is a need for a scalable infrastructure that can continuously process billions of events per day the instant the data is acquired. To achieve real time performance at scale, Twitter developed and deployed Heron, a next-generation cloud streaming engine that provides unparalleled performance at large-scale. Heron has been successfully meeting the strict performance requirements for various streaming applications and is now an open source project with contributors from various institutions. Heron faced some crucial challenges from developers and operators point of view: the manual, time-consuming and error-prone tasks of tuning various configuration knobs to achieve service level objectives (SLO) as well as the maintenance of SLOs in the face of sudden, unpredictable load variation and hardware or software performance degradation.
In order to address these issues, we conceived and implemented Dhalion that aims to bring self-regulating capabilities to streaming systems. Dhalion monitors the streaming application, identifies problems that prohibit the application from meeting its targeted performance and automatically takes actions to recover such as restarting slow processes and scaling up and down resources in case of load variations. Dhalion has been built as an extension to Heron and contributed back open source. In this talk, I will give a brief introduction to Heron and enumerate the challenges that we faced while running in production and describe how Dhalion solves some of the challenges. This is a joint work with Avrilia Floratou and Ashvin Agrawal at Microsoft and Bill Graham at Twitter.

Real-time deep link analytics: The next stage of graph analytics

Dr. Victor Lee - TigerGraph

Graph databases are the fastest growing category in data management, according to DB-Engines. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. To support real-time deep link analytics, we need the power of combining real-time data updates, big datasets, and deep link traversals.
Dr. Victor Lee offers an overview of TigerGraph’s distributed Native Parallel Graph, a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Yu discusses the techniques behind the distributed native parallel graph platform, including how it partitions graph data across machines, supports fast update, and is still able to perform fast graph traversal and computation. He also shares a subsecond real-time fraud detection system managing 100 billion graph elements to detect risk and fraudulent groups.
(Product Showcase)

Writing Distributed Graph Algorithms

Andrew Ray - Sam's Club

Distributed graph algorithms are an important concept for understanding large scale connected data. One such algorithm, Google’s PageRank, changed internet search forever. Efficient implementations of these algorithms in distributed systems are essential to operate at scale.
This talk will introduce the main abstractions for these types of algorithms. First we will discuss the Pregel abstraction created by Google to solve the PageRank problem at scale. Then we will discuss the PowerGraph abstraction and how it overcomes some of the weaknesses of Pregel. Finally we will turn to GraphX and how it combines together some of the best parts of Pregel and PowerGraph to make an easier to use abstraction.
For all of these abstractions we will discuss the implementations of three key examples: Connected Components, Single Source Shortest Path, and PageRank. For the first two abstractions this will be in pseudo code and for GraphX we will use Scala. At the end we will discuss some practical GraphX tips and tricks.

We R What We Ask: The Landscape of R Users on Stack Overflow

Dave Robinson - Stack Overflow

Since its founding in 2008, the question and answer website Stack Overflow has been a valuable resource for the R community, collecting more than 200,000 questions about R that are visited millions of times each month. This makes it a useful source of data for observing trends about how people use and learn the language. In this talk, I show what we can learn from Stack Overflow data about the global use of the R language over the last decade. I'll examine what ecosystems of R packages are asked about together, what other technologies are used alongside it, in what industries it has been most quickly adopted, and what countries have the highest density of users. Together, the data paints a picture of a global and rapidly growing community. Aside from presenting these results, I'll introduce interactive tools and visualizations that the company has published to explore this data, as well as a number of open datasets that analysts can use to examine trends in software development.

Fishing Graphs in a Hadoop Data Lake

Claudius Weinberger - ArangoDB

Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
Spark with GraphX is great for answering relatively simple graph questions which are worth starting a Spark job for because they essentially involve the whole graph. But does it make sense to start one for every ad-hoc query or is it suitable for complex real-time queries?