Confirmed Sessions for Data Day Texas 2018

KEYNOTE: Deep Learning in the Real World

Lukas Biewald - Crowdflower

Deep Learning has made some incredible advances in the past few years. I've watched hundreds of organizations build and deploy machine learning algorithms in the past few years and I've seen it make a huge impact on many different applications. But deep learning isn't magic and it takes real work to make it effective. Everyone talks about algorithms, but that's rarely the biggest problem. This talk is about real machine learning from beginning to end, collecting training data, setting expectations, handling errors, dealing with potential adversaries and explaining why the model did what it did. It will cover a variety of use cases from medical diagnosis to sentiment analysis to self driving cars.

Pilgrim’s Progress: a journey from confusion to contribution

Mara Averick - RStudio

Navigating the data science landscape can be overwhelming. Luckily, you don't have to do it alone! In fact, I'll argue shouldn't do it alone. Whether it be by tweeting your latest mistake, asking a well-formed question, or submitting a pull request to a popular package, you can help others and yourselves by "learning out loud." No matter how much (or little) you know, you can turn your confusion into contributions, and have a surprising amount of fun along the way.

Navigating Time and Probability in Knowledge Graphs

Jans Aasman - Franz, Inc.

The market for knowledge graphs is rapidly developing and evolving to solve widely acknowledged deficiencies with data warehouse approaches. Graph databases are providing the foundation for these knowledge graphs and in our enterprise customer base we see two approaches forming: static knowledge graphs and dynamic event driven knowledge graphs. Static knowledge graphs focus mostly on metadata about entities and the relationships between these entities but they don’t capture ongoing business processes. DBPedia, Geonames and Census or Pubmed are great examples of static knowledge.
Dynamic knowledge graphs are used in the enterprise to facilitate internal processes, facilitate the improvement of products or services or gather dynamic knowledge about customers. I recently authored an IEEE article describing this evolution of knowledge graphs in the Enterprise and during this presentation I will describe two critical success factors for dynamic knowledge graphs, a uniform way to model, query and interactively navigate time and the power of incorporating probabilities into the graph. The presentation will cover three use cases and live demos showing the confluence of knowledge via machine learning, visual querying, distributed graph databases, and big data not only displays links between objects, but also quantifies the probability of their occurrence.

Improving Graph Based Entity Resolution using Data Mining and NLP

David Bechberger - Gene by Gene

“Hey, here are those new data files to add. I ‘cleaned’ them myself so it should be easy. Right?”
Words like these strike fear into the heart of all developers but integrating ‘dirty’ unstructured, denormalized and text heavy datasets from multiple locations is becoming the de facto standard when building out data platforms.
In this talk we will look at how we can augment our graph’s attributes using techniques from data mining (e.g. string similarity/distance measures) and Natural Language Processing (e.g. keyword extraction, named entity recognition). We will then walkthrough an example using this methodology to demonstrate the improvements in the accuracy of the resulting matches.

Building a Knowledge Graph

Dan Bennett - Thomson Reuters

Just a few years ago a knowledge graph was the domain of academic papers, today they underpin the natural language capabilities of Alexa, Siri, Cortana and Google Now. Graphs are a natural fit for this use case: treating every data item as equivalent and embracing rapid schema mutation. For the past few years, Thomson Reuters has been building a professional information knowledge graph to power our next generation of products. Our graph is RDF based, fast growing and supports a number of different products and user experiences. In this session, Dan will cover our experiences, architecture, tools and lessons learned from building, integrating and maintaining a 100bn triple graph.

Cassandra and Kubernetes

Ben Bromhead - Instaclustr

Kubernetes has become the most popular container orchestration and management API with cloud-native support from AWS, GCP, Azure and a growing enterprise support ecosystem. Leveraging Kubernetes to provide tested, repeatable deployment patterns that follow best practices is a win for both developers and operators.
In this talk Ben Bromhead, CTO of Instaclustr will introduce the Cassandra Kubernetes Operator, a Cassandra controller that provides robust, managed Cassandra deployments on Kubernetes. By adopting Kubernetes and Cassandra, you can provide DBaaS like services rapidly and easily to the rest of your team and have a simple on-ramp to true multi-cloud capabilities to your environment.

Using Dockerized Cassandra and TensorFlow to Predict Future Blockchain Prices

Joaquin Casares - The Last Pickle

Join Joaquin Casares of The Last Pickle in a code-heavy presentation of how he uses Docker Compose to start all of his new projects for his day job, clients, and side projects.
The presentation will come with a companion Github repository that contains a Docker Compose setup with Cassandra as well as a TensorFlow app to ingest and analyze blockchain technology price data.
To get the most out of the project, we recommend installing the following software before the meeting:
Docker Engine (for Mac):
Docker Engine (Ubuntu):
Docker Compose:

Cassandra Architecture FTW!

Jeff Carpenter - DataStax

In this talk we’ll take a deep dive into the architecture of Apache Cassandra to learn why it succeeds at scales where other databases fail. We’ll introduce the key distributed system design elements that Cassandra is built on, the problems that Cassandra solves especially well, and how to pair Cassandra with complementary technologies to build even more powerful systems. If you’ve heard about Cassandra and wondered if it was right for your use case, this talk is for you.

Go big or go home! Does it still make sense to do Big Data with Small Nodes?

Glauber Costa - ScyllaDB

In the world of Big Data, scaling out is the norm. The prospect of running massive computation in commodity hardware is enticing, but what does "commodity hardware" really mean? The usual 8-core setup people have been deploying with can now be found on phones, and every cloud provider makes boxes with 32 cores and up available at the click of a button. And still, a lot of Big Data deployments are trapped in a sea of small boxes cluster.
With the advent of scalable platforms like ScyllaDB, node performance is no longer an issue and doubling the size of the nodes will usually double the available storage and memory and processing power. So which other reasons stop people from going big in the Cloud Native world? This talk will explore some of the popular knowledge associated with it and delve into which are true, and which aren't.

Making Causal Claims as a Data Scientist: Tips and Tricks Using R

Lucy D'Agostino - Vanderbilt University Medical Center

Making believable causal claims can be difficult, especially with the much repeated adage “correlation is not causation”. This talk will walk through some tools often used to practice safe causation, such as propensity scores and sensitivity analyses. In addition, we will cover principles that suggest causation such as the understanding of counterfactuals, and applying Hill’s criteria in a data science setting. We will walk through specific examples, as well as provide R code for all methods discussed.

AI and Graph to optimize steam process in a large process plant

Arnaud de Moissac - DCbrain / Jean-Reynald Macé - Areva

Steam production and distribution networks can cost a lot of money for a large process plant. To optimise such complex networks, you have to deal with several parameters such as : Physical issues like boilers efficiency or clogging of heat exchangers, Availability constraints of your network regarding the business SLA and Non-linear dependancies between the pool of boilers and the steam network. It can be a bit tricky to build a objective function by using standard physical models.
This talk will show how we can use first a graph structure to modelize the network and extract features from it. Then deep learning to build transfer function to represent the behavior of each steam producers and consumers. And finalely run optimization based on this meta-model to find the best way to operate the plant.

Statistics for Data Science: what you should know and why

Gabriela de Queiroz - ‎R-Ladies

Data science is not only about machine learning. To be a successful data person, you also need a significant understanding of statistics. Gabriela de Queiroz walks you through the top five statistical concepts every Data Scientist should know to work with data.

R, What is it good for? Absolutely Everything

Jasmine Dumas - Simple Finance

Good does not mean great, but good is better than bad. When we try to compare programming languages we tend to look at the surface components (popular developer influence, singular use cases or language development & design choices) and sometimes we forget the substantive (sometimes secondary) components of what can make a programming language appropriate for use, such as: versatility, environment and inclusivity. I’ll highlight each of these themes in the presentation to show and not tell of why R is good for everything!

Debugging Apache Spark

Joey Echeverria - Rocana / Holden Karau - Google

Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.

Arcade: play with your data (Product Showcase)

Roberto Franchini / Cody Corrington - Arcade Analytics

Arcade let’s you efficiently investigate and understand the meaning of your connections through clear, concise, and contemporary images to help make you more successful. Data never stops. It’s the driving force of everything around us, and it is only going to grow with the years to come. Every day, data offers the opportunity to make lives better – so let’s do it together.
Find and understand the undiscovered connections in your database. Arcade works with all the most popular Graph Databases (Neo4j, OrientDB, TigerGraph and JanusGraph). Arcade executes queries against the connected DBMS in real time. Arcade puts the power of data back into your hands.

Playing well together: Big data beyond the JVM with Spark and friends

Holden Karau - Google / Rachel Warren - Independent

Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.

Cassandra and the Cloud

Jonathan Ellis - DataStax

Is Apache Cassandra still relevant in an era of hosted cloud databases? DataStax CTO Jonathan Ellis will discuss Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.

Using R for Advanced Analytics with MongoDB

Jane Fine - MongoDB

In the age of big data, organizations rely on data scientists to provide critical decision support and predictive analysis. Most industries now leverage new kinds of data to innovate, understand their customers, and capture new markets. MongoDB’s flexible schema and scalability makes it a natural choice for storing diverse data sets needed to accomplish these tasks.
In this session, we will explore the tools and design patterns available to the data scientist to harness the power of MongoDB for data preparation and enrichment. We will focus on R for advanced analytics utilizing mongolite as well as MongoDB Spark Connector R API.

3 ways to build a near real-time recommendation engine

David Gilardi - DataStax

David Gilardi will show how to add a near real-time recommendation engine to KillrVideo, a reference application built to help developers learn Cassandra, Graph, and DataStax Enterprise. David will discuss the benefits and practical considerations of building a solution that leverages multiple, highly connected data models such as graph and tabular. He’ll also take a look at multiple recommendation engine examples, including use of domain specific languages to make development even simpler.
An introduction to KillrVideo - David will briefly introduce the reference implementation, a cloud-based video sharing application which uses the Apache Cassandra core of DataStax Enterprise as well as DSE Search and DSE Graph integrations.
What do we mean by “multiple, highly connected models”? - David will talk about what this means and discuss the benefits of these attributes in building applications that include transaction processing, search, and graph.
Adding a recommendation engine - David will discuss the task to extend KillrVideo to provide real-time video recommendations using DSE Graph and the popular Gremlin graph traversal language using DSL’s (Domain Specific Language).

Knowledge Graphs: You're doing them wrong!

Michael Grove - Stardog Union

As organizations strive to become more data driven and look for ways to better manage and utilize their data, many look to Knowledge Graphs as the answer. While a Knowledge Graph is the only way to effectively analyze, utilize, and monetize enterprise data at scale, just throwing some data into a plain graph and declaring that a “Knowledge Graph” doesn't cut it, yet many organizations make this mistake. This approach simply creates another data silo and doesn’t address the fundamental data challenge these organizations face and fails to create the kind of data infrastructure they need to accomplish their goals.
In this talk we will provide a short overview of the data silo problem as well as a more robust definition of what exactly an Enterprise Knowledge Graph _is_ and the kinds of features it needs to have in order to provide the capabilities required to help an enterprise achieve its goals. We will also provide a demo that brings together a variety of public data sources into a Knowledge Graph, demonstrating how going beyond having a simple graph structure yields a much more powerful platform for todays' enterprises.

Cassandra pluggable storage engine

Dikang Gu - Facebook / Pengchao Wang - Facebook

Instagram is running one of the largest Cassandra deployments. In this year, the Cassandra team in Instagram has been working on a very interesting project to make Apache Cassandra's storage engine to be pluggable, and implement a new RocksDB based storage engine into Cassandra. The new storage engine can improve the performance of Apache Cassandra significantly.
In this talk, we will describe the motivation and different approaches we have considered, the high-level design of the solution we choose, also the performance metrics in benchmark and production environments.

Cassandra Performance Tuning and Crushing SLAs

Jon Haddad - The Last Pickle

In an ideal world, everything would just be fast out of the box. Unfortunately, we’re not quite there yet. Getting the best performance out of a database means understanding your entire system, from the hardware and OS to the databases’s internals. In this talk, Jon Haddad will discuss a wide range of performance tuning techniques. We’ll start by examining how to measure and interpret the statistics from the different components on our machines. Once we understand how to identify what exactly is holding our performance back, we can take the necessary steps to address the problem and move to the next issue. We’ll examine common pitfalls and problems, learning how to tune counters, compaction, garbage collection, compression, and more. If you’re working on a low latency, high throughput system you won’t want to miss this talk.

DBpedia - A Global Open Knowledge Network

Sebastian Hellmann - DBpedia Association

In the last 10 years DBpedia has developed in one of the most successful knowledge graphs projects with a thriving community. After the foundation of the DBpedia Association in 2014, there has been a three year long discussion about the new strategy and identity of DBpedia to further push Open Data as well as economic exploitation of Open Data. Especially, the economic exploitation poses a significant challenge as there are very few successful Open Data business models compared to open-source models. Our belief is that open data models - other than content or software - work best in a networked economy, i.e. a collaborative environment. In the presentation, which was co-created by Soeren Auer with feedback from the whole DBpedia community, I will introduce the new DBpedia incubator model (inspired by the Apache Software Foundation and Github), that can help organisations to analyse their current state of data governance and show concrete scenarios where open knowledge graphs can provide economic benefits. All these are backed up and supported by the new DBpedia platform, which is currently being developed.

infer: an R package for tidy statistical inference

Chester Ismay - DataCamp

How do you code-up a permutation test in R? What about an ANOVA or a chi-square test? Have you ever been uncertain as to exactly which type of test you should run given the data and questions asked? The `infer` package was created to unite common statistical inference tasks into an expressive and intuitive framework to alleviate some of these struggles and make inference more intuitive. This talk will focus on the design principles of the package, which are firmly motivated by Hadley Wickham's tidy tools manifesto. It will also discuss the implementation, centered on the common conceptual threads that link a surprising range of hypothesis tests and confidence intervals. Lastly, we'll walk through some examples of how to implement the code of the `infer` package. The package is aimed to be useful to new students of statistics as well as seasoned practitioners.

Something old, something new, something borrowed, something blue: Ways to teach data science (and learn it too!)

Albert Y. Kim - Amherst College

How can we help newcomers take their first steps into the world of data science and statistics? In this talk, I present ModernDive: An Introduction to Statistical and Data Sciences via R, an open source, fully reproducible electronic textbook available at, co-authored by myself and Chester Ismay, Data Science Curriculum Lead at DataCamp. ModernDive’s authoring follows a paradigm of “versions, not editions” much more in line with software development than traditional textbook publishing, as it is built using RStudio’s bookdown interface to R Markdown. In this talk, I will present details on our book’s construction, our approaches to teaching novices to use tidyverse tools for data science (in particular ggplot2 for data visualization and dplyr for data wrangling), how we leverage these data science tools to teach data modeling via regression, and preview the new infer package for statistical inference, which performs statistical inference using an expressive syntax that follows tidy design principles. We’ll conclude by presenting example vignettes and R Markdown analyses created by undergraduate students to demonstrate the great potential yielded by effectively empowering new data scientists with the right tools.

Cognitive Graph Analytics on Company Data and News: Popularity ranking, importance and similarity

Atanas Kiryakov - Ontotext

Analyzing diverse data from multiple sources requires concept and entity awareness – the kind of knowledge that people have when saying “I am aware of X”. Matching entities across data sources or recognizing mentions in text requires disambiguation – something that people do with easy and computers often fail to do right. Because an average graduate has awareness about wide set of entities and concepts and computers do not. The most common type of entities, when dealing with business information are people, organizations and locations (POL). Ontotext’s POL Knowledge Graph will be presented. It provides entity awareness about all locations, the globally most popular companies and people. I will demonstrate graph analytics on a knowledge graph of about 2 billion triples loaded in Ontotext GraphDB engine. The graph combines several open data sources mapped to the FIBO ontology and interlinks their entities to 1 million news articles. The demonstration will include: importance ranking of nodes, based on graph centrality; popularity ranking, based on news mentions of the company and its subsidiaries; retrieval of similar nodes in a knowledge graph and determining distinguishing features of entity.

Biorevolutions: Machine Learning and Graph Analysis Illuminate Biotechnology Startup Success

Gunnar Kleemann - Berkeley Data Science Group / Kiersten Henderson - Austin Capital Data Group

Biotechnology is a multi-billion dollar industry. But, if only 11% succeed, which companies are good investments? To gain insight into the likelihood of biotechnology startup success, we leveraged the biotech domain knowledge graph developed at Berkeley Data Science Group (BDSG). The BDSG analysis used a machine learning-based predictive model that used publicly available data on biotech startups. Based on this model, some of the major predictors of US Biotech startup success rate are the percent of employees that are scientists and a company’s geographic location.
To further explore the relationship between startup success and these two features, we turned to using a GRAKN.AI knowledge graph of scientific publications. This knowledge graph includes information on the subject matter of scientific publications as well as the scientists who collaborated to publish together. Using this publication graph we explored the collaboration style of scientists at startups in different cities and found a range of collaboration networks from close-knit to broad ones. We also investigated how scientists at startups in different parts of the country differ in terms of the breadth of their subject matter expertise. We will discuss these and other insights gleaned as we applied graph analysis to analyzing patterns in biotech startups. Our analysis suggests further avenues to explore when refining the accuracy of our model to predict biotechnology startup success.

Identifying viral bots and cyborgs in social media

Dr. Steve Kramer - Paragon Science

Particularly over the last several years, researchers across a spectrum of scientific disciplines have studied the dynamics of social media networks to understand how information propagates as the networks evolve. Social media platforms like Twitter and Facebook include not only actual human users but also bots, or automated programs, that can significantly alter how certain messages are spread. While some information-gathering bots are beneficial or at least benign, it was made clear by the 2016 U.S. Presidential election and the 2017 elections in France that bots and sock puppet accounts (that is, numerous social accounts controlled by a single person) were effective in influencing political messaging and propagating misinformation on Twitter and Facebook. It is thus crucial to identify and classify social bots to combat the spread of misinformation and especially the propaganda of enemy states and violent extremist groups. In this talk, Steve will explain the techniques I applied to identify battling groups of viral bots and cyborgs that seek to sway opinions online.

For this research, Steve has applied techniques from complexity theory, especially information entropy, as well as network graph analysis and community detection algorithms to identify clusters of viral bots and cyborgs (human users who use software to automate and amplify their social posts) that differ from typical human users on Twitter and Facebook. In addition to commercial bots focused on promoting click traffic, Steve discovered competing armies of pro-Trump and anti-Trump political bots and cyborgs. Steve has also made the source data and analysis results available on for researchers who wish to collaborate in this work.

Hardware Accelerators for NoSQL databases

Chidamber Kulkarni / Prasanna Sukumar - Reniac

In this talk, we will discuss our experiences in building a low-latency, high-throughput data acceleration engine for NoSQL databases, such as Cassandra. We will introduce the concept of a Data Proxy that marries the concepts of a transparent data layer proxy, that can talk CQL (for Cassandra), and the concept of a storage engine, that implements caching of data across multiple memory technologies such as SRAM, DRAM, and Flash memories. We will then discuss the different accelerator technologies that can be used to build such a data acceleration engine and elucidate why FPGAs make the right choice. In this talk, we will also highlight the key design decisions and trade-offs that we made to leverage the power of FPGA devices to deliver 1/3rd to 1/10th lower latency and significantly higher throughput by saturating the 10Gb Ethernet within a server. Finally, we will present some of our benchmark results to validate our claims. To conclude, we will show how the Data Proxy can be applied to other NoSQL databases, and adjacent technologies such as Search (using Solr/ElasticSearch) both as an on-prem solution and on cloud platforms with FPGAs, such as AWS F1.

How to Destroy Your Graph Project with Terrible Visualization

Corey Lanum - Cambridge Intelligence

We are all using graphs for a reason - in many cases, it's because the graph model presents an intuitive view of the data. Unfortunately, the most elegant graph data models can often be stymied by bad visualizations that obscure rather than enlighten. In this talk, Corey Lanum will discuss a number of bad practices in graph visualization that are surprisingly common. He will then outline graph visualization best practices to help create visual interfaces to graph data that convey useful insight into the data.

How We Got to 1 Millisecond Latency in 99% Under Repair, Compaction, and Flushes with ScyllaDB

Dor Laor - ScyllaDB

Scylla is an open source reimplementation of Cassandra which perform up to 10X with drop in-replacement compatibility. At Scylla, performance matters but even
more importantly, stable performance under any circumstances.
A key factor for our consistent performance is our reliance on userspace schedulers. Scheduling in user space allows the application, the database in our case to have better control on the different priorities each task has and to provide a SLA to selected operations. Scylla used to have an I/O scheduler and recently won a cpu scheduler.
At Scylla, we make architectural decisions that provide not only low latencies but consistently low latencies at higher percentiles. This begins with our choice of language and key architectural decisions such as not using the Linux page-cache, and is fulfilled by autonomous database control, a set of algorithms, which guarantees that the system will adapt to changes in the workload. In the last year, we have made changes to Scylla that provide latencies that are consistent in every percentile. In this talk, Dor Laor will recap those changes and discuss what ScyllaDB is doing in the future.

Graph Analysis of Russian Twitter Trolls

William Lyon - Neo4j

As part of the US House Intelligence Committee investigation into how Russia may have influenced the 2016 US election, Twitter released the screen names of nearly 3000 Twitter accounts tied to Russia's Internet Research Agency. These accounts were immediately suspended, deleting the data from and Twitter's developer API. In this talk we show how we can reconstruct a subset of the Twitter network of these Russian troll accounts and apply graph analytics to the data to try to uncover how these accounts were spreading fake news.

This case study style presentation will show how we collected and munged the data, taking advantage of the flexibility of the property graph. We'll dive into how NLP and graph algorithms like PageRank and community detection can be applied in the context of social media to make sense of the data. We'll show how Cypher, the query language for graphs is used to work with graph data. And we'll show how visualization is used in combination with these algorithms to interpret results of the analysis and to help share the story of the data.

Detecting Bias in News Articles

Rob McDaniel - Lingistic / Rakuten

Bias is a hard thing to define, let alone detect. What is bias? How many different types of bias exist? What, if any, lexical cues exist to identify bias-inducing words? Can machines help us qualify and improve news articles?
Using millions of individual Wikipedia revisions, we will discuss a supervised method for identifying bias in news articles. First, we will discuss the last several decades of linguistic research into bias and the various types of biased verbs and lexicons that exist. Then, with plenty of examples, we will explore the way that these words introduce hidden bias into a text, and will follow up with a demonstration of a model for predicting the presence of bias-inducing words.
We will conclude with an exploration of ways to automatically suggest improvements to an article, to associate bias with topics, future implications in the field of stance detection, and a discussion of the background bias of various publishers.

Understanding the development of visual focus of attention in infants using computer vision tools

Qazaleh Mirsharif - CrowdFlower

Head cameras enable developmental scientists to have access to infant’s visual field from his/her own point of view. The head camera can be mounted on infant and collect his/her momentary visual experience about how they visually recognize objects, interact with their social partners and assign names to those objects. Analysis of such videos requires frame by frame human observation of infant’s behavior and high-level expertise. Computer vision have been emerging in this field recently to help the developmental scientists further their understanding of the development of visual focus of attention in infants by providing tools to process these videos in terms of objects and analyze motion that generates visual attention. Such computer vision tools reveal patterns in the developmental process of visual focus of attention in infants which cannot be estimated by humans as the head camera is in constant motion due to infant’s large and random head movements.

Opinionated Analysis Development

Hilary Parker - Stitch Fix

Traditionally, statistical training has focused primarily on mathematical derivations and proofs of statistical tests. The process of developing the technical artifact -- that is, the paper, dashboard, or other deliverable -- is much less frequently taught, presumably because of an aversion to cookbookery or prescribing specific software choices. In this talk, I argue that it's critical to teach generalized opinions for how to go about developing an analysis in order to maximize the probability that an analysis is reproducible, accurate and collaborative. A critical component of this is adopting a blameless postmortem culture. By encouraging the use of and fluency in tooling that implements these opinions, as well as a blameless way of correcting course as analysts encounter errors, we as a community can foster the growth of processes that fail the practitioners as infrequently as possible.

Performance Data Modeling at Scale

Aaron Ploetz - Target

The most important aspect about backing your application with Cassandra, is in building a good data model. In addition to designing a query-based model that distributes well, performance at scale should also be a prime consideration. After all, you want good things to happen when your application gets a sudden 10x increase in traffic. At Target, the holiday season hits our infrastructure hard, and engineering to withstand that 10x increase is our reality.
In this presentation, we will examine real-world use cases and data processing scenarios. We will cover Cassandra data modeling techniques, and considerations for both high performance and large scale. Performance engineering of existing models will also be discussed, along with ways to get that extra bit of lower latency.
Intended audience: Cassandra DBAs, developers, and data modelers.

Powers of Ten Redux

Jason Plurad - IBM

One of the first problems a developer encounters when evaluating a graph database is how to construct a graph efficiently. Recognizing this need in 2014, TinkerPop's Stephen Mallette penned a series of blog posts titled "Powers of Ten" which addressed several bulkload techniques for Titan. Since then Titan has gone away, and the open source graph database landscape has evolved significantly. Do the same approaches stand the test of time? In this session, we will take a deep dive into strategies for loading data of various sizes into modern Apache TinkerPop graph systems. We will discuss bulkloading with JanusGraph, the graph database forked from Titan, to better understand how its open source architecture can be optimized for ingestion.

Graph Convolutional Networks for Node Classification

Steve Purves - Expero

We describe a method of classifying nodes in an information network by application of a non-Euclidean convolutional neural network. The convolutional layers are kernelized to operate directly on the natural manifold of the information space, and thus produce output more accurate than analysis on information arbitrarily embedded in a Euclidean geometry. First, we describe the benefits of operating in a non-Euclidean geometry. We then sketch out how graph convolutional networks work. Finally, we demonstrate the application of this technique by predicting the credit-worthiness of applicants based on their population characteristics and their relationships to other individuals.

Building advanced search and analytics engines over arbitrary domains...without a data scientist

Mayank Kejriwal - USC Information Sciences Institute (ISI)

Although search engines like Google work well for many everyday search needs, there are many use cases to which they don't apply. For example, Google does not allow you to limit search to a given arbitrary 'domain' of your choice, be it publications, bioinformatics, or stocks, and it does not offer customized analytics over the domain that you would get if you were able to query the Web like a database. In the past, building such domain-specific 'search and analytics' engines required a full team of engineers and data scientists that would have to collect and crawl the data, set up the infrastructure, write and configure code, and implement complex machine learning algorithms e.g., for extracting useful information from webpages using natural language processing.
The open source Domain-specific Insight Graph (DIG) architecture meets the challenge of domain-specific search by semi-automatically structuring an arbitrary-domain Web corpus into an inter-connected 'knowledge graph' of entities, attributes and relationships. DIG provides the user with intuitive interfaces to define their own schema, customize search, and ultimately, build an entire engine in just a few hours of (non-programming) effort. The search engine itself, once set up, can be used by anyone who has access credentials, and, in addition to structured, faceted and keyword-based search, allows for complex analytics that includes geospatial and temporal analysis, network analysis and dossier generation. The approach is now widely used by law enforcement in the US for important social problems like combating human trafficking, and new uses for it have continued to emerge in DARPA, IARPA and NSF projects. In this talk, I will describe the problem of domain-specific search and the knowledge graph-centric architecture of DIG. I will also cover some important use cases, especially in social domains, for which DIG has already been instantiated and deployed.

What have we done!? 10 years of Cassandra

Patrick McFadin - DataStax

10 years ago a couple of engineers at Facebook put up a project on Google code and a legend was born. The project has grown and users have shown an enormous amount of success. Are we ready to say Apache Cassandra has won and have a party? Let me present the evidence and we can decide as a group. No other database has delivered on the initial promises of being a reliable, performant, multi-datacenter source of record for important data. No other project, vendor or cloud has done as well or, I would argue, ever will.
I will highlight the main use cases and data models that has put Apache Cassandra ahead of its peers. If you are new to Apache Cassandra, come learn how you are lied to buy every other database that makes this claim. If you are a veteran, let me revive some of the thinking that got you here in the first place and give you some fresh reasons to love this database of ours.

Building Shiny Apps: Challenges and Responsibilities

Jessica Minnier - Oregon Health and Science University

R Shiny has revolutionized the way statisticians and data scientists distribute analytic results and research methods. We can easily build interactive web tools that empower non-statisticians to interrogate and visualize their data or perform their own analyses with methods we develop. However, ensuring the user has an enjoyable experience while guaranteeing the analyses options are statistically sound is a difficult balance to achieve. Through a case study of building START (Shiny Transcriptome Analysis Resource Tool), a shiny app for "omics" data visualization and analysis, I will present the challenges you may face when building and deploying an app of your own. By allowing the non-statistician user to explore and analyze data, we can make our job easier and improve collaborative relationships, but the success of this goal requires software development skills. We may need to consider such issues as data security, open source collaborative code development, error handling and testing, user education, maintenance due to advancing methods and packages, and responsibility for downstream analyses and decisions based on the app’s results. With Shiny we do not want to fully eliminate the statistician or analyst “middle man” but instead need to stay relevant and in control of all types of statistical products we create.

Understanding People Using Three Different Kinds of Graphs

Misty Nodine - Spiceworks

There are various ways that we can learn about people using graph-based approaches.
Social graphs – These graphs help understand people via the connections they have with other people. They are characterized by having one kind of node type (person) and one type of edge type (whatever social relationship the graph is representing). Typical questions we ask in this space are: How important is this person in this relationship? How well-connected are the people? What are the interesting groups?
Knowledge graphs – These graphs represent information we have about a user, what things we can know about them. For instance, it may have nodes not only for people but for places, or companies. There are also a variety of edge types, like ‘lives_in’ between a person and a city. Knowledge graphs typically take two forms: RDF or entity-relationship. The RDF representations also are related to ontologies and the semantic web. Knowledge graphs enable you to leverage existential knowledge or knowledge related to other people to understand a person. Hence, these are graphs that we reason over. Example questions that a knowledge graph might answer include: How big a company does this person work for?
Probabilistic graphical models – Probabilistic graphical models allow us to infer information about a person based on things we have observed directly about the person based on probabilistic relationships. In a PGM, the nodes represent specific things you can observe (variables), and each edge has the conditional dependencies between the two variables. In real life, we observe actual values for some subset of the nodes and can then know the probabilities for the values of the unobserved variables.
This talk will provide an overview of these three different kinds of graphs and their desirable properties, and the algorithms and approaches that you use over those graphs to understand more about a person.

Vital Role of Humans in Machine Learning

Lynn Pausic - Expero / Chris LaCava - Expero

It doesn’t take much effort to stumble across high profile stories faulting “automated technology” for misguided decisions made by courts of law, medical professionals, financial institutions and other important establishments. Upon further examination, these “technology failures” are often attributed to a lack of human oversight or aiming the intelligence at ill-defined problems rather than some critical flaw in the algorithms per se. While the relationship between humans and machine learning (ML) is still in its infancy, one thing is clear - humans play a symbiotic if not vital role in augmenting intelligent technology. For example, training algorithms requires continuous curation, ML outcomes often need human counterparts who can sensibly apply them to real world contexts and any organization utilizing ML should routinely review the moral implications of decisions made using intelligent technology. Join us for a fun and engaging talk where we’ll demonstrate how the same ML can yield from good to very bad outcomes based key aspects of human involvement.

Here and now: Bringing AI into the enterprise

Kristian Hammond - Narrative Science

Even as AI technologies move into common use, many enterprise decision makers remain baffled about what the different technologies actually do and how they can be integrated into their businesses. On the plus side, technology we thought was decades away seems to be showing up at our doorstep with increasing frequency. However, little effort has been made to clearly explain the value and genuine business utility of this technology.
Kristian Hammond shares a practical framework for understanding the role of AI technologies in problem solving and decision making, focusing on how they can be used, the requirements for doing so, and the expectations for their effectiveness. Kris starts with a lecture outlining this functional framework and ends with hands-on exercises so you can practice using it in the real world when evaluating data, requirements and opportunities. You’ll leave with greater knowledge of the space and the skills to apply that knowledge to your businesses, ensuring that as you build, evaluate, and compare different systems, you’ll understand and be able to articulate how they work and the resulting impact.

Lexicon Mining for Semiotic Squares: Exploding Binary Classification

Jason Kessler - CDK Global

A common task in natural language processing is category-specific lexicon mining, or identifying words and phrases that are associated with the presence or absence of a specific category. For example, lists of words associated with positive (vs. negative) product reviews may be automatically discovered from labeled corpora.
In the 1960s, the semanticists A. J. Greimas and F. Rastier developed a framework for turning two opposing categories into a network of 10 semantic classes. This talk introduces an algorithm for discovering lexicons associated with those semantic classes given a corpus of categorized documents. This algorithm is implemented as part of Scattertext, and the output can be viewed in an interactive browser-based visualization.

Introduction to SparkR in AWS EMR (90 minute session)

Alex Engler - Urban Institute

This session is a hands-on tutorial on working in Spark through R and RStudio in AWS Elastic MapReduce (EMR). The demonstration will overview how to launch and access Spark clusters in EMR with R and RStudio installed. Participants will be able to launch their own clusters and run Spark code during an introduction to SparkR, including the SparklyR package, for data science applications. Theoretical concepts of Spark, such as the directed acyclic graph and lazy evaluation, as well as mathematical considerations of distributed methods will be interspersed throughout the training. Follow up materials on launching SparkR clusters and tutorials in SparkR will be provided.
Intended Audience: R users who are interested in a first foray into distributed cloud computing for the analysis of massive datasets. No big data, dev ops, or Spark experience is required.

Everything is not a graph problem (but there are plenty)

Dr. Denise Gosnell - DataStax

As the reality of the graph hype cycle sets in, the graph pragmatists have shown up to guide the charge. What we are seeing and experiencing is an adjustment in mindset: the convergence to multi-model database systems parallels the mentality of using the right tool for the problem. With graph databases, there is an intricate balance to find where the rubber meets the road between theorists and practitioners.
Before hammering away on the keyboard to insert vertices and edges, it is crucial to iterate and drive the development life cycle from definitive use cases. Too many times the field has seen monoglot system thinking pressure the construction of the one graph that can rule it all which can result in some impressive scope creep. In this talk, Dr. Gosnell will walk through common solution design considerations that can make or break a graph implementation and suggest some best practices for navigating common misconceptions.

Real-time deep link analytics: The next stage of graph analytics

Dr. Victor Lee - TigerGraph

Graph databases are the fastest growing category in data management, according to DB-Engines. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. To support real-time deep link analytics, we need the power of combining real-time data updates, big datasets, and deep link traversals.
Dr. Victor Lee offers an overview of TigerGraph’s distributed Native Parallel Graph, a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Yu discusses the techniques behind the distributed native parallel graph platform, including how it partitions graph data across machines, supports fast update, and is still able to perform fast graph traversal and computation. He also shares a subsecond real-time fraud detection system managing 100 billion graph elements to detect risk and fraudulent groups.
(Product Showcase)

Generating Natural-Language Text with Neural Networks

Jonathan Mugan - Deep Grammar

Automatic text generation enables computers to summarize text, to have conversations in customer-service and other settings, and to customize content based on the characteristics and goals of the human interlocutor. Using neural networks to automatically generate text is appealing because they can be trained through examples with no need to manually specify what should be said when. In this talk, we will provide an overview of the existing algorithms used in neural text generation, such as sequence2sequence models, reinforcement learning, variational methods, and generative adversarial networks. We will also discuss existing work that specifies how the content of generated text can be determined by manipulating a latent code. The talk will conclude with a discussion of current challenges and shortcomings of neural text generation.

Using R on small teams in industry

Jonathan Nolis - Lenati

Doing statistical analyses and machine learning in R requires many different components: data, code, models, outputs, and presentations. While one person can usually keep track of their own work, as you grow into a team of people it becomes more important to keep coordinated. This session discusses the work we do data science work at Lenati, a marketing and strategy consulting firm, and why R is a great tool for us. It covers the best practices we found for working on R code together over many projects and people, and how we handle the occasional instances where we must use other languages.

Autopiloting #realtime processing in Heron

Karthik Ramasamy - Streamlio

Several enterprises have been producing data not only at high volume but also at high velocity. Many daily business operations depend on real-time insights, therefore real-time processing of the data is gaining significance. Hence there is a need for a scalable infrastructure that can continuously process billions of events per day the instant the data is acquired. To achieve real time performance at scale, Twitter developed and deployed Heron, a next-generation cloud streaming engine that provides unparalleled performance at large-scale. Heron has been successfully meeting the strict performance requirements for various streaming applications and is now an open source project with contributors from various institutions. Heron faced some crucial challenges from developers and operators point of view: the manual, time-consuming and error-prone tasks of tuning various configuration knobs to achieve service level objectives (SLO) as well as the maintenance of SLOs in the face of sudden, unpredictable load variation and hardware or software performance degradation.
In order to address these issues, we conceived and implemented Dhalion that aims to bring self-regulating capabilities to streaming systems. Dhalion monitors the streaming application, identifies problems that prohibit the application from meeting its targeted performance and automatically takes actions to recover such as restarting slow processes and scaling up and down resources in case of load variations. Dhalion has been built as an extension to Heron and contributed back open source. In this talk, I will give a brief introduction to Heron and enumerate the challenges that we faced while running in production and describe how Dhalion solves some of the challenges. This is a joint work with Avrilia Floratou and Ashvin Agrawal at Microsoft and Bill Graham at Twitter.

Next Generation Real Time Architectures

Karthik Ramasamy - Streamlio

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This is driven by enterprises not only producing data in volume but also at high velocity. Many daily business operations depend on real-time insights and how enterprises react to those situations. In this talk, we will describe what constitutes a real time stack and how the stack is organized to provide an end to end real time experience. Next generation real time stack consists of Apache Pulsar, a messaging system, Heron, a distributed streaming engine and Apache BookKeeper that provides a fast streaming storage. We will delve into details of each of the systems and explain why these systems are better than the previous generation system.

Writing Distributed Graph Algorithms

Andrew Ray - Sam's Club

Distributed graph algorithms are an important concept for understanding large scale connected data. One such algorithm, Google’s PageRank, changed internet search forever. Efficient implementations of these algorithms in distributed systems are essential to operate at scale.
This talk will introduce the main abstractions for these types of algorithms. First we will discuss the Pregel abstraction created by Google to solve the PageRank problem at scale. Then we will discuss the PowerGraph abstraction and how it overcomes some of the weaknesses of Pregel. Finally we will turn to GraphX and how it combines together some of the best parts of Pregel and PowerGraph to make an easier to use abstraction.
For all of these abstractions we will discuss the implementations of three key examples: Connected Components, Single Source Shortest Path, and PageRank. For the first two abstractions this will be in pseudo code and for GraphX we will use Scala. At the end we will discuss some practical GraphX tips and tricks.

We R What We Ask: The Landscape of R Users on Stack Overflow

Dave Robinson - Stack Overflow

Since its founding in 2008, the question and answer website Stack Overflow has been a valuable resource for the R community, collecting more than 200,000 questions about R that are visited millions of times each month. This makes it a useful source of data for observing trends about how people use and learn the language. In this talk, I show what we can learn from Stack Overflow data about the global use of the R language over the last decade. I'll examine what ecosystems of R packages are asked about together, what other technologies are used alongside it, in what industries it has been most quickly adopted, and what countries have the highest density of users. Together, the data paints a picture of a global and rapidly growing community. Aside from presenting these results, I'll introduce interactive tools and visualizations that the company has published to explore this data, as well as a number of open datasets that analysts can use to examine trends in software development.

The Lesser Known Stars of the Tidyverse

Emily Robinson - Etsy

While most R programmers have heard of ggplot2 and dplyr, many are unfamiliar with the breath of the tidyverse and the variety of problems it can solve. In this talk, we will give a brief introduction to the concept of the tidyverse and then describe three packages you can immediately start using to make your workflow easier. The first package is forcats, designed for making working with categorical variables easier; the second is glue, for programmatically combining data and strings; and the third package is tibble, an alternative to data.frames. We will cover their basic functions so that, at the end of the talk, we will be able to use and learn more about the broader tidyverse.

Integrating Semantic Web Technologies in the Real World: A journey between two cities

Juan Sequeda - Capsenta

An early vision in Computer Science has been to create intelligent systems capable of reasoning on large amounts of data. Today, this vision can be delivered by integrating Relational Databases with Semantic Web technologies via the W3C standards: a graph data model (RDF), ontology language (OWL), mapping language (R2RML) and query language (SPARQL). The research community has successfully been showing how intelligent systems can be created with Semantic Web technologies, dubbed now as Knowledge Graphs. However, where is the mainstream industry adoption? What are the barriers to adoption? What are the open scientific problems that need to be addressed to overcome the barriers?
This talk will chronicle our journey of deploying Semantic Web technologies with real world users to address Business Intelligence and Data Integration needs, describe technical and social obstacles that are present in large organizations, scientific challenges that require attention and argue for the resurrection of Knowledge Engineers.

G-CORE: A Core for Future Graph Query Languages, designed by the LDBC Graph Query Language Task Force

Juan Sequeda - Capsenta

In this talk, Juan will report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.
This work is the culmination of 2.5 years of intensive discussion between the LDBC Graph Query Language Task Force and members of industry (Capsenta, HP, Huawei, IBM, Neo4j, Oracle, SAP and Sparsity) and academia (CWI Amsterdam, PUC Chile, Technische Universiteit Eindhoven, Technische Universitat Dresden, Universidad de Chile, Universidad de Talca).
Link to paper:

Text Mining Using Tidy Data Principles

Julia Silge - Stack Overflow

Text data is increasingly important in many domains, and tidy data principles and tidy tools can make text mining easier and more effective. I will demonstrate how we can manipulate, summarize, and visualize the characteristics of text using these methods and R packages from the tidy tool ecosystem. These tools are highly effective for many analytical questions and allow analysts to integrate natural language processing into effective workflows already in wide use. We will explore how to implement approaches such as sentiment analysis of texts, measuring tf-idf, and measuring word vectors.

Speeding up R with Parallel Programming in the Cloud

David Smith - Microsoft

There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.

Making Magic with Keras and Shiny

Nicholas Strayer - Vanderbilt University

The web-application framework Shiny has opened up enormous opportunities for data scientists by giving them a way to bring their models and visualizations to the public in interactive applications with only R code. Likewise, the package keras has simplified the process of getting up and running with deep-neural networks by abstracting away much of the boiler-plate and book-keeping associated with writing models in a lower-level library such as tensorflow. In this presentation, I will demo and discuss the development of a shiny app that allows users to cast 'spells' simply by waving their phone around like a wand. The app gathers the motion of the device using the library shinysense and feeds it into a convolutional neural network which predicts spell casts with high accuracy. A supplementary shiny app for gathering data will be also be shown. These applications demonstrate the ability for shiny to be used at both the data-gathering and model-presentation steps of data science.

Silicon Valley vs New York: Who has Better Data Scientists? (a knowledge graph)

Denis Vrdoljak / Gunnar Kleemann - Berkeley Data Science Group

The story of Berkeley Data Science Group started when a New York Data Scientist met a Silicon Valley Data Scientist. Along the way, we built several Data Science tools, including one we showed of at Data Day Seattle recently! We built a tool to analyze and determine the necessary job skills for a given job, along with potential equivalent skills. In this talk, we’ll use our tool and the underlying knowledge graph to contrast the differences in Data Science skill sets between the coasts.
We’ll cover the basics of how our system works and how we used Graph Databases to help us model and analyze traditional NLP problems. We’ll show you the ontology we used to model the information we extracted into a knowledge graph, and how we applied the concept of lazy evaluation to simplify our application. We’ll talk about our experiences with and choices of Graph Databases, and why we chose to use the ones we use in our app.
But, most importantly, in this talk we’ll try to answer a very important question: Which coast has better Data Scientists?

Fishing Graphs in a Hadoop Data Lake

Claudius Weinberger - ArangoDB

Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
Spark with GraphX is great for answering relatively simple graph questions which are worth starting a Spark job for because they essentially involve the whole graph. But does it make sense to start one for every ad-hoc query or is it suitable for complex real-time queries?

The State of JanusGraph 2018

Ted Wilmes - Expero

Graph database adoption increased at a rapid clip in 2017 and shows no sign of slowing down as we begin 2018. When coupled with the right problem set, it's a compelling solution and word has spread from the startup world all the way to the Fortune 500. JanusGraph, an Apache TinkerPop compatible fork of the popular Titan graph database, was one of the newcomers into the marketplace last year. Its future was uncertain, but a dedicated community coalesced around it and three releases later and with an ever growing list of contributors, it is here to stay. This talk will introduce JanusGraph, discuss where it fits into the existing graph database ecosystem, and then review the progress made over the past year along with an eye to what exciting things are coming up in 2018.