October 28th, 2016

Big Data Summer Internship Program, Summer 2016/17
Data Management Group @ Brown Computer Science

The Data Management Group at the Brown Computer Science Department is looking for highly motivated students as  interns for the fall and spring semester 2016/17. The interns will participate in research projects that revolve around challenging big data problems, specifically in the areas of interactive, user-friendly big data analytics, scientific data management, and cloud-optimized data management.

Current Projects:

20/20: Human-in-the-Loop Data Exploration

We propose to build a new class of database systems designed for Human-In-the-Loop (HIL) operation. We target an ever growing set of data-centric applications in which data scientists of varying skill levels manipulate, analyze and explore large data sets, often using complex analytics and machine learning techniques. Enabling these applications with ease of use and at “human speeds” is key to democratizing data science and maximizing human productivity.

Traditional database technologies are ill-suited to serve this purpose. Historically, databases assumed (1) text-based input (e.g., SQL) and output, (2) a point (i.e., stateless) query-response paradigm, (3) batch results, and (4) simple analytics. We will drop these fundamental assumptions and build a system that instead supports visual input and output, ”conversational” interaction, early and progressive results, and complex analytics. Building a system that integrates these features requires a complete rethinking of the full database stack, from the interface to the ”guts”, as well as incorporating pertinent algorithms.

Tupleware

There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the problems faced by the Googles and Facebooks of the world—petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes and analyze relatively small data sets of up to a few terabytes. Targeting these users fundamentally changes the way we should build analytics systems.

Therefore, we are developing Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware’s architecture brings together ideas from the database and compiler communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis.

S-Store

Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. In the past, these two modes of operation were found only in separate, stove-piped systems. However, with the creation of NewSQL OLTP systems, it becomes possible to perform scalable real-time operations without sacrificing transactional support. Enter S-Store, the world’s first transactional streaming database system.

MLBase

Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming and many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support ML are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives.

With MLbase, we tackle both of these issues simultaneously, leveraging the aligned incentives between ML researchers and non-expert practitioners to build a single platform for consuming and developing ML. Moreover, MLbase provides an attractive interface between ML and systems researchers, as the efforts of both groups naturally complement one another. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a run-time optimized for the data-access patterns of these high-level operators.

For more information about the data management group and activities, see: http://database.cs.brown.edu

Requirements and Compensation:

Each intern will work closely with the Brown Data Management Group (faculty and students) to build new data management systems and tools. Strong programming skills is a requirement. Research experience and familiarity with database systems and techniques are recommended. We especially encourage students who are interested in pursuing PhD-level database research in the near future.

Application:

To apply for an internship position, please e-mail your resume to <bigdata@cs.brown.edu>. You need to include the contact information for two people who can serve as a reference. Compensation will include a summer salary (commensurate with skills and experience) and the cost of travel expenses. The standard internship period will be 3-6 month. Informal inquiries should be sent to the same address.