Google Summer of Code 2015

At Twitter, we love Open Source, working with students and Google Summer of Code (GSOC)! What is GSOC? Every year, Google invites students to come up with interesting problems for their favorite open-source projects and work on them over the summer. Participants get support from the community, plus a mentor who makes sure you don't get lost and that you meet your goals. Aside from the satisfaction of solving challenging problems and contributing to the open source community, students get paid and get some sweet swag for their work! In our opinion, this is a great opportunity to get involved with open source, improve your skills and help out the community!

If you're interested in Outreach Program for Women as an option, please see that wiki: https://github.com/twitter/twitter.github.com/wiki/Outreach-Program-for-Women-2015

Information for Students

These ideas were contributed by our developers and our community, they are only meant to be a starting point. If you wish to submit a proposal based on these ideas, you may wish to contact the developers and find out more about the particular suggestion you're looking at.

Being accepted as a Google Summer of Code student is quite competitive. Accepted students typically have thoroughly researched the technologies of their proposed project and have been in frequent contact with potential mentors. Simply copying and pasting an idea here will not work. On the other hand, creating a completely new idea without first consulting potential mentors is unlikely to work out.

If there is no specific contact given you can ask questions via @TwitterOSS or via the twitter-gsoc mailing list.

Accepted Projects

For 2015, @TwitterOSS accepted X students to work on Y different open source projects:

TODO

The project details are listed below:

Adding a Proposal

Please follow this template:

Brief explanation:
Expected results:
Knowledge Prerequisite:
Mentor:

When adding an idea to this section, please try to include the following data.

If you are not a developer but have a good idea for a proposal, get in contact with relevant developers first or @TwitterOSS.

Project Ideas

Finagle

A good starting point is Finagle is the Quickstart: http://twitter.github.io/finagle/guide/Quickstart.html

You could also start digging in the code here: https://github.com/twitter/finagle/

Check out the Finagle mailing list if you have any questions.

Kerberos authentication in Mux

Brief explanation: Mux is a new RPC session protocol in use at Twitter. We would like to add kerberos authentication.
Knowledge Prerequisite: Scala, Distributed systems
Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

Aurora

Aurora CLI Improvements

Add new functionality to the Aurora client CLI to make programmers lives easier.
Knowledge prereq: Python
Mentor: Mark Chu-Carroll (@MarkCC)
JIRA Issue: AURORA-217

Aurora Configuration Documentation

Add documentation for the Pystachio framework used for managing configurations in Aurora.
Knowledge prereq: Python (weak)
Mentor: Mark Chu-Carroll (@MarkCC)

Mesos

Libprocess Benchmark Suite

Brief explanation: Implement a benchmark suite for libprocess to identify potential performance improvements and test for performance regressions.
Knowledge Prerequisite: C++
Mentor: Ben Mahler (@bmahler) Jie Yu (@jie_yu)
JIRA ISsue: MESOS-1018

TODO

Summingbird

Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Addition of Akka backend for streaming compute

Brief explanation: Akka(http://akka.io) is a popular open source distributed actor system. Integrating this into Summingbird would increase the range of potential compute platform for users. Making the system more accessible and suitable for more varied tasks.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop.
Mentor: Oscar Boykin (@posco)

Addition of Samza backend for streaming compute

Brief explanation: Samza(http://samza.incubator.apache.org/) is a new Apache incubator project allowing compute to be placed between two Kafka streams. Integrating this into Summingbird would increase the range of potential compute platform for users. Making the system more accessible and suitable for more varied tasks.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop, Yarn.
Mentor: Oscar Boykin (@posco) or Ian O'Connell (@0x138)

Better Spark Support in Summingbird

Brief explanation: We currently have an alpha version of Spark support for batch computation. This should be completed along with creating a demo application. After that, we should add a realtime layer using spark-streaming.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Spark.
Mentor: Oscar Boykin (@posco) or Ian O'Connell (@0x138)

Addition of Tez backend for offline batch compute

Brief explanation: Tez(http://tez.incubator.apache.org) is a new Apache incubator to generalize and expand the map/reduce model of computation. Summingbird should be able to automatically take advantage of map-reduce-reduce plans, and other optimizations that Tez enables. This should perform better than the existing Hadoop-via-cascading-via-scalding backend that is currently available.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop, Yarn.
Mentor: Ian O'Connell (@0x138)

Addition of batch key/value store on Mesos or Yarn

Brief explanation: Something that is sorely missing from the open source release of scalding is a good batch-writable read-only key-value store to use for batch jobs. This could be something like ElephantDB (https://github.com/nathanmarz/elephantdb) or HBase. Having such a project set up with Summingbird would be a huge coup for the open-source community.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Ideally familiar with mesos or yarn, and low latency key-value stores likes HBase or ElephantDB.
Mentor: Oscar Boykin (@posco) or Ian O'Connell (@0x138)

Scalding

Scalding Twitter's library for programming in scala on Hadoop. It is approachable by new-comers with a fields/Data-frame-like API as well as a type-safe API. There is also a linear algebra API to support working with giant matrices and vectors on Hadoop.

Productionize the scalding REPL

Brief explanation: scalding has a repl, where users can enter commands and see them run. There are a few issues: 1) it does not currently detect what pipes need to be reevaluated 2) it cannot load files and jars and execute user scripts 3) the design of scalding usually assumes units (jobs) that don't interact well with more immutable functional style (because the plan is mutated by the job).
Expected results: We want a scalding executable that does the standard imports, can interact with a cluster or local mode, supports EMR, and does not inefficiently repeatedly compute data.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop. Must be familiar with graphs for modeling flows of computation.
Mentor: Oscar Boykin @posco

TODO

Parquet

https://github.com/Parquet/parquet-mr/issues?labels=GSoC-2014&state=open

Parquet compatibility across tools

Brief explanation: Develop cross tools compatibility tests for parquet (https://github.com/Parquet/parquet-mr/issues/300)
Expected results:
- Compatibility of nested data types across tools - pig, hive, avro, thrift etc.
- Automated compatibility check between java implementation and impala (across release versions)
Knowledge Prerequisite: Java, Hadoop, Test frameworks
Mentor: Aniket Mokashi (@aniket486) and Julien Le Dem (@J_)

Parquet optimizations from stats support

Brief explanation: Develop optimizations for parquet using page statistics (https://github.com/Parquet/parquet-mr/issues/301)
Expected results:
- In parquet-format 2.0, data page header in parquet-format supports Statistics. We need to add optimizations to make use of these page statistics.
- Index support for parquet
- Explore use of probabilistic data structures in parquet (CountMinSketch etc.)
Knowledge Prerequisite: Java
Mentor: Aniket Mokashi (@aniket486) and Julien Le Dem (@J_)

Decouple Parquet from the Hadoop API

Brief explanation: To allow reading and writing Parquet files independently of the Hadoop API. (https://github.com/Parquet/parquet-mr/issues/305)
Expected results: read and write Parquet without the Hadoop libraries
Mentor: Julien Le Dem (@J_)

Study state of the art floating point compression algorithms

(https://github.com/Parquet/parquet-mr/issues/306)

Brief explanation: Study existing lossless floating point compression papers and implement benchmarks.
Expected results: Provide reference implementation and benchmark comparison. With integration into the Parquet library
Mentor: Julien Le Dem (@J_)

Netty

You can learn more about getting involved with the Netty Project here: http://netty.io/community.html

Android testsuite

Brief explanation:
- Netty project team is willing to support Android 4.0 Ice Cream Sandwich officially, and we need an automated testsuite to achieve the goal.
Expected results:
- During the build process, an Android emulator is automatically started and stopped to run all (or applicable) JUnit tests inside the Android emulator.
- The result of the JUnit tests inside the emulator affects the build result so that we can run the Android compatibility test in our CI machine.
- All Android compatibility issues found during the test are fixed.
Knowledge Prerequisite:
- Java and Android programming
- Custom JUnit runners
- Experience with building a network application atop Netty
Mentor: Trustin Lee (@trustin)

Pants

For more information about Pants, check these out:

Contributors Guide: http://pantsbuild.github.io/howto_contribute.html
Developers Guide: http://pantsbuild.github.io/howto_develop.html
Task Developers Guide: http://pantsbuild.github.io/dev_tasks.html

C/C++ support for Pants

Brief explanation: Add C/C++ support to the Pants build system.
Expected results: Pants can compile C/C++ based applications.
Knowledge Prerequisite: Python, C/C++
Mentor: John Sirois (@johnsirois)

Eclipse Integration

Brief explanation: Add Eclipse integration to Pants
Expected results: Create a classpath container based on integrating with Pants and a launcher.
Knowledge Prerequisite: Python, Java, Eclipse
Mentor: Travis Crawford (@tc) and Chris Aniszczyk (@cra)

TwitterCLDR

Improve string collation implementation

Twitter CLDR Ruby gem provides a basic implementation of string collation (locale-aware sorting), but there's a number of ways it can be improved:
Add support for script reordering that allows sorting characters from a native script before characters from other scripts (e.g., sorting Cyrillic characters before Latin ones in Russian locale). More info.
Switch from deprecated XML syntax for collation rules to the basic one. More info.
Address issues with ignoring denormalized code points in the Collation Elements Table. More info.
Expected results: Fixing all or some of the issues listed above and achieving better parity with Unicode Collation Algorithm implementation from ICU library.
Knowledge Prerequisite: Ruby.
Mentor: Kiryl Lashuk (@KL7)

Port missing features from Ruby gem to JavaScript library

Twitter CLDR JavaScript library is still missing a lot of features that are available in the Ruby gem. Among them:
Text segmentation
Rule-based numbers formatting
Localization of language codes
String collation (though, this feature might be a bit to heavy for a JavaScript library)
etc.
Expected results: Having a wider range of Twitter CLDR features available in the JavaScript version of the library.
Knowledge Prerequisite: JavaScript, CoffeeScript, Ruby.
Mentor: Kiryl Lashuk (@KL7)

Project

Project URL

Project Idea (e.g., New Feature)

Brief explanation:
Expected results:
Knowledge Prerequisite:
Mentor:

General Proposal Requirements

Proposals will be submitted via http://www.google-melange.com/gsoc/homepage/google/gsoc2014, therefore plain text is the best way to go. We expect your application to be in the range of 1000 words. Anything less than that will probably not contain enough information for us to determine whether you are the right person for the job. Your proposal should contain at least the following information, but feel free to include anything that you think is relevant:

Please include your name and twitter handle!
Title of your proposal
Abstract of your proposal
A link to your github id (if you have one)
Detailed description of your idea including explanation on why is it innovative
Description of previous work, existing solutions (links to prototypes, bibliography are more than welcome)
Mention the details of your academic studies, any previous work, internships
Any relevant skills that will help you to achieve the goal (programming languages, frameworks)?
Any previous open-source projects (or even previous GSoC) you have contributed to?
Do you plan to have any other commitments during SoC that may affect you work? Any vacations/holidays planned?
Contact details

Good luck!

Follow us at @TwitterOSS

Google Summer of Code 2018 Projects

Google Summer of Code 2015

Google Summer of Code 2015

Information for Students

Accepted Projects

Adding a Proposal

Project Ideas

Kerberos authentication in Mux

Aurora CLI Improvements

Aurora Configuration Documentation

Libprocess Benchmark Suite

TODO

Addition of Akka backend for streaming compute

Addition of Samza backend for streaming compute

Better Spark Support in Summingbird

Addition of Tez backend for offline batch compute

Addition of batch key/value store on Mesos or Yarn

Productionize the scalding REPL

TODO

Parquet compatibility across tools

Parquet optimizations from stats support

Decouple Parquet from the Hadoop API

Study state of the art floating point compression algorithms

Android testsuite

C/C++ support for Pants

Eclipse Integration

Improve string collation implementation

Port missing features from Ruby gem to JavaScript library

Project

Project Idea (e.g., New Feature)

General Proposal Requirements

Clone this wiki locally