The Network, The Kingmaker: distributed tracing and Zipkin

Adding Zipkin instrumentation into a codebase makes it possible to create one tracing view across an entire platform. This is the often eluded "correlation identifier" that's recommended by Microservices but has so few solid open sourced solutions available. This is an aspect to monitoring of distributed platforms akin to the separate concerns in aggregation of metrics and logs.

This talk will use the use case of extending Apache Cassandra's tracing: to use Zipkin so to demonstrate a single tracing view across an entire system. From browser and HTTP, through a distributed platform, and into the Database down to seeks on disk. Put together it makes easy to identify which queries to a particular service took the longest and to trace back how the application made them.

This presentation will raise the requirements and expectations DevOps has on their infrastructural tools. For people that want to take their infrastructural tools to the next level, where the network is known as the kingmaker.

 
1 favorite thumb_down thumb_up 1 comment visibility_off  Remove from Watchlist visibility  Add to Watchlist
 

Outline/structure of the Session

 - Intro,
 - Concepts,
 - Zipkin,
 - Cassandra tracing,
 - Zipkin+Cassandra.

Learning Outcome

Understand what Zipkin is, the value distributed tracing adds, and get an idea how it can be instrumented into existing distributed technologies.

Target Audience

For people that want to take their infrastructural tools to the next level, where the network is known as the kingmaker.

schedule Submitted 2 months ago

Comments Subscribe to Comments

comment Comment on this Proposal
  • Josh Graham
    By Josh Graham  ~  1 month ago
    reply Reply

    Although the focus is on Zipkin, the topic covers many layers in the tech stack, and zipkin is a reasonable choice of tool to showcase the benefits of tracing and instrumenting the data generated by tracing.


  • Liked Mick Semb Wever
    keyboard_arrow_down

    Mick Semb Wever - Looking behind Microservices to Brewer's theorem, Externalised Replication, and Event Driven architecture

    Mick Semb Wever
    Mick Semb Wever
    Consultant
    The Last Pickle
    schedule 2 months ago
    Sold Out!
    30 mins
    Talk
    Advanced

    Scaling data is difficult, scaling people even more so.

    Today Microservices makes it possible to effectively scale both data and people by taking advantage of bounded contexts and Conway's law.
    But there's still a lot more theory that's coming together in our adventures in dealing with ever more data. Some of these ideas and theories are just history repeating, while others are newer concepts.

    These ideas can be seen in many Microservices platforms, within the services' code but also in the surrounding infrastructural tools we become ever more reliant upon.

    Mick'll take a dive into it using examples and offer recommendations after seven years of coding Microservices around 'big data' platforms. The presentation will be relevant to people wanting to move beyond REST based asynchronous platforms, to eventually consistent asynchronous designs that aim towards the goal of linear scalability and 100% availability.

  • Liked Christopher Biggs
    keyboard_arrow_down

    Christopher Biggs - From little things, Big Data Grow - IoT at Scale

    Christopher Biggs
    Christopher Biggs
    Director
    Accelerando Consulting
    schedule 2 months ago
    Sold Out!
    30 mins
    Talk
    Intermediate

    The Internet of Things (IoT) is really about the ubiquity of data, the possibility of humans extending their awareness and reach globally, or further.
    IoT frees us from the tedium of physically monitoring or maintaining remote systems, but to be effective we must be able to rely on data being accessible but comprehensible.

    This presentation covers three main areas of an IoT big data strategy

    * The Air Gap - options (from obvious to inventive) for connecting wireless devices to the internet
    * Tributaries - designing a scalable architecture for amalgamating IoT data flows into your data lake. Covers recommended API and message-bus architectures.
    * Management and visualisation - how to characterise and address IoT devices in ways that scale to continental populations. Examples of large scale installations to which I've contributed. Coping with information overload.

  • Liked Yaniv Rodenski
    keyboard_arrow_down

    Yaniv Rodenski - Introduction to Apache Amaterasu (incubating): CD framework for your Big Data pipelines

    Yaniv Rodenski
    Yaniv Rodenski
    Developer
    Shinto
    schedule 2 months ago
    Sold Out!
    30 mins
    Demonstration
    Advanced

    In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development. 

    Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery. 

    In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
    We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.

  • Liked Graham Polley
    keyboard_arrow_down

    Graham Polley - Running Apache Beam pipelines on Google Cloud Platform

    30 mins
    Talk
    Intermediate

    The original white paper on MapReduce was written by Google in 2004, and the open source community were quick to implement it. Since then, Hadoop has been the de facto tool when you need to process and analyse massive datasets. However, the big data landscape had begun to shift dramatically in recent years, and cloud tools are now disrupting the way in which we work with data.

    In this presentation, you will learn about Apache Beam; Google’s latest contribution to opens source, and the successor to MapReduce. Built on two internal Google technologies - 'Flume' and 'Millwheel' - Apache Beam offers a unified programming model for stream and batch pipelines then can be run at massive scale using Google's 'Cloud Dataflow, which is a fully managed service for running Beam pipelines. This allows teams to focus on their code, rather than wasting their time managing infrastructure, worrying about scalability, or waiting for Hadoop jobs to finish. It exemplifies how the cloud can democratize data analytics at scale.

    Beam pipelines running on Cloud Dataflow have autoscaling baked in, will build the worker pool on-the-fly for you, and even tear it down when it's done. It abstracts away all the gnarly complexities that are normally associated with managing Hadoop cluster such as dynamic load rebalancing, scaling, and stragglers.

  • Liked Abhishek Tiwari
    keyboard_arrow_down

    Abhishek Tiwari - Building scalable real-time data pipelines

    30 mins
    Talk
    Advanced

    Real-time data pipelines can solve a variety of problems. They are particularly useful when applied to use cases involving real-time insights and real-time actioning such as ad serving, digital analytics, media attribution, media monitoring, anomaly detection, etc. That said, they can also replace traditional batch-oriented ETL workloads.

    In this talk, we will discuss how to build scalable real-time data pipelines. We will cover various building blocks and how to apply them as a pattern to create different types of real-time data pipelines. Last but not least, we will cover design considerations and best practices for building real-time data pipelines.

  • Liked Lukas Toma
    keyboard_arrow_down

    Lukas Toma - PR in age of data aka Talk to people that want to listen

    Lukas Toma
    Lukas Toma
    Lead Data Scientist
    Prezly
    schedule 2 months ago
    Sold Out!
    30 mins
    Talk
    Intermediate

    Prezly is a SaaS company, building software for PR professionals and external communications departments, that enables them to do their job better. We have realized, that in the age of ubiquitous data, people expect the switch from search to discover in their personal as well as professional lives. We are in the middle of ambitious initiative with a goal to eliminate spammy communication tactics, that have been a standard in our area for many years. The talk is structured about our progress in this initiative and wins and challenges encountered in the process.

  • Liked Mick Semb Wever
    keyboard_arrow_down

    Mick Semb Wever - From time-series to petabytes: an introduction to Apache Cassandra 3.X, and beyond

    Mick Semb Wever
    Mick Semb Wever
    Consultant
    The Last Pickle
    schedule 2 months ago
    Sold Out!
    30 mins
    Talk
    Intermediate

    The 3.x versions of Apache Cassandra brought a lot of changes to the way data is stored and manipulated. It's exciting, even for people who have been using it since version 0.4.

    In this session Mick will cover:

    1) Introduction to Cassandra.
    A quick explanation of how Cassandra operates as a cluster of machines, how it handles consistency, and the basics of reading and writing data to disk.

    2) What's new in CQL.
    Looking at recent improvements such as Materialised Views, UDF and UDAs, per partition Limits, and why you don't have to "freeze" your UDTs.

    3) Storage Engine Changes.
    The new 3.X storage engine format, a major re-write of the on disk format. The storage engine now understands the CQL layout and optimises the bytes used on disk and how they are accessed.

    4) The Bleeding Edge.
    What's arrived and what is coming, touching on ideas like SASI indexes, Change Data Capture, and Partition Level Aggregations.

    The presentation will provide an introduction and update to what Apache Cassandra provides today, and why: among the top ten most popular databases in the world; it is the data platform de jure for the next evolution of software services: whether they be IoT, Event Based systems, or BASE architectures.

  • Liked Gill Walker
    keyboard_arrow_down

    Gill Walker - Taming the beast of CRM data

    Gill Walker
    Gill Walker
    CRM Consultant
    Opsis
    schedule 2 months ago
    Sold Out!
    30 mins
    Demonstration
    Intermediate

    Behind all successful CRM solutions lies great data.

    So all the data’s in your CRM. Now how do you get it out? And how do you turn that data into useful information?

    CRM projects are great at designing ways for users (and sometimes clients and prospects) to put data into the CRM. They are often not so good at helping those same users to get information out in ways that help them serve the clients effectively.

    Many CRM projects often focus huge effort on ensuring that they can enter information into CRM, then rely on pushing it into Excel and manipulating it there each time they need it. Not only is this tedious and time-consuming, but to make matters worse, these reports are often then saved and used again even though they are out of date. Wouldn’t you rather leverage the data within your CRM, so that you can access real time information in the format you need?
    If you want to transform your CRM from a huge data repository into a powerful management tool, then this presentation is essential for you.

  • Liked Robert Lechte
    keyboard_arrow_down

    Robert Lechte - Do schema migrations better: Automate your database changes using a diff tool

    30 mins
    Demonstration
    Advanced

    Existing schema migration tools are terrible. Version numbers? Creating a migration file every time you need to change anything in your schema? Painful.

    Luckily, there's a better way. With the help of a database diff tool and the right test setup, we can make our database migrations and schema changes much faster, less tedious, and more testable and reliable.

    # The Bad News

    It's hard to work with database schemas. But schemas are actually good.

    Enforcing structure and consistency of data is good. The trouble is that people get frustrated with the tooling.

    Existing tools like django migrations, alembic migrations, and the other (rails-inspired) migrations all make it far too hard. When you have to worry about version numbers and migration files each time you want to make a change, it makes it every change a chore. That means it's much harder than it should be to change your schema, and when that's hard, it becomes hard to manage your data properly, harder to rapidly prototype, and an operational overhead each time you make a change.

    The database, has become our enemy rather than our friend.

    There are more problems. Migration files build up over time, cluttering your working copy and obfuscating the setup and structure of your database.

    Typical migration tools don't offer any help at all with actually testing that your migrations will succeed, or if the result of your pile of migrations matches your intended schema. Most people just deploy-and-hope.

    The next problem is framework lock-in. If you want to use Django, you're stuck with Django migrations and their particular limitations. Same deal with flask and alembic. Most of them assume you'll be using the database as a dumb table store. What if you you'd like to take advantage of important database features like views and stored procedures? They're often ignored by these tools.

    # Some Theory

    What do we actually need from a migration tool?

    ## Intended states

    Why have 30 migration files sitting around when you only fundamentally compare about moving between 3 database states.

    - empty (when you start your project)
    - development schema ( the one you put together when designing and developing your database and applications)
    - production schema (the schema that's actually in use once deployed.

    # Representing schema changes

    Diffs are a familiar concept. Us developers most likely use them every day in git.

    Database schema changes can also be represented as diffs!

    If this is a text diff:

    + the new text
    - the old text

    Then this is a database diff:

    alter table "author" add column "new_column" text;

    The equation is simple:

    old database + diff = new database


    # Automating your schema changes with a schema diff tool: migra

    # pip install migra..

    > psql
    # create database a; create database b;

    > psql b
    # create table(id serial primary key, name varchar)

    > pip install migra
    > migra postgresql:///a postgresql:///b

    create sequence "public"."t_id_seq";

    create table "public"."t" (
    "id" integer not null default nextval('t_id_seq'::regclass),
    "name" text
    );

    CREATE UNIQUE INDEX t_pkey ON t USING btree (id);

    # A faster workflow

    Instead of faffing around creating and editing migration files, simply compare, generate the diff, and apply it.

    > migra postgresql:///a postgresql:///b | psql -1
    > migra postgresql:///a postgresql:///b

    Migra now shows no changes, so we know we've reached the right schema state.


    When we want to make a change to a database, instead of faffing about with migration files, all we need to do is.

    # Building your own test and deployment workflows.

    Migra has an API, so you when you want to deploy you can figure out the changes you need very easily and intuitively:

    def prod_vs_app():
    with temporary_db() as CURRENT_DB_URL, temporary_db() as TARGET_DB_URL:
    load_current_production_state(CURRENT_DB_URL)
    load_from_app_model(TARGET_DB_URL)

    with S(CURRENT_DB_URL) as s_current, S(TARGET_DB_URL) as s_target:
    m = Migration(s_current, s_target)

    m.set_safety(False)
    m.add_all_changes()

    print('Differences:\n{}'.format(m.sql))

    Here we're directly comparing the schemas of two database sessions and generating the statements required to reconcile the differences between them.

    # Testability

    You can use a similar setup to confirm your migration is correct.

    def load_post_migration(dburl):
    with S(dburl) as s:
    load_sql_from_file(s, 'MIGRATIONS/production.dump.sql')

    with S(dburl) as s:
    load_sql_from_file(s, 'MIGRATIONS/pending.sql')


    def load_from_app(dburl):
    with S(dburl) as s:
    Model.metadata.create_all(s.bind.engine)
    load_sql_from_folder(s, 'bookapp/SQL')


    def test_db():
    with temporary_db() as CURRENT_DB_URL, temporary_db() as TARGET_DB_URL:
    load_post_migration(CURRENT_DB_URL)
    load_from_app(TARGET_DB_URL)

    with S(CURRENT_DB_URL) as s_current, S(TARGET_DB_URL) as s_target:
    m = Migration(s_current, s_target)
    m.set_safety(False)
    m.add_all_changes()
    assert not m.statements


    # Test before-and-after

    With the right test config you can run each test of both pre and post migration state.

    with io.open('MIGRATIONS/pending.sql') as f:
    pending_contents = f.read()


    if pending_contents.strip():
    DATABASE_SETUPS_TO_TEST = [
    load_pre_migration,
    load_post_migration
    ]
    else:
    DATABASE_SETUPS_TO_TEST = [
    load_post_migration
    ]


    @pytest.fixture(params=DATABASE_SETUPS_TO_TEST)
    def db(request):
    with temporary_database() as test_db_url:
    setup_method = request.param
    setup_method(test_db_url)
    yield test_db_url

    With this level of testing you can deploy your app, deploy your migration after, and know you'll have no downtime through app-db mismatch.

    # Limitations

    A postgres-only tool so far.

    Migrations can never be fully automatic. Always need manual review. But this tool gets us a lot closer and cuts out the busywork.

    # Result

    The Australian government is already using this in production. A much happier developer experience. The ability to work with our data in a far more agile and intuitive way.