Building a Case for a Standardized Data Pipeline for All Your Organizational Data
Organizations of all size and domains today face a data explosion problem, driven by a proliferation of data management tools and techniques. A very common scenario is creation of silos of data and data-products which increases the system’s complexity spread across the whole data lifecycle - right from data modeling to storage and processing infrastructure.
High complexity = high system maintenance overheads = sluggish decision making. Another side-effect of this is divergence of the implemented system’s behaviour from high-level business objectives.
In this talk we look at Zeta's experience as a case-study for reducing this complexity by defining and tackling various concerns at well-defined stages so as to prevent a build of complexity.
Outline/Structure of the Experience Report
Context: [3 min]
- Zeta is a rapidly growing 3-year-old fintech startup with increasing data size, variety and complex decision making scenarios.
- To exemplify: within a duration of 18 months our transaction volume increased from a few thousand per month to over 10 million; the Elasticsearch cluster used for ad-hoc search and dashboarding increased in size from sub-GB to 7.5 TB; product offering increased from a single product to a bouquet of product offerings to multiple customer categories; changes in the compliance and regulatory environment once every few months required us to revisit our reporting.
- At an early stage we decided to come up with a roadmap for the next 2 years for its data products.
Problem we were solving: [3 min]
- In a rapidly growing, evolving organization often there is an explosion in data silos. Leads to multiple sub-optimal data administration, collection, storage, processing subsystems (including specialized workforce) and an inability to do deep, cross-dataset analyses.
- Need to think of the whole data lifecycle feeding into the final visualization and reduce duplication wherever possible (always a good software engineering practice).
What we did: [12 min]
- First things first - the Data Model, or what is it that we are managing. Imagined all the present + near-future data that is being generated, then classified them in a small number of categories. Clearly defined the kind of data that we will support, enumerate on the nature of the data and Engineering guarantees that they will need. Eg Events, Entities, Metrics, Logs.
- Used existing and likely near-future needs, did not want to over-engineer
- Data Model includes the data’s performance specifications that we need to support. Eg latency, throughput, failure tolerance, scalability, retention policy, concurrency, resource prioritization at a high level.
- Building on top of this greatly helped us make correct architecture and technology framework decisions. Allowed us to identify several red-flags at design stage itself; eg HDFS as a data storage layer may initially work and give sufficient latency performance for a low-latency fraud-detection scenario but will definitely fail once we hit our specified data scale. Elasticsearch seems great for data aggregation but will break if we expect it to power a high-multi-tenancy dashboard feature given its poor resource isolation behaviour.
- The specific data storage and transport technology used was agnostic of the data model (think www ISO-OSI stack). Higher abstractions and derivations could be made on top of this simple, standardized data model - eg feature vectors on top of events and entities.
Key wins: [2 min]
- Simple to understand and communicate data model
- High correlation between the data model and the implementation infra (= lower exceptions and band-aid fixes)
- Low cost of maintenance and good value-for-money
Learning Outcome
Participants will get insights into:
- Aspects of data evolution within an organization - the data lifecycle from data modeling to collection to visualization, as well as changes in the data model and business needs themself over time
- Complexity challenges that this will lead to
- Zeta’s experience in designing a data infrastructure to tackle these challenges and its tech and business validation over a 18 tumultuous month long period in the rapidly growing startup’s life.
Target Audience
Anyone interested in data administration, storage and creating data infra products aligned to business objectives.
Prerequisites for Attendees
None.
Some exposure to scenarios and discussions around data complexity and scaling is suggested as it will help participants tune into the problems being discussed here.
Video
schedule Submitted 5 years ago
People who liked this proposal, also liked:
-
keyboard_arrow_down
Joy Mustafi - The Artificial Intelligence Ecosystem driven by Data Science Community
45 Mins
Talk
Intermediate
Cognitive computing makes a new class of problems computable. To respond to the fluid nature of users understanding of their problems, the cognitive computing system offers a synthesis not just of information sources but of influences, contexts, and insights. These systems differ from current computing applications in that they move beyond tabulating and calculating based on pre-configured rules and programs. They can infer and even reason based on broad objectives. In this sense, cognitive computing is a new type of computing with the goal of more accurate models of how the human brain or mind senses, reasons, and responds to stimulus. It is a field of study which studies how to create computers and computer software that are capable of intelligent behavior. This field is interdisciplinary, in which a number of sciences and professions converge, including computer science, electronics, mathematics, statistics, psychology, linguistics, philosophy, neuroscience and biology. Project Features are Adaptive: They MUST learn as information changes, and as goals and requirements evolve. They MUST resolve ambiguity and tolerate unpredictability. They MUST be engineered to feed on dynamic data in real time; Interactive: They MUST interact easily with users so that those users can define their needs comfortably. They MUST interact with other processors, devices, services, as well as with people; Iterative and Stateful: They MUST aid in defining a problem by asking questions or finding additional source input if a problem statement is ambiguous or incomplete. They MUST remember previous interactions in a process and return information that is suitable for the specific application at that point in time; Contextual: They MUST understand, identify, and extract contextual elements such as meaning, syntax, time, location, appropriate domain, regulation, user profile, process, task and goal. They may draw on multiple sources of information, including both structured and unstructured digital information, as well as sensory inputs (visual, gestural, auditory, or sensor-provided). {A set of cognitive systems is implemented and demonstrated as the project J+O=Y}
-
keyboard_arrow_down
Saurabh Deshpande - Introduction to reinforcement learning using Python and OpenAI Gym
Saurabh DeshpandeSoftware ArchitectSAS Research and Development India Pvt. Ltd.schedule 5 years ago
90 Mins
Workshop
Advanced
Reinforcement Learning algorithms becoming more and more sophisticated every day which is evident from the recent win of AlphaGo and AlphaGo Zero (https://deepmind.com/blog/alphago-zero-learning-scratch/ ). OpenAI has provided toolkit openai gym for research and development of Reinforcement Learning algorithms.
In this workshop, we will focus on introduction to the basic concepts and algorithms in Reinforcement Learning and hands on coding.
Content
- Introduction to Reinforcement Learning Concepts and teminologies
- Setting up OpenAI Gym and other dependencies
- Introducing OpenAI Gym and its APIs
- Implementing simple algorithms using couple of OpenAI Gym Environments
- Demo of Deep Reinforcement Learning using one of the OpenAI Gym Atari game
-
keyboard_arrow_down
Ujjyaini Mitra - When the Art of Entertainment ties the knot with Science
20 Mins
Talk
Advanced
Apparently, Entertainment is a pure art form, but there's a huge bit that science can back the art. AI can drive multiple human intensive works in the Media Industry, driving the gut based decision to data-driven-decisions. Can we create a promo of a movie through AI? How about knowing which part of the video causing disengagement among our audiences? Could AI help content editors? How about assisting script writers through AI?
i will talk about few specific experiments done specially on Voot Original contents- on binging, hooking, content editing, audience disengagement etc.
-
keyboard_arrow_down
Manish Kumar Shukla - Machine learning With Quantum Systems
20 Mins
Talk
Intermediate
The domain of Machine learning and Quantum computation are the next big leap in the general experience of computing. Using machine learning we want to take smart decisions and enlarge our solution space. In this era, we can see active research in Medical image processing, self driving cars, Content summarisation, sentiment analysis etc. All these can be obtained by using classical computers. Another domain of active research is Quantum information processing, here we try to use the principles of quantum mechanics to gain a leap in efficiency of information processing task. One such example is Shor's factoring algorithm which can solve the prime factorisation problem in O((log N)2(log log N)(log log log N) which is NP hard if we use classical computers. The primary reason which enables this the phenomena of superposition.The confluence of Classical machine learning with Quantum information theory which gives rise to field of Quantum Machine Learning, which is still in its nascent stages, but never the less very interesting to study. It is interesting to ask two types of questionsa. How good would be a quantum computer in learning the the classical information. To give an example will a quantum computer be able to classify Apples and Oranges better than a classical computer.b. Another interesting question is to ask will some of the problems which are hard in quantum world be learned using classical computers. One such example is to classify entangled vs separable states.In this conference I will talk about how two domains are similar and different and what are some proposed solutions for the above stated problems. -
keyboard_arrow_down
Rohit Gupta - Image Compression with Neural Networks
20 Mins
Talk
Intermediate
Nowadays many apps and social networking sites are dealing with large number of images which lead to huge storage cost and problem in surfacing those images on client machine in case of bad/low network speed. In this talk we will present how can we achieve low bpp using RNNs.