Big Data - Verify Data Completeness and Data Correctness using Selenium/JAVA

schedule 10:45 AM - 11:30 AM place Grand Ballroom 1

Using Hadoop MapReduce, Java programs are written to process large amounts of Data. Testing has to be performed to check the accurate functioning of these applications. Testing process includes manually verifying business logic on each node for MapReduce process accuracy, Data aggregation/Segregation rules and Generation of Key Value pairs. At the same time, Output data files are also verified for Transformation Rules, Successful Load, Data Integrity and Data Accuracy.

Due to the enormous amount of data and various business rules, manual testing process is time consuming and may lead to slippage of validations. Implementing the automation testing process using Selenium & Java Adapters will make sure the data is complied with all the business/transformation rules and checks the data integrity.

6 favorite thumb_down thumb_up 6 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist

Outline/structure of the Session

Hadoop is used to fetch data from different data sources and programmers apply business logic to fetch the required data. E.g., if the Hadoop-gathered data is 100 PB(Peta Bytes) which is a combination of good data & bad data, and the business is interested in good data with required information.Let us say, out of 100 PB data, business-required data is 50 TB. Hadoop programmers write few MapReduce programs in JAVA to run against multiple nodes and they gather around 50 TB. And to validate this 50TB data, what the traditional manual testing process does is take random/sample data files and validate against business logic. This does not 100% guarantee data completeness and data correctness. And with manual testing process it takes months-time to validate data integrity and data accuracy. Manual testing cannot be done with unstructured data which consists of different file types. for ex: Audio, Video, Mobile Calls, Call center data, images, etc. Using automation we can read the headers of these files and split in to structured files.

Learning Outcome

The Test Automation approach consists of initially splitting the 50TB data into smaller chunks and develop test JAVA programs known as JAVA adapters. The required business logic should be implemented in these JAVA adapters and these have to be run against each individual data file chunk. This generates the output data. And this output file content and output file size is verified against the data generated by Hadoop MapReduce programs. This finally results in verifying of data completeness and data correctness as briefly explained below.

Data Correctness - Validating actual data (Data by MapReduce) Vs expected data(Data by test Java Adapters).

Data Completeness – 100 % data validation

And, for O/P data validation, tools like PRESTO can be used. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from 300 PB and above.

Using Test Automation Framework, large amounts of data can be verified within a very minimal time and Data Integrity/Data Accuracy are verified with 100% coverage

Target Audience

leads, architects

schedule Submitted 2 years ago

Comments Subscribe to Comments

comment Comment on this Proposal
  • Dave Haeffner Test
    By Dave Haeffner Test  ~  2 years ago
    reply Reply

    That's an interesting problem set and approach to solving it. But do you actually use Selenium for this kind of testing?

    • giri
      By giri  ~  2 years ago
      reply Reply


      Thanks for your observation. Yes selenium is required as this is a open source and to maintain testbed like test repository, test scenarios, results reporting, and respective java adapters(function libraries).


      • Leo Laskin
        By Leo Laskin  ~  1 year ago
        reply Reply

        I don't understand how you use selenium.  What web UI testing are you doing as part of this testing?  

  • Naveen Chauhan
    By Naveen Chauhan  ~  1 year ago
    reply Reply

    You compare the data produced by mapreduce code with the output of Java Adapters ,Not sure what we gain from this as we need to compare the output data with the actual source systems data from where we get the data at first place because data produced by JAVA Adapters OR Mapreduce program may not be correct if compare to source data.

    Please let me know if i am missing anything here.






  • RK Raju
    By RK Raju  ~  2 years ago
    reply Reply

    Could you please elaborate test completion time depends on what parameters in this case study

    • giri
      By giri  ~  2 years ago
      reply Reply

      Test Completion time depends on the amount of data that is being tested. As you might be knowing that, bigdata deals with 5 V's of data (Velocity, Volume, Variety, Veracity, Value). Here to achieve data completeness (100 % data validation) we are splitting the extracted data (mapreduced data) in to smaller chunks and running parallel java adapters against them.Hope this info helps.

  • Liked rajesh sarangapani

    rajesh sarangapani - Visualizing Real User Experience Using Integrated Open Source Stack (Selenium + Jmeter + Appium + Visualization tools)

    45 mins

    Traditional approach in performance testing does not include client side processing time (i.e. DOM Content Load, Page Render, JavaScript Execution, etc.) as part of response times, performance tests has always been conducted to stress the server so tools like Jmeter have been very popular to execute tests. With increasing complexity of architectures (Web, Browser, Mobile) on the client side it has been important to understand the real user experience.   Commercial tools have started to provide features that can provide insights into real user experience after the bytes are transferred to the client end.  With the ability to call Selenium scripts via Jmeter the ability to conduct real user experience tests using open source stack has opened up new avenues to comment on real user experience.   This enables us to comment on

    • Provides Page load times similar to On Load time of real browsers
    • Generates HAR file with following statistics
    • Details of summary of request times and content types
    • Waterfall chart with page download time breakdown statistics such as  DNS resolution time, Connection time, SSL handshaking time, Request send time, wait time and receive time.

    By integrating the open source stack tools it enables us to provide the same insights which a commercial of the shelf tools would offer.   At Gallop we have implemented this at multiple clients providing them insights into various bottlenecks at the client side which helped us to provide greater value proposition

  • Liked Trinath Babu

    Trinath Babu - Visual Test Automation using Selenium

    Trinath Babu
    Trinath Babu
    Sr. Manager
    Gallop Solutions
    schedule 2 years ago
    Sold Out!
    45 mins

    Visual Test Automation using Selenium

    Visual Testing is the method of verifying that the application’s GUI appears correctly to its users. Most of the people say visual testing is hard to automate. Given the number of web browsers, operating systems, screen resolutions, responsive design, internationalization, etc.) the nature of visual testing can be complex. But with existing open source and commercial solutions, this complexity is manageable, making it easier to automate than it once was, since verification with traditional automated functional testing tools can be very challenging.

    It can be easily achieved by integrating Selenium with Applitools. This talk mainly focuses on verifying the application’s graphical user interfaces (GUI) and finding the visual bugs using Applitools. It is very helpful for all sites having graphical functionalities like (charts, graph, dashboards etc).  Verify that the GUI appears correctly across all devices & browsers. The nature of visual testing can be complex. But with existing open source and commercial solutions, this complexity is manageable, making it easier to automate than it once was. And the payoff is well worth the effort.

    Take pressure off manual QA: increase coverage, test faster & more accurately.  Reduce maintenance efforts: automatically propagate changes across execution environments. Release faster, with confidence & flawless.

    Applitools Eyes Express captures the screen you want to test, and compares it to a baseline image – instantly, with a single click. No extra testing code necessary, no boring error logs.

    For example, a single automated visual test will look at a page and assert that every element on it has rendered correctly. Effectively checking hundreds of things and telling you if any of them are out of place. This will occur every time the test is run, and it can be scaled to each browser, operating system, and screen resolution you care about.

    Put another way, one automated visual test is worth hundreds of assertions. And if done in the service of an iterative development workflow, then you’re one giant leap closer.


    Each of these tools follows some variation of the following work flow:

    1. Drive the application under test (AUT) and take a screenshot
    2. Compare the screenshot with an initial “baseline” image
    3. Report the differences
    4. Update the baseline as needed
  • Liked Dilip S

    Dilip S - TestComplete supports Selenium – Other Commercial tools will follow soon?

    Dilip S
    Dilip S
    Associate Architect
    schedule 2 years ago
    Sold Out!
    45 mins

    Oliver Wendell Holmes once said: I would not give a fig for the simplicity this side of complexity, but I would give my life for the simplicity on the other side of complexity.

    Tool evangelists around the world have been using this phrase for selling their products. They make it a point to look into customer’s eyes and say “Scalability, you know, that’s the main problem with Selenium. What about 3 years down the line, when you have multiple applications in your landscape and Selenium does not support it?”

    But they simply ignore the fact, knowingly off course, that this ever belting technology world is going where Selenium is right now. Applications are shrinking into browser windows and changing tracks to align itself to this mobile era.

    Even commercial tool like TestComplete from SmartBear has started supporting Selenium and many will follow soon. Reason for this change is not only that most of the organizations are preferring open source tools like Selenium for starting point of their automation activities but also the fact that Selenium by far has proven itself to be one of the best automation tools when it comes to mobile or browser based desktop automation.

    Here our aim is to display how seamlessly Selenium integrates with TestComplete and QAComplete and it is for us to understand that it is not Selenium which needs other tools to extend it but it is the other way round.

  • Liked Vishal Aggarwal

    Vishal Aggarwal - Selenium Next Generation Framework

    Vishal Aggarwal
    Vishal Aggarwal
    Test Architect
    Gallop Solutions
    schedule 2 years ago
    Sold Out!
    45 mins

    This talk focuses on the technical side of automated acceptance tests for web applications. There are a lot of high-level frameworks that allow definition of acceptance tests in natural language (Robot, JBehave, Cucumber etc). But when it comes to the technical implementation of the test cases, you are often forced to use the rather low-level WebDriver API directly.

    GEB addresses exactly this problem. It is an abstraction of the WebDriver API and combines the expressive and concise Groovy language with a jQuery-like selection and traversal API. This makes the test implementation easier and the code more readable. On top of that, we get support for the page object pattern, asynchronous content lookup and a very good integration in existing test frameworks which makes it simply next generation of automation framework.

  • Liked vishnu nallani chekravarthula

    vishnu nallani chekravarthula - Extending Selenium Element Locator Strategies – Element Filtering

    45 mins

    Element Locator strategies for Selenium WebDriver are highly flexible, and have been later inherited by many commercial tools. Although the locator strategies are flexible, they are also limited in a sense that, Selenium WebDriver does not currently allow its users to identify/filter UI elements with multiple locator strategies(at a time), as many commercial tools do.

    The solution discussed in this article describes a library that allows Selenium WebDriver users to extend the Selenium element locator strategies for Element Filtering and few use cases for the library.

    The solution approach allows users to continue to use the existing UI Element definitions in their tests, and extend them, using the By reference. The library will replace the existing Selenium WebDriver “By” reference.

    Filtering based on multiple locator strategies

    There are various scenarios where to uniquely identify an UI element, a complex XPath has to be written. However, the element can be identified uniquely using multiple locator strategies for the UI Element. The UI Elements can also be filtered, when there are multiple matches in a page. This is the UI Element recognition mechanism used in many commercial test automation tools.

    The algorithm for filtering UI Elements based on multiple locator strategies is based on priority of locator strategies. The priority of locator strategies when filtering is:

    1. ID
    2. Name
    3. TagName
    4. ClassName
    5. XPATH
    6. LinkText and PartialLinkText
    7. CSS

    The By.elementFilter method takes multiple locator strategies, and searches the page for elements matching a particular locator strategy/property, and checks if it is a unique match on the page, if not then it uses the next locator strategy passed to it and so on.

    This method is also very helpful when the application undergoes constant changes and UI Elements might have either of XPATH, ID , NAME, TagName, ClassName etc still unchanged. That way, it helps reduce a lot of maintenance effort in Selenium WebDriver implementations which is due to UI element changes.

    Filtering based on Index

    When there are multiple similar UI Elements in a page, such as cells in a grid/table, it makes sense to identify objects based on their Index based on their appearance on the web page.

    The By.indexFilter method allows users to define an UI Element based on its Index of occurrence of the UI Element. The Index starts from 1.

    Filtering based on relative element

    When a UI element cannot be identified uniquely and reliably by any of its properties, but has some elements in its hierarchy or relative to a particular element, this method can be used to identify the element

    The By.relationFilter method allows users to define an UI Element in relation to another element. The relation can be defined as “Left”, ”Right”, ”Top”, ”Bottom”, ”Child”, ”Parent”

    Filtering for Tables

    When dealing specifically with Tables, which have the

    html tag, the By.tableFilter method allows the user to quickly identify specific cells in the table, without having to write complex XPaths or logics to achieve the same.

    The By.tableFilter method allows users to define a cell in the table with Row,Column numbers. This allows users to directly use the UI Element in their code instead of writing their logic each time. This also increases efficiency and readability of the code.