Big Data - Verify Data Completeness and Data Correctness using Selenium/JAVA

schedule 10:45 AM - 11:30 AM place Grand Ballroom 1

Using Hadoop MapReduce, Java programs are written to process large amounts of Data. Testing has to be performed to check the accurate functioning of these applications. Testing process includes manually verifying business logic on each node for MapReduce process accuracy, Data aggregation/Segregation rules and Generation of Key Value pairs. At the same time, Output data files are also verified for Transformation Rules, Successful Load, Data Integrity and Data Accuracy.

Due to the enormous amount of data and various business rules, manual testing process is time consuming and may lead to slippage of validations. Implementing the automation testing process using Selenium & Java Adapters will make sure the data is complied with all the business/transformation rules and checks the data integrity.

Outline/structure of the Session

Hadoop is used to fetch data from different data sources and programmers apply business logic to fetch the required data. E.g., if the Hadoop-gathered data is 100 PB(Peta Bytes) which is a combination of good data & bad data, and the business is interested in good data with required information.Let us say, out of 100 PB data, business-required data is 50 TB. Hadoop programmers write few MapReduce programs in JAVA to run against multiple nodes and they gather around 50 TB. And to validate this 50TB data, what the traditional manual testing process does is take random/sample data files and validate against business logic. This does not 100% guarantee data completeness and data correctness. And with manual testing process it takes months-time to validate data integrity and data accuracy. Manual testing cannot be done with unstructured data which consists of different file types. for ex: Audio, Video, Mobile Calls, Call center data, images, etc. Using automation we can read the headers of these files and split in to structured files.

Learning Outcome

The Test Automation approach consists of initially splitting the 50TB data into smaller chunks and develop test JAVA programs known as JAVA adapters. The required business logic should be implemented in these JAVA adapters and these have to be run against each individual data file chunk. This generates the output data. And this output file content and output file size is verified against the data generated by Hadoop MapReduce programs. This finally results in verifying of data completeness and data correctness as briefly explained below.

Data Correctness - Validating actual data (Data by MapReduce) Vs expected data(Data by test Java Adapters).

Data Completeness – 100 % data validation

And, for O/P data validation, tools like PRESTO can be used. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from 300 PB and above.

Using Test Automation Framework, large amounts of data can be verified within a very minimal time and Data Integrity/Data Accuracy are verified with 100% coverage

Target Audience

leads, architects

schedule Submitted 2 years ago

  • Dave Haeffner Test
    By Dave Haeffner Test  ~  2 years ago
    That's an interesting problem set and approach to solving it. But do you actually use Selenium for this kind of testing?

    • giri
      By giri  ~  2 years ago
      Thanks for your observation. Yes selenium is required as this is a open source and to maintain testbed like test repository, test scenarios, results reporting, and respective java adapters(function libraries).


      • Leo Laskin
        By Leo Laskin  ~  1 year ago
        I don't understand how you use selenium.  What web UI testing are you doing as part of this testing?  

  • Naveen Chauhan
    By Naveen Chauhan  ~  1 year ago
    You compare the data produced by mapreduce code with the output of Java Adapters ,Not sure what we gain from this as we need to compare the output data with the actual source systems data from where we get the data at first place because data produced by JAVA Adapters OR Mapreduce program may not be correct if compare to source data.

    Please let me know if i am missing anything here.






  • RK Raju
    By RK Raju  ~  2 years ago
    Could you please elaborate test completion time depends on what parameters in this case study

    • giri
      By giri  ~  2 years ago
      Test Completion time depends on the amount of data that is being tested. As you might be knowing that, bigdata deals with 5 V's of data (Velocity, Volume, Variety, Veracity, Value). Here to achieve data completeness (100 % data validation) we are splitting the extracted data (mapreduced data) in to smaller chunks and running parallel java adapters against them.Hope this info helps.

