Hadoop-based Services for Windows Azure includes several samples you can use for learning and testing. One sample is the 10GB GraySort which is a scaled-down version of the Hadoop Terasort benchmark. There are three jobs to run and in this video, Developer Brad Sarsfield walks you through Teraval.
See Also
- More Videos about Hadoop Services on Windows and Windows Azure
- Apache Hadoop Services on Windows - wiki Homepage
- Microsoft's
Big Data channel on YouTube
Transcript
Hi, my name is Brad Sarsfield and I’m a Developer on the Hadoop-based Services for Windows and Windows Azure team.
This video is Part 3 in the 10GB GraySort series. In videos 1 and 2 we generated and sorted the data. In this video I will show you how to validate that the sort was correct.
So let’s get started.
- From the Samples page I select the 10GB GraySort sample.
- To start the process, I deploy the sample to my cluster.
- On the Create Job page, the fields are pre-populated for me, but I need to make a few changes.
- First, I rename the job from Terasort Example to Teravalidate Example to identify this as the data validation program/job.
- The first parameter I change to teravalidate – this is the name of the program that will be run from the hadoop-examples JAR.
- The 2nd parameter specifies the number of map tasks and reduce tasks to be executed. I leave the 50 map tasks and add 25 Reduce tasks.
- The 3rd parameter identifies the input and output files. The teravalidate sample takes input from the previously-created 10GB-sort-output file and will write the results to a file named 10GB-sort-validate.
- Execute the job.
Behind the scenes, Hadoop is validating that each of the parts of the file have been sorted correctly. It goes through and validates that the sorting is correct and the records have been sorted in the correct order.
While the job is running, I switch over to the terminal services view and review the 10GB-sort-output, the output from the terasort example. Here it is. I take a look at one of these files. There are 25 files – they correspond with the number of reduce tasks we requested.
I take a closer look at one of those files, part 9. The data in this file is sorted from AH and ends with AHt. So the teravalidate program is now using the 10GB-sort-output, the output of the terasort, to validate that this is in fact actually the correct sort order.
- The job completed successfully.
But ‘success’ doesn’t mean that the the sort is valid, it just means the task completed successfully. To see if the sort order is valid, take a look at the Exit Code and the Logs. Exit Code is 0 and the log file is empty – a zero byte file. This indicates the sort was correct.
That concludes the 10GB GraySort sample video series. Thank you for watching, I hope you found it helpful.