After completion of the course, you will work on a live project where you will be using PIG, HIVE, HBase and MapReduce to perform Big Data analytics.
There are industry-specific Big Data case studies that are included in our Big Data and Hadoop Certification e.g. Finance, Retail, Media, Aviation etc. which is consider for your project work:
- Project #1: Analyze social bookmarking sites to find insights
Industry: Social Media
Data: There are the information gathered from bookmarking sites like reddit.com, stumbleupon.com, allow you to bookmark, review, rate, search various links on any topic. reddit.com, stumbleupon.com, etc. These data are in the XML format and keeps various links/posts URL, categories defining it and the ratings linked with it.
Problem Statement:Analyze data in the Hadoop ecosystem:
- Fetch the data from Hadoop Distributed File System and analyze data with the help of MapReduce, Pig and Hive to find out the top rated links based on the user comments, likes etc.
- By the use of MapReduce, you can convert the semi-structured format (XML data) into a structured format and categorize the user rating as positive and negative for each of the thousand links.
- Push these output HDFS and then after feed it into PIG, which partitions of the data into two parts: Category data and Ratings data.
- Write a fancy Hive Query to analyze the data further and push the output is into relational database (RDBMS) by the use of Sqoop.
- By using the web server running on grails/java/ruby/python that renders the result in real time processing on a website.
- Project #2: Analysis of Customer Complaints
Industry: Retail
Data: These data are available on publicly, containing a few lakh observations with attributes like; CustomerId, Payment Mode, Product Details, Complaint, Location, Status of the complaint, etc.
Problem Statement: Analyze data in the Hadoop ecosystem:
- Get the number of complaints filed under each product
- Get the total number of complaints filed from specific location
- Get the list of complaints grouped by location which has no any reaction on time
- Project #3: Tourism Data Analysis
Industry: Tourism
Data: The data comprises attributes like: City pair (combination of from and to), adults traveling, seniors traveling, children traveling, air booking price, car booking price, etc.
Problem Statement:Find the following insights from the data:
- Top 20 destinations people frequently travel: Based on the given data we can find the most popular destinations where people often travel, based on the specific initial number of trips booked for a specific destination
- Top 20 locations from where highest number of trips start. It is based on the booked trip count
- Top 20 high air-revenue destinations, i.e the 20 cities that generate high airline revenues for travel, so that the discount offers can be given to attract more bookings for these destinations.
- Project #4: Airline Data Analysis
Industry: Aviation
Data: The data which contains the flight details of various airlines such as: Airport id, Name of the airport, Main city served by airport, Country or territory where airport is stationed, Code of Airport, Decimal degrees, Hours offset from UTC, Timezone, etc.
Problem Statement:Analyze the airlines' data to:
- Find list of airports operating in the country
- Find the list of airlines having zero stops
- List of airlines operating with code share
- Which country (or) territory has the most number of airports
- Find out the list of active airlines in the United States
- Project #5: Analyze Loan Dataset
Industry: Banking and Finance
Data: Publicly available dataset which contains all the details of loans issued, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information.
Problem Statement:
- Find out the number of cases as per location and categorize the count with respect to reason for taking loan and display the average risk score.
- Project #6: Analyze Movie Ratings
Industry: Media
Data: These data are available publicly from sites like rotten tomatoes, IMDB, etc.
Problem Statement:Analyze the ratings of movies by different users:
- Get the user who has highest rated of movies
- Get the user who has lowest rated of movies
- Get the count of total number of movies rated by user who belongs to a specific occupation
- Get the number of underage users
- Project #7: Analyze YouTube data
Industry: Social Media
Data: It is all about the YouTube videos and contains attributes such as: VideoID, Uploader, Age, Category, Length, views, ratings, comments, etc.
Problem Statement:
- Identify the top 5 categories in which most videos are uploaded, the top 10 rated videos, and the top 10 most viewed videos.
Except from these there are some twenty more use-cases to choose: