Query JSON, HiveQL, BigQuery, and Python/R Analytics - Textnotes

Query JSON, HiveQL, BigQuery, and Python/R Analytics


Learn NoSQL and Big Data SQL concepts in this comprehensive guide. Master SQL-like querying in NoSQL databases, HiveQL, BigQuery SQL, and integration with Python/R for analytics. Practice querying JSON/document data and aggregating large datasets efficiently.

1. Introduction

With Big Data and NoSQL databases, traditional SQL queries need adaptation.

  1. NoSQL databases like MongoDB, Couchbase, and Cassandra store JSON, key-value, or document data.
  2. Big Data frameworks (Hive, BigQuery, Spark SQL) allow SQL-like queries on massive datasets.
  3. Integration with Python or R enables advanced analytics and visualization.

Key Points:

  1. SQL-like syntax simplifies querying non-relational data.
  2. Big Data SQL tools can process millions to billions of records efficiently.
  3. Python/R integration allows data science workflows on top of SQL queries.

2. SQL-like Querying in NoSQL Databases

2.1 Query JSON Documents (MongoDB Example)


-- Find employees with salary > 50000
db.Employees.find({ "Salary": { $gt: 50000 } })

-- Project only Name and Department
db.Employees.find({ "Salary": { $gt: 50000 } }, { Name: 1, Department: 1, _id: 0 })

-- Aggregation: Total salary per department
db.Employees.aggregate([
{ $group: { _id: "$Department", TotalSalary: { $sum: "$Salary" } } }
])

3. HiveQL / BigQuery SQL

Hive and BigQuery allow SQL-like queries on distributed data stored in Hadoop or cloud storage.

3.1 HiveQL Example


-- Count employees per department
SELECT Department, COUNT(*) AS TotalEmployees
FROM employees
GROUP BY Department;

-- Filter high salaries
SELECT Name, Salary
FROM employees
WHERE Salary > 50000;

3.2 BigQuery Example


-- Aggregate sales by region
SELECT Region, SUM(Sales) AS TotalSales
FROM `project.dataset.Sales`
GROUP BY Region
ORDER BY TotalSales DESC;

4. Integration with Python / R for Analytics

4.1 Python with Pandas & SQL


import pandas as pd
import sqlite3

# Connect to database
conn = sqlite3.connect('Sales.db')

# Query large dataset
df = pd.read_sql_query("SELECT Region, SUM(Sales) AS TotalSales FROM Sales GROUP BY Region", conn)

# Perform analytics
df['RunningTotal'] = df['TotalSales'].cumsum()
print(df)

4.2 R with DBI and dplyr


library(DBI)
library(dplyr)

con <- dbConnect(RSQLite::SQLite(), "Sales.db")

sales_summary <- tbl(con, "Sales") %>%
group_by(Region) %>%
summarise(TotalSales = sum(Sales)) %>%
arrange(desc(TotalSales))

print(sales_summary)

5. Practical Exercises

  1. Query JSON/document data in MongoDB for employees earning > 50,000.
  2. Aggregate department salaries using MongoDB aggregation pipeline.
  3. Use HiveQL or BigQuery to count records and compute sums for large datasets.
  4. Connect Python to a SQL or NoSQL database and perform grouped analytics.
  5. Compute running totals, moving averages, or rankings on large datasets in Python/R.

6. Tips for Beginners

  1. Start with small datasets before scaling to Big Data.
  2. Use aggregation pipelines in NoSQL for analytics queries.
  3. Leverage SQL-like syntax in Hive/BigQuery for easier transition from traditional SQL.
  4. Python/R integration is essential for visualization and advanced analytics.
  5. Always optimize queries to handle large datasets efficiently.


Next Step: After completing this optional module, you will have a complete end-to-end roadmap from beginner to advanced SQL, including relational SQL, analytics, optimization, security, backup, and NoSQL/Big Data concepts.