Data Analytics Interview QA Preparation

October 24, 2020

1. What is Spark SQL ?

Ans: Spark SQL is a component of the Spark ecosystem and is used to access structured and semi-structured information.

It is used for relational big data processing in integration with Scala, Python, Java, etc.
It also provides the SQL query execution support to Spark. The process of querying data can be done on data stored in internal Spark RDDs and other external data sources.

2. What are the two types of data in Big Data ?

Ans: Data in Big Data can be classified as: Structured, Semi-structured and Unstructured data.

3. What is RDD ?

Ans: RDD is Resilient Distributed Datasets. It is an immutable, cacheable, distributed set of data. It is the primary distributed Dataset abstraction in Spark.

4. What is Spark SQL ?

Ans: Spark SQL is the solution for data analysis from the Spark family of tool sets. It runs SQL on top of Spark.

Spark SQL can deal with data from a wide variety of data sources such as:
- Structured data files
- Tables in Hive
- External Databases
- Existing RDD
Spark SQL uses easy-to-use and straightforward Domain Specific Language (DSL) for selecting, filtering, and aggregating data.

5. What are the goals of Spark SQL ?

Ans:

Goals of Spark SQL

Relational processing can be done on both native RDDs and external sources.
DBMS techniques ensure high performance processing.
Easy to process new data sources (both semi-structured data and external sources).
Supports advanced analytics algorithms such as machine learning and graph processing.

Note: Important URL for Data Analytics Interview Preparation

Url: https://www.amazon.com/dp/B07YKQK4M9

Url : https://www.amazon.com/dp/B07QBGVF5L

Url: https://www.amazon.com/dp/B07LFDY87F

Linux Kuriosity