Informatica Blog

Wednesday, 10 June 2015

Real Time Interview Questions on Joiner Transformation

Joiner Transformation
1. What is a Joiner Transformation and why it is an Active one?
Answer:
A Joiner is an Active and Connected transformation used to join two source data streams coming from same or heterogeneous databases or files.
The Joiner transformation joins sources with at least one matching column. The Joiner transformation uses a condition that matches one or more pairs of columns between the two sources.
In the Joiner transformation, we must configure the transformation properties namely Join Condition, Join Type and optionally Sorted Input option to improve Integration Service performance.
The join condition contains ports from both input sources that must match for the Integration Service to join two rows. Depending on the join condition and the type of join selected, the Integration Service either adds the row to the result set or discards the row. Because of this reason, the number of rows in Joiner output may not be equal to the number of rows in Joiner Input. This is why Joiner is considered an Active transformation.
2. State the limitations where we cannot use Joiner in the mapping pipeline.
Answer:
The Joiner transformation accepts input from most transformations. However, following are the limitations:

Joiner transformation cannot be used when either of the input pipelines contains an Update Strate-gy transformation.
Joiner transformation cannot be used if we connect a Sequence Generator transformation directly before the Joiner transformation.

3. Out of the two input pipelines of a joiner, which one will we set as the master pipeline?
Answer:
During a session run, the Integration Service compares each row of the master source against the detail source. The master and detail sources need to be configured for optimal performance.
When the Integration Service processes an unsorted Joiner transformation, it blocks the detail source while it caches rows from the master source. Once the Integration Service finishes reading and caching all master rows, it unblocks the detail source and reads the detail rows. This is why if we have the source containing fewer input rows in master, the cache size will be smaller, thereby improving the performance.
For a Sorted Joiner transformation, use the source with fewer duplicate key values as the master source for optimal performance and disk storage. When the Integration Service processes a sorted Joiner transfor-mation, it caches rows for one hundred keys at a time. If the master source contains many rows with the same key value, the Integration Service must cache more rows, and performance can be slowed.
Blocking logic is possible if master and detail input to the Joiner transformation originate from dif-ferent sources. Otherwise, it does not use blocking logic. Instead, it stores more rows in the cache.
4. What are the different types of Joins available in Joiner Transformation?
Answer:
In SQL, a join is a relational operator that combines data from multiple tables into a single result set. The Joiner transformation is similar to an SQL join except that data can originate from different types of sources.
The Joiner transformation supports the following types of joins:

Normal
Master Outer
Detail Outer
Full Outer

A normal or master outer join performs faster than a full outer or detail outer join.
5. Define the various Join Types of Joiner Transformation.
Answer:

In a normal join, the Integration Service discards all rows of data from the master and detail source that do not match, based on the join condition.
A master outer join keeps all rows of data from the detail source and the matching rows from the master source. It discards the unmatched rows from the master source.
A detail outer join keeps all rows of data from the master source and the matching rows from the detail source. It discards the unmatched rows from the detail source.
A full outer join keeps all rows of data from both the master and detail sources.

6. Describe the impact of number of join conditions and join order in a Joiner.
Answer:
We can define one or more conditions based on equality between the specified master and detail sources. Both ports in a condition must have the same data type.
If we need to use two ports in the join condition with non-matching data types we must convert the data types so that they match. The Designer validates data types in a join condition.
Additional ports in the join condition, increases the time necessary to join two sources.
The order of the ports in the join condition can impact the performance of the Joiner transformation. If we use multiple ports in the join condition, the Integration Service compares the ports in the order we specified.
Only equality operator is available in joiner join condition.
7. How does Joiner transformation treat NULL value matching?
Answer:
The Joiner transformation does not match null values.
For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the Integration Service does not consider them a match and does not join the two rows.
To join rows with null values, replace null input with default values in the Ports tab of the joiner, and then join on the default values.
If a result set includes fields that do not contain data in either of the sources, the Joiner transfor-mation populates the empty fields with null values. If we know that a field will return a NULL and we do not want to insert NULLs in the target, set a default value on the Ports tab for the corre-sponding port.
8. When we configure the join condition, what are the guidelines we need to follow to main-tain the sort order?
Suppose we configure Sorter transformations in the master and detail pipelines with the following sorted ports in order: ITEM_NO, ITEM_NAME and PRICE.
Answer:
If we have sorted both the master and detail pipelines in order of the ports say ITEM_NO, ITEM_NAME and PRICE we must ensure that:
 Use ITEM_NO in the First Join Condition.
 If we add a Second Join Condition, we must use ITEM_NAME.
 If we want to use PRICE as a Join Condition apart from ITEM_NO, we must also use ITEM_NAME in the Second Join Condition.
 If we skip ITEM_NAME and join on ITEM_NO and PRICE, we will lose the input sort order and the In-tegration Service fails the session.
9. What are the transformations that cannot be placed between the sort origin and the Join-er transformation so that we do not lose the input sort order?
Answer:
The best option is to place the Joiner transformation directly after the sort origin to maintain sorted data. However do not place any of the following transformations between the sort origin and the Joiner transfor-mation:

 Custom
 Unsorted Aggregator
 Normalizer
 Rank
 Union transformation
 XML Parser transformation
 XML Generator transformation
 Mapplet [if it contains any one of the above mentioned transformations]

10. What is the use of sorted input in joiner transformation?
Answer:
It is recommended to Join sorted data when possible. We can improve session performance by con-figuring the Joiner transformation to use sorted input. When we configure the Joiner transformation to use sorted data, it improves performance by minimizing disk input and output. We see
great performance improvement when we work with large data sets.
For an unsorted Joiner transformation, designate as the master source the source with fewer rows. For optimal performance and disk storage, designate the master source as the source with the fewer rows. During a session, the Joiner transformation compares each row of the master source against the de-tail source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which speeds the join process.
11.Can we join two tables based on a join column having different data type?
For example table 1 EMPNO (string) and table 2 EMPNUM (number)
Answer:
Yes possible in this case. If we are using Joiner, we should be able to do this explicit conversion in an expres-sion transformation before joining the tables.
12.Implementation Scenario1 - Joiner transformation is joining two tables s1 and s2. s1 has 10,000 rows and s2 has 1000 rows . Which table you will set master for better perfor-mance of joiner transformation? Why?
Answer:
Set table S2 as Master table because informatica server has to keep master table in the cache so if it is 1000 in cache will get performance instead of having 10000 rows in cache.
DWBIConcepts DWBIConcepts DWBIConcepts DWBIConcepts

Real Time Data Warehousing Interview Questions with Answers

What is data warehouse?

A data warehouse is a electronic storage of an Organization's historical data for the purpose of reporting, analysis and data mining or knowledge discovery.

Other than that a data warehouse can also be used for the purpose of data integration, master data management etc.

According to Bill Inmon, a datawarehouse should be subject-oriented, non-volatile, integrated and time-variant.

Explanatory Note

Note here, Non-volatile means that the data once loaded in the warehouse will not get deleted later. Time-variant means the data will change with respect to time.

The above definition of the data warehousing is typically considered as "classical" definition. However, if you are interested, you may want to read the article - What is a data warehouse - A 101 guide to modern data warehousing - which opens up a broader definition of data warehousing.

What is the benefits of data warehouse?

A data warehouse helps to integrate data (see Data integration) and store them historically so that we can analyze different aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our analysis to improve the efficiency of business processes.

Why Data Warehouse is used?

For a long time in the past and also even today, Data warehouses are built to facilitate reporting on different key business processes of an organization, known as KPI. Data warehouses also help to integrate data from different sources and show a single-point-of-truth values about the business measures.

Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition etc. Check this article to know more about data mining

What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis system on that data.

OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations.

Explanatory Note:

In a departmental shop, when we pay the prices at the check-out counter, the sales person at the counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction data and the related system is a OLTP system.

On the other hand, the manager of the store might want to view a report on out-of-stock materials, so that he can place purchase order for them. Such report will come out from OLAP system

What is data mart?

Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketting etc. stored in data warehouse and each department may have separate data marts. These data marts can be built on top of the data warehouse.

What is ER model?

ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of modeling is to normalize the data by reducing redundancy. This is different than dimensional modeling where the main goal is to improve the data retrieval mechanism.

What is dimensional modeling?

Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional model is not to achive high degree of normalization but to facilitate easy and faster data retrieval.

Ralph Kimball is one of the strongest proponents of this very popular data modeling technique which is often used in many enterprise level data warehouses.

If you want to read a quick and simple guide on dimensional modeling, please check our Guide to dimensional modeling.

What is dimension?

A dimension is something that qualifies a quantity (measure).

For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure - 20kg.

Dimensions are mutually independent. Technically speaking, a dimension is a data element that categorizes each item in a data set into non-overlapping regions.

What is Fact?

A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can be aggregated.

What are additive, semi-additive and non-additive measures?

Non-additive Measures

Non-additive measures are those which can not be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table.

Semi Additive Measures

Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.

Additive Measures

Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc.

At this point, I will request you to pause and make some time to read this article on "Classifying data for successful modeling". This article helps you to understand the differences between dimensional data/ factual data etc. from a fundamental perspective

What is Star-schema?

This schema is used in data warehouse models where one centralized fact table references number of dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the name.

What is snow-flake schema?

This is another logical arrangement of tables in dimensional modeling where a centralized fact table references number of other dimension tables; however, those dimension tables are further normalized into multiple related tables.

Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be the measure here and keys from customer, product and time dimension tables will flow into the fact table. Additionally all the products can be further grouped under different product families stored in a different table so that primary key of product family tables also goes into the product table as a foreign key. Such construct will be called a snow-flake schema as product table is further snow-flaked into product family.

Note
Snow-flake increases degree of normalization in the design.

What are the different types of dimension?

In a data warehouse model, dimension can be of following types,

Conformed Dimension
Junk Dimension
Degenerated Dimension
Role Playing Dimension

Based on how frequently the data inside a dimension changes, we can further classify dimension as

Unchanging or static dimension (UCD)
Slowly changing dimension (SCD)
Rapidly changing Dimension (RCD)

You may also read, Modeling for various slowly changing dimension and Implementing Rapidly changing dimension to know more about SCD, RCD dimensions etc.

What is a 'Conformed Dimension'?

A conformed dimension is the dimension that is shared across multiple subject area. Consider 'Customer' dimension. Both marketing and sales department may use the same customer dimension table in their reports. Similarly, a 'Time' or 'Date' dimension will be shared by different subject areas. These dimensions are conformed dimension.

Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are said to be conformed.

What is degenerated dimension?

A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table.

A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any more associated attributes and hence can not be designed as a dimension table.

What is junk dimension?

A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can be removed from other tables and can be junked into an abstract dimension table.

These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise. Junk dimensions are often used to implement Rapidly Changing Dimensions in data warehouse.

What is a role-playing dimension?

Dimensions are often reused for multiple applications within the same database with different contextual meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a 'role-playing dimension'

What is SCD?

SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing. These can be of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are most common. Read this article to gather in-depth knowledge on various SCD tables.

What is rapidly changing dimension?

This is a dimension where data changes rapidly. Read this article to know how to implement RCD.

Describe different types of slowly changing Dimension (SCD)

Type 0: A Type 0 dimension is where dimensional changes are not considered. This does not mean that the attributes of the dimension do not change in actual business situation. It just means that, even if the value of the attributes change, history is not kept and the table holds all the previous data.

Type 1:

A type 1 dimension is where history is not maintained and the table always shows the recent data. This effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this update, we lose the previous values.

Type 2:

A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed to group G2. Then there will be two separate records in dimension table like below,

Key	Customer	Group	Start Date	End Date
1	C1	G1	1st Jan 2000	31st Dec 2005
2	C1	G2	1st Jan 2006	NULL

Note that separate surrogate keys are generated for the two records. NULL end date in the second row denotes that the record is the current record. Also note that, instead of start and end dates, one could also keep version number column (1, 2 … etc.) to denote different versions of the record.

Type 3:

A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing. See the example below,

Key	Customer	Previous Group	Current Group
1	C1	G1	G2

This is only good when you need not store many consecutive histories and when date of change is not required to be stored.

Type 6:

A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra column to denote which record is the current record.

Key	Customer	Group	Start Date	End Date	Current Flag
1	C1	G1	1st Jan 2000	31st Dec 2005	N
2	C1	G2	1st Jan 2006	NULL	Y

What is a mini dimension?

Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has a huge number of rapidly changing attributes it is better to separate those attributes in different table called mini dimension. This is done because if the main dimension table is designed as SCD type 2, the table will soon outgrow in size and create performance issues. It is better to segregate the rapidly changing members in different table thereby keeping the main dimension table small and performing.

What is a fact-less-fact?

A fact table that does not contain any measure is called a fact-less fact. This table will only contain keys from different dimension tables. This is often used to resolve a many-to-many cardinality issue.

Explanatory Note:

Consider a school, where a single student may be taught by many teachers and a single teacher may have many students. To model this situation in dimensional model, one might introduce a fact-less-fact table joining teacher and student keys. Such a fact table will then be able to answer queries like,

Who are the students taught by a specific teacher.
Which teacher teaches maximum students.
Which student has highest number of teachers.etc. etc.

What is a coverage fact?

A fact-less-fact table can only answer 'optimistic' queries (positive query) but can not answer a negative query. Again consider the illustration in the above example. A fact-less fact containing the keys of tutors and students can not answer a query like below,

Which teacher did not teach any student?
Which student was not taught by any teacher?

Why not? Because fact-less fact table only stores the positive scenarios (like student being taught by a tutor) but if there is a student who is not being taught by a teacher, then that student's key does not appear in this table, thereby reducing the coverage of the table.

Coverage fact table attempts to answer this - often by adding an extra flag column. Flag = 0 indicates a negative condition and flag = 1 indicates a positive condition. To understand this better, let's consider a class where there are 100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 = 500 records (all combinations) and if a certain teacher is not teaching a certain student, the corresponding flag for that record will be 0.

What are incident and snapshot facts

A fact table stores some kind of measurements. Usually these measurements are stored (or captured) against a specific time and these measurements vary with respect to time. Now it might so happen that the business might not able to capture all of its measures always for every point in time. Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements. The first case is the example of incident fact and the second one is the example of snapshot fact.

What is aggregation and what is the benefit of aggregation?

A data warehouse usually captures data with same degree of details as available in source. The "degree of detail" is termed as granularity. But all reporting requirements from that data warehouse do not need the same degree of details.

To understand this, let's consider an example from retail business. A certain retail chain has 500 shops accross Europe. All the shops record detail level transactions regarding the products they sale and those data are captured in a data warehouse.

Each shop manager can access the data warehouse and they can see which products are sold by whom and in what quantity on any given date. Thus the data warehouse helps the shop managers with the detail level data that can be used for inventory management, trend prediction etc.

Now think about the CEO of that retail chain. He does not really care about which certain sales girl in London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is interested is, perhaps to check the percentage increase of his revenue margin accross Europe. Or may be year to year sales growth on eastern Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is derived by summing up the individual sales data from each shop in East Europe.Therefore, to support different levels of data warehouse users, data aggregation is needed.

What is slicing-dicing?

Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g. Brown Bread) and measures (e.g. sales).

Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.

Slicing and dicing operations are part of pivoting.

What is drill-through?

Drill through is the process of going to the detail level data from summary data.

Consider the above example on retail shops. If the CEO finds out that sales in East Europe has declined this year compared to last year, he then might want to know the root cause of the decrease. For this, he may start drilling through his report to more detail level and eventually find out that even though individual shop sales has actually increased, the overall sales figure has decreased because a certain shop in Turkey has stopped operating the business. The detail level of data, which CEO was not much interested on earlier, has this time helped him to pin point the root cause of declined sales. And the method he has followed to obtain the details from the aggregated data is called drill through.

Real Time interview questions on Aggregator Transformation

Aggregator Transformation
1. What is an Aggregator Transformation?
Answer:
An aggregator is an Active, Connected transformation which performs aggregate calculations like AVG, COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM and VARIANCE.

2. How an Expression Transformation differs from Aggregator Transformation?
Answer:
An Expression Transformation performs calculation on a row-by-row basis, whereas an Aggregator Trans-formation performs calculations on groups.

3. Does an Aggregator Transformation support only aggregate expressions?
Answer:
Apart from aggregate expressions, aggregator transformation supports non-aggregate expressions and con-ditional clauses.

4. Give one example for each of Conditional Aggregation, Non-Aggregate expression and Nested Aggregation.
Answer:

Use conditional clauses in the aggregate expression to reduce the number of rows used in the ag-gregation. The conditional clause can be any clause that evaluates to TRUE or FALSE.
SUM (SALARY, JOB = ‘CLERK’)
Use non-aggregate expressions in group by ports to modify or replace groups.
IIF (PRODUCT = ‘Brown Bread’, ‘Bread’, PRODUCT)
Nested aggregation expression can include one aggregate function within another aggregate func-tion.
MAX (COUNT (PRODUCT))

5. How does Aggregator Transformation handle NULL values?
Answer:
By default, the aggregator transformation treats null values as NULL in aggregate functions. But we can specify to treat null values in aggregate functions as NULL or zero.
6. What are the performance considerations when working with Aggrega
tor Transfor-mation?
Answer:

Filter the unnecessary data before aggregating it. Place a Filter transformation in the mapping be-fore the aggregator transformation to reduce unnecessary aggregation.
Improve performance by connecting only the necessary input/output ports to subsequent transfor-mations, thereby reducing the size of the data cache.
Use Sorted input which reduces the amount of data cached and improves session performance.
Aggregator performance improves dramatically if records are sorted before passing to the aggregator and “Sorted Input” option under aggregator properties is checked. The record set should be sorted on those col-umns that are used in Group By operation.
It is often a good idea to sort the record set in database level (click here to see why?) e.g. inside a source qualifier transformation, unless there is a chance that already sorted records from source qualifier can again become unsorted before reaching aggregator.

7. What are the uses of index and data cache?
Answer:
The group data is stored in index files whereas Row data stored in data files.

8. What differs when we choose Sorted Input for Aggregator Transformation?
Answer:
Integration Service creates the index and data caches files in memory to process the Aggregator transfor-mation. If the Integration Service requires more space as allocated for the index and data cache sizes in the transformation properties, it stores overflow values in cache files i.e. paging to disk.
One way to increase session performance is to increase the index and data cache sizes in the transformation properties.
But when we check Sorted Input the Integration Service uses memory to process an Aggregator transfor-mation it does not use cache files.

9. Under what conditions selecting Sorted Input in aggregator will still not boost session per-formance?
Answer:

 Incremental Aggregation, session option is enabled.
 The aggregate expression contains nested aggregate functions.
 When session property, Treat Source rows as is set to data driven.

10.Under what condition selecting Sorted Input in aggregator may fail the session?
Answer:

If the input data is not sorted correctly, the session will fail.
Also if the input data is properly sorted, the session may fail if the sort order by ports and the group by ports of the aggregator are not in the same order.

11.Suppose we do not group by on any ports of the aggregator what will be the output.
Answer:
If we do not use an input port in group-by neither in aggregate expression, the Integration Ser-vice will return only the last row value of the column for the input rows.
For example, if we have 100 rows coming from source then aggregator will output only the last record (100th record)

12.What is the expected value if the column in an aggregator transformation is neither a group by nor an aggregate expression?
Answer:
Integration Service produces one row for each group based on the group by ports. The columns which are neither part of the key nor aggregate expression will return the corresponding value of last record of the group received.
However, if we specify particularly the FIRST function, the Integration Service then returns the value of the specified first row of the group. So default is the LAST function.

13.What is Incremental Aggregation?
Answer:
We can enable the session option, Incremental Aggregation for a session that includes an Aggregator Trans-formation. When the Integration Service performs incremental aggregation, it actually passes changed source data through the mapping and uses the historical cache data to perform aggregate calculations in-crementally.

14.Sorted input for aggregator transformation will improve performance of mapping. How-ever, if sorted input is used for nested aggregate expression or incremental aggregation, then the mapping may result in session failure. Explain why?
Answer:
In case of a nested aggregation, there are multiple levels of sorting associated as each aggregation function will require one sorting pass, and after the first level of aggregation, the sort order of the group by column may get jumbled up, so before the second level of aggregation, Informatica must internally sort it again. However, if we already indicate that input is sorted, Informatica will not do this sorting - resulting into fail-ure. In incremental aggregation, the aggregate calculations are stored in historical cache on the server. In this his-torical cache the data may not be in sorted order. If we give sorted input, the records come as presorted for that particular run but in the historical cache the data may not be in the sorted order.

15.How can we delete duplicate record using Informatica Aggregator?
Answer:
One way to handle duplicate records in source batch run is to use an Aggregator Transformation and using the Group By checkbox on the ports having duplicate occurring data. Here you can have the flexibility to se-lect the last or the first of the duplicate column value records.

Top 20 SQL Interview Questions with Answers

Contents of these tables are not same with Oracle emp and dept tables!!

What is the difference between inner and outer join? Explain with example.

Inner Join

Inner join is the most common type of Join which is used to combine the rows from two tables and create a result set containing only such records that are present in both the tables based on the joining condition (predicate).

Inner join returns rows when there is at least one match in both tables

If none of the record matches between two tables, then INNER JOIN will return a NULL set. Below is an example of INNER JOIN and the resulting set.

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE

FROM DEPT dept, EMPLOYEE emp WHERE emp.dept_id = dept.id

Department	Employee
HR	Inno
HR	Privy
Engineering	Robo
Engineering	Hash
Engineering	Anno
Engineering	Darl
Marketing	Pete
Marketing	Meme
Sales	Tomiti
Sales	Bhuti

Outer Join

Outer Join can be full outer or single outer

Outer Join, on the other hand, will return matching rows from both tables as well as any unmatched rows from one or both the tables (based on whether it is single outer or full outer join respectively).

Notice in our record set that there is no employee in the department 5 (Logistics). Because of this if we perform inner join, then Department 5 does not appear in the above result. However in the below query we perform an outer join (dept left outer join emp), and we can see this department.

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE

FROM DEPT dept, EMPLOYEE emp

WHERE dept.id = emp.dept_id (+)

The (+) sign on the emp side of the predicate indicates that emp is the outer table here. The above SQL can be alternatively written as below (will yield the same result as above):

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE

FROM DEPT dept LEFT OUTER JOIN EMPLOYEE emp

ON dept.id = emp.dept_id

What is the difference between JOIN and UNION?

SQL JOIN allows us to “lookup” records on other table based on the given conditions between two tables. For example, if we have the department ID of each employee, then we can use this department ID of the employee table to join with the department ID of department table to lookup department names.

UNION operation allows us to add 2 similar data sets to create resulting data set that contains all the data from the source data sets. Union does not require any condition for joining. For example, if you have 2 employee tables with same structure, you can UNION them to create one result set that will contain all the employees from both of the tables.

SELECT * FROM EMP1 UNION SELECT * FROM EMP2;

What is the difference between UNION and UNION ALL?

UNION and UNION ALL both unify for add two structurally similar data sets, but UNION operation returns only the unique records from the resulting data set whereas UNION ALL will return all the rows, even if one or more rows are duplicated to each other.

In the following example, I am choosing exactly the same employee from the emp table and performing UNION and UNION ALL. Check the difference in the result.

SELECT * FROM EMPLOYEE WHERE ID = 5 UNION ALL SELECT * FROM EMPLOYEE WHERE ID = 5

ID	MGR_ID	DEPT_ID	NAME	SAL	DOJ
5.0	2.0	2.0	Anno	80.0	01-Feb-2012
5.0	2.0	2.0	Anno	80.0	01-Feb-2012

SELECT * FROM EMPLOYEE WHERE ID = 5

UNION

SELECT * FROM EMPLOYEE WHERE ID = 5

ID	MGR_ID	DEPT_ID	NAME	SAL	DOJ
5.0	2.0	2.0	Anno	80.0	01-Feb-2012

What is the difference between WHERE clause and HAVING clause?

WHERE and HAVING both filters out records based on one or more conditions. The difference is, WHERE clause can only be applied on a static non-aggregated column whereas we will need to use HAVING for aggregated columns.

To understand this, consider this example.
Suppose we want to see only those departments where department ID is greater than 3. There is no aggregation operation and the condition needs to be applied on a static field. We will use WHERE clause here:

SELECT * FROM DEPT WHERE ID > 3

ID	NAME
4	Sales
5	Logistics

Next, suppose we want to see only those Departments where Average salary is greater than 80. Here the condition is associated with a non-static aggregated information which is “average of salary”. We will need to use HAVING clause here:

SELECT dept.name DEPARTMENT, avg(emp.sal) AVG_SAL FROM DEPT dept, EMPLOYEE emp

WHERE dept.id = emp.dept_id (+) GROUP BY dept.name HAVING AVG(emp.sal) > 80

DEPARTMENT	AVG_SAL
Engineering	90

As you see above, there is only one department (Engineering) where average salary of employees is greater than 80.

What is the difference among UNION, MINUS and INTERSECT?

UNION combines the results from 2 tables and eliminates duplicate records from the result set.

MINUS operator when used between 2 tables, gives us all the rows from the first table except the rows which are present in the second table.

INTERSECT operator returns us only the matching or common rows between 2 result sets.

To understand these operators, let’s see some examples. We will use two different queries to extract data from our emp table and then we will perform UNION, MINUS and INTERSECT operations on these two sets of data.

UNION

SELECT * FROM EMPLOYEE WHERE ID = 5 UNION SELECT * FROM EMPLOYEE WHERE ID = 6

ID	MGR_ID	DEPT_ID	NAME	SAL	DOJ
5	2	2.0	Anno	80.0	01-Feb-2012
6	2	2.0	Darl	80.0	11-Feb-2012

MINUS

SELECT * FROM EMPLOYEE MINUS SELECT * FROM EMPLOYEE WHERE ID > 2

ID	MGR_ID	DEPT_ID	NAME	SAL	DOJ
1		2	Hash	100.0	01-Jan-2012
2	1	2	Robo	100.0	01-Jan-2012

INTERSECT

SELECT * FROM EMPLOYEE WHERE ID IN (2, 3, 5) INTERSECT SELECT * FROM EMPLOYEE WHERE ID IN (1, 2, 4, 5)

ID	MGR_ID	DEPT_ID	NAME	SAL	DOJ
5	2	2	Anno	80.0	01-Feb-2012
2	1	2	Robo	100.0	01-Jan-2012

What is Self Join and why is it required?

Self Join is the act of joining one table with itself.

Self Join is often very useful to convert a hierarchical structure into a flat structure

In our employee table example above, we have kept the manager ID of each employee in the same row as that of the employee. This is an example of how a hierarchy (in this case employee-manager hierarchy) is stored in the RDBMS table. Now, suppose if we need to print out the names of the manager of each employee right beside the employee, we can use self join. See the example below:

SELECT e.name EMPLOYEE, m.name MANAGER

FROM EMPLOYEE e, EMPLOYEE m

WHERE e.mgr_id = m.id (+)

EMPLOYEE	MANAGER
Pete	Hash
Darl	Hash
Inno	Hash
Robo	Hash
Tomiti	Robo
Anno	Robo
Privy	Robo
Meme	Pete

The only reason we have performed a left outer join here (instead of INNER JOIN) is we have one employee in this table without a manager (employee ID = 1). If we perform inner join, this employee will not show-up.

How can we transpose a table using SQL (changing rows to column or vice-versa) ?

The usual way to do it in SQL is to use CASE statement or DECODE statement.

How to generate row number in SQL Without ROWNUM

Generating a row number – that is a running sequence of numbers for each row is not easy using plain SQL. In fact, the method I am going to show below is not very generic either. This method only works if there is at least one unique column in the table. This method will also work if there is no single unique column, but collection of columns that is unique. Anyway, here is the query:

SELECT name, sal, (SELECT COUNT(*) FROM EMPLOYEE i WHERE o.name >= i.name) row_num FROM EMPLOYEE o order by row_num

NAME	SAL	ROW_NUM
Anno	80	1
Bhuti	60	2
Darl	80	3
Hash	100	4
Inno	50	5
Meme	60	6
Pete	70	7
Privy	50	8
Robo	100	9
Tomiti	70	10

The column that is used in the row number generation logic is called “sort key”. Here sort key is “name” column. For this technique to work, the sort key needs to be unique. We have chosen the column “name” because this column happened to be unique in our Employee table. If it was not unique but some other collection of columns was, then we could have used those columns as our sort key (by concatenating those columns to form a single sort key).

Also notice how the rows are sorted in the result set. We have done an explicit sorting on the row_num column, which gives us all the row numbers in the sorted order. But notice that name column is also sorted (which is probably the reason why this column is referred as sort-key). If you want to change the order of the sorting from ascending to descending, you will need to change “>=” sign to “<=” in the query.

As I said before, this method is not very generic. This is why many databases already implement other methods to achieve this. For example, in Oracle database, every SQL result set contains a hidden column called ROWNUM. We can just explicitly select ROWNUM to get sequence numbers.

How to select first 5 records from a table?

This question, often asked in many interviews, does not make any sense to me. The problem here is how do you define which record is first and which is second. Which record is retrieved first from the database is not deterministic. It depends on many uncontrollable factors such as how database works at that moment of execution etc. So the question should really be – “how to select any 5 records from the table?” But whatever it is, here is the solution:

In Oracle,

SELECT * FROM EMP WHERE ROWNUM <= 5;

In SQL Server,

SELECT TOP 5 * FROM EMP;

Generic solution,

I believe a generic solution can be devised for this problem if and only if there exists at least one distinct column in the table. For example, in our EMP table ID is distinct. We can use that distinct column in the below way to come up with a generic solution of this question that does not require database specific functions such as ROWNUM, TOP etc.

SELECT name FROM EMPLOYEE o WHERE (SELECT count(*) FROM EMPLOYEE i WHERE i.name < o.name) < 5

name

Inno

Anno

Darl

Meme

Bhuti

I have taken “name” column in the above example since “name” is happened to be unique in this table. I could very well take ID column as well.

In this example, if the chosen column was not distinct, we would have got more than 5 records returned in our output.

Do you have a better solution to this problem? If yes, post your solution in the comment.

What is the difference between ROWNUM pseudo column and ROW_NUMBER() function?

ROWNUM is a pseudo column present in Oracle database returned result set prior to ORDER BY being evaluated. So ORDER BY ROWNUM does not work.

ROW_NUMBER() is an analytical function which is used in conjunction to OVER() clause wherein we can specify ORDER BY and also PARTITION BY columns.

Suppose if you want to generate the row numbers in the order of ascending employee salaries for example, ROWNUM will not work. But you may use ROW_NUMBER() OVER() like shown below:

SELECT name, sal, row_number() over(order by sal desc) rownum_by_sal

FROM EMPLOYEE o

name	Sal	ROWNUM_BY_SAL
Hash	100	1
Robo	100	2
Anno	80	3
Darl	80	4
Tomiti	70	5
Pete	70	6
Bhuti	60	7
Meme	60	8
Inno	50	9
Privy	50	10

What are the differences among ROWNUM, RANK and DENSE_RANK?

ROW_NUMBER assigns contiguous, unique numbers from 1.. N to a result set.

RANK does not assign unique numbers—nor does it assign contiguous numbers. If two records tie for second place, no record will be assigned the 3rd rank as no one came in third, according to RANK. See below:

SELECT name, sal, rank() over(order by sal desc) rank_by_sal FROM EMPLOYEE o

name	Sal	RANK_BY_SAL
Hash	100	1
Robo	100	1
Anno	80	3
Darl	80	3
Tomiti	70	5
Pete	70	5
Bhuti	60	7
Meme	60	7
Inno	50	9
Privy	50	9

DENSE_RANK, like RANK, does not assign unique numbers, but it does assign contiguous numbers. Even though two records tied for second place, there is a third-place record. See below:

SELECT name, sal, dense_rank() over(order by sal desc) dense_rank_by_sal FROM EMPLOYEE o

name	Sal	DENSE_RANK_BY_SAL
Hash	100	1
Robo	100	1
Anno	80	2
Darl	80	2
Tomiti	70	3
Pete	70	3
Bhuti	60	4
Meme	60	4
Inno	50	5