Home » Shortest Common Supersequence

Shortest Common Supersequence

by Online Tutorials Library

Shortest Common Supersequence

The shortest common supersequence is the variation of the longest common subsequence.

Problem Statement: Given two strings s1 and s2, return the shortest string that has both s1 and s2 as subsequences. If multiple answers of a problem exist, then we can return any of them.

(A String S is a subsequence of string T if deleting some characters from the string T results in the string S).

Let’s understand the above statement through an example.

Input: str1 = “abac”

str2 = “cab”

Output:

The shortest common supersequence would be “cabac”.

Explanation

The string str1 = “abac” is a subsequence of cabac as we can delete the letter ‘c’ from the string “cabac” and it becomes “abac”.

The string str2 = “cab” is a subsequence of “cabac” as we can delete the letters ‘a’ and ‘c’ from the string “cabac” and it becomes “cab”.

Consider the problem that has two strings:

S1: “abac”

S2: “cab”

The goal of the problem is to find the shortest common supersequence (SCS). Here, supersequence is also a string that maintains the sequence of S1 and S2. Let’s consider the supersequence as S3 such that we consider the string S1 in S3 then it becomes subsequence of S3. Similarly, we consider the string S2 in S3 then it also becomes a subsequence of S3.

One simple type of supersequence is that we add both the strings S1 and S2 to concatenate both the strings.

After concatenation, the string would look like:

S3: abac cab

We can observe in the above string that S1 and S2 strings are added side by side. Therefore, we can say that S3 is a supersequence. We can form a greater number of supersequence. To form any subsequence, we can add the characters in between the subsequence, but the order should be same. Another form of super sequence can be generated as:

S3: abcacab

In the above super-sequence, first we add the “ab” of string S1 in S3 then we add ‘c’ in S3, we add “ac” of string S1 and then “ab” of string S2 in S3.

Third form of super-sequence is given below:

S3: abcab

The above string is also a super-sequence. When we compare S1 with S3 then we find that string S1 is available in S3 as a subsequence. Similarly, when we compare S2 with S3 then we find that the string S2 is available as a subsequence in S3. Since the character ‘c’ is common in both the string; instead of writing the ‘c’ letter twice, we write the letter ‘c’ only once. This leads to a reduction in the length of the super-sequence. The length of the super-sequence is 5.

4th form of super-sequence is given below:

S3: cabac

The above string is also a super-sequence. When we compare S1 and S2 strings with S3 then we find that both the strings are contained in the string S3 as a subsequence. Since both the strings are contained in the string S3 as subsequence; therefore, we can say that S3 is a super-sequence of both the strings. The length of the super-sequence is 5, and it is the shortest super-sequence. Therefore, the shortest common super-sequence is “cabac”.

How to find the length of the super-sequence?

In worst case, the length of the super-sequence= (L1 + L2) = 7. We have to follow the following rules to reduce the length of the super-sequence:

  • We can match some overlapping characters and form a super-sequence of smaller length.
  • To form smallest super-sequences, we need to take longest overlapping characters.

If we consider only one common character, i.e., ‘c’ then we will add only one ‘c’ in the supersequence given as below:

S3: abacab

If we consider the longest common subsequence between the strings S1 and S2, i.e., ‘ab’ then we will add only one ‘ab’ in the supersequence rather than writing twice:

S3: cabac

Therefore,

Length of the shortest common supersequence = (L1 + L2) – LCS

In the above case,

L1 = 4

L2 = 3

LCS = 2 (‘ab’ is occurring twice)

Length of the shortest common supersequence = (4 + 3) – 2

= 7-2 = 5

Example 2: Consider two strings which are given below:

S1: AGGTAB

S2: GXTXAXB

When we compare both strings S1 and S2 then we find that the longest common subsequence is equal to GTAB.

Length of S1 = 6

Length of S2 = 7

Therefore, length of the shortest common super-sequence = (6 + 7) – 4 = 9

How to find the SCS string?

Consider two strings given below:

S1: A G G T A B

S2: G X T X A Y B

First, we will find the LCS string, and it comes out to be:

LCS: G T A B

Once the LCS string has been calculated, then we will find out the SCS string as we have already discussed, the LCS string characters should appear only once in the SCS string.

The following are the points that we need to remember while creating the LCS string:

  • Add LCS chars only once.
  • Assume 1st common char belongs to LCS char.
  • Add non-LCS chars in order of either S1 then S2 or vice versa.

We will traverse the string from left to right. Any character that appears common in both the string then we will include that character at the beginning of the LCS. Now we will dry run to understand it more clearly.

Consider the pointer P1 pointing to the string S1, the pointer P2 pointing to the string S2 and the pointer P3 pointing to the LCS string.

Initially, P1 pointer points to the character ‘A’ of S1 and P3 points to the character ‘G’. Since both the characters are different so we will include the character ‘A’ in the SCS string as shown as below:

Shortest Common Supersequence

SCS: “A”

Once the character ‘A’ is included in the SCS string then we will increment the pointer P1 and pointing to the character ‘G’. Now compare the character pointed by the P2 in S2 with the character pointed by P3. Since both the characters, i.e., ‘G’ is same so we will not include ‘G’. Since character ‘G’ is common in S1, S2 and LCS so we will include character ‘G’ in SCS string shown as below:

Shortest Common Supersequence

SCS: “AG”

Once the character ‘G’ is included in SCS string, we will increment the pointers P1, P2, and P3. Now P1 points to ‘G’, P2 points to X and P3 points to T.

Compare P1 and P3; Since ‘G’ and ‘T’ are different so we include ‘G’ in SCS string and increment the P1 pointer shown as below:

Shortest Common Supersequence

SCS: “AGG”

Compare P1 and P3; Since both the characters is same, i.e., T so we will stop here. Compare P2 and P3; Since the characters ‘X’ and ‘T’ are different so include ‘X’ in SCS string and increment the pointer P2 shown as below:

Shortest Common Supersequence

SCS: “AGGX”

Compare P2 and P3 again; Since the character ‘T’ is common in S1, S2 and LCS so include ‘T’ in SCS only once shown as below:

Shortest Common Supersequence

SCS: “AGGXT”

Once the character ‘T’ is included in SCS string, we will increment the pointers P1, P2, and P3. Now P1 points to ‘A’, P2 points to X and P3 points to A.

Compare P1 and P3; Since the characters is same, i.e.., A so we stop here. Compare P2 and P3; characters ‘X’ and ‘A’ are different so include X in SCS string and increment the pointer P2 shown as below:

Shortest Common Supersequence

SCS: “AGGXTX”

Now the characters pointed by P1, P2, and P3 are same, i.e., A, so we will include ‘A’ character in SCS string shown as below:

Shortest Common Supersequence

SCS: “AGGXTXA”

Once the character ‘A’ is included in SCS string, we will increment the pointers P1, P2, and P3. Now P1 points to ‘B’, P2 points to Y and P3 points to B. Compare P1 and P3; since the characters is same, so we will stop here.

Compare the pointers P2 and P3; since the characters are different so we will add ‘Y’ in the SCS string and increment the pointer shown as below:

Shortest Common Supersequence

SCS: “AGGXTXAY”

Now the characters pointed by P1, P2, and P3 are same, i.e., B, so we will include ‘B’ character in SCS string shown as below:

Shortest Common Supersequence

SCS: “AGGXTXAYB”

Implementation in C

You may also like