Sunday, December 11, 2011

Recurrences in Computational Complexity

Manish Sharma asked me an interesting question over chat regarding recurrences and computational complexity estimation. The question was interesting enough that I thought that the answer would be of use to many more people in general so I'm posting it to my blog. In any case, a post to my blog was long overdue. So, here it is:

The original question was:
How do you find the computational complexity of the recurrence:
T(n) = a.T(n . r) + b.T(n (1 - r)) + O(n)
or, when pretty printed:

Well, technically speaking, this question is phrased slightly ambiguously, so I'm phrasing it as below:
T(n) = a T(n r) + b T(n (1-r)) + cn + d
a > 0, b > 0, c > 0, d belonging to R

Pretty print:
and while I'm at it, I'll point out to you that without loss of generality, I can impose the additional restriction that r is in the range of 0.5 to 1.0.
Pretty print:

This is possible because I can simply take a recurrence equation where this isn't true and transform it to a recurrence where this is true by swapping the values of a and b and replacing the value of r with (1-r).

Given that the restrictions that I've imposed onto the original recurrence are valid, here are a few observations:
1. T(n) is atleast linear. This is because it already contains a cn term in the recurrence.
2. r > (1-r) in all cases given the valid range of r. A corollary of this fact is that n r > n (1- r) is always valid.
3. Since T(n) is atleast linear (and positive for n > 1 with a positive first derivative - which are the standard assumptions in such a problem), T(n r) > T(n (1 - r))
Now comes the tough part:

Case 1: T(n) is linear.

Assume T(n) = A n + B with A > 0
Therefore,
T(n) = a T(nr) + b T(n(1-r)) + cn + d = a { A nr + B } + b { A n(1-r) + B} + cn = {Aar + Ab(1-r) + c} n + {aB + bB + d}

Equating parts of the equation:
Aar + Ab(1-r) + c = A
aB + bB + d = B

Simplifying,
A = c/(1 - ar - b(1-r))
B = d/(1 - (a+b))

To satisfy our original assumption, A > 0 - we require that 1 - ar - b(1-r) > 0. That means, if ar + b(1-r) < 1 then T(n) is linear. This, of course, need not be true for all valid values of a and b, so for those values, our linearity assumption is violated.

Case 2: T(n) is superlinear.

If T(n) is a superlinear increasing function of n, then for sufficiently large values of n,
there are 3 subcases:
1. a T(nr) > b T(n(1-r)) for n > K1
2. a T(nr) = b T(n(1-r)) for all n > K2
3. a T(nr) < b T(n(1-r))  for n > K3
Subcase 1: a T(nr) > b T(n(1-r)) when n -> infinity then

T(n) < 2a T(nr) + cn + d

If 2a < 1 then the solution is O(n) - valid?
If 2a = 1 then the solution is O(n log n)
else the solution is O(n ^ log_(1/r)(2a))

Subcase 2: aT(nr) = b T(n(1-r)) for all n then

T(n) = 2a T(nr) + cn + d

If 2a < 1 then the solution is again O(n) (tight bound) - valid?
If 2a = 1 then the solution is again O(n log n) (tight bound)
else the solution is O(n^log_(1/r)(2a)) (tight bound)

Subcase 3: a T(nr) < b T(n(1-r)) then

the solutions are similar to subcase 1 except that r is replaced by 1-r and a is replaced by b.

Special case: When a = (1-r) and b = r.

The complexity should be O(n) under limit.

Note: Case 2 seems sufficient to solve the problem but I've kept Case 1 for clarity. I think there might be bugs in the analysis too. If you notice any, please comment!

Thursday, August 04, 2011

The Case Against GM Foods: Putting our food security at the hands of an American Corporation

I've always looked at Genetically Modified foods with a wary eye. Also, the fact that the US GM seed market is dominated by a single company - Monsanto - makes me even more suspicious (the obligatory Wiki Link - http://en.wikipedia.org/wiki/Monsanto). It is well known that Monsanto spends tons of money lobbying the US Food and Drug Administration (US-FDA) to ensure legal status for GM foods and to guide American legislation such that GM foods remain legal and no adverse regulatory laws are passed against GM foods.

Nowadays, every supermarket I've been to in the US keeps GM foods next to non-GM foods. There are no clear label markings mandated by law that would allow a consumer to distinguish between GM and non-GM foods at a glance. Consumers are forced to read fine print on labels to determine if GM products have been used in the food chain of the product they're buying. Then too, things are hampered by the fact that there is no clear way to determine if a long chemical-sounding-name is a natural, nature-identical or artificially created ingredient. Things have gotten so bad that while I was in the US, I heard of a case where a company using the term - "no GM foods used for feeding chicken" for marketing purposes was sued and asked to include a disclaimer that GM foods have not been proven to cause any harm whatsoever and that GM foods are safe for use. Extreme measures like these make me much more distrustful of companies promoting GM foods.

By the way, here's a chart from Wikipedia that show's Monsanto's astronomical profits.

These profits are generated year after year and are only growing as acres and acres of farmland come under GM crop cultivation. GM crops typically are modified so that they are not self-replicating due to concerns about wild, unchecked replication. Though introduced for the right reasons, this constraint has set up a monopolistic market, one that cannot be shut down very easily. The fault lies in the fact that since the crops cannot be propagated season after season from the seeds, the farmer needs to approach the company holding the patents to the GM crops every single year in order to plant his crop. Also, since the seeds are patented (and the Intellectual property is vigorously defended by lawsuits) and there is a very very long approval process for GM foods, there is huge bar for companies that intend to enter this space and the incumbent - Monsanto - enjoys a virtual monopoly in this space. Not only is this disruptive to the natural cycle of (free) seed generation used by farmers, in the Indian context, it is tantamount to handing the base of our agricultural pyramid to an American company (Monsanto controls 90% of the GM seed market in America).

Image courtesy: http://www.planetsave.com

Right now, Indian farmers are responsible for the cost of vegetable production based on the cost of production, seed storage, fertilizer costs etc. The government controls market prices by setting a minimum support price (MSP) at which it will buy produce from farmers. Now, alter the situation and put an American corporation on the other side of the fence. Monsanto will have price control over the seeds that it supplies to farmers. Since there is virtually no cost to replicating seeds year after year (the R&D cost has probably been recouped by now), every rupee charged by the monopoly of Monsanto will go straight to its profit margins and to the bottom line cost of foodgrains in our country. As a corollary, our government (which procures over 50% of our foodgrains) will be paying farmers so that they can pass on some of the money to an MNC as the cost of seeds. Just imagine, every single man, woman and child in India will effectively be paying a food tax to an American corporation. This money will be money that will be leached from our country, just like the way the Britishers leached money through their Manchester Cotton mills while using raw cotton from India. Frankly speaking, it's a terrible idea.

Courtesy: Greenpeace

Now, let's come to the obvious question - "why would people buy GM crops?" The answer is quite obvious - there is a market for foods such as seedless brinjals (the Bt Brinjal product) just as there is a market for seedless watermelons. Also, GM crops are like drugs - first - show the public the high of high quality blemish-less, super-sized onions, brinjals, oranges, wheat etc. The public will get hooked onto the product and will generate high demand. Farmers will then cave into market forces and will be forced to plant GM crops in order to realize higher margins from their farmland. They will abandon their traditional methods of hybridization and seed selection and start buying seeds off-the-shelf from Monsanto. Monsanto could start off with supplying seeds for free to get farmers onboard (after all, besides the initial R&D cost, they really don't have large recurring costs in seed generation). Farmers would try it out and sell the GM produce on the market (they could even start a black market for GM crops on the sly if they so wished). Once the farmers are hooked onto the seeds, Monsanto - being a monopoly - could revise the rates citing XYZ concerns (as most monopolies are wont to do) and then you end up with an American corporation leaching supra-normal profits from us. And once the floodgates of GM crops open for this country, there is no going back because GM crops will freely intermingle with natural crops (a recent study showed that 80% of wild canola crop in North America had atleast 1 artificial gene after a few years of the introduction of GM Canola into the country). This is a really disturbing scenario. Monsanto will start out small, but will quickly grow into a mammoth agricultural concern having control over large swathes of our food supply chain. See the chart embedded below to see how a single company can take over a country's entire crop acreage.

But before getting too far ahead of ourselves, let us analyze the Bt Brinjal issue. Farmers had opposed the Bt Brinjal project because of 2 major reasons - a dependence on a corporation for seeds every year and unproven health risks from GM foods. Indian farmers are smart in a way. But. if you notice, recently, the opposition to Bt Brinjal has quietened down and the images of demonstrating farmers have faded from public memory. Besides, Monsanto has been in India for about 3? years now and they have had plenty of time to figure out ways to win farmers over to their cause. However, despite all their efforts, there is bound to be opposition to the introduction of GM crops in India. So, how do they intend to overcome that? The answer is: "Lobby the government".

Now, the Indian government can't be lobbied overtly. So, let's do it the Indian way: bribe a few ministers, pull in a few favours and get an independent authority created by an act of Parliament. (After all, the multitude of scams involving our MPs has clearly demonstrated that more than a few of them are for sale). That independent body will function like the US-FDA and grant approvals for all GM crops intending to be introduced into India. Also, by making it an autonomous non-profit body, they could conceivable claim exemption from the Right to Information Act by virtue of it not being a "public authority" as defined under the act. Then, it would remain a simple matter of getting just a few officials to fall into line and you'll have a green light for all kinds of GM crops being introduced into India. Not only that, to go one step further, India would become a testing ground for all kinds of GM crops that have to be introduced into other countries. Given the extremely lax oversight of our government departments, this is sure to lead to lots of issues for us in the future and when things start to go wrong in the country, we will conveniently have yet another government department to blame for not performing its job properly. The only difference is that this department is being brought into being for one and only one reason - to give Monsanto a crack in the wall to introduce GM foods into our country.

Reviewing all this, you - my casual, tech inclined reader will feel that this scenario is quite far-fetched. So let me put forth a few supporting facts:

1. Two reporters from Fox news set out to research Monsanto and their claims on GM food safety. Monsanto literally buried their report under a threat of a lawsuit. It's a really explosive video (one that I'd seen some time ago). You need to watch this to get the full context of what's happening in the GM world today.

2. Greenpeace states that there is a "secretive bill" that is likely to be tabled in the monsoon session of the Indian Parliament that seeks to create a Biotechnology Regulatory Authority of India (BRAI) that will be an authority for clearing the introduction of GM foods into India. There has been no notable public discussion on the contents of this bill. More details here: http://www.greenpeace.in/take-action/save-your-food/stop-the-brai-bill.php?tyf

3. Monsanto has been trying out GM Wheat since before 2004. The plans to introduce it in North America were abandoned in 2004. Source: http://www.gmo-compass.org/eng/grocery_shopping/crops/22.genetically_modified_wheat.html . However, recently, Monsanto has carried out trials of GM wheat in Australia (which is one of the largest exporters of Wheat in the world). Source: http://www.crikey.com.au/2010/09/17/gm-seed-for-australian-wheat-great-initiative-or-costly-experiment/. India is one of the world's largest wheat growing nations and the introduction of GM wheat would give the reigns of our food supply to single corporation's products.

4. It is also important to note that there are virtually no alternative companies in the marketplace to put out contradictory claims and act as whistleblowers. The ones selling the products are also the ones certifying them as safe. That's hardly something that would put your mind at ease. To make things worse, Monsanto has a history of bringing large and costly lawsuits against companies and individuals that challenge the safety of their products - after all - their entire market base depends on the faith that the people have in the food that they are consuming and it would be highly detrimental to their plans if someone conclusively demonstrated that GM foods had an adverse impact on human health.

After seeing all this, would you trust this company with your food? Take your time and think it over.

1. A law unto itself, www.outlookindia.com, March 8, 2010
http://www.outlookindia.com/article.aspx?264454
2. India says no to Bt brinjal, for now, www.rediff.com, February 9, 2010
3. No GM trials in State: Katti, Deccan Herald, July 20, 2011
http://www.deccanherald.com/content/177877/no-gm-trials-state-katti.html
4. Biotech Bill: Sweeping powers, glaring omissions, www.rediff.com, March 11, 2010
Remember, Monsanto is a multi-billion dollar company and has deep pockets to buy out the media through advertising dollars. This negative coverage might change substantially in the future to become positive coverage. So keep your eyes open for it.

Friday, July 22, 2011

I was just browsing on Facebook and I was surprised to see a link to a video posted in the comments being shown inline. Here's how the new interface looks:

Suprisingly, the video player also seemed to be updated.

See the new style controls?

Hmm... It seems okay... Though the whole site could do with a big makeover...

Wednesday, July 20, 2011

The FBI Chargesheet for ISI Lobbying via Americans

I just came to know of the FBI chargesheet against Americans for channeling funds from the ISI for lobbying purposes. Read the entire chargesheet below.

Tuesday, June 21, 2011

Solved? LaTeX Error: pdf file is damaged - attempting to reconstruct xref table

Update: Greater experience with this issue has given me greater clarity. The issue is not related to LaTeX. Getting this error message simply means that evince tried to access the PDF file while it was being written to by pdflatex or latex and found it in a damaged state. This happens if you have a large pdf file being generated with lots of text or a number of figures. There's nothing to worry about if you get this error message. You can safely ignore it.

---- Original Post Below ---

This is just a short note for anyone coming across this problem while doing LaTeX work.
Google doesn't have a decent explanation for why this message occurs. I don't have one either
but I have isolated the cause for my particular issue and I hopefully have a fix.

The error under discussion is:

Error: PDF file is damaged - attempting to reconstruct xref table... Error: Couldn't find trailer dictionary Error: Couldn't read xref table Error: PDF file is damaged - attempting to reconstruct xref table... Error: Couldn't find trailer dictionary Error: Couldn't read xref table

I hit this error while importing a PNG file exported from GIMP into my LaTeX thesis using the
\includegraphics command. The PNG file probably had a few transparent pixels inserted due to
the editing process because the base image (which was unmodified) was imported without any
issue into the LaTeX PDF and this is something that LaTeX doesn't like.

I fixed this issue by simply removing the obnoxious PNG and converting that to a JPG. The
problem got solved.

If it helps you, let me know in the comments! :)

Friday, June 17, 2011

What is the difference between a mutex and a semaphore?

Okay. This is another common interview question. The one line answer to this question is:

"A mutex is a binary semaphore that can only be unlocked by the thread that locked it. It's primary purpose is to ensure single threaded access to a piece of code called the critical section. A semaphore, in contrast, is often used for cooperative signalling across threads, a feature that is explicitly disallowed by the construction of the mutex as a binary semaphore owned by the thread that locked it."

Here are a few of the points you should keep in mind:

Q1: What is a semaphore?
* A semaphore is a mechanism to convey events between two or more cooperating processes. Now what does this mean exactly?
In a typical multiprocessing machine, many processes are being timeshared on a single CPU. The degree to which computation will proceed in one process depends on how big a time slice it has been given by the OS CPU Scheduler. Thus, different processes will be at different stages of computation at any point of time. So, how do we coordinate across processes? We use semaphores to tell other processes that we've reached certain stages in our processing (via a sem_post) or to wait for other processes to tell us that they've reached certain well defined stages in their computation (via a sem_wait). These two major functions allow us to synchronize computation across process boundaries and this is the main function of a semaphore.

Effectively, you can think of a semaphore as a click lock with many keys. Anybody who holds a key can open the lock. There can be multiple keys and multiple people can open the lock at any time. Anybody can lock the lock.

Q2: What is a mutex?
* A mutex is a mechanism to ensure mutual exclusion. In simple terms, this means that one and only 1 process will be allowed to execute a particular piece of code guarded by a mutex. This particular piece of code is called the critical section. A mutex has 2 major properties:
1. Only one of the cooperating processes gains access to the critical section.
2. Only the process that holds access to the critical section (called the owner of the mutex) can revoke its access to the critical section. (This is called ownership of the critical section or mutex).

Effectively, a mutex is the concept of a lock + owner. There is a lock and only 1 key. Whoever uses a key to open the lock is called the owner of the mutex. Only he can lock the lock or open it. Thus, there can be at max 1 person at any time in the locked room.

By ensuring that no more than 1 process can ever own and release the mutex, mutual exclusion is ensured.

* A semaphore can be used to implement a mutex. This is possible if we associate the concept of an "owner" to a binary semaphore (a click lock with 1 key only). Look at the following pseudocode:

Global Initialization:
sem_init(&sem, 1); // Binary Semaphore (ensures that only 1 process can enter the Critical Section)
owner = 0; // No pid

Mutex Lock Code:
sem_wait(&sem);
owner = getpid();

// Critical Section

Mutex Unlock Code:
if (owner != getpid()) {
abort(); // non-owner is trying to unlock the mutex. Invalid state - abort. (Ensures Mutual Exclusion)
}
sem_post(&sem);

Note how the unlock code verifies the owner of the lock and thus fulfills the function of a mutex. This is a requirement for ensuring that only the owner of the lock is able to unlock the code.

This is in sharp contrast to some other uses of a semaphore.
Consider the following example:

Global:
sem_init(&sem1, 1);
sem_init(&sem2, 0);

void f() {
while(true) {
sem_wait(&sem1);
// CS1
sem_post(&sem2);
}
}

void g() {
while(true) {
sem_wait(&sem2);
// CS2
sem_post(&sem1);
}
}

If f() and g() are 2 processes cooperating via the 2 semaphores, the way that the code is written ensures that the critical sections will be executed one after the other (CS1 - CS2 - CS1 - CS2 etc).
This example illustrates the fact that semaphores can be used to convey events between 2 cooperating processes. In this case, the events being conveyed are the completion of the critical sections of f() and g(). Also note that in this example, the process that waits on a semaphore isn't the one that is posting to the semaphore. This is a major difference between a semaphore and a mutex because a mutex is unable to achieve such a behaviour.

Q3: What are counting semaphores?
* Semaphores also come in a variety of flavours. One such flavour is called a counting semaphore. A counting semaphore is one that is able to allow more than one process to successfully wait on it. It lets through some "count" number of processes through sem_waits. This makes a counting semaphore more useful than a mutex - for example, it can be used to implement reader-writer locks or allow a fixed level of concurrency in a piece of code. Note that mutexes only allow for one process/thread in a critical section.

* Please remember that mutexes need to be acquired and released by the same thread. A request release a mutex currently held by another process should be treated as invalid code. This is different from the behaviour of a semaphore. A semaphore may be WAIT-ed upon by one thread/process and be POST-ed by another. This is usually by design. Otherwise mutual exclusion cannot be guaranteed in all cases.

I hope this makes the differences between semaphores and mutexes very clear. If you have any more doubts, just put in your questions in the comments!

Thursday, June 16, 2011

An approach better than Pancake Sorting for a restricted sort

The original source of the question for this blog post is again Algo Geeks.

Here's the problem description:
There is an array in an external system which cannot be accessed directly.
The system exposes 3 functions whose complexity is O(1) :
length() - returns the length of the array.
get(i) - returns the element at index i.
reverse(i,j) - reverses the elements in the array from index i to index j (both indexes inclusive).

Sort the array in the best possible way using only these 3 operations?

The direct solution to sorting in this case is a Pancake Sort (see the video embedded below).

Pancake sort has a direct implementation complexity of O(N2) as locating the max element in the unsorted section of the array is an O(N) operation. Besides this, it is obvious that one cannot use standard quicksort or heapsort directly since a swap operation
isn't supplied. However, we can simulate a swap operation using the reverse operator. It's easy to do:

swap(i, j) = reverse(i + 1, j) reverse(i, i+1), reverse(i+1, j) where i, j are indices to the array.

Given this swap operation, we can then implement any O(n log n) sort that we desire since we already have random access via the get(i) function.

However, we can do better:

Let us define a run as a sequence of consecutive values in the array.

1. Go through the array and reverse descending runs - O(n) as reverse(i, j) is supplied. Now you have an array with many runs in ascending order. It looks something like a0, a1 ... ak, b0, b1, ... bl, c0, c1, ... cm ...
2. Perform a merge procedure pairwise for consecutive runs in the array. The merge is performed as follows:
If a0, a1, a2 ...ak b0, b1, b2... bl are consecutive runs, then:
a) Compare a0, b0 (using get() operation to get the values).
--- Else If b0 < a0 then perform step (b).
--- Else a0 is in its correct place, increment pointer to the 'a' array and repeat step 2 for a1... ak, b0 ... bl

b) reverse(a0 ... b0) and then reverse (ak ... a0) - giving b0, a0, a1 ... ak, b1, b2 .. bl. Repeat step 2 for the run a0, a1 ... ak, b1, b2 ... bl

Complexity analysis:
Step 1 can be done in O(n) time. Using 2 indices to locate boundaries of runs and an O(1) reverse call, this is easy to achieve.
Step 2 The merge procedure requires 2 get() operations (one of which may be optimized away with careful coding) and (in the worst case) 2 reverse(i,j) operations. Therefore, the cost of putting 1 element in its correct position is O(1). Hence 2 sequences of lengths n and m can be merged in O(min{n, m}) swaps and O(max{n, m}) time.

Overall complexity:
Let us assume that we end up with K runs of lengths L1 - Lk from step 1. Cost of this step O(N)
Cost of merging adjacent runs = O(min{Li, Li+1}) (in terms of swaps), O(max{Li, Lj}) in terms of comparisons.
Total number of merges that will take place = K

The worst case for the problem will occur when all the arguments to min{ Li, Lj } are the same. This happens when the runs are of size N/K.
Therefore, cost of merge = O( N/K ) swaps (rather reverse(i,j) calls) and total number of merges = K. Hence overall swaps complexity is O( K * N/K ) = O( N ). However, overall comparison complexity remains O(N log N). So using the reverse operation, we've be able to optimize the number of swaps. However, we cannot breach the lower bound of Omega(n log n) for the number of comparisons required for sorting as this is a comparison based sort.

A combinatorial game problem

So, I was lurking on the AlgoGeeks group when I came across this interesting question posted by Piyush from IIIT-Allahabad. Apparently, it came in a DE Shaw interview.

The original question is:
Given N heaps numbered sequentially from 1 to N that contain atleast 1 counter each, players move in turns to pick counters from the heaps. The rules of the game are such that counters can only be picked from the lowest numbered heap that is not empty. The question is: design an algorithm for optimal play that will determine which player will win given a configuration of heaps.

There are only 2 types of heaps in this problem:
1. Optional heaps -> have > 1 coins
2. Non-optional heaps -> have = 1 coins

For optional heaps, optimal play dictates that you either pick all the elements from the heap or leave just one element in the heap when you play your turn. This is an obvious constraint because otherwise, you're putting your opponent in the driving seat with a heap configuration that is similar to yours and if he can force a win, he will! :)

So, to repeat: there are only 2 things you can do with an optional heap - pick all or leave 1. Now, to get the solution for the game, read on:

Start from the rightmost heap.
This is base case - whoever gets this heap surely wins.
Look at the heaps before it.

If it is a non-optional heap (1 element only), then the assured outcome is the
inverse of the assured outcome of the next heap.

If it is an optional heap, then the assured outcome for whoever is first is a win - irrespective of the
assured outcome of the next heap. This is because the player whose turn
it is to move can either pick up all the elements of the heap thus converting it
into the assured outcome of the next heap (if that is a loss for the other player)
or can leave 1 element in the heap, thus getting a win by optimal play.

When there are multiple optional heaps in sequence, the optimal play is
to always force your opponent to keep picking the last element from each heap
so that you retain control over the fate of the game. Hence, a sequence of
optional heaps is the same as a single optional heap.

Unroll the loop till the first heap and output the assured outcome as the answer.

For example 1 2 2 2 1:
1 - Win
2 1 - Win for Player 1 at this stage
2 2 1 - Win for Player 1 at this stage
2 2 2 1 - Win for Player 1 at this stage
1 2 2 2 1 - Loss for Player 1 at this stage...
Outcome: Loss
This is equivalent to 1 2 1

For eg.
For the heap: 1 2 3 1 2
2 - win for player 1
1 2 - Loss for player 1
3 1 2 - Win for player 1
2 3 1 2 - Win for player 1
1 2 3 1 2 - Loss for player 1
This is equivalent to 1 2 1 2

For all the games discussed above:
Game 1: 2 3 -> is the same as game 2 2 which is equivalent to 2 -> Player 1 wins
Game 2: 1 2 -> is a loss for player 1 for obvious reasons
Game 3: 2 1 -> is a win because of the optional heap
Game 4: 1,2,3,1,4 -> Loss for player 1. Equal to 1 2 2 1 2 -> 1 2 1 2 -> Loss
Game 5: 1,1,2,3 -> Win for player 1. Equal to 1 1 2 2 -> 1 1 2 -> Win
Game 6: 2,1,3,4,5 -> Equal to 2 1 2 2 2 -> 2 1 2 -> Win for player 1

Nice question. Hope you enjoyed the solution too! :)

Sunday, June 12, 2011

Printing all permutations of a string containing duplicate characters (C/C++)

This is a follow up post to my earlier post on printing all possible permutations of a string (also called Anagrams) using the C++ Standard Library. The earlier post is located here. This post should have come on this blog about 3 years ago. I'm no longer interested in solving problems like this but I wanted to put this up for the millions of people who get this problem in interviews and the like and to retain the same for posterity.

So, previously we've discussed how to print all possible permutations of a string without duplicates using the standard C++ library functions. That's all fine and dandy if we want to use the standard libraries – especially next_permutation, but what happens if we have to write our own code to do so?

In this blog post I'll present recursive solutions to the problem of printing all the possible permutations of a string. I'll even take up the case where the input string contains repeated letters later on in this post. The resultant function will be able to print all possible distinct permutations of a string containing duplicate characters.

Printing all permutations of a string containing no duplicate characters

Code 1: The recursive solution for a string with no duplicate characters in C/C++

char * full_string;
void permute(char * str, int length) {
if(length == 0) {
printf(“%s\n”, full_string);
return;
} else {
for(int i = 0; i < length; ++i) {
swap(str[0], str[i]);
permute(str+1, length-1);
swap(str[0], str[i]);
}
}
}

Call like: permute(full_string, strlen(full_string));

To understand how this works, just look at the string “ABC”.

Let us assume that we have a magic function permute2 that generates all possible permutations of a given string of length 2.

Then, the permutations problem for the string “ABC” can then be broken down as:

These permute2 values themselves can be broken down to smaller subproblems.

Here, it is obvious that permute1(any char) = char itself. This is the base case of this recursion. Unrolling the recursion for the 3 required values yields:

permute2(“BC”) = { “BC”, “CB” }
permute2(“AC”) = { “AC”, “CA”}
permute2(“AB”) = {“AB”, “BA”}

So the resultant sets of strings are:
permute3(“ABC”) = {“ABC”, “ACB”} U {“BAC”, “BCA”} U {“CAB”, “CBA”}

which is the set of all permutations of the string “ABC”. It is obvious to see that we are in fact just choosing the starting prefix of the permutation and then requesting the permute function to run on a smaller subproblem of permuting a smaller string.

To generalize, we can state that we can get all the permutations of a string using the general recurrence:

This formula basically means that you select a character sk from the string S of length n to represent the first character of the permutation, find all the permutations of the smaller string formed by removing sk from S and extend the permutations thus obtained by prefixing sk.

This is essentially what the code above is implementing. The two swap statements perform the work of selecting an element sk from the string S and replacing it back after permuting the smaller string

Printing all distinct permutations of a string containing duplicate characters

To achieve this, we just modify the above code a little. To avoid duplicates, we need to enforce the constraint that no character will be selected twice to place in the prefix before calling the subproblems. So, the resulting recursion gets modified as:

This simply means that at every stage of the recursion, pick only distinct characters. So, how do we do that? We pick the sk in ascending order.

Code 2: C/C++ code for finding all distinct permutations of a string with duplicate characters in it

char* full_string;
int counter = 0;
void permute(char *str, int length) {
if(length == 0) {
printf("%s\n", full_string);
++counter;
return;
} else {
// Find the smallest char in the string set it to be the first character. Solve the subproblem for the smaller string.
char *smallest = min_element(str, str + length);
iter_swap(str, smallest);
permute(str+1, length-1);

// Look for the smallest element strictly greater than the first element of the current string
char *smallest_greater = str + length;
for(char *i = str+1; i != str+length; ++i)
if(*i > *str && (smallest_greater == str + length || *i < *smallest_greater))
smallest_greater = i;
while(smallest_greater != str+length) {
// If such an element is found, swap it into the first slot and recurse
iter_swap(str, smallest_greater);
permute(str+1, length-1);

// Repeat the loop if possible for the next greater character
smallest_greater = str + length;
for(char *i = str+1; i != str+length; ++i)
if(*i > *str && (smallest_greater == str + length || *i < *smallest_greater))
smallest_greater = i;

}
}
}

Code Explanation:
The fact that we are selecting the character to put at the beginning of the string str in strict ascending order gives us 2 major guarantees:

1. No character will be selected more than once. Hence the distinct selection requirement is satisfied.
2. All distinct characters in the string str will be selected. Hence, all the permutations will end up being generated

A small bonus of selecting the sk character in this fashion is that the permutations generated in the output will be in lexicographical order (dictionary order). This makes it easy to verify the correctness of the program.
This code uses two functions that you might not be aware of :
The functions iter_swap and min_element are provided by the standard C++ library. They are really simple and you could code your own versions easily.

• iter_swap – it just swaps the elements pointed to by the respective pointers without changing the pointers themselves. So, it's basically equivalent to the function:
void iter_swap(char *ptr1, char *ptr2) {
char tmp = *ptr1; *ptr1 = *ptr2;
*ptr2 = tmp;
}
• min_element – finds the location of the minimum element that exists in the given range. In this case, it is the char* pointer pointing to the location of that element. It can be implemented like:
char * min_element(char *start, char *end) { // end is 1 beyond the last valid element
if(start == end) return end; // empty range
char *min_pos = start;
for(char *iter = start+1; iter != end; ++iter) {
if(*iter < *min_pos) min_pos = iter;
}
return min_pos;
}

Of course, the standard library implements these as template functions so that they can be used with a variety of datatypes.

Code 3: The iterative solution.

The C++ standard library is an elegant repository of algorithms. You can also use the iterative next_permutation function to generate all the permutations of a string which might contain duplicates. Mark Nelson has done a superb job of explaining how next_permutation works at http://marknelson.us/2002/03/01/next-permutation/ - do take the time to go through his complete blog post. It is the best description of this algorithm till date. Sample code on how to use that function has already been provided in my previous post.

Code for this and many other algorithms can be found at the github archive: http://github.com/swagatata/ds_and_algos
However, all this is a work in progress.

C++ BigIntegers: Installing a static GMP library with the C++ class extensions (gmpxx.h)

So I was spending some time with GMP today, trying to install it on my Ubunty Natty (11.04). [1] I wanted to finally get BigInteger support for C/C++ in a hassle free manner. Naturally, I turned to my trusty Synaptic to get and install the libraries. A quick search for libgmp in the repository turned up some old versions of GMP and some modified packages. I was surprized! I was expecting libgmp to be in the Ubuntu repository. A quick Google confirmed that Ubuntu doesn't ship with GMP 5 (which was the latest at the time of this blog post). So I set down to install the library from source. So, first things first:

1) Get the gmp-5.0.2 library (or whatever's the latest now) from http://www.gmplib.org

2) Verify the file using gpg (Always a good practice)
gpg --verify gmp-5.0.2.tar.bz2.sig gmp-5.0.2.tar.bz2

You might get an error at this stage saying that the public key is not available. To fix that, issue
gpg --search-keys <key_id>

Alternatively, you could check the SHA-1 or SHA-256 cryptographic checksum by extracting the .tar file from the .tar.bz2 or .tar.gz file and running sha1sum gmp-5.0.2.tar or sha256sum gmp-5.0.2.tar

3) Extract the files from the archive
tar -xvjf gmp-5.0.2.tar.bz2

4) Follow the installation instructions: Now here's the catch.
If you follow the standard installation instructions which ask you to follow the 3 step procedure: ./configure , make , sudo make install then you'll end up with only the C library built and installed. I did this and I spent quite a bit of time today messing around with GMP, trying to get it to install the C++ classes for enabling BigInteger support. Unfortunately, Google didn't help in fixing this problem. I couldn't find a thread which mentioned the installation of the C++ extensions. However, reading miscellaneous blog posts on issues with installing GMP, I figured out that the switch to install the C++ extensions is --enable-cxx and it has to be passed to ./configure. It seems obvious in hindsight but was really unobvious when I was figuring out the installation issues. So, here are the correct set of commands that you need to follow to get GMP to install with the C++ extensions.

./configure --enable-cxx
make
make check
sudo make install

This will install GMP to your /usr/local/include and /usr/local/lib directories. I prefer this because you can easily remove the files and folders later with minimal system modifications should you so desire. Also, this is the recommended way for non-system installed packages (read install from source packages) to install themselves on a linux system.

5) Verify that GMP has installed correctly

Create the following 2 programs to test out GMP:

test1.c:

#include <stdio.h>
#include <stdlib.h>
#include <gmp.h>

int main() {
mpz_t a, b, c;
mpz_inits(a, b, c, NULL);
mpz_set_si(a, rand());
mpz_set_si(b, rand());
gmp_printf("%Zd + %Zd =  %Zd\n", a, b, c);
// If you're not quitting, use mpz_clears(a, b, c, NULL)
return EXIT_SUCCESS;
}
Compile and test with: gcc test1.c -lgmp -static -o test1 && ./test1

test2.cc:

#include <iostream>
#include <cstdlib>
#include <gmpxx.h>
using namespace std;

int main() {
mpz_class a, b, c;
a = rand();
b = rand();
c = a + b;
cout << a << " + " << b << " = " << c << endl;
return EXIT_SUCCESS;
}

Compile and run with: g++ test2.cc -lgmpxx -lgmp -static -o test2 && ./test2

I've added the static flag to compile with the library included in the executable. If you want to use a shared library version of GMP, then omit the -static flag and make sure your LD_LIBRARY_FLAGS variable has /usr/local/lib present in it.

[1] GMP is the GNU Multi Precision computing library (Home page: http://www.gmplib.org) that provides unlimited precision computing support for C and C++. (Actually, the maximum limit of precision is 41 billion digits but by the time you get to that... hmm... you'll probably run out of space). Their code is very optimized and if you *ever* need BigInteger support in C or C++ - that's the way to go.