Tuesday, June 21, 2011

Solved? LaTeX Error: pdf file is damaged - attempting to reconstruct xref table

Update: Greater experience with this issue has given me greater clarity. The issue is not related to LaTeX. Getting this error message simply means that evince tried to access the PDF file while it was being written to by pdflatex or latex and found it in a damaged state. This happens if you have a large pdf file being generated with lots of text or a number of figures. There's nothing to worry about if you get this error message. You can safely ignore it.

---- Original Post Below ---

This is just a short note for anyone coming across this problem while doing LaTeX work.
Google doesn't have a decent explanation for why this message occurs. I don't have one either
but I have isolated the cause for my particular issue and I hopefully have a fix.

The error under discussion is:

Error: PDF file is damaged - attempting to reconstruct xref table... Error: Couldn't find trailer dictionary Error: Couldn't read xref table Error: PDF file is damaged - attempting to reconstruct xref table... Error: Couldn't find trailer dictionary Error: Couldn't read xref table

I hit this error while importing a PNG file exported from GIMP into my LaTeX thesis using the 
\includegraphics command. The PNG file probably had a few transparent pixels inserted due to
the editing process because the base image (which was unmodified) was imported without any
issue into the LaTeX PDF and this is something that LaTeX doesn't like.

I fixed this issue by simply removing the obnoxious PNG and converting that to a JPG. The 
problem got solved.

If it helps you, let me know in the comments! :)

Friday, June 17, 2011

What is the difference between a mutex and a semaphore?

Okay. This is another common interview question. The one line answer to this question is:

"A mutex is a binary semaphore that can only be unlocked by the thread that locked it. It's primary purpose is to ensure single threaded access to a piece of code called the critical section. A semaphore, in contrast, is often used for cooperative signalling across threads, a feature that is explicitly disallowed by the construction of the mutex as a binary semaphore owned by the thread that locked it."

Here are a few of the points you should keep in mind:

Q1: What is a semaphore?
* A semaphore is a mechanism to convey events between two or more cooperating processes. Now what does this mean exactly?
In a typical multiprocessing machine, many processes are being timeshared on a single CPU. The degree to which computation will proceed in one process depends on how big a time slice it has been given by the OS CPU Scheduler. Thus, different processes will be at different stages of computation at any point of time. So, how do we coordinate across processes? We use semaphores to tell other processes that we've reached certain stages in our processing (via a sem_post) or to wait for other processes to tell us that they've reached certain well defined stages in their computation (via a sem_wait). These two major functions allow us to synchronize computation across process boundaries and this is the main function of a semaphore.

Effectively, you can think of a semaphore as a click lock with many keys. Anybody who holds a key can open the lock. There can be multiple keys and multiple people can open the lock at any time. Anybody can lock the lock.

Q2: What is a mutex?
* A mutex is a mechanism to ensure mutual exclusion. In simple terms, this means that one and only 1 process will be allowed to execute a particular piece of code guarded by a mutex. This particular piece of code is called the critical section. A mutex has 2 major properties:
1. Only one of the cooperating processes gains access to the critical section.
2. Only the process that holds access to the critical section (called the owner of the mutex) can revoke its access to the critical section. (This is called ownership of the critical section or mutex).

Effectively, a mutex is the concept of a lock + owner. There is a lock and only 1 key. Whoever uses a key to open the lock is called the owner of the mutex. Only he can lock the lock or open it. Thus, there can be at max 1 person at any time in the locked room.

By ensuring that no more than 1 process can ever own and release the mutex, mutual exclusion is ensured.

* A semaphore can be used to implement a mutex. This is possible if we associate the concept of an "owner" to a binary semaphore (a click lock with 1 key only). Look at the following pseudocode:

Global Initialization:
sem_init(&sem, 1); // Binary Semaphore (ensures that only 1 process can enter the Critical Section)
owner = 0; // No pid

Mutex Lock Code:
sem_wait(&sem);
owner = getpid();

// Critical Section

Mutex Unlock Code:
if (owner != getpid()) {
   abort(); // non-owner is trying to unlock the mutex. Invalid state - abort. (Ensures Mutual Exclusion)
}
sem_post(&sem);

Note how the unlock code verifies the owner of the lock and thus fulfills the function of a mutex. This is a requirement for ensuring that only the owner of the lock is able to unlock the code.

This is in sharp contrast to some other uses of a semaphore.
Consider the following example:

Global:
sem_init(&sem1, 1);
sem_init(&sem2, 0);

void f() {
  while(true) {
   sem_wait(&sem1);
   // CS1
   sem_post(&sem2);
  }
}

void g() {
  while(true) {
   sem_wait(&sem2);
   // CS2
   sem_post(&sem1);
  }
}

If f() and g() are 2 processes cooperating via the 2 semaphores, the way that the code is written ensures that the critical sections will be executed one after the other (CS1 - CS2 - CS1 - CS2 etc).
This example illustrates the fact that semaphores can be used to convey events between 2 cooperating processes. In this case, the events being conveyed are the completion of the critical sections of f() and g(). Also note that in this example, the process that waits on a semaphore isn't the one that is posting to the semaphore. This is a major difference between a semaphore and a mutex because a mutex is unable to achieve such a behaviour.

Q3: What are counting semaphores?
* Semaphores also come in a variety of flavours. One such flavour is called a counting semaphore. A counting semaphore is one that is able to allow more than one process to successfully wait on it. It lets through some "count" number of processes through sem_waits. This makes a counting semaphore more useful than a mutex - for example, it can be used to implement reader-writer locks or allow a fixed level of concurrency in a piece of code. Note that mutexes only allow for one process/thread in a critical section.

* Please remember that mutexes need to be acquired and released by the same thread. A request release a mutex currently held by another process should be treated as invalid code. This is different from the behaviour of a semaphore. A semaphore may be WAIT-ed upon by one thread/process and be POST-ed by another. This is usually by design. Otherwise mutual exclusion cannot be guaranteed in all cases.

I hope this makes the differences between semaphores and mutexes very clear. If you have any more doubts, just put in your questions in the comments!

Thursday, June 16, 2011

An approach better than Pancake Sorting for a restricted sort

The original source of the question for this blog post is again Algo Geeks.

Here's the problem description:
There is an array in an external system which cannot be accessed directly. 
The system exposes 3 functions whose complexity is O(1) :
length() - returns the length of the array.
get(i) - returns the element at index i.
reverse(i,j) - reverses the elements in the array from index i to index j (both indexes inclusive). 

Sort the array in the best possible way using only these 3 operations? 



The direct solution to sorting in this case is a Pancake Sort (see the video embedded below).




Pancake sort has a direct implementation complexity of O(N2) as locating the max element in the unsorted section of the array is an O(N) operation. Besides this, it is obvious that one cannot use standard quicksort or heapsort directly since a swap operation
isn't supplied. However, we can simulate a swap operation using the reverse operator. It's easy to do: 


swap(i, j) = reverse(i + 1, j) reverse(i, i+1), reverse(i+1, j) where i, j are indices to the array.


Given this swap operation, we can then implement any O(n log n) sort that we desire since we already have random access via the get(i) function.

However, we can do better:

Let us define a run as a sequence of consecutive values in the array.

1. Go through the array and reverse descending runs - O(n) as reverse(i, j) is supplied. Now you have an array with many runs in ascending order. It looks something like a0, a1 ... ak, b0, b1, ... bl, c0, c1, ... cm ...
2. Perform a merge procedure pairwise for consecutive runs in the array. The merge is performed as follows:
If a0, a1, a2 ...ak b0, b1, b2... bl are consecutive runs, then:
a) Compare a0, b0 (using get() operation to get the values). 
   --- Else If b0 < a0 then perform step (b). 
   --- Else a0 is in its correct place, increment pointer to the 'a' array and repeat step 2 for a1... ak, b0 ... bl

b) reverse(a0 ... b0) and then reverse (ak ... a0) - giving b0, a0, a1 ... ak, b1, b2 .. bl. Repeat step 2 for the run a0, a1 ... ak, b1, b2 ... bl

Complexity analysis:
Step 1 can be done in O(n) time. Using 2 indices to locate boundaries of runs and an O(1) reverse call, this is easy to achieve.
Step 2 The merge procedure requires 2 get() operations (one of which may be optimized away with careful coding) and (in the worst case) 2 reverse(i,j) operations. Therefore, the cost of putting 1 element in its correct position is O(1). Hence 2 sequences of lengths n and m can be merged in O(min{n, m}) swaps and O(max{n, m}) time.

Overall complexity: 
Let us assume that we end up with K runs of lengths L1 - Lk from step 1. Cost of this step O(N)
Cost of merging adjacent runs = O(min{Li, Li+1}) (in terms of swaps), O(max{Li, Lj}) in terms of comparisons.
Total number of merges that will take place = K

The worst case for the problem will occur when all the arguments to min{ Li, Lj } are the same. This happens when the runs are of size N/K.
Therefore, cost of merge = O( N/K ) swaps (rather reverse(i,j) calls) and total number of merges = K. Hence overall swaps complexity is O( K * N/K ) = O( N ). However, overall comparison complexity remains O(N log N). So using the reverse operation, we've be able to optimize the number of swaps. However, we cannot breach the lower bound of Omega(n log n) for the number of comparisons required for sorting as this is a comparison based sort.

Comments invited!

Also, make sure you follow me on Twitter at http://twitter.com/divyekapoor

A combinatorial game problem

So, I was lurking on the AlgoGeeks group when I came across this interesting question posted by Piyush from IIIT-Allahabad. Apparently, it came in a DE Shaw interview.

The original question is:
Given N heaps numbered sequentially from 1 to N that contain atleast 1 counter each, players move in turns to pick counters from the heaps. The rules of the game are such that counters can only be picked from the lowest numbered heap that is not empty. The question is: design an algorithm for optimal play that will determine which player will win given a configuration of heaps.

Here's my answer:
There are only 2 types of heaps in this problem:
1. Optional heaps -> have > 1 coins
2. Non-optional heaps -> have = 1 coins

For optional heaps, optimal play dictates that you either pick all the elements from the heap or leave just one element in the heap when you play your turn. This is an obvious constraint because otherwise, you're putting your opponent in the driving seat with a heap configuration that is similar to yours and if he can force a win, he will! :)

So, to repeat: there are only 2 things you can do with an optional heap - pick all or leave 1. Now, to get the solution for the game, read on:

Start from the rightmost heap. 
This is base case - whoever gets this heap surely wins.
Look at the heaps before it.

If it is a non-optional heap (1 element only), then the assured outcome is the 
inverse of the assured outcome of the next heap.

If it is an optional heap, then the assured outcome for whoever is first is a win - irrespective of the
assured outcome of the next heap. This is because the player whose turn 
it is to move can either pick up all the elements of the heap thus converting it 
into the assured outcome of the next heap (if that is a loss for the other player) 
or can leave 1 element in the heap, thus getting a win by optimal play.

When there are multiple optional heaps in sequence, the optimal play is
to always force your opponent to keep picking the last element from each heap
so that you retain control over the fate of the game. Hence, a sequence of 
optional heaps is the same as a single optional heap.

Unroll the loop till the first heap and output the assured outcome as the answer.

For example 1 2 2 2 1:
1 - Win
2 1 - Win for Player 1 at this stage
2 2 1 - Win for Player 1 at this stage
2 2 2 1 - Win for Player 1 at this stage
1 2 2 2 1 - Loss for Player 1 at this stage...
Outcome: Loss
This is equivalent to 1 2 1

For eg.
For the heap: 1 2 3 1 2
2 - win for player 1
1 2 - Loss for player 1
3 1 2 - Win for player 1
2 3 1 2 - Win for player 1
1 2 3 1 2 - Loss for player 1
This is equivalent to 1 2 1 2

For all the games discussed above:
Game 1: 2 3 -> is the same as game 2 2 which is equivalent to 2 -> Player 1 wins
Game 2: 1 2 -> is a loss for player 1 for obvious reasons
Game 3: 2 1 -> is a win because of the optional heap
Game 4: 1,2,3,1,4 -> Loss for player 1. Equal to 1 2 2 1 2 -> 1 2 1 2 -> Loss
Game 5: 1,1,2,3 -> Win for player 1. Equal to 1 1 2 2 -> 1 1 2 -> Win
Game 6: 2,1,3,4,5 -> Equal to 2 1 2 2 2 -> 2 1 2 -> Win for player 1

Nice question. Hope you enjoyed the solution too! :)
Feel free to leave a comment if you liked it and 
don't forget to follow me on Twitter at:  http://twitter.com/divyekapoor

Sunday, June 12, 2011

Printing all permutations of a string containing duplicate characters (C/C++)


This is a follow up post to my earlier post on printing all possible permutations of a string (also called Anagrams) using the C++ Standard Library. The earlier post is located here. This post should have come on this blog about 3 years ago. I'm no longer interested in solving problems like this but I wanted to put this up for the millions of people who get this problem in interviews and the like and to retain the same for posterity.

So, previously we've discussed how to print all possible permutations of a string without duplicates using the standard C++ library functions. That's all fine and dandy if we want to use the standard libraries – especially next_permutation, but what happens if we have to write our own code to do so?

In this blog post I'll present recursive solutions to the problem of printing all the possible permutations of a string. I'll even take up the case where the input string contains repeated letters later on in this post. The resultant function will be able to print all possible distinct permutations of a string containing duplicate characters.

Printing all permutations of a string containing no duplicate characters

Code 1: The recursive solution for a string with no duplicate characters in C/C++

char * full_string;
void permute(char * str, int length) {
    if(length == 0) {
        printf(“%s\n”, full_string);
        return;
    } else {
        for(int i = 0; i < length; ++i) {
            swap(str[0], str[i]);
            permute(str+1, length-1);
            swap(str[0], str[i]);
        }
    }
}

Call like: permute(full_string, strlen(full_string));



To understand how this works, just look at the string “ABC”.

Let us assume that we have a magic function permute2 that generates all possible permutations of a given string of length 2.

Then, the permutations problem for the string “ABC” can then be broken down as:


These permute2 values themselves can be broken down to smaller subproblems.



Here, it is obvious that permute1(any char) = char itself. This is the base case of this recursion. Unrolling the recursion for the 3 required values yields:

permute2(“BC”) = { “BC”, “CB” }
permute2(“AC”) = { “AC”, “CA”}
permute2(“AB”) = {“AB”, “BA”}

So the resultant sets of strings are:
permute3(“ABC”) = {“ABC”, “ACB”} U {“BAC”, “BCA”} U {“CAB”, “CBA”}

which is the set of all permutations of the string “ABC”. It is obvious to see that we are in fact just choosing the starting prefix of the permutation and then requesting the permute function to run on a smaller subproblem of permuting a smaller string.

To generalize, we can state that we can get all the permutations of a string using the general recurrence:



This formula basically means that you select a character sk from the string S of length n to represent the first character of the permutation, find all the permutations of the smaller string formed by removing sk from S and extend the permutations thus obtained by prefixing sk.

This is essentially what the code above is implementing. The two swap statements perform the work of selecting an element sk from the string S and replacing it back after permuting the smaller string



Printing all distinct permutations of a string containing duplicate characters

To achieve this, we just modify the above code a little. To avoid duplicates, we need to enforce the constraint that no character will be selected twice to place in the prefix before calling the subproblems. So, the resulting recursion gets modified as:


This simply means that at every stage of the recursion, pick only distinct characters. So, how do we do that? We pick the sk in ascending order.

Code 2: C/C++ code for finding all distinct permutations of a string with duplicate characters in it

char* full_string;
int counter = 0;
void permute(char *str, int length) {
    if(length == 0) {
        printf("%s\n", full_string);
        ++counter;
        return;
    } else {
        // Find the smallest char in the string set it to be the first character. Solve the subproblem for the smaller string.
        char *smallest = min_element(str, str + length);
        iter_swap(str, smallest);
        permute(str+1, length-1);
        
        // Look for the smallest element strictly greater than the first element of the current string
        char *smallest_greater = str + length;
        for(char *i = str+1; i != str+length; ++i)
            if(*i > *str && (smallest_greater == str + length || *i < *smallest_greater))
                smallest_greater = i;
        while(smallest_greater != str+length) {
            // If such an element is found, swap it into the first slot and recurse
            iter_swap(str, smallest_greater);
            permute(str+1, length-1);

            // Repeat the loop if possible for the next greater character
            smallest_greater = str + length;
            for(char *i = str+1; i != str+length; ++i)
                if(*i > *str && (smallest_greater == str + length || *i < *smallest_greater))
                    smallest_greater = i;

        }
    }
}



Code Explanation:
The fact that we are selecting the character to put at the beginning of the string str in strict ascending order gives us 2 major guarantees:

  1. No character will be selected more than once. Hence the distinct selection requirement is satisfied.
  2. All distinct characters in the string str will be selected. Hence, all the permutations will end up being generated

A small bonus of selecting the sk character in this fashion is that the permutations generated in the output will be in lexicographical order (dictionary order). This makes it easy to verify the correctness of the program.
This code uses two functions that you might not be aware of :
The functions iter_swap and min_element are provided by the standard C++ library. They are really simple and you could code your own versions easily.

  • iter_swap – it just swaps the elements pointed to by the respective pointers without changing the pointers themselves. So, it's basically equivalent to the function:
    void iter_swap(char *ptr1, char *ptr2) {
      char tmp = *ptr1; *ptr1 = *ptr2;
      *ptr2 = tmp;
    }
  • min_element – finds the location of the minimum element that exists in the given range. In this case, it is the char* pointer pointing to the location of that element. It can be implemented like:
    char * min_element(char *start, char *end) { // end is 1 beyond the last valid element
      if(start == end) return end; // empty range
      char *min_pos = start;
      for(char *iter = start+1; iter != end; ++iter) {
        if(*iter < *min_pos) min_pos = iter;
      }
      return min_pos;
    }

Of course, the standard library implements these as template functions so that they can be used with a variety of datatypes.

Code 3: The iterative solution.

The C++ standard library is an elegant repository of algorithms. You can also use the iterative next_permutation function to generate all the permutations of a string which might contain duplicates. Mark Nelson has done a superb job of explaining how next_permutation works at http://marknelson.us/2002/03/01/next-permutation/ - do take the time to go through his complete blog post. It is the best description of this algorithm till date. Sample code on how to use that function has already been provided in my previous post.

Code for this and many other algorithms can be found at the github archive: http://github.com/swagatata/ds_and_algos
However, all this is a work in progress.


C++ BigIntegers: Installing a static GMP library with the C++ class extensions (gmpxx.h)


GMP Logo
So I was spending some time with GMP today, trying to install it on my Ubunty Natty (11.04). [1] I wanted to finally get BigInteger support for C/C++ in a hassle free manner. Naturally, I turned to my trusty Synaptic to get and install the libraries. A quick search for libgmp in the repository turned up some old versions of GMP and some modified packages. I was surprized! I was expecting libgmp to be in the Ubuntu repository. A quick Google confirmed that Ubuntu doesn't ship with GMP 5 (which was the latest at the time of this blog post). So I set down to install the library from source. So, first things first:

1) Get the gmp-5.0.2 library (or whatever's the latest now) from http://www.gmplib.org

2) Verify the file using gpg (Always a good practice)
    gpg --verify gmp-5.0.2.tar.bz2.sig gmp-5.0.2.tar.bz2

    You might get an error at this stage saying that the public key is not available. To fix that, issue
    gpg --search-keys <key_id>

    and select the SWOX key from the options you get. The KEY_ID is mentioned right below the download link on the gmplib download page. The gmp-5.0.2.tar.bz2.sig file can be downloaded from the downloads area of the site.

    Alternatively, you could check the SHA-1 or SHA-256 cryptographic checksum by extracting the .tar file from the .tar.bz2 or .tar.gz file and running sha1sum gmp-5.0.2.tar or sha256sum gmp-5.0.2.tar

3) Extract the files from the archive
    tar -xvjf gmp-5.0.2.tar.bz2

4) Follow the installation instructions: Now here's the catch.
     If you follow the standard installation instructions which ask you to follow the 3 step procedure: ./configure , make , sudo make install then you'll end up with only the C library built and installed. I did this and I spent quite a bit of time today messing around with GMP, trying to get it to install the C++ classes for enabling BigInteger support. Unfortunately, Google didn't help in fixing this problem. I couldn't find a thread which mentioned the installation of the C++ extensions. However, reading miscellaneous blog posts on issues with installing GMP, I figured out that the switch to install the C++ extensions is --enable-cxx and it has to be passed to ./configure. It seems obvious in hindsight but was really unobvious when I was figuring out the installation issues. So, here are the correct set of commands that you need to follow to get GMP to install with the C++ extensions.

   ./configure --enable-cxx
   make
   make check
   sudo make install

This will install GMP to your /usr/local/include and /usr/local/lib directories. I prefer this because you can easily remove the files and folders later with minimal system modifications should you so desire. Also, this is the recommended way for non-system installed packages (read install from source packages) to install themselves on a linux system.

5) Verify that GMP has installed correctly

   Create the following 2 programs to test out GMP:
   
test1.c:

   #include <stdio.h>
   #include <stdlib.h>
   #include <gmp.h>

   int main() {
      mpz_t a, b, c;
      mpz_inits(a, b, c, NULL);
      mpz_set_si(a, rand());
      mpz_set_si(b, rand());
      mpz_add(c, a, b);
      gmp_printf("%Zd + %Zd =  %Zd\n", a, b, c);
      // If you're not quitting, use mpz_clears(a, b, c, NULL)
      return EXIT_SUCCESS;
   }
Compile and test with: gcc test1.c -lgmp -static -o test1 && ./test1

test2.cc:
  
   #include <iostream>
   #include <cstdlib>
   #include <gmpxx.h>
   using namespace std;

   int main() {
       mpz_class a, b, c;
       a = rand();
       b = rand();
       c = a + b;
       cout << a << " + " << b << " = " << c << endl;
       return EXIT_SUCCESS;
   }

Compile and run with: g++ test2.cc -lgmpxx -lgmp -static -o test2 && ./test2

I've added the static flag to compile with the library included in the executable. If you want to use a shared library version of GMP, then omit the -static flag and make sure your LD_LIBRARY_FLAGS variable has /usr/local/lib present in it.


[1] GMP is the GNU Multi Precision computing library (Home page: http://www.gmplib.org) that provides unlimited precision computing support for C and C++. (Actually, the maximum limit of precision is 41 billion digits but by the time you get to that... hmm... you'll probably run out of space). Their code is very optimized and if you *ever* need BigInteger support in C or C++ - that's the way to go.