About a month ago, I finally pulled the trigger to ditch the #! convention and used pushstate instead to allow major search engines to index our ajax content while keeping our urls nice and purrty. Unfortunately, during this endeavor, I messed up utf-8 + url encoding for the urls that contain non-ASCII characters. I have since fixed the bug, but google is still attempting to crawl these badly-encoded URLs.
After poking around Google's webmaster tools a bit, I found that I could tell google to remove an individual url from its cache and index, but I could only do one at a time! Yes, I could've submitted a new robot.txt with those urls blocked, but it would take some time for them to be truely unindexed and I'd also have to keep that list around more or less forever. Some hacking was in order. I decided to write a Chrome browser extension to allow bulk url removals. You can get my extension here: https://github.com/noitcudni/google-webmaster-tools-bulk-url-removal.
To install it:
1) git clone https://github.com/noitcudni/google-webmaster-tools-bulk-url-removal.git
2) Go to chrome://extensions/ and turn on Developer mode.
3) Click on Load unpacked extension and load my extension.
To use it:
1) Create a list of urls to be removed and store them in a file. All urls are separated by \n. I downloaded the entire list of problematic urls from the health error page and wrote a quick script to extract the ones with utf8 errors.
2) Go to Google's webmaster tools.
3) Click on Optimization -> Remove URLs.
4) You should now see a new drop-down with several removal options, along with a "Choose File" button.
5) Click on the Choose File button.
6) Select the file you created in step 1.
7) Sit back and relax.
Hopefully, you will find this extension useful.