Caffeine Fueled Dreams

Sean O'Donnells Weblog

  • Archive
  • Contact
  • RSS Feed
  • From Del.icio.us to Pinboard.in with Python 09:20 Sunday the 19th of December 2010 13 Comments

    The sad news that Yahoo plans to shut down del.icio.us reached me this week (although theres still hope). I use del.icio.us pretty much every day and was a little traumatized upon hearing this. Once I had finished wailing and gnashing my teeth I set out looking for somewhere to go.

    There are many bookmarking sites/services out there, but I fear change, and pinboard.in seemed like the closest thing to a plain replacement. It even supports the same API as del.icio.us. Theres a small charge for signing up, but no recurring fee, so I broke out the credit card and joined up.

    The next step was to figure out how to migrate my bookmarks. del.icio.us provides a export to html feature in its settings area, but a quick look at the export revealed some data was missing (mostly extended descriptions). Rabid googling revealed a lesser known XML export mechanism. To use it visit https://api.del.icio.us/v1/posts/all , enter your username and password and save the resulting XML file.

    Now to get my bookmarks into pinboard.in. I broke out my trusty text editor and battered together the script below which works just fine, a few hours later all my bookmarks are in pinboard.in, their bookmarklets are installed in my browser, and I'm loving their read later features. Sean is a happy geek again.

    You can download my migration script. To use it :

    python delmigrate.py backup.xml username password

    Heres the source for the curious.

    from xml.dom import minidom
    import sys
    
    import urllib
    import urllib2
    import time
    
    user = sys.argv[2]
    
    password = sys.argv[3]
    
    endpoint = "https://api.pinboard.in"
    
    url = "/v1/posts/add?"
    
    #open the xml file to import from and parse it
    f = open(sys.argv[1], "r")
    
    doc = minidom.parse(f).documentElement
    
    #keep count of how many urls have been imported
    urlcount = 0
    
    count = 0
    ellength = len(doc.childNodes)
    
    failcount = 0
    while count < ellength:
        e = doc.childNodes[count]
    
        if e.nodeType == e.ELEMENT_NODE:
            print "import url %s" % urlcount
    
            #get the attributes from the xml
            href = e.getAttribute("href")
            description = e.getAttribute("description")
            extended = e.getAttribute("extended")
            tags = e.getAttribute("tag")
    
            dt = e.getAttribute("time")
            rargs = dict(url=href, description=description, extended=extended,
                            tags=tags, dt=dt)
            shared = e.getAttribute("shared")
    
            if shared.strip() == 'no':
                rargs['shared'] = 'no'
    
            #convert them to unicode
            rargs = dict([k, v.encode('utf-8')] for k, v in rargs.items())
    
            print rargs
            #build the request to send
            #set up http auth for pinboard.in
            #doing this for every request may seem wasteful, but urllib2
            #seems to forget the auth details after a half dozen requests
            # if you dont
            password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
            password_manager.add_password(None, endpoint, user, password)
    
            auth_handler = urllib2.HTTPBasicAuthHandler(password_manager)
            opener = urllib2.build_opener(auth_handler)
    
            urllib2.install_opener(opener)
    
    
            request = urllib2.Request(endpoint + url + urllib.urlencode(rargs))
    
            #set the user agent
            request.add_header('User-Agent','SeansDeliciousMigrater')
            try:
    
                r = opener.open(request)
                #send the request and read the response
                response = minidom.parse(r).documentElement.getAttribute("code")
    
            except Exception, e:
                response = str(e)
    
            #if we get an invalid response, abort, proabbly throttled
            if response !="done":
                failcount += 1
    
                print "Failure: Invalid response: %s" % response
                if failcount > 4:
    
                    print "Aborting: Invalid response %s"
                    break
                else:
                    print "waiting for 30 seconds and retrying"
    
                    time.sleep(30)
            else:
                failcount = 0
    
                count += 1
                #put in a delay between requests to reduce odds of throttling
                time.sleep(1)
    
                urlcount += 1
        else:
            count += 1
    
    print "%s urls imported" % urlcount
    
  • The Future comes to pass 00:30 Tuesday the 27th of April 2010 0 Comments

    All the shares are owned by those companies in equal measure, and I can tell you that their regulations are written in Python.

    Charles Stross - Accelerando 2005

    We are proposing to require that most ABS issuers file a computer program that gives effect to the flow of funds, or “waterfall,” provisions of the transaction. We are proposing that the computer program be filed on EDGAR in the form of downloadable source code in Python. …

    SECURITIES AND EXCHANGE COMMISSION - 17 CFR Parts 200, 229, 230, 232, 239, 240, 243 and 249 Release Nos. 33-9117; 34-61858; File No. S7-08-10 RIN 3235-AK37 ASSET-BACKED SECURITIES - 2010

    via Sean McGrath

  • Streaming uploads to S3 with Python and Poster 16:25 Sunday the 24th of January 2010 5 Comments

    Every Amazon S3 library I can lay my hands on (for Python at least), seems to read the entire file to be uploaded into memory before sending it. This might by ok when uploading lots of small files, but I have needed to upload a lot of very large files, and my poor old server would creak under the weight of that kind of memory usage.

    I managed to bolt a solution together using urllib2 and poster that has been working reliably for me for the past few months. Im going to show you:

    1. A little about how S3 works
    2. How to use Poster
    3. A simple script to stream uploads to S3

    A little about how S3 works

    S3 is essentially a big python dictionary in the cloud, you give it a key and a value(file) to store, and later on you can read it back out again. S3 has a nice HTTP API, so you can read and write to the store using standard HTTP libraries.

    The area you put your files into is called a bucket. Bucket names (which have restrictions) are globally unique, that is, if you make a bucket called holiday_photos, then no one else using s3 can have a bucket called holiday_photos, which might sound weird, but it has its advantages, you can now access your files from http://holiday_photos.s3.amazonaws.com/. If you set the permissions up so anyone can read the contents of the bucket, the whole world can see you files via http://holiday_photos.s3.amazonaws.com/.

    The flip side of this, is that you can upload your files, lets say "meonthebeach.jpg" by using HTTP PUT, in this case PUT to http://holiday_photos.s3.amazonaws.com/meonthebeach.jpg.

    When uploading to S3, we need provide a few HTTP headers along with our file data when we PUT.

    • Date - The current date and time in a specific format, e.g. Wed, 01 Mar 2006 12:00:00 GMT. I generate it with time.strftime("%a, %d %b %Y %X GMT", time.gmtime())
    • Content-Type - The mime type of the file being uploaded, e.g. text/html. Python's mimetypes module does a good job of guessing this for any given file based on its extension. mimetypes.guess_type(filename)[0]
    • Content-Length - the length of the data to be uploaded according to RFC 2616, if you are uploading the file from disk you can get this with the os modules stat function. os.stat(filename).st_size
    • x-amz-acl - Optional, this tells S3 with default access control policy to use, by default this will be available to the logged in owner of the bucket only, to make it publicly readable set it to public-read
    • Authorization - This is the tricky one, S3 requires that your PUT request be accompanied by an authorization string in the following format: AWS AWS_ACCESS_KEY_ID:SIGNATURE The AWS_ACCESS_KEY_ID is the one provided to you when you signed up to S3

      The signature is a string consisting of several of the headers you are sending, along with the resource you are putting concatenated, and hashed with your AWS Secret access key. Constructing the signature is quite complicated in the general case, so I am going to show a method of generating it for the specific type of upload request we will be making, if you need to send headers that we are not using here, see Amazons Documentation for how to create the Authentication Header.

      The signature string consists of

      PUT\n\n<content-type>\n<date>\nx-amz-acl:public-read\n<resource>

      a code example of creating this

      sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
      content_type, date, resource)

      We then take this string and create an sha1 hash of it and your secret access key, and base 64 encode it.

       signature = base64.encodestring(
                          hmac.new(
                    settings.AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()
                     ).strip()
      
      

      and thats your signature.

    How to use Poster

    Poster is a small library that works with urllib2 to allow streaming uploads. All you need to do is import it and call a single function which registers posters custom url openers with urllib2 and you are good to go.

    import urllib2
    
    from poster.streaminghttp import register_openers
    register_openers()
    

    Secondly we need to tell urllib to use HTTP PUT rather than POST. We do this by creating a request object and overriding the get_method

    request = urllib2.Request(url, data=data)
    
    request.get_method = lambda: 'PUT'
    

    And then we can make our request and read the response

    response = urllib2.urlopen(request).read()
    
    

    The last step for use in poster is that rather than data containing the file object to be uploaded, it should return an iterator that provides the file data chunk by chunk. For example.

    def read_data(file_object):
    
        while True:
            r = file_object.read(64 * 1024)
    
            if not r:
                break
            yield r
    
    f = open("text.txt","r")
    
    data = read_data(f)
    
    

    data is now a generator that will return our file a line at a time.

    A simple script to stream uploads to S3

    Below is the source for a simple command line tool that will take a filename bucket name, and amazon credentials and upload the file to the bucket making it publicly readable

    import os
    
    import sys
    import time
    import base64
    import hmac
    import mimetypes
    
    import urllib2
    
    from hashlib import sha1
    
    from poster.streaminghttp import register_openers
    
    def read_data(file_object):
        while True:
            r = file_object.read(64 * 1024)
    
            if not r:
                break
            yield r
    
    def upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
                  AWS_SECRET_ACCESS_KEY):
        length = os.stat(filename).st_size
        content_type = mimetypes.guess_type(filename)[0]
        resource = "/%s/%s" % (bucket, filename)
    
        url = "http://%s.s3.amazonaws.com/%s" % (bucket, filename)
    
        date = time.strftime("%a, %d %b %Y %X GMT", time.gmtime())
    
        sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
                                                content_type, date, resource)
        signature = base64.encodestring(
                    hmac.new(
                        AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()).strip()
    
        auth_string = "AWS %s:%s" % (AWS_ACCESS_KEY_ID, signature)
    
        register_openers()
        input_file = open(filename, 'r')
    
        data = read_data(input_file)
        request = urllib2.Request(url, data=data)
    
        request.add_header('Date', date)
        request.add_header('Content-Type', content_type)
    
        request.add_header('Content-Length', length)
        request.add_header('Authorization', auth_string)
    
        request.add_header('x-amz-acl', 'public-read')
        request.get_method = lambda: 'PUT'
    
        urllib2.urlopen(request).read()
    
    if __name__ == "__main__":
    
        filename = sys.argv[1]
        bucket = sys.argv[2]
    
        AWS_ACCESS_KEY_ID = sys.argv[3]
        AWS_SECRET_ACCESS_KEY = sys.argv[4]
    
        upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
                 AWS_SECRET_ACCESS_KEY)
    
  • Send me your OPML 15:22 Sunday the 7th of June 2009 1 Comments

    I used to work with a guy (Hi Daniel) who got everyone he knew to send him OPML files from their RSS readers so he could find new gems to subscribe to. Im feeling kind of bored at the moment. So I am going to repeat his experiment. Anyone who reads this, or sees the related tweet, please send me your OPML file. If your RSS reader makes it difficult to export a list of links, then by all means send them in whatever format you like.

    In a weeks time, I'll take the results, crunch em a little, and put them up for all to see. So you can get the benefit too. My email address can be grabbed from the contact link to the left. Come on, send me your links!

    For the curious, here is my current list of feeds.

  • Readability 00:30 Thursday the 21st of May 2009 0 Comments

    Readability is a bookmarklet that removes clutter from webpages to make them more readable. I read from computer screens a lot, but when it comes to longer text I actually prefer to read from the tiny screen on my mobile phone than from a laptop monitor.

    I recently began a little reading on Typography, and learned of the concept of the comfortable measure. Essentially, approximately 66 characters per line is regarded as the ideal width for readable text.

    While Readability does not hit that mark exactly, its a lot closer than the average over wide web layout. Give it a try, it can return a lot of the pleasure of reading to computers.

© Copyright 2004-2010 Sean O'Donnell