Will

Will Webberley

Computer science PhD student in Cardiff - internet/mobile/social computing enthusiast

Talk on Open-Source Contribution

Today I gave an internal talk at the School of Computer Science & Informatics about open-source contribution.

The talk described some of the disadvantages of the ways in which hobbyists and the non-professional sector publicly publish their code. A lot of the time these projects do not receive much visibility or use from others.

Public contribution is important to the open-source community, which is driven largely by volunteers and enthusiasts, so the point of the talk was to try and encourage people to share expert knowledge through contributing documentation (wikis, forums, articles, etc.), maintaining and adopting packages, and getting more widely involved.

Seminar at King's College London

Last week, I was invited to give a seminar to the Agents and Intelligent Systems group in the Department of Informatics at King's College London.

I gave an overview of my PhD research conducted over the past two or three years, from my initial research into retweet behaviours and propagation characteristics through to studies on the properties exhibited by Twitter's social graph and the effects that the interconnection of users have on message dissemination.

I finished by outlining our methods for identifying interesting content on Twitter and by demonstrating its relative strengths and weaknesses as were made clear by crowd-sourced validations carried out on the methodology results.

There was some very interesting and useful questions from the audience, some of which is now being taken into consideration in my thesis. It was also good to visit another computer science department and to hear about the work done independently and collaboratively by its different research groups.

The slides from the seminar are available here and there is a blog post about it on the Department of Informatics' website.

Direct-to-S3 Uploads in Node.js

A while ago I wrote an article for Heroku's Dev Center on carrying out direct uploads to S3 using a Python app for signing the PUT request. Specifically, the article focussed on Flask but the concept is also applicable to most other Python web frameworks.

I've recently had to implement something similar, but this time as part of an Node.js application. Since the only difference between the two approaches is literally just the endpoint used to return a signed request URL, I thought I'd post an update on how the endpoint could be constructed in Node.

The example below assumes use of the Express.js framework, and requires the crypto Node package. The front-end code in the companion repository demonstrates an example of how the endpoint can be queried to retrieve the signed URL, and is available here. Take a look at that repository's README for information on the front-end dependencies.

app.get('/sign_s3_upload', function(req, res){
        var objectName = req.query.s3_object_name;
        var mimeType = req.query.s3_object_type;
        var now = new Date();
        var expires = Math.ceil((now.getTime() + 100000)/1000); // 100 seconds from now, for exmple
        var amzHeaders = "x-amz-acl:public-read";  
        var stringToSign = "PUT\n\n"+mimeType+"\n"+expires+"\n"+amzHeaders+"\n/"+s3_bucket+"/"+objectName;
        var signature = crypto.createHmac('sha1', s3_secret).update(stringToSign).digest('base64');
        var url = 'https://'+s3_bucket+'.s3.amazonaws.com/'+objectName;
        var credentials = {
            signed_request: url+"?AWSAccessKeyId="+s3_key+"&Expires="+expires+"&Signature="+signature,
            url: url
        };
        res.write(JSON.stringify(credentials));
        res.end();
 });

The variables s3_bucket, s3_key, and s3_secret also need to be set - probably from a settings file. The full example referenced by the Python article is in a repository hosted by GitHub and may be useful in providing more context.

llavac

Have you ever wanted to be able write Java in Welsh? No? Neither have I. However, with half an hour spare, I thought it'd be a fun (yet relatively pointless) little project to help learn the basics of Perl.

llavac is a Perl script acting as a simple wrapper for the command-line Java compiler (javac). It works by carrying out basic string replacements on a currently incomplete set of Welsh Java keywords in order to create a temporary 'English' Java source file, which is then compiled and deleted.

Below is a simple "helo!" Java program written in Welsh.

cyhoedd dosbarth Example{ 
    cyhoedd sefydlog ddi-rym main(String[] args){ 
        System.out.println("helo!"); 
    } 
} 

Use llavac in place of directly running javac to compile it, before running it as a normal Java program. Compiler errors will still be shown in English, however!

$ ./llavac.pl Example.java
$ java Example

The script is available from this repository.

Workshop Presentation in Germany

Last week I visited Karlsruhe, in Germany, to give a presentation accompanying a recently-accepted paper. The paper, "Inferring the Interesting Tweets in Your Network", was in the proceedings of the Workshop on Analyzing Social Media for the Benefit of Society (Society 2.0), which was part of the Third International Conference on Social Computing and its Applications (SCA).

Although I only attended the first workshop day, there was a variety of interesting talks on social media and crowdsourcing. My own talk went well and there was some useful feedback from the attendees.

I presented my recent work on the use of machine learning techniques to help in identifying interesting information in Twitter. I rounded up some of the results from the Twinterest experiment we ran a few months ago and discussed how this helped address the notion of information relevance as an extension to global interestingness.

I hadn't been to Germany before this, so it was also a culturally-interesting visit. I was only there for two nights but I tried to make the most of seeing some of Karlsruhe and enjoying the traditional food and local beers!

CasaStream

In my last post I discussed methods for streaming music to different zones in the house. More specifically I wanted to be able to play music from one location and then listen to it in other rooms at the same time and in sync.

After researching various methods, I decided to go with using a compressed MP3 stream over RTP. Other techniques introduced too much latency, did not provide the flexibility I required, or simply did not fulfill the requirements (e.g. not multiroom, only working with certain applications and non-simultaneous playback).

To streamline the procedure of compressing the stream, broadcasting the stream, and receiving and playing the stream, I have started a project to create an easily-deployable wrapper around PulseAudio and VLC. The system, somewhat cheesily named CasaStream and currently written primarily in Python, relies on a network containing one machine running a CasaStream Master server and any number of machines running a CasaStream Slave server.

The Master server is responsible for compressing and broadcasting the stream, and the Slaves receive and play the stream back through connected speakers. Although the compression is relatively resource-intensive (at least, for the moment), the Slave server is lightweight enough to be run on low-powered devices, such as the Raspberry Pi. Any machine that is powerful enough to run the Master could also simultaneously run a Slave, so a dedicated machine to serve the music alone is not required.

The Master server also runs a web interface, allowing enabling of the system and to disable and enable Slaves. Slave servers are automatically discovered by the Master, though it is possible to alter the scan range from the web interface also. In addition, the selection of audio sources to stream (and their output volumes) and the renaming of Slaves are available as options. Sound sources are usually automatically detected by PulseAudio (if it is running), so there is generally no manual intervention required to 'force' the detection of sources.

My current setup consists of a Master server running on a desktop machine in the kitchen, and Slave servers running on various other machines throughout the house (including the same kitchen desktop connected to some orbital speakers and a Raspberry Pi connected to the surround sound in the living room). When all running, there is no notable delay between the audio output in the different rooms.

There are a few easily-installable dependencies required to run both servers. Both require Python (works on V2.*, but I haven't tested on V3), and both require the Flask microframework and VLC. For a full list, please see the README at the project's home, which also provides more information on the installation and use.

Unfortunately, there are a couple of caveats: firstly, the system is not reliable over WLAN (the sound gets pretty choppy), so a wired connection is recommended. Secondly, if using ethernet-over-power to mitigate the first caveat, then you may experience sound dropouts every 4-5 minutes. To help with this problem, the Slave servers are set to restart the stream every four minutes (by default).

This is quite an annoying issue, however, since having short sound interruptions every few minutes is very noticeable. Some of my next steps with this project, therefore, are based around trying to find a better fix for this. In addition, I'd like to reduce the dependency footprint (the Slave servers really don't need to use a fully-fledged web server), reduce the power requirements at both ends, and to further automate the installation process.

Zoned Network Sound-Streaming: The Problem

For a while, now, I have been looking for a reliable way to manage zoned music-playing around the house. The general idea is that I'd like to be able to play music from a central point and have it streamed over the network to a selection of receivers, which could be remotely turned on and off when required, but still allow for multiple receivers to play simulataneously.

Apple's AirPlay has supported this for a while now, but requires the purchasing of AirPlay compatible hardware, which is expensive. It's also very iTunes-based - which is something that I do not use.

Various open-source tools also allow network streaming. Icecast (through the use of Darkice) allows clients to stream from a multimedia server, but this causes pretty severe latency in playback between clients (ranging up to around 20 seconds, I've found) - not a good solution in a house!

PulseAudio is partly designed around being able to work over the network, and supports the discovery of other PulseAudio sinks on the LAN and the selection a sound card to transmit to through TCP. This doesn't seem to support multiple sound card sinks very well, however.

PulseAudio's other network feature is its RTP broadcasting, and this seemed the most promising avenue for progression in solving this problem. RTP utilises UDP, and PulseAudio effecively uses this to broadcast its sound to any devices on the network that might be listening on the broadcast address. This means that one server could be run and sink devices could be set up simply to receive the RTP stream on demand - perfect!

However, in practice, this turned out not to work very well. With RTP enabled, PulseAudio would entirely flood the network with sound packets. Although this isn't a problem for devices with a wired connection, any devices connected wirelessly to the network would be immediately disassociated from the access point due to the complete saturation of PulseAudio's packets being sent over the airwaves.

This couldn't be an option in a house where smartphones, games consoles, laptops, and so on require the WLAN. After researching this problem a fair bit (and finding many others experiencing the same issues), I found this page, which describes various methods for using RTP streaming from PulseAudio and includes (at the bottom) the key that could fix my problems - the notion of compressing the audio into MP3 format (or similar) before broadcasting it.

Trying this technique worked perfectly, and did not cause network floods anywhere nearly as severely as the uncompressed sound stream; wireless clients no longer lost access to the network once the stream was started and didn't seem to lose any noticeable QoS at all. In addition, when multiple clients connected, the sound output would be nearly entirely simultaneous (at least after a few seconds to warm up).

Unfortunately, broadcasting still didn't work well over WLAN (sound splutters and periodic drop-outs), so the master server and any sound sinks would need to be on a wired network. This is a small price to pay, however, and I am happy to live with a few Ethernet-over-power devices around the house. The next stage is to think about what to use as sinks. Raspberry Pis should be powerful enough and are significantly cheaper than Apple's equivalent. They would also allow me to use existing sound systems in some rooms (e.g. the surround-sound in the living room), and other simple speaker setups in others. I also intend to write a program around PulseAudio to streamline the streaming process and a server for discovering networked sinks.

I will write an update when I have made any more progress on this!

A rather French week

I recently spent a week in France as part of a holiday with some of my family. Renting houses for a couple of weeks in France or Italy each summer has almost become a bit of a tradition, and it's good to have a relax and a catch-up for a few days. They have been the first proper few days (other than the decking-building adventure back in March) I have had away from University in 2013, so I felt it was well-deserved!

This year we stayed in the Basque Country of southern France, relatively near Biarritz, in a country farmhouse. Although we weren't really within walking distance to anywhere, the house did come with a pool in the garden, with a swimmable river just beyond, and an amazing, peaceful setting.

Strangely enough, there was no Internet installation at the house, and no cellular reception anywhere nearby. This took a bit of getting-used to, but after a while it became quite relaxing not having to worry about checking emails, texts, and Twitter. The only thing to cause any stress was a crazed donkey, living in the field next door, who would start braying loudly at random intervals through the nights, waking everyone up.

As might be expected, the food and drink was exceptional. Although we did end up eating in the house each evening (to save having someone sacrifice themselves to be the designated driver), the foods we bought from the markets were very good, and the fact that wine cost €1.50 per bottle from the local Intermarché gave very little to complain about.

The majority of most days was spent away from the house, visiting local towns, the beaches and the Pyrenees. We spent a few afternoons walking in the mountains, with some spectacular scenery.

Fearing the Convenience of Google

I hate to use the term 'fan-boy', but I do get a bit Google-engrossed sometimes.

Like the Apple obsessives I see daily on many social networks, I often find myself going wildly out of my way to use new Google services and almost force myself to rely on them in many different aspects of my life. Last week saw Google Music All Access's UK launch, and I instantly paused Spotify and began playing around with the shiny new service - and this was nothing to do with the fact that it is 20% cheaper than Spotify Premium or its 1 month free trial. Within a day, my playlists were copied over and the Spotify app shortcut was removed from my OS X dock and my Android homescreens. Even comparatively mundane points, such as the inclusion of an equaliser in the Android app, were included in my arguments with friends on the matter.

As many people do, I use GMail, Hangouts, Chrome, Google Calendar, Google Drive, YouTube, Google Plus, Analytics, etc. (and, though this often ironically gets taken for granted, their search engine) on daily bases as my primary products in their respective fields, and, also as many people do, I love their integration with each other, my Android devices, and the web in general. This last point is the key here - I almost feel at home and 'safe' when I'm using a Google service.

Without thinking, you can quickly get to the stage where you are nearly wholly reliant on these services for your day-to-day work, social, and personal lives - but, why not? They are the best out there; they're reliable, they're 'free', fast, up-to-date, and they're familiar. Even most iOS users I know use GMail and Google Maps in preference to their Apple equivalents, and as Google's user base grows they are able to collect more data on their customer's web usage habits, search history, and much more. The result of this to the customer is that the services become more intuitive, more useful, and more features get implemented.

So, what's the problem? Well, there isn't one, yet, and I can't really see one coming - It's mainly the fear of the problem that is the problem.

You don't have to look far to find horror stories in which people suddenly and inexplicably (they claim) lose access to their Google account. Immediately, they lose their email with several years' worth of history, invoices, important information, insurance documents, 'evidence', etc. They lose access to their calendars and schedules, the documents they've synced in Google Drive, holiday photos, music, books, the ability to download apps to their Android device. The list goes on, and includes the now-seemingly trivial aspects such as contact syncing and Chrome bookmarks, which on their own would be a pretty substantial loss.

Happily, this doesn't happen very often and most victims seem to regain their accounts eventually, but only after a pretty intense period of disconnection in their lives. However, even the demise of iGoogle and, more recently (and to much more of an uproar), Google Reader, seemed to cause big problems for a lot of people. One of the contributors to the fear, in this respect, is Google's lack of communication with 'standard' (non-paying) customers, and indeed many people complain about this as one of Google's shortcomings. However, when one considers the daily traffic to Google and its services, it'd have to be one of the largest customer services departments in the world to handle everyone's problems - not feasible for a company which focuses on innovation and creation.

Another 'fear', though one which is secondary for me (and which undoubtedly bothers far fewer people in general), is Google's position when it comes to your data.

Whilst Google, along with other tech giants (such as Facebook, Apple, and Microsoft), is pretty good at being transparent when it comes to things like government requests for customer data and for DMCA takedowns, there are also concerns from some people about data 'back doors' allowing organisations access to their data at will. In addition, articles flare up all over the web as soon as any information arises that could affect people's data (including the reaction to the wrongly-reported "customers should have no expectation of privacy" story), which just shows how much people do want their information that they freely give to large organisations to be kept private.

People were angry when they discovered that Google reads emails to target adverts at its customers, even though pretty much the exact same process occurs to detect if the email is spam. At the end of the day, the data is 'private' enough that those who you'd really be bothered about reading the information you write would never actually be given the chance to read it. The fact that many of those who do complain often don't really do any more about it, and then continue to use the services as they were anyway, suggests that they either don't really care that much about it and/or they feel they can't conveniently live without the services anyway.

This brings us back round to the first point. How would you live without these services? Well, if you suffer from either of the fears, then there are certainly options. However, most of the time, these are a hassle to set up, are unreliable (and probably actually more vulnerable to attack), and are not really in reach of the every-day Internet user. How you do it depends on how strongly you feel about losing your data or who has control of it.

Several organisations exist to help people in reducing their online observable data footprint (including Duck Duck Go, secure calendar and scheduler applications, and the use of GnuPG for email) - all of which are easy enough for many people to find, set up, and use. These kinds of services are useful if you feel more strongly about your privacy (like Richard Stallman, who is famous for his strong views regarding the use of services which place the control of his information out of his hands) and would prefer to prioritise data privacy over convenience.

If using many services is too fragmented and you feel that having to login to many different places is a hassle, then you could consider hosting services yourself. Applications such as Cozy Cloud and Own Cloud allow you to host your own versions of these services (mail, contacts, calendar, photos, to-do lists, and so on) wherever you want and in one place. If you're super keen on privacy, you could set up a machine in a trusted location and host services on that, but it's just as easy (and possibly more reliable) to get a VPS instead.

All of this is at a cost to convenience, however. These products are designed to cater for day-to-day tasks, and people don't generally have time to configure websites and services all over the web or maintain hardware for their personal mail server or keep their VPS software up-to-date. People need services that are expected to work, services that aren't taken down when the small company they entrusted their data to encounters financial difficulties, and services that don't reject important emails if the mail server trips up for a few hours. The ability to synchronise contacts, mail and photos from and between all of your mobile devices with a single sign-in is so essential for every-day users and technology-enthusiasts alike, and are usually taken completely for granted.

Relying on each independent piece of the resultant online jigsaw and living with the fear that any part could fail silently at any time is severely outweighed by the familiar structure, robustness, and reliability of having it all working together in one place and having it maintained by an organisation that knows what it's doing. Sure, I certainly fear the idea of suddenly losing access to the account (accidentally, or otherwise), and I would definitely suffer if anything did happen to it (as it's essentially a collection of tools for my online and mobile life), but sacrificing the comfort, convenience and habits associated with Google for safety and/or privacy is not feasible for me at the moment. However, I hope that putting this into perspective will make me more prepared for and embrace independent services more as they emerge on the web.

Gower Tides v1.4

Surf forecasts

Last week I released a new version of the tides Android app I'm currently developing.

The idea of the application was initially to simply display the tidal times and patterns for the Gower Peninsula, and that this should be possible without a data connection. Though, as the time has gone by, I keep finding more and more things that can be added!

The latest update saw the introduction of 5-day surf forecasts for four Gower locations - Llangennith, Langland, Caswell Bay, and Hunts Bay. All the surf data comes from Magic Seaweed's API (which I talked about last time).

Location choices

The surf forecasts are shown, for each day they are available, as a horizontal scroll-view, allowing users to scroll left and right within that day to view the forecast at different times of the day (in 3-hourly intervals).
Location selection is handled by a dialog popup, which shows a labelled map and a list of the four available locations in a list view.

The backend support for the application was modified to now also support 30-minute caching of surf data on a per-location basis (i.e. new calls to Magic Seaweed would not be made if the requested location had been previously pulled in the last 30 minutes). The complete surf and weather data is then shipped back to the phone as one JSON structure.

Tides view update

Other updates were smaller but included an overhaul of the UI (the tide table now looks a bit nicer), additional licensing information, more speedy database interaction, and so on.

If you are interested in the source, then that is available here, and the app itself is on Google Play. If you have any ideas, feedback or general comments, then please let me know!

Magic Seaweed's Awesome New API (Beta)

Back in March, I emailed Magic Seaweed to ask them if they had a public API for their surf forecast data. They responded that they didn't at the time, but that it was certainly on their to-do list. I am interested in the marine data for my Gower Tides application.

Yesterday, I visited their website to have a look at the surf reports and some photos, when I noticed the presence of a Developer link in the footer of the site. It linked to pages about their new API, with an overview describing exactly what I wanted.

Since the API is currently in beta, I emailed them requesting a key, which they were quick to respond with and helpfully included some further example request usages. They currently do not have any strict rate limits in place, but instead have a few fair practice terms to discourage developers from going a bit trigger happy on API requests. They also request that you use a hyperlinked logo to accredit the data back to them. Due to caching, I will not have to make too many requests (since the application will preserve 'stale' data for 30 minutes before refreshing from Magic Seaweed, when requested), so hopefully that will keep the app's footprint down.

I have written the app's new backend support for handling and caching the surf data ready for incorporating into the Android app soon. So far, the experience has been really good, with the API responding with lots of detailed information - almost matching the data behind their own surf forecasts. Hopefully they won't remove any of the features when they properly release it!

Accidental Kernel Upgrades on Digital Ocean

I today issued a full upgrade of the server at flyingsparx.net, which is hosted by Digital Ocean. By default, on Arch, this will upgrade every currently-installed package (where there is a counterpart in the official repositories), including the Linux kernel and the kernel headers.

Digital Ocean maintain their own kernel versions and do not currently allow kernel switching, which is something I completely forgot. I rebooted the machine and tried re-connecting, but SSH couldn't find the host. Digital Ocean's website provides a console for connecting to the instance (or 'droplet') through VNC, which I used, through which I discovered that none of the network interfaces (except the loopback) were being brought up. I tried everything I could think of to fix this, but without being able to connect the droplet to the Internet, I was unable to download any other packages.

Eventually, I contacted DO's support, who were super quick in replying. They pointed out that the upgrade may have also updated the kernel (which, of course, it had), and that therefore the modules for networking weren't going to load properly. I restored the droplet from one of the automatic backups, swapped the kernel back using DO's web console, rebooted and things were back to where they should be.

The fact that these things can be instantly fixed from their console and their quick customer support make Digital Ocean awesome! If they weren't possible then this would have been a massive issue, since the downtime also took out this website and the backend for a couple of mobile apps. If you use an Arch instance, then there is a community article on their website explaining how to make pacman ignore kernel upgrades and to stop this from happening.

WekaPy

Over the last few months, I've started to use Weka more and more. Weka is a toolkit, written in Java, that I use to create models with which to make classifications on data sets.

It features a wide variety of different machine learning algorithms (although I've used the logistic regressions and Bayesian networks most) which can be trained on data in order to make classifications (or 'predictions') for sets of instances.

Weka comes as a GUI application and also as a library of classes for use from the command line or in Java applications. I needed to use it to create some large models and several smaller ones, and using the GUI version makes the process of training the model, testing it with data and parsing the classifications a bit clunky. I needed to automate the process a bit more.

Nearly all of the development work for my PhD has been in Python, and it'd be nice to just plug in some machine learning processes over my existing code. Whilst there are some wrappers for Weka written for Python (this project, PyWeka, etc.), most of them feel unfinished, are under-documented or are essentially just instructions on how to use Jython.

So, I started work on WekaPy, a simple wrapper that allows efficient and Python-friendly integration with Weka. It basically just involves subprocesses to execute Weka from the command line, but also includes several areas of functionality aimed to provide more of a seamless and simple experience to the user.

I haven't got round to writing proper documentation yet, but most of the current functionality is explained and demo'd through examples here. Below is an example demonstrating its ease of use

model = Model(classifier_type = "bayes.BayesNet")
model.train(training_file = "train.arff")
model.test(test_file = "test.arff")

All that is needed is to instantiate the model with your desired classifier, train it with some training data and then test it against your test data. The predictions can then be easily extracted from the model as shown in the documentation.

I hope to continue updating the library and improving the documentation when I get a chance! Please let me know if you have any ideas for functionality.

Gower Tides Open-Sourced

This is just a quick post to mention that I have made the source for the Gower Tides app on Google Play public.

The source repository is available on GitHub. From the repository I have excluded:

  • Images & icons - It is not my place to distribute graphics not owned or created by me. Authors are credited in the repo's README and in the application.
  • External libraries - The app requires a graphing package and a class to help with handling locally-packaged SQLite databases. Links to both are also included in the repo's README.
  • Tidal data - The tidal data displayed in the app has also been excluded. However, the format for the data stored by the app should be relatively obvious from its access in the source.

Contribution to Heroku Dev Center

The Heroku Dev Center is a repository of guides and articles to provide support for those writing applications to be run on the Heroku platform.

I recently contributed an article for carrying out Direct to S3 File Uploads in Python, as I have previously used a very similar approach to interface with Amazon's Simple Storage Service in one of my apps running on Heroku.

The approach discussed in the article focuses on avoiding as much server-side processing as possible, with the aim of preventing the app's web dynos from becoming too tied up and unable to respond to further requests. This is done by using client-side JavaScript to asynchronously carry out the upload directly to S3 from the web browser. The only necessary server-side processing involves the generation of a temporarily-signed (using existing AWS credentials) request, which is returned to the browser in order to allow the JavaScript to successfully make the final PUT request.

The guide's companion git repository hopes to demonstrate a simple use-case for this system. As with all of the Heroku Dev Center articles, if you have any feedback (e.g. what could be improved, what helped you, etc.), then please do provide it!

Is Twitter's New API Really Such a Nightmare?

When the first version of the Twitter API opened, writing applications to interface with the popular microblogging service was a dream. Developers could quickly set up apps and access the many resources provided by the API and third parties were fast in creating easy-to-use wrappers and interfaces (in loads of different languages) for embedding Twitter functionality in all sorts of applications and services.

The API began by using Basic Authentication, at least for making the requests that required authentication (e.g. writing Tweets, following users, etc.). This is, generally, a Very Bad Idea, since it meant that client applications were required to handle the users' usernames and passwords and transmit these, along with any extra required parameters, in every request made to the API. Users had no idea (and no control over) what the organisations behind these applications did with the access credentials once they were provided with them.

Then, in 2010, the API moved on to OAuth. This was a much better authentication protocol, as it meant that users could directly authorise apps from Twitter itself and easily view which functions each individual app would be able to perform with their Twitter account. In addition, it meant that applications didn't need to receive and/or store the user's username and password; instead, an access token would be sent back to the app (after authentication), which would then be used to make the requests to the API. This access token could then be sent, along with the application's own key and secret key, with requests to the API, which would be able to recognise the authenticating user based on the access token and restrict/allow actions based on who the user is. Since apps could safely store the user's access key without too many security implications, it meant that the procedure was much more personalised and streamlined for the end-users.

What was cool was that there were still several methods exposed by Twitter's API that didn't require authentication. Things like retrieving a user's recent Tweets or the public timeline involved a simple JSON request that could easily be made from a client without authenticating first. This was particularly useful when used with JavaScript as clients could still request the information and, due to the distributed nature of clients (i.e. not making requests from a single IP or application signature), they wouldn't generally reach the rate limit for these methods.

It meant that you could embed a Twitter feed showing your recent Tweets on your website without having to hop through your own servers first.

Now Twitter have opened v1.1 of their API, with all methods from the previous version becoming deprecated and should expect complete removal some time in 2013. The main disadvantage with version 1.1 is that now all requests to the API will require OAuth authentication. This means that client-side JavaScript Twitter requests will no longer be safely available (as clients would have access to the application's private key, amongst other things), and developers will be forced to use Twitter's own massive and unstylable widgets. Twitter themselves also (sensibly, I suppose) discourage users from trying to write their own client-side code for this.

Of course, you could modify your app so that your server makes the requests, authenticated with your own account, and then passes the response to the browser, but if your site is fairly popular and caching requests isn't appropriate for your purposes then you are at risk of running into rate limit issues. This leads me to another (slightly less important) disadvantage. Whilst the API used to grant each authenticated application 350 requests per hour, the rate limit system has now become unnecessarily complicated, with many methods having completely different request allowances per window (which has now been reduced to 15 minutes). On top of this, many resources actually have two rate limits - one for that particular user, and one for the app itself. They also have a handy table outlining the rate limits of each method. It's starting to become a bit more of a mess for developers, with many more things to think about.


Despite all the additional strictness with the API, there are actually several advantages. Requests that are user-focused (i.e. have a separate user-based rate limit) mean that your application, if used correctly, may be able to access more information before reaching the limits. This is also true of some of the application-based resources, such as "GET search/tweets" and "GET statuses/user_timeline", which now allow many more requests to be made in the same time frame than in API v1.

For other methods, though, it's not so great. Most of the user-based rate-limited methods allow 15 requests per window (equating to one request per minute). For me, and others who research Twitter, who require a fair amount of data, this will become a nightmare. There are also many app developers who are being impacted pretty heavily by the new changes, which includes Twitter's (slightly evil) new policy to restrict apps to 100,000 users.


Generally, there is a different set of advantages and disadvantages every way you look at it, but with the web's turn to the ubiquitous availability and propagation of information, and some other open and awesome APIs (including Foursquare's and Last.fm's), then it's hard to know in which direction Twitter is heading at the moment.

Feeling a Bit Bloated With MongoDB

When this blog was first set up, it was stored in a MongoDB database. MongoDB, a document- and no-SQL-based database, seemed perfect for a blog, since posts and their metadata could be stored easily as individual documents.

Each blog post consists of a unique numeric ID, its title, its contents, and some information on its creation time and edit times. Media is stored in a different directory, and is simply linked to from the post contents using standard HTML. Posts are, on average, around a couple of hundred words each.

Despite this, the files needed by MongoDB to store just over 15 smallish, text-only posts (at the time of writing) took up a massive 3.3GB of storage space. This isn't particularly useful to me, especially considering that the site is hosted on a virtual server instance on a shared cloud platform where my storage is limited to 20GB in total. I didn't want to consume nearly a sixth of this storage for storing a blog alone.

I realise there are various procedures that can try to compress the database down a bit (including repairing the database), but these tend to take a while to complete and block access to the data whilst they're in progress.

In the end I reverted to using a simple SQLite database, using SQLAlchemy to streamline the process and to preserve some of the document-based aspects (at least on the Python side). The file size of the new SQLite database, after transferring over the blog posts, came to 38KB, which is 0.001% of the size consumed by the same data in MongoDB.

I certainly appreciate that, in some circumstances, MongoDB can be much more useful than a simple SQLite database, but for my needs (where performance isn't massively important, but storage efficiency is), it didn't make much sense.

eartub.es

Last weekend I went to CFHack Open Sauce Hackathon. I worked in a team with Chris, Ross and Matt.

We started work on eartub.es, which is a web application for suggesting movies based on their sound tracks. We had several ideas for requirements we wanted to meet but, due to the nature of hackathons, we didn't do nearly as much as what we thought we would!

For now, eartubes allows you to search for a movie (from a 2.5 million movie database) and view other movies with similar soundtracks. This is currently based on cross matching the composer between movies, but more in-depth functionality is still in the works. We have nearly completed Last.fm integration, which would allow the app to suggest movies from your favourite and most listened-to music, and are working towards genre-matching and other, more complex, learning techniques. The registration functionality is disabled while we add this extra stuff.

The backend is written in Python and runs as a Flask application. Contrary to my usual preference, I worked on the front end of the application, but also wrote our internal API for Last.fm integration. It was a really fun experience, in which everyone got on with their own individual parts, and it was good to see the project come together at the end of the weekend.

The project's source is on Github.

flyingsparx.net On Digital Ocean

My hosting for willwebberley.net has nearly expired, so I have been looking for renewal options.

These days I tend to need to use servers for more than simple web-hosting, and most do not provide the flexibility that a VPS would. Having (mostly) full control over a properly-maintained virtual cloud server is so much more convenient, and allows you to do tonnes of stuff beyond simple web hosting.

I have some applications deployed on Heroku, which is definitely useful and easy for this purpose, but I decided to complement this for my needs by buying a 'droplet' from Digital Ocean.

Droplets are DO's term for a server instance, and are super quick to set up (55 seconds from first landing at their site to a booted virtual server, they claim) and very reasonably priced. I started an Arch instance, quickly set up nginx, Python and uwsgi, and started this blog and site as a Python app running on the Flask microframework.

So far, I've had no issues, and everything seems to work quickly and smoothly. If all goes to plan, over the next few months I'll migrate some more stuff over, including the backend for the Gower Tides app.

Trials of Eduroam

I've been having trouble connecting to Eduroam, at least reliably and persistently, in some barebones BNU/Linux installs and basic window managers. Eduroam is the wireless networking service used by many Universities in Europe, and whilst it would probably work fine using the tools provided by heavier DEs, I wanted something that could just run quickly and independently.

Many approaches require the editing of loads of config files (especially true for netcfg), which would need altering again after things like password changes. The approach I used (for Arch Linux) is actually really simple and involves the use of the user-contributed wicd-eduroam package available in the Arch User Repository.

Obviously, wicd-eduroam is related to, and depends on, wicd, a handy network connection manager, so install that first:

# pacman -S wicd
$ yaourt -S wicd-eduroam

(If you don't use yaourt, download the tarball and build it using the makepkg method.)

wicd can conflict with other network managers, so stop and disable them before starting and enabling wicd. This will allow it to startup at boot time. e.g.:

# systemctl stop NetworkManager
# systemctl disable NetworkManager
# systemctl start wicd
# systemctl enable wicd

Now start wicd-client (or set it to autostart), let it scan for networks, and edit the properties of the network eduroam. Set the encryption type as eduroam in the list, enter the username and password, click OK and then allow it to connect.

Cardiff Open Sauce Hackathon

Next week I, along with others in a team, am taking part in Cardiff Open Sauce Hackathon.

If you're in the area and feel like joining in for the weekend then sign up at the link above.

The hackathon is a two-day event in which teams work to 'hack together' smallish projects, which will be open-sourced at the end of the weekend. Whilst we have a few ideas already for potential projects, if anyone has any cool ideas for something relatively quick, but useful, to make, then please let me know!

A simple outbound mail server

Being able to send emails is an important part of a server's life, especially if it helps support a website. If you manage your own servers for running a website and need to send outgoing email (e.g. for newsletters, password resets, etc.), then you'll need to run an SMTP server to handle this for you.

You will need to have properly configured your DNS settings for email to work properly. This is because most email providers will run rDNS (reverse-DNS) lookups on incoming email to ensure it isn't someone else pretending to send emails from your domain. An rDNS lookup basically involves matching the resolved IP from your domain name (after the "@" sign in the email address) to the domain name addressed by the IP in DNS. If the r-DNS lookup fails, then email providers may automatically mark your emails as spam.

Your DNS host settings should point your domain name towards the IP of your host as an A record. In addition, it is sometimes necessary to add a TXT record (for the "@" subdomain) as v=spf1 ip4:xxx.xxx.xxx.xxx -all. This indicates to mail providers that the IP (represented by the x's) is authorised to send mail for this domain. This further reduces the chance that your email will be marked as spam. Since we are not intending to receive mail at this server, either leave the MX records blank, configure them to indicate a different server, set up a mail-forwarder, or something else.

The following mail server set up is aimed at Arch Linux, but the gist of it should be compatible for many UNIX-based systems. The mail server I am covering is postfix. This can easily be installed (e.g. on Arch):

# pacman -S postfix

Once installed, edit the configuration file in /etc/postfix/main.cf so that these lines read something like this:

myhostname = mail.domain.tld
mydomain = domain.tld
myorigin = domain.tld

Next, edit the file /etc/postfix/aliases such that:

root: your_username

Replace your_username with the user who should reveive root's mail.

Finally, refresh the alias list, enable the service so that postfix starts on boot, and then start postfix:

# cd /etc/postfix && newaliases
# systemctl enable postfix.service
# systemctl start postfix.service

You should now be able to send mail (e.g. through PHP, Python, Ruby, etc.) through this server. If you run the website on the same machine, simply tell the application to use localhost as the mail server, though this is usually default anyway.

Normal service resumed: AJAX + Python + Amazon S3

I wanted a way in which users can seamlessly upload images for use in the Heroku application discussed in previous posts.

Ideally, the image would be uploaded through AJAX as part of a data-entry form, but without having to refresh the page or anything else that would disrupt the user's experience. As far as I know, barebones JQuery does not support AJAX uploads, but this handy plugin does.

Handling the upload (AJAX)

I styled the file input nicely (in a similar way to this guy) and added the JS so that the upload is sent properly (and to the appropriate URL) when a change is detected to the input (i.e. the user does not need to click the 'upload' button to start the upload).

Receiving the upload (Python)

The backend, as previously mentioned, is written in Python as part of a Flask app. Since Heroku's customer webspace is read-only, uploads would have to be stored elsewhere. Boto's a cool library for interfacing with various AWS products (including S3) and can easily be installed with pip install boto. From this library, we're going to need the S3Connection and Key classes:

from boto.s3.connection import S3Connection
from boto.s3.key import Key

Now we can easily handle the transfer using the request object exposed to Flask's routing methods:

file = request.files['file_input_name']
con = S3Connection(<'AWS_KEY'>, <'AWS_SECRET'>)
key = Key(con.get_bucket(<'BUCKET_NAME'>))
key.set_contents_from_file(file)

Go to the next step for the AWS details and the bucket name. Depending on where you chose your AWS location as (e.g. US, Europe, etc.), then your file will be accessible as something like https://s3-eu-west-1.amazonaws.com//. If you want, you can also set, among other things, stuff like the file's mime type and access type:

key.set_metadata('Content-Type', 'image/png')
key.set_acl('public-read')

Setting up the bucket (Amazon S3)

Finally you'll need to create the bucket. Create or log into your AWS account, go to the AWS console, choose your region (if you're in Europe, then the Ireland one is probably the best choice) and enter the S3 section. Here, create a bucket (the name needs to be globally unique). Now, go to your account settings page to find your AWS access key and secret and plug these, along with the bucket name, into the appropriate places in your Python file.

And that's it. For large files, this may tie up your Heroku dynos a bit while they carry out the upload, so this technique is best for smaller files (especially if you're only using the one web dyno). My example of a working implementation of this is available in this file.

A bit of light construction on an Easter weekend

wow, look It's a well-known fact that computer scientists fear all forms of physical labour above everything else (except perhaps awkward social mingling).

Despite this, I managed to turn about two tonnes of material into something vaguely resembling 'decking' in my back garden this weekend. It makes the area look much nicer, but whether it actually stays up is a completely different matter.

Gower Tides App Released

Get it on Google Play

A few posts back, I talked about the development of an Android app for tide predictions for South Wales. This app is now on Google Play.

If you live in South Wales and are vaguely interested in tides/weather, then you should probably download it :)

The main advantage is that the app does not need any data connection to display the tidal data, which is useful in areas with low signal. In future, I hope to add further features, such as a more accurate tide graph (using a proper 'wave'), surf reports, and just general UI updates.

Deploying to Heroku

In my last post, I talked about developing Python applications using Flask (with MongoDB to handle data). The next stage was to consider deployment options so that the application can be properly used.

Python is a popular language in the cloud, and so there are many cloud providers around who support this kind of application (Amazon Elastic Beanstalk, Google App Engine, Python Anywhere, etc.), but Heroku seems the most attractive option due to its logical deployment strategy, scalability and its range of addons (including providing the use of MongoDB).

First, download the Heroku toolbelt from their website. This allows various commands to be run to prepare, deploy and check the progress, logs and status of applications. Once installed, log into your account using your Heroku email address and password:

$ heroku login

Install the dependencies of your project (this should usually be done inside a virtual Python environment). In my case, these are Flask and Flask-MongoAlchemy:

$ pip install Flask
$ pip install Flask-MongoAlchemy

We now declare these dependencies so that they can be installed for your deployed app. This can be done using pip, which will populate a file of dependencies:

$ pip freeze > requirements.txt

The file requirements.txt should now list all the dependencies for the application. Next is to declare how the application should be run (Heroku has web and worker dynos). In this case, this is a web app. Add the following to a file Procfile:

web: python app_name.py

This basically tells Heroku to execute python app_name.py to start a web dyno.

The application setup can now be tested using foreman (from the Heroku toolbelt). If successful (and you get the expected outcome in a web browser), then the app is ready for deployment:

$ foreman start

Lastly, the app needs to be stored in Git and pushed to Heroku. After preparing a suitable .gitignore for the project, create a new Heroku app, initialize, commit and push the project:

$ heroku create
$ git init
$ git add .
$ git commit -m "initial commit"
$ git push heroku master

Once done (assuming no errors), check its state with:

$ heroku ps

If it says something like web.1: up for 10s then the application is running. If it says the application has crashed, then check the logs for errors:

$ heroku logs

Visit the live application with:

$ heroku open

Finally, I needed to add the database functionality. I used MongoHQ, which features useful tools for managing MongoDB databases. Add this addon to your application using:

$ heroku addons:add mongohq:sandbox

This adds the free version of the addon to the application. Visit the admin interface from the Apps section of the website to add a username and password. These (along with the host and port) need to be configured in your application in order to work. e.g.:

app.config['MONGOALCHEMY_USER'] = 'will'
app.config['MONGOALCHEMY_PASSWORD'] = 'password'
app.config['MONGOALCHEMY_SERVER'] = 'sub.domain.tld'
etc.

It may be that this step will need to be completed earlier if the application depends on the database connection to run.

Playing with Flask and MongoDB

I've always been a bit of an Apache/PHP fanboy - I find the way they work together logical and easy to set up and I enjoy how easy it is to work with databases and page-routing in PHP.

However, more recently I've found myself looking for other ways to handle web applications and data. I've messed around with Node.js, Django, etc., in the past but, particularly with Django, found that there seems to be a lot of setting-up involved even in creating quite small applications. Despite this, I understand that once setup properly Django can scale very well and managing large applications becomes very easy.

Flask is a Python web framework (like Django, except smaller) which focuses on its easiness and quickness to setup and its configurability. Whilst it doesn't, by default, contain all the functionality that larger frameworks provide, it is extensible through the use of extra modules and addons.

I thought I'd use it for a quick play around to introduce myself to it. Most of this post is for my own use to look back on.

As it is Python, it can be installed through pip or easy_install:

# easy_install flask

Note: If Python is not yet installed, then install that (and its distribution tools) for your system first. For example, in Arch Linux:

# pacman -S python2 python2-distribute

In terms of data storage, I used MongoDB, a non-SQL, document-oriented approach to handling data. This can be downloaded and installed from their website or your own distro may distribute it. For example, in Arch:

# pacman -S mongodb

MongoDB can be started as a standard user. Create a directory to hold the database and then start it as follows:

$ mkdir data/db
$ mongodb --dbpath data/db

This will start the server and runs, by default, on port 27017. The basic setup is now complete, and you can now start working on the application.


A complete example (including all necessary code and files) is available in this repository. This also includes a more comprehensive walkthrough to getting started.

ScriptSlide

I've taken to writing most of my recent presentations in plain HTML (rather than using third-party software or services). I used JavaScript to handle the appearance and ordering of slides. An example (to show what I mean) is here.

I bundled the JS into a single script, js/scriptslide.js, which can be configured using the js/config.js script.

There is a Github repo for the code, along with example usage and instructions.

Most configuration can be done by using the js/config.js script, which supports many features including:

  • Set the slide transition type (appear, fade, slide)
  • Set the logos, page title, etc.
  • Configure the colour scheme

Then simply create an HTML document, set some other styles (there is a template in css/styles.css), and put each slide inside <section>...</section> tags. The slide menu is then generated autmatically when the page is loaded.

Research Poster Day

Each January the School of Computer Science hosts a poster day in order for the research students to demonstrate their current work to other research students, research staff and undergraduates. The event lets members of the department see what other research is being done outside of their own group and gives researchers an opportunity to defend their research ideas.

This year, I focused on my current research area, which is to do with inferring how interesting a Tweet is based on a comparison between simulated retweet patterns and the propagation behaviour demonstrated by the Tweet in Twitter itself. The poster highlights recent work in the build-up to this, a general overview of how the research works, and finishes with where I want to take this research in the future.

The poster is available here.

Delving into Android

Tides Main Activity

I've always been interested in the development of smartphone apps, but have never really had the opportunity to actually hava a go. Whilst I'm generally OK with development on platforms I feel comfortable with, I've always considered there to be no point in developing applications for wider use unless you have a good idea about first thinking about the direction for it to go.

My Dad is a keen surfer and has a watch which tells the tide changes as well as the time. It shows the next event (i.e. low- or high-tide) and the time until that event, but he always complains about how inaccurate it is and how it never correctly predicts the tide schedule for the places he likes to surf.

He uses an Android phone, and so I thought I'd try making an app for him that would be more accurate than his watch, and maybe provide more interesting features. The only tricky criterion, really, was that he needed it to predict the tides offline, since the data reception is very poor in his area.

I got to work on setting up a database of tidal data, based around the location he surfs in, and creating a basic UI in which to display it. When packaging the application with an existing SQLite database, this helper class was particularly useful.

Tides Settings Activity

A graphical UI seemed the best approach for displaying the data, so I tried AndroidPlot, a highly-customisable graphing library, to show the tidal patterns day-by-day. This seemed to work OK (though not entirely accurately - tidal patterns form more of a cosine wave rather than the zigzags my graph produced, but the general idea is there), so I added more features, such as a tide table (the more traditional approach) and a sunrise and sunset timer.

I showed him the app at this stage, and he decided it could be improved by adding weather forecasts. Obviously, preidcting the weather cannot be done offline, so having sourced a decent weather API, I added the weather forecast for his area too. Due to the rate-limiting of World Weather Online, a cache is stored in a database on the host for this website, which, when queried by the app, will make the request on the app's behalf and store the data until it is stale.

I added a preferences activity for some general customisation, and that's as far as I've currently got. In terms of development, I guess it's been a good introduction to the ideas behind various methodologies and features, such as the manifest file, networking, local storage, preferences, and layout design. I'll create a Github repository for it when I get round to it.

SocialShower

A few weeks ago I wrote some PHP scripts that can retrieve some of your social interactions and display them in a webpage (though the scripts could easily be modified to return JSON or XML instead). When styled, they can produce effects similar to those on the Contact page of this website (here).

SocialShower

Currently they are available for retrieving recent tweets from Twitter, recent listens from Last.fm and recent uploads to Picasa Web Albums.

The scripts run, in their current state, when the appropriate function is called from the included script. As a result, this could seriously slow down the page-load time if called as part of the page request. If embedded in a webpage, they should be run through an AJAX call after the rest of the page has loaded.

The repo for the code (and example useage) is available from Github.

Seminar: Retweeting

I gave a seminar on my current research phase.

I summarised my work over the past few months; in particular, the work on the network structure of Twitter, the way in which tweets propagate through different network types, and the implications of this. I discussed the importance of precision and recall as metrics for determining a timeline\'s quality and how this is altered through retweeting in different network types.

I concluded by talking about my next area of research; how I may use the model used for the previous experimentation to determine if a tweet is particularly interesting based on its features. Essentially, this boils down to showing that tweets are siginificantly interesting (or uninteresting) by looking at how they compare to their predicted retweet behaviours as produced by the model.

The slides for the talk (not much use independently!) are available here.

ShadowSlide

I initially had a slideshow of images on the homepage of this website but, having put it up, it seemed a bit self-indulgent to have a scrolling view of images of my face.

I took the slideshow down in the end, but decided to stick the code for it on Github anyway since it used some nice (but simple) CSS to produce a kind of internal-shadow effect as shown below. Note that the effect is more heavily applied here for demonstration.

The repo for it is available here.

DigiSocial Hackathon

We recently held our DigiSocial Hackathon. This was a collaboration between the Schools of Computer Science and Social Sciences and was organised by myself and a few others.

The website for the event is hosted here.

DigiSocial logo

The idea of the event was to try and encourage further ties between the different Schools of the University. The University Graduate College (UGC) provide the funding for these events, which must be applied for, in the hope that good projects or results come out of it.

We had relatively good responses from the Schools of Maths, Social Sciences, Medicine, and ourselves, and had a turnout of around 10-15 for the event on the 15th and 16th September. Initially, we started to develop ideas for potential projects. Because of the nature of the event, we wanted to make sure they were as cross-disciplined as possible. A hackday, in itself, is pretty computer science-y so we needed to apply a social or medical spin on our ideas.

Eventually, we settled into two groups: one working on a social-themed project based on crimes in an area (both in terms of distribution and intensity) in relation to the food hygiene levels in nearby establishments; another focusing on hospital wait times and free beds in South Wales. Effectively, then, both projects are visualisations of publicly-available datasets.

I worked on the social project with Matt Williams, Wil Chivers and Martin Chorley, and it is viewable here.

Overall the event was a reasonable success; two projects were completed and we have now made links with the other Schools which will hopefully allow us to do similar events together in the future.