In 2015, Australia passed a new piece of legislation entitled the Telecommunications (Interception and Access) Amendment (Data Retention) Act 2015 . Following the introduction of this act, service providers have obligations to retain various data associated with services provided to customers.

Despite having taken effect in October 2015, I still see a lot of confusion in the service provider and broader community about exactly what customer data should be and is retained by providers. Having been heavily involved at my work in both preparing our implementation plan and providing guidance to the service provider industry at large, I feel I'm somewhat cognizant of some of the common misunderstandings and hope I can shed some light on how providers should be interpreting their obligations.

Disclaimer: I am not a lawyer and this is not advice. If you think your organisation may have a metadata obligation - the best thing you can do is contact a lawyer who is familiar with the service provider industry to get expert advice. Likewise, this website and these words are not those of my employer, so please don't hold them accountable for any opinions herein.

Finally, a lot of this information (and much more) can be found in the document Data Retention - Frequently Asked Questions for Industry published by the Attorney Generals Department (AGD). The AGD copped a lot of flak from members of industry for not being able to clearly articulate how to interpret the legislation, however I've found this document (even in its first revision) very capable of doing so - if only people take the time and effort to read an understand it with a sense of calm.

A bit of context

First thing first - this law is in effect now. If you have a metadata obligation you are assumed to be compliant from the 13th October 2015 unless you have lodged what is known as a Data Retention Implementation Plan (DRIP) with the Communications Access Coordinator (CAC) and received approval prior to this date.

Providers with approved implementation plans may have up to the 12th April 2017 (18 months later) to become compliant.

Why do this

Quite simply, this act mandates service providers to retain data about customers buying a relevant service and some metadata around the service.

This information is often requested by police investigating issues, but historically has never been retained by service providers. By introducing this legislation, the government helps support law enforcement by ensuring important information must be retained.

Criticisms

There has been a lot of criticism around this legislation, from my reading this falls broadly into one of two areas:

  • What is being collected - Misinformation about the data being collected, or misunderstanding of what is being asked of providers by overzealous industry members; and

  • Who can access the data - Misinformation about who can request access to the data from a provider.

The legislation provides a very specific (and in my opinion reasonable) list of government agencies that can request access to this data. The Attorney general can also declare additions to this, however such appointments are public and thus have a level of oversight.

I have heard reports of agencies outside this list making requests for information from service providers and whilst it is unclear whether the providers have made data available when they shouldn't have, it is clear that there is still confusion about who can ask what. The CAC is supposedly able to provide clarification on such matters to those that find themselves in this situation.

Exemptions

Providers are also able to apply for exemption to their obligations, both as part of an implementation plan and on an ongoing basis. An application for exemption may be made to the CAC on the basis of one of the following:

  • the long-term interests of end-users of carriage services or of services provided by means of carriage services; or
  • the efficiency and international competitiveness of the Australian telecommunications industry; or
  • the availability of accessible and affordable carriage services that enhance the welfare of Australians

Exemptions are required to be kept confidential by providers simply because public knowledge of these loopholes may provide a vector for bad actors to exploit.

Funding

The Australian government has made funding available to industry in the form of the Data Retention Industry Grants Programme to implement their metadata obligations.

Broadly speaking, applications for funding were open to providers who incurred costs preparing their DRIP; performed work to ensure compliance between 30 October 2014 and 13 October 2015; or had an implementation plan approved to become compliant.

Grants totalling up to $128.4 million were awarded to applicants in August 2016 and information on the recipients and the allocation methodology are avilable on the AGD website.

There has been some level of controversy within industry regarding the grants awarded - some spectators have questioned funding requests disproportionate to the size of the providers operations.

It's worth bearing in mind however, that as with most grant programmes - recipients must agree to a funding agreement which includes reporting requirements on how the money is being spent.

What data

Phew, okay - now we're ready to talk about what information is being collected...

Information is only required to be collected where the customer has a service where a metadata obligation exists (evluating this is covered in the next section)

Generally speaking, if the provider does not handle or generate any of the data covered here - they are not required to generate it or capture it solely for the purpose of data retention.

Any information that is covered and available must be retained by the service provider for no less than two years.

Providers are required to ensure that data retained is:

  • Adequately protected as required under the Privacy Act 1988
  • Encrypting data at rest and providing access measures

Customer information

A provider must retain any customer contact (name, address, etc) or billing details in their CRM including historic data of at least 2 years.

Communication Metadata

Where the service provider facilitates a communication, the relevant metadata must be collected where available for any communication held or attempted to be held:

  • The date and time (and where relevant, the duration) of a communication
  • The source of a communication (who initiated the communication)
  • The destination of a communication (who was being communicated to)
  • The type of communication
  • The location of the communication at the start and end of the communication (where relevant)

In the case the provider is able to positively identify the other party of a communication (ie: the other party is also a customer of the provider) then metadata about that customer must also be retained. It is my understanding that even if that other party does not consume any services that would be subject to metadata retention, though I have not sought any clarification on this).

It's really important to note here that:

  • The above items need to be considered in the context of the applicable service and may not apply
  • Providers do not need to generate or collect metadata for third-party services that a customer consumes "over the top" of a service (with the exception of some specific services)

What's NOT collected

Anything that falls outside of the items discussed above.

Specifically, it's worth noting some specific things that aren't covered:

  • Content of communications (eg: email content, voip calls, etc)
  • Web Browsing History - this is explicitly exempted by the legislation
  • Customer location when service not in use (eg: mobile phone location)
  • Packet metadata (eg: NetFlow, sFlow)

The last item here is one of the things overzealous operators have jumped on - however the AGD provides specific guidance around this.

Evaluating your obligation

Metadata retention obligations are determined on a per-service basis and a provider must consider each of the following criteria.

If the provider:service combination does not meet all the criteria below, there is no obligation.

Note: Again I would like to acknowledge the fantastic Industry FAQ from the AGD from which this is derived.

Am I a Carrier, CSP or ISP?

Are you one of the following:

  • Carrier
  • Internet Service Provider (ISP)
  • Carriage Service Provider (CSP)

These are very well defined terms under existing Telecommunications legislation and generally you will know if you are one of these.

It's worth noting that if you provide certain types of listed carriage services to a third party in return for a reward, you are considered a CSP.

Is it a relevant service

Does the service carry, or enable a communication to be carried out?

This doesn't include services required to carry out a communication (eg: DNS), just those that actually carry it.

Intent is important here, if the service isn't primarily concerned with carrying communications in normal operation - you don't need to anticipate off-the-wall scenarios (eg: iodine dns, etc).

Not in "same area"

Services offered and consumed within the same property boundary are exempted.

Not in "immediate circle"

Metadata obligations do not extend to services offered to officers or employees of the provider.

Worked examples

Here are some pretty common scenarios that come up and how I would evaluate a metadata obligation for them.

Wired internet connection

There is an obligation unless the "immediate circle" exclusion applies.

Customer data would be readily available. IP Address allocation and session start/stop times are acceptable metadata records. The service address of the service would suffice for the this type of connection.

Telephony Provider

I've added this example mostly to cover off voice services, whilst they remain a large part of the focus of this legislation - the industry is very mature and has a solid background in generating and retaining metadata here.

Generally speaking, there's an obligation here except free services.

CRM, telephone number allocation, attempted inbound/outbound call logs are required here. Including physical location of the handset (fixed address or mobile) at the call initiation/hangup.

Free Public Wi-Fi

If the operator of the WiFi is not a Carrier/ISP there is no obligation. This is because the service is offered for free so they would not be considered a CSP.

If the operator of the WiFi is a Carrier/ISP/CSP but the service is offered within a single property boundary there is no obligation. This is because the "same area" exclusion applies.

If the operator of the WiFi is a Carrier/ISP/CSP and the service is offered across multiple locations... get a lawyer. Strictly speaking you don't meet the "same area" exclusion - but some good lawyering might just change that.

Where an obligation exists, depending on the solution you may not have customer identifying information. MAC addresses suffice if your captive portal collects them - however not all solutions do.

If you are performing NAT you may be required to collect NAT mappings.

Paid Hotel WiFi

The hotel may considered a CSP as they are selling internet access for reward. Fortunately for them, the "same area" exclusion may apply to these providers.

Unless they're a chain of hotels, in which case clever lawyering may be required.

Or if they outsource the operation of the WiFi, in which case the operator almost certainly has an obligation.

Staff email accounts

There is no obligation - this is because of the "immediate circle" exclusion as email accounts are offered to employees only.

Unless email is a proscribed service for CSP (I don't think it is, but I haven't looked), and you outsource your IT externally - then your IT provider shouldn't have an obligation either unless they're considered a CSP/Carrier/ISP for other business activities.

If they were, your supply agreement may determine whether they have an obligation - if they're contracted to perform professional services for your staff email server - there is no obligation, however if they are providing the email as a contracted service then that might be a different story.

Even then, the "same area" restriction may apply if they manage a server on your premises.

Shared web hosting

There is no metadata obligation.

Whilst you operate this service, web browsing history is specifically excluded form legislation - so web server logs are not in scope here.

Any over-the top services operated by your customer (forums, etc) are not your responsibility to retain data for.

However, in theory any outbound traffic generated by a customer may be in scope - in which case the data you retain may need to include process owner (if you give customers a system account and run their apps as them), as this is a similar situation as NAT (shared resource, retain the mappings).

If I was a hosting provider, I'd be getting a lawyer to review this with me. I'd also be preparing an application for exemption on the basis that this data would not normally be generated as part of business as usual operation, even if you offer a dedicated IP for SSL/other reasons.

VPS hosting

There is a metadata obligation here.

In the case of VPS, a static IP allocation exists and meeting the obligation is quite easy to meet.

Conclusion

I hope this helps address some of the landscape of metadata retention in Australia.

The general constraints are fairly easy to understand for engineering staff looking after these services, however your obligation does depend on how you offer the service from both a technical and commercial point of view.

My only advice is that you find (and retain) a competent technology lawyer and keep a level head!

Over the last year or so I've been looking quite a bit at Elasticsearch for use as a general purpose time series database for operational data.

Whilst there are definitely a lot of options in this space, most notably InfluxDB, I keep coming back to Elasticsearch when we're talking about large volumes of data where you're doing a lot of analytical workload.

More than a few times, I've been asked to explain what Elasticsearch looks like to the would-be developer/operations person. This isn't too suprising, the documentation isn't great at giving a real world architectural overview - which can really help contextualise the documentation.

Having been asked again today, I've decided to write one up here - so I can save some time explaining the same thing over and over.

It's important to note for the would be reader that this is my imperfect understanding of Elasticsearch. If you spot any glaringly obvious errors please let me know and I will update this accordingly and you'll have my eternal thanks for helping me grok this a little better.

So, as they say: the show must go on!

Elasticsearch concepts

ElasticSearch is a search engine built on top of the Apache Lucene. It's great for full text search (duh) as well as analytical workloads with adhoc queries and aggregations.

In NoSQL parlance, it would be classified as a document-oriented database and that's primarily how your application will run with it.

You insert your DOCUMENT into an INDEX.

  • An index is a namespace for documents and is analogous to a database table
  • Unlike a table, there is no schema per se
  • You define what fields are indexed for search at this level

An INDEX is backed by one or more SHARDS

  • When you create an index, you specify the number of primary shards your data will be split across.
  • Elasticsearch uses a hash function to determine what shard your document is stored in and accessed from.
  • You cannot change the number of primary shards after the index is created.

A SHARD can have zero or more REPLICA SHARDS

  • For each primary shard in your index you will have X replica copies (defined by the index policy)
  • If the node hosting the primary shard fails, a replica shard will be promoted to primary
  • Replica shards are used for scaling out read performance - the more replica shards you have, the more reads you can service.
  • Unlike primary shards, you can change the number of replica shards any time

ELASTICSEARCH runs as a CLUSTER made up of NODES

  • Nodes automatically form a cluster when correctly configured
  • Elasticsearch will automatically distribute (and move) shards as needed by your index configuration
  • Your application can talk to any node in the cluster and they will forward your request to a node with the data to service your request
  • If you have a busy cluster, you can deploy proxy nodes - these are nodes that don't store shards and can be used to direct incoming requests

A SHARD is a Lucene index

  • Every time your query needs to access a shard, the Lucene engine needs to be running for that data
  • Don't confuse a Lucene index (a shard) with an Elasticsearch Index (a collection of shards)

Caring and feeding for your cluster

I won't cover setting up and maintaining quorum in the cluster, because that's pretty well covered elsewhere. If you're running on AWS there's even a managed product available which helps simplify things a lot.

For the keen observer, that last bullet point raises some interesting constraints when managing your cluster.

Elasticsearch manages lucene process, but remember they don't share the same memory space. Because your indexes have to be loaded fully into virtual memory, make sure you leave enough memory free for your index data (lucene).

Don't allocate more than 32GB heap to Elasticsearch. Due to how Java addresses memory, this will slow things down heaps (see what I did there?).

Read this. No really.

Working with elasticsearch

In short, deploying Elasticsearch for your search application requires some careful planning on both the ingest and query side.

Primarily you want your cluster to have enough memory so your busy shards (read or write) stay resident in physical memory. Otherwise your nodes will spend all their time paging data in and out of disk, which defeats the point!

Read the designing for scale part of the Elasticsearch guide.

Your best bet is to split the data across indexes by a meaningful criteria, and in the case of time series data this is a natural fit.

Elasticsearch has a feature called index templates, which are super useful for dynamically creating indexes with specific settings. Writes are directed to the correct index and you can automatically have the new index added to an index alias for reads.

Conclusion

Elasticsearch is a great tool, but requires you to plan ahead: Hopefully I've given you a good introduction to how things hang together and where there are sharp edges.

I highly recommend reading through the entire Elasticsearch: The definitive guide document, with the above information in mind ahead of time I think it makes for a much more cohesive read.

Good luck and happy hacking!

Thanks to @ZacharyTong for pointing out that Lucene does in fact support paging index segments (statement removed), and @warkolm for spotting a mistake regarding number of replicas and an opportunity to clarify!

After a suggestion by someone, I got it in my mind that a certain group could really do with a NNTP (Usenet) caching proxy. NNTP Proxies and Caches do already exist, but none of them support cache hierachies - that is, trying to resolve articles from peer caches before talking upstream.

The use case here is WACAN, a wireless network where each participant may want to offer their article cache for use by members on the network.

So after talking about it for a little while, I wrote one.

It's pretty terrible code actually (suprise!) - I've hand written the parser and for now it supports the bare minimum set of the protocol to support SABnzbd. NZBGet won't work, but could with some minor command support.

This ended up being useful to me for some of my other forever projects, so I've had a chance to look back at this recently and am considering revisiting it if I have time.

If that happens, I'll be looking to rewrite the parser using Ragel and targeting either Go (even though it has the pretty useful net/textproto package) or C++.

The new implementation will be a lot simpler, backing usenet requests onto stateless HTTP requests - leaving the implementation as a fairly flexible and pluggable exercise. I've done a bit of testing with this already, specifically trying to use the news and nntp schemes over HTTP - though library support for this (I'm looking at you net/http and libcurl is pretty average).

Watch this space, I guess.

Having been writing apps (poorly) with AngularJS for a while now, I was pretty excited to realise that combining this with Phonegap/Cordova - I could start writing portable mobile apps!

For the uninitiated, Cordova is an open source mobile app development platform. Basically, it bundles your app for your device and runs an embedded (lightweight) webserver and runs your app in a fullscreen browser control. Nifty. Phonegap is the Adobe commercial version, and they kindly donated the core of it to the Apache FSF (thanks Adobe!), which is the bit known as Cordova.

Now what project to learn with? Hmm... aha!

puush is a pretty cool image sharing service run by a guy who I used to LAN with, and is pretty much entirely funded by their insanely popular rythym game, osu!. It's a hobby project.

One of the problems with being a hobby project is that it doesn't get a lot of love. Specifically, their mobile app for iOS no longer works unless you still have an iPhone 3 (how quaint). I use the app /a lot/, and would love it to work on mobile without carrying an extra device around.

After vague promises of maybe adopting my code for their official app, I decided to see how hard it would be to replicate this on iOS.

As it turns out, not hard at all! After a week of development I had a fully functioning prototype with most (not all) of the features implemented. This ended up being a fantastic starter project for both size and combination of native features (camera, local storage, modal dialogs, etc).

Whilst the puush guys haven't adopted the code yet, there's nothing stopping anyone with an apple developer account from building and publishing to the app store. Which I've decided to do, after I polish it up a bit and make it ready for prime time.

Probably the biggest change I need to make is converting it to a grunt project so it bundles and minifies all the code locally instead of including it from cloud CDNs... Don't look at me like that - it was development code!

Anyway, check it out: puush-phonegap on GitHub

I had to split up my posts to avoid my hiatus post from being an unreadable mess.

This post is a bit of a recap of some projects I've worked on or am working on. There are another two posts following this one covering some larger posts.

Startup weekend mentoring

I haven't participated in another Startup Weekend in Perth since my first one. I don't think I ever wrote it up in its full glory either, needless to say it was pretty intense and despite not having the experience I was after, I learnt a bunch and would recommend it to anyone who is interested in that kind of thing.

In preparation for the SWPerth7 event the organisers did a call out for mentors. This is something I'm interested in, but I honestly didn't expect to get accepted. Heh, suckers.

Mentoring was a great experience, and I genuinely hope I was able to help the teams with their planning and validation - my feedback seems to suggest so. I was a bit worried I'd commit a cardinal sin and be prescriptive about things "You should do this or that", but I was able keep it to leading questions and answering specific advice questions - so yay for that.

In the future, I'd love to do this again and aim to get some of the pitch coaching mentor timeslots. I think I might have more to help on this side of things.

Distillation automation

I managed to get most of the parts together to completely automate my still for the purposes of , uh.. extracting essential oils and distilled water.

The hardest part so far is the temperature probe which is an annoying 5mm in diameter in stainless steel. I was hoping to use a one wire digital probe, however the smallest package for these is 5mm in diameter - leaving no room for the stainless steel shroud.

I'll have to order some K-type sensors, and once they arrive and the shed is cleaned - I should have some updates on that particular project.

Video streaming

I've been pretty interested in getting involved in the video streaming project over at the Perth Linux Users Group for a while now.

They're looking to move from DV capture to HDMI over USB and have been waiting on some custom hardware to get made, which looks like a good path forwards.

But in the meantime my work has given me some budget to put together a solution with off the shelf parts. I've happily spent the entire budget and have a nice pile of bits ready to go.

Whilst I'm still keen on the open source solution, it's going to have to wait a while for me to play with these new toys.

Expect a post on this soon.

WACAN

I'm happy to be a founding member of this organisation. A bunch of guys involved in WAFreeNet incorporated an association to further the goals of building a community operated wireless network across the region.

The incorporation (or rather, some of the people involved on either side of the should-we-incorporate fence) has caused some division in the community, but has also achieved some really great stuff. Specifically, the organisation has a relationship with WAIA which has helped secure a tenancy for a great core node at QV1 as well as access to some CDN traffic over the network.

I've perpetually been unable to participate in these networks - since 2003(?) I've been testing line of sight to each house I've lived in with no luck. More recently I've had perfect LOS to QV1, but at the wrong angle for the sector panel - but it looks like our new house should be able to manage an ok connection.

My involvement in this group has primarily been as a member, I'm not really interested in committees any more - but I've organised a few public meetings (none of which I've been able to attend, heh).

I think I'll keep doing that for a while.

Pelican Modules

I promised a while back to put the code up for some of the pelican modules I wrote to support this site. As of sometime last year, they're now available on GitHub here and here.

This is the personal website of Will Dowling, a DevOps Engineer hailing from Perth, Western Australia.

Twitter

I talk shit here sometimes.

GitHub

My terrible code. For free.

LinkedIn

Pay me to write bad code and talk shit for you.

Tumblr

Pretty pictures, rarely my own.

Flickr

Pretty pictures, actually mine.