Remote Revolution

Paul Graham wrote an essay recently about immigration and the tech industry. Matt Mullenweg wrote a response mentioning distributed teams as a solution and there was a thread on Hacker News with additional discussion about colocated versus distributed teams. Steven Sinofsky wrote a response to all of those.

I am all for changing immigration laws. If someone wants to move to Silicon Valley from outside the country to work I think that is wonderful and they should be allowed to do that. I do not, however, believe that the future of the technology industry lies in colocated teams.

Exceptional programmers are difficult to find. If we take it that exceptional programmers are rare and necessary for success then it would seem incumbent on the companies who need these individuals to maintain work environments that remove barriers to working with them.

I have been working in the tech industry for twenty years modulo hiatuses for creative projects. I have spent roughly half of that time colocated and half of that time distributed. I do not have conclusive or broad data to support what I am about to write; even if data was available there well may not be enough of it yet.

Distributed teams are an innovation enabled by technology. Arguments of authority–Google is not distributed, Facebook is not distributed, ask startup founders why they do not use distributed teams–are ludicrous. Gather around let me tell you about a little thing called disruption.

I am reminded of the late 90s when people were debating the use of Linux and other open source projects in production. “All the successful companies use Sun! Talk to CEOs they will tell you they buy Sun because Sun is what you need to get the job done! You. just. cannot. be. enterprise. with. Linux. it. is. not. true.”

The last time I dealt with Sun (circa 2003) they wanted me to renew a $35,000/year maintenance contract on two servers. The one time I had used them it took five business days to get a failed RAID drive replaced. I said no thanks and invested the money into ten FreeBSD servers, keeping five as spares. Those spares wound up as a cluster enabling the organization to innovate in ways that would not have been possible had we wasted resources doing things the old way. Today that much money would pay for ten nice EC2 instances and related supporting AWS services without upfront costs, facilities, maintenance, etc.

The feedback loop seems clear: innovation reduces friction, reduces barriers, reduces cost thus fueling more innovation.

These are not the wild west days of outsourcing projects to hustlers cobbling together low quality teams. This is 2015; we have GitHub, Stack Overflow, job boards for distributed teams, massive numbers of contributors to open source projects. Today quality remote workers and companies can find and evaluate each other directly.

The way I see it this comes down to synchronous versus asynchronous communication. We started with synchronous communication. We built empires on synchronous communication, we developed techniques and processes and hardware architectures all around this idea of synchronous communication. But as time passed we realized what a drain of resources it is to always employ synchronous communication so we started looking at asynchronous communication. Oh no! What a disaster! We designed all the things for synchronous communication! As technologists how did we approach this problem? Did we dig our heels in and start screaming for more and more powerful hardware to keep the old ways afloat? Yup. How did that work out? It gave us a little breathing room but then we had to put on our big kid pants and acknowledge the reality that we could no longer continue to cram everything into one place and expect it to scale. We had to change processes and the way we approach problems. And what happened? We discovered better ways to do things. Not merely different ways or almost-as-good ways, but better ways.

Asynchronous communication is efficient because senders do not have to wait for a response. People choose text messages over phone calls for the same reason. In 2015 we choose abstractions, languages and frameworks–functional programming, actor systems, immutable data structures and so on–that support asynchronous communication. But eventually we get to a point, for example writing to a physical device, that requires synchronous communication. Similarly while distributed teams choose tools and processes that support asynchronous communication there are times when synchronous communication is necessary. The day-to-day implementation of this can vary by team–team video calls and 1-1 manager video calls and so on–but something I believe should not vary for long-running teams is getting together once or twice a year for a few days on the company’s time and dime to do something fun. Not a tech conference or an all-hands work meeting but something fun. It does not have to be a whitewater rafting excursion; getting together in a centrally-located city and visiting museums is great too. There is a level of intimacy that only comes from having had face-to-face interaction, but people do not have to work in an office together to make that happen.

When people are focused–in the zone–interruptions can be expensive. We respect this with surgeons because the consequences of interrupting a surgeon concentrating are obvious. We respect this in the library. We respect this on the golf course. We have recognized it as an issue with colocated teams for more than half a century–Herman Miller introduced what would become the cubicle in 1964 based on research into open office layouts that concluded open environments actually reduce communication between employees.

A few weeks back there was an article in The Economist about cubicles and how workers could break free of them.

“Workers, the Cornell study suggested, like closed offices best of all. But open-plan offices are preferred to cubicles.”

“For reasons of economy, if nothing else, a return to private offices seems unlikely. But mobile technology is making it possible to work anywhere. Could it also offer an escape from the cubicle farm?”

Curiously the article fails to mention remote work as a solution.

I believe a well-run distributed team has more transparent and less siloed communication that leads to better collaboration. This may be a tortured analogy but I think of it like Facebook versus Twitter–you may have more friend-to-friend bandwidth on Facebook but on Twitter everything is open and transparent which leads to insights outside of one’s circle of friends and to the natural formation of groups of collaborators. Before I really “got” Twitter I thought it sounded like a horrible idea–strangers following me and me following other people just sounded creepy and there cannot be any depth with this character limit and blah blah blah. It was not until I had to use it for work that I really got that it was not a pale substitute for Facebook but rather a genuine advancement in human consciousness and communication. This is also how I feel about distributed teams.

A well-run distributed team primarily communicates via group text chat like Slack. As much as possible all communication happens in various channels that form by team or project or whatever fits best. Communication happens asynchronously, transparently and is immutably logged. Conversations are not lost or forgotten; they are viewable and searchable. People can be included in conversations more easily because they can participate asynchronously. New group members can quickly be brought up to speed on a given topic or issue by referring them to relevant logs. Cross referencing code commits with tickets and publishing changes either directly to a group’s room or to a “newsfeed” room provides amazing documentation. If I want to know why something was done I can look at the commit, find the ticket and find all the relevant discussion about that, how the solution was worked out–it may have required something beyond just code such as infrastructure changes or contacting a service provider. Colocated teams can do this somewhat, but there will always be large gaps that require relying on people’s memories of events to resolve. Fitbit > trying to remember what happened. You may say the need for this does not come up often and you would be correct, but when it does come up it can save days or even weeks of time. It can be a lifesaver in time-sensitive situations like production outages. It can help prevent service regressions. As technologists we love data, we can learn from data, we can use data to improve our processes. Distributed teams produce more and richer data about themselves and the work they do than colocated teams.

An industry embracing Big Data should look for ways to apply those insights to itself.

Is the world ready for asynchronous work? It took Slack less than a year to reach a billion-dollar valuation. A list of companies that use distributed teams I started last month already has over a hundred listings. Remoteville is well on its way to being a major tech hub and we have not even begun to tap the possibilities. Given asynchronous work environments and the decoupled nature of functional programming it is not difficult to imagine trivially outsourcing coding tasks down to the granularity of a single function if we had the software tools to frictionlessly manage such a process.

Should everyone work on distributed teams? Of course not. There are many varieties of people with different working styles and interpersonal needs at different stages of life. By all means support workers who want to cluster together, whether that means maintaining official offices or paying for cowork space or whatever works. The idea is not to force distributed teams but to create and maintain a workplace that is asynchronous by default to maximize transparency and efficiency so all kinds of workers and teams can flourish.

As technologists we do not write blog posts imploring people to please go to gyms because that is the best way to work out; instead we innovate and build wearables that people can use to monitor and increase their activity and fitness levels wherever they are. That is who we are and what we do. We are the cheap plastic solution you should never bet against.

Postscript

In the course of writing this essay I realized asynchronous work is my mission and I do not want to just talk about it–I want to make it happen. I want to make using colocated teams by default as anachronistic as the horse and buggy. I want former homeless people living in renovated former office buildings because we no longer know what to do with all the extra space. I want cities planning to build smaller highways instead of maintaining mile-wide strips of asphalt that are empty except during rush hour.

Harvard recently completed a 75-year study on human (male) happiness. Their conclusion? It is all about love and warm relationships. By making work asynchronous we can get the same amount of work done while spending more time with people we love and cherish, living happier lives as a result.


DynamoDB Hum

“DynamoDB is a fast, fully managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. All data items are stored on Solid State Drives (SSDs), and are replicated across 3 Availability Zones for high availability and durability.”

DynamoDB is interesting because of its ability to scale. Let us talk about the bits of DynamoDB relevant at scale.

Keys

Imagine our DynamoDB table is a giant hash. We call each key of that hash the hash key – clever, eh? When we query or get an item from DynamoDB we must know (or be able to deduce–more on this later) its exact hash key.

It is possible to configure DynamoDB such that there is one item–an item is a hash–per hash key. That is boring. We will not talk about that.

Instead imagine that our giant hash is full of arrays that are full of items. These arrays are indexed by each item’s range key and can be accessed using the Query API.

Consider a table containing Articles. We can use the UserId value of the article as the hash key and the UpdatedAt value as the range key. We can now use the Query API to efficiently retrieve a user’s articles for the last thirty days. DynamoDB goes to the hash key location, scans through the range keys (we can tell it which direction to look) and there we go.

Indexes

One table with two indexes means we need roughly triple the write capacity compared to the table alone. Solutions that involve external indexes require management of write capacities across multiple tables and possible data inconsistency–boo–but in the next few weeks AWS will offer Global Secondary Indexes which will hopefully render such schemes obsolete. Local Secondary Indexes–additional range key indexes–are available now.

What is an “additional range key index”?! Sorry! Remember our giant hash filled with arrays filled with hashes, those arrays ordered on a range key value selected from the hashes they contain? LSIs allow us to specify additional range keys so, to add to our above example, we have UserId as the hash key and UpdatedAt as the range key–we add an LSI so we can make a second range key on Name. Now we can also query Articles for a user’s articles that have a certain name or start with ‘z’ or whatever.

Global Secondary Indexes will allow us to add additional hash keys to index our tables. If we add Name as a GSI instead of an LSI then we can query all articles by name rather than just querying articles by name for a given user.

Partitioning

The write capacity of a table is divided evenly among a number of partitions. We do not know the number of partitions but it increases/decreases as capacity is added/removed. In order to utilize our total provisioned write capacity we must evenly spread writes across partitions to avoid hotspots because exceeding the capacity of any one partition will cause our table to be throttled. One approach is to choose random hash keys but this requires that we always know the exact (random) hash key of the item we want. To get around this a hash key index (external or GSI) is added. But! That index has the same hash key constraint–indexes are just other DynamoDB tables so the index hash key will suffer the same partitioning hotspot issues. Thus we have not really resolved the issue by choosing a random primary hash key but rather moved it elsewhere and doubled our required write capacity for the effort.

What to do now? There are two primary options. One is we can use something useful, like a UserId, as the primary hash key and add something to help randomize it. For example we could prepend a random digit 0-9 to each hash key. If we are storing a key for a specific item somewhere we can store it with the random digit and use that to get the item directly. If, on the other hand, we need to query for an item and we are not sure which of the ten possible hash keys it uses we have to query each of them (or better if we can figure that out) until we find it.

The other option is to spread writes for a given hash key out over time (or better if we can figure that out). Continuing with our example above let us imagine our app can import a user’s articles from WordPress in bulk. When that job runs for a prolific blogger we are suddenly beating the snot out of that UserId hash key. Throw those writes in a queue and work on them over time and that snot-beaten spot is not so hot.

Which approach to take (perhaps both!) and the implementation details will naturally depend on our data and usage patterns, but now we know enough to reason about the problem.

Patterns

DynamoDB’s constraints might be off-putting for a lesser data store, but DynamoDB’s speed, scalability, simplicity and cost make its constraints worth thinking about. There are interesting and useful patterns to be discovered.

For example let us go back to our Articles table example from above. We will use the UserId for the hash key again. But this time we are going to make a composite range key by prepending a SequenceId to a randomly-generated UuId. [Note that foreign keys–from Dynamo or other data stores–are an interesting choice to use here instead of random uuids long as they are unique within the primary hash key.] We take advantage of DynamoDB’s flexible schema and store the metadata item for a sequence–representing a Category–in position 0 and store the data items for a sequence–representing articles–in positions 1-n. We write the metadata item last to ensure consistency (potentially at the expense of some orphan items) and persist the total number of items written in it. We could also persist some sort of sequence index here, tags related to the sequence, etc. Once the metadata item is written we consider the sequence immutable. Now we can query a user’s categories by searching for all items that start with 0. From there we select a category whose articles we want; we know from the metadata item the sequence’s uuid and how many items are in it so we batch_get them. Thus if our metadata indicates our sequence contains three items we do a batch_get for hash_key[1.uuid, 2.uuid, 3.uuid] and DynamoDB efficiently retrieves the sequence for us. Yay. We could, of course, get them one at a time or in paginated batches instead. This covers a lot of use cases and does so without using an index. Naturally an appropriate solution to the partitioning problem above must be implemented as well.

Deleting

Deleting items from DynamoDB at scale is expensive. Avoid it if possible. Prefer to rotate tables instead–a new table for each month, for example–eventually expiring or moving old tables to cold storage.

Backups & Reporting

If we are running at scale we are probably going to be very unhappy trying to restore a DynamoDB table of any significant size from some other media. Rotating tables helps here by keeping things small but the only real way to quickly recover from, say, an accidentally dropped table, is a hot spare table. Also if we want to analyze the data on a table, as with reporting or other data mining activity, the recommended method is to use a hot spare table and do the analysis on it so as not to impact performance of the production table.

Scan API

The Scan API iterates over each item in a DynamoDB table. Just no.

Conclusion

Whew! There is plenty more to talk about but this should be enough to get us moving in the right direction. I will write a follow up article if there is interest. Have fun! Feel free to hit me up @ggfuzz.

UPDATE

I wrote a Ruby gem for this.


Ruby in Jails

A number of Ruby and Rails security vulnerabilities have come to light recently. This is an excellent article about the situation and what you should do about it. If you do not have hardened servers you should follow the given advice to the letter. If you do need to rebuild this would be a good opportunity to migrate your infrastructure to chef, puppet or another tool that will allow you to easily rebuild again in the future.

But is it true that all servers will always be vulnerable? Can you not build out some servers that you can genuinely trust against common threats on an ongoing basis?

It is true that all networked servers will always be vulnerable against a suitably skilled & determined attacker. But, yes, you can build out hardened servers that are extremely difficult to compromise–they will not protect you from a crack team of NSA PhDs or, worse, a really, really smart 13-year-old–but they will be immune to automated attacks as well as determined attacks from less-than-great hackers. Anyone who can compromise a well-hardened server has the skills to be making a lot of money, so you will generally be safe from most attack vectors unless compromising your server would be worth a lot of money to someone. But hopefully if you are in banking or such this is not your first introduction to hardened servers.

The Ruby community mostly uses Linux servers. I have extensive experience with Linux but when it comes to locking things down I prefer BSD. I started using Ruby about a decade ago when I ran across portupgrade(1) which is written in Ruby and at the time was FreeBSD’s primary tool for keeping third-party software from the ports system up to date. I was fascinated by how slow yet intelligent its behavior was; I could just keep running portupgrade(1) over and over again if I got stuck and it would eventually sort everything out. I have been a Rubyist ever since.

FreeBSD’s port system includes rubygem ports for gems with external dependencies. Install those and FreeBSD will make sure a gem’s dependencies are there prior to installation and everything gets upgraded in sync with proper versions forever. My upgrades look like portupgrade -aRr && gem update

I almost never have to manually intervene; I cannot even remember the last time it happened. I have not had to install a dependency for a gem manually on FreeBSD in years.

My point here is that FreeBSD and Ruby go together like peanut butter and jelly and have for many years; this is not a new thing. Matz used to hang out on freebsd-hackers and much of Ruby’s philosophy comes from that community. That said, I do not expect many converts (and that’s fine with me–it makes me a harder target); but I do hope this will provide some ideas on how to lock down your own infrastructure, whatever platforms you may use. As with anything, one size does not fit all. In most of my environments I have at least a pair of hardened servers and try to keep the rest of the boxes as throwaway as possible. You do not have to lock every box down like Fort Meade. Unless you do.

I have included some links to the relevant FreeBSD documentation, which I find to be well-written, informative and helpful on most topics it covers.

Securing the disk: full encryption at the hardware level . This provides privacy even if the storage media itself is stolen or otherwise compromised (block-level copy).

Securing access: only run services inside of jails (shared kernel VMs) and do not allow in-band remote server access whatsoever; instead connect serial ports on your servers to a console server and only access physical hardware out of band. These days an analog modem is a worthy choice as it is completely off the radar of today’s script kiddie and, as a bonus, you can access your physical machines even if your data center’s network goes down. Plus you can access BIOS settings, single-user mode and other system functions not typically available with a remote shell. Console servers are traditionally pared with remote power distribution units (PDUs) which are essentially power strips that you can control programmatically if you, say, need to power cycle a stuck server on a given outlet(s). Together these devices provide the mechanisms necessary to write scripts to self-diagnose and self-correct pretty much any server failure short of actual hardware failure. Virtual versions of some of these facilities are available with virtualization software (VMWare Server, etc) and server management cards (though beware a lot of the UIs for the latter are Java-only. And buggy to boot.)

Securing the process table and filesystem: securelevels, file flags, TrustedBSD/MAC framework; in the hardlink OS jail model the read-only paritions (/, /usr – BSD is very good about keeping read-only and read-write files in separate trees so they can be treated different on disk) are all hardlinked (saving inodes across jails) and immutable from inside the jail even by root. Better yet, a jail does not have to have root access at all and only needs a bare minimum OS–even a a single service from a single binary file on a read-only file system with no root access–in order to run. You can take things even further by setting file flags (such as immutable, append-only, etc) that can only be changed when the system is in single-user mode. Thus an attacker can only change these files when sitting in front on the server or when connected via console server as networking (much less sshd) is not available in single-user mode. Or take it further still and use the MAC framework for truly deep control, including the ability to limit the powers of root. SELinux provides similar facilities in Linux.

Securing the logs: these security features can provide extremely detailed security auditing and logging (and control). With a bit of elbow grease you can capture any sort of malicious system activity you can imagine. Be sure to secure a copy of those logs; seperate, hardened log servers are good. One set up to burn logs to non-rewritable optical media (DVD-R, etc) is even better. Security features can be configured to only allow appending to certain files, etc, unless the system is in single-user mode (or a similar safe state). There are many different paths you can take, just make sure your logs are safe.

Securing the cloud: You do not own the cloud, you should not show the cloud your private data. Any data not meant for public download should be encrypted before being persisted in the cloud.

Securing the data: If possible (unless you are dealing with petabytes of data, in other words) you should also be backing your data up to hardened boxes. If the worst happens you want to be 100% sure that no one can tamper with your backups. Put another set of encrypted backups in the cloud, by all means–two or three copies even to be safe–but always keep one set on a hardened server. Backups to physical media should be encrypted, no exceptions.

Securing the network: I assume you already use an external firewall of some sort, whether it belongs to you or your provider. That’s great but I always also run the pf firewall locally on each server both to enforce security (for incoming traffic and for any ports opened by an unauthorized service) and to give me eyes into what is going on with the network on that box. This also lets me adapt the firewall on the fly–for example a script can watch the logs for excessive sshd login failures and then tell the firewall to block all traffic to and from from that IP address. PF also lets you route packets to and from local IP addresses so you can have any number of jails running on local IP addresses (127.0.0.2, 127.0.0.3, etc) each providing a different service on a different port on the same public IP address. In this way every service on a given public IP is provided by a different VM and a compromise to one is not a compromise to any other. Meanwhile the remote logging for all jails is running on the jail host (you can see all jail activity and read/write their filesystems from the jail host), which need not have in-band (ssh) access at all, and you have captured all the nefarious activity to your hardened log servers. Now you can make informed decisions about how to respond to the incident rather than having to assume that a compromised user account means wipe everything and start over. Which IS the correct response if you have not properly locked down your infrastructure.

You can never be entirely crackproof while connected to a network, but you can harden key bits of your infrastructure (the more layers of protection you have the more you frustrate any would-be attacker) and be prepared to quickly rebuild the rest from scratch if necessary–Chef, Puppet, PaaS from which you can spin up instances at will, whatever works best for your environment. Naturally not all of these safeguards will be available to use depending on your OS, hosting provider and other variables. This is not meant as a how-to but rather as an introduction to some tools that I use when securing my servers. With a bit of research and effort you can no doubt find similar tools for your platforms of choice.


Buying Local

I am a big fan of buying local. I am big fan of shopping on the Internet. I am a big fan of living a car-free lifestyle. We walked to the hospital to have our first child, and we walked him home. Our last child was born at home and, as far as I know, has never ridden in an automobile, though he does ride public transportation from time to time.

It is great to ride a bicycle and I have a trailer I can load with groceries and such, but this is the rainy Pacific Northwest and we have four little kids. What I really want is to be able to shop an Amazon-like site that knows where I am and what stores near to me have in stock and lets me buy stuff from them via credit card/Paypal/whatever. And have it delivered either directly by them or contract someone to go pick it up–cabs would probably be awesome for this, especially if it can be arranged so that they are doing the pickups nearby after dropping off a fare, such that time that was being wasted on unpaid, unfruitful return trips can now earn them extra money. And since most deliveries are not especially time-sensitive as long as they happen within a day (unless frozen food or donor organs are involved) the picked-up items could be brought back to the cab garage for later delivery, when a call for a cab from/to the area is received.

This goal is easy to achieve with today’s technology. So what’s the holdup?

Lack of a standardized API and lack of someone or something to drive adoption of such an API by POS/inventory system vendors. Lack of a geolocation-aware API directory publishing service.

The way I see the future, a shop or restaurant in a well-traveled area will not need its own website–third-party developers will use the available API data to create aggregate websites (perhaps supported by advertising) that let you browse the menus/inventory of all restaurants and shops in the area. Thus rather than zooming in on Alberta Street in Google Maps and then going along checking out the websites, yelp reviews, etc, of each place, one could just go to an Alberta Street site and see/search it all.

Less investment for the shop owners, more exposure, better data/experience for consumers, new income streams for web developers, a fantasy playground for entrepreneurs. Delivery services are a simple, obvious example; once this actually exists the ensuing creativity and innovation will produce game-changers we can only begin to imagine.

I see this as less of a technical challenge and more as something that requires people and persuasion skills–coming up with a generic standardized retail/restaurant API and directory service can probably be done over a weekend. The real work will be in getting all of the various POS/inventory system vendors to agree to the spec and either getting them to either develop the interfaces for their products internally or convincing them to allow others to do it for them.