TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

As I’m writing this, four DDoS attacks are ongoing and being automatically mitigated by Gatebot. Cloudflare’s job is to get attacked. Our network gets attacked constantly.

Around the fall of 2016, we started seeing DDoS attacks that looked a little different than usual. One attack we saw around that time had traffic coming from 52,467 unique IP addresses. The clients weren’t servers or desktop computers; when we tried to connect to the clients over port 80, we got the login pages to CCTV cameras.

Obviously it’s important to lock down IoT devices so that they can’t be co-opted into evil botnet armies, but when we talk to some IoT developers, we hear a few concerning security patterns. We’ll dive into two problematic areas and their solutions: software updates and TLS.

The Trouble With Updates

With PCs, the end user is ultimately responsible for securing their devices. People understand that they need to update their computers and phones. Just 4 months after Apple released iOS 10, it was installed on 76% of active devices.

People just don’t know that they are supposed to update IoT things like they are supposed to update their computers because they’ve never had to update things in the past. My parents are never going to install a software update for their thermometer.

And the problem gets worse over time. The longer a device stays on an older software version, the less likely it will be compatible with the newer version. At some point, an update may not be possible anymore. This is a very real concern as the shelf life of a connected thing can be 10 years in the case of a kitchen appliance - have you ever bought a refrigerator?

This is if the device can be patched at all. First, devices that are low battery are programmed not to receive updates because it’s too draining on the battery. Second, IoT devices are too lightweight to run a full operating system, they run just a compiled binary on firmware which means there’s a limit to the code that can later be pushed to it. Some devices cannot receive specific patches.

The other thing we hear about updates from IoT developers is that often they are afraid to push a new update because it could mean breaking hundreds of thousands of devices at once.

All this may not seem like a big deal - ok, so a toaster can get hacked, so what - but two very real things are at stake. First, every device that’s an easy target makes it easier to make other applications a target. Second, once someone is sitting on a device, they are in your network, which can put at stake any traffic sent over the wire.

The security model that worked for PC doesn’t work for IoT — the end user can’t be responsible, and patching isn’t reliable. We need something else. What’s the solution?

Traffic to an IoT device passes through many different networks: the transit provider from the application server, the content delivery network used to deliver device traffic, the ISP to the building where the device sits.

It is at those network layers that protection can be added. As IoT device traffic moves through these networks, packets can be filtered to only let in good traffic. Even if a device is running vulnerable code, filters added in the network level can keep hackers out.

The Trouble With TLS

TLS is used in two ways in IoT devices: First, TLS is used to encrypt data in transit. This is used for data privacy and to make it harder to reverse engineer the communications used by the device. Second, devices store client TLS certificates that are used to authenticate the devices to the application - makes it one step harder to fake a device.

There are three problems developers run into when they want to implement TLS in IoT. The first is that while IoT traffic needs to be quick and lightweight, TLS adds an additional two round trips to the start of every session. The second is that certificates can be large files, and device memory is limited in IoT. And the third is that some of the protocols that are being developed for IoT are plaintext by default.

TLS Isn’t Lightweight

IoT devices run on low power chips. An IoT device may only have 256 or 512 KB of RAM and often need to conserve battery. They send and receive lots of small information constantly. Imagine an internet connected wind sensor - it measures wind speed and every 30 seconds, sends the new wind speed to the application server. It’s just a few bytes of data it needs to get over the wire and it wants to be able to do so without as much overhead as possible to conserve RAM and battery life.

Here’s an HTTP POST to do that:

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

But let’s say the same device is going to use TLS. Here’s what the same POST looks like with the TLS handshake — this is with TLS 1.2:

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

Depending on distance between the device and the application server and the latency of the server, this can be hundreds of milliseconds added. The solution is likely the newest version of TLS, TLS 1.3.

TLS 1.3 eliminates a complete round trip in the TLS handshake, which makes TLS much lighter and faster. It cuts the number of round trips in the handshake by half by predicting what key agreement protocol and algorithm the server will decide to use and sends those guessed parameters and the key share directly in the client hello. And if the server likes that, it sends back its own key share for the same algorithm, and the whole handshake is done.

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

If the same IoT device talks to the same server again, there’s actually no round trip at all. The parameters chosen in the initial handshake are sent alongside application data in the first packet.

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

Why isn’t every IoT device using 1.3 today? TLS 1.3 is still actively being developed in the IETF standards track and while Chrome as of version 56 in January and Firefox as of version 52 in March support 1.3, not everything does. The biggest problem today are middleboxes that are used by ISP’s and enterprises that panic when they see a 1.3 handshake and close the connection. This also happened when the world was upgrading to TLS 1.2 and middleboxes only understood TLS 1.1, so it’s just a matter of time.

TLS Certificate Size

In a TLS handshake, the server can use a server-side TLS certificate to authenticate itself to the client, and the client can use a client-side certificate to authenticate itself to the server. Devices often store certificates to authenticate themselves to the application server. However, device memory is often limited in IoT, and certificates can be large. What can we do?

Most certificates today use the RSA algorithm, which has been around since the 70’s. The certificates are large because the keys in RSA to be secure need to be large - either 1,024 to 2,048 bytes, however, a newer algorithm using elliptic curve cryptography has been in wide use since the early 2000’s that can solve this problem. With elliptic curve cryptography we can use smaller keys with the same level of security as a larger RSA key and save space on the device.

Default Plaintext IoT Protocols

IoT devices need to be lightweight so two emerging protocols are replacing HTTP as the dominant transfer protocol for some IoT devices: MQTT and CoAP.

MQTT is a pub/sub protocol that has been around almost 20 years. In MQTT, a proxy server acts as a broker. An IoT device or web app publishes a message to the broker, and the broker distributes those messages to all the other IoT devices that need to receive that message.

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

When MQTT was written almost 20 years ago, it was written without security by intention. It was written for oil and gas companies and they were just sending sensor data and no one thought it needed to be encrypted.

CoAP was standardized just three years ago. It has all the same methods as HTTP, but it’s over UDP so it’s really light.

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

The problem is, if you want to add TLS (DTLS really because CoAP is over UDP), it no longer is light anymore.

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

The Future

It will be quite interesting to see how update mechanisms and TLS implementations change as the number of deployed IoT devices continues to grow. If this type of thing interests you, come join us.

Simple Cyber Security Tips (for your Parents)

Today, December 25th, Cloudflare offices around the world are taking a break. From San Francisco to London and Singapore; engineers have retreated home for the holidays (albeit with those engineers on-call closely monitoring their mobile phones).

Software engineering pro-tip:

Do not, I repeat, do not deploy this week. That is how you end up debugging a critical issue from your parent's wifi in your old bedroom while your spouse hates you for abandoning them with your racist uncle.
— Chris Albon (@chrisalbon) December 20, 2017

Whilst our Support and SRE teams operated on a schedule to ensure fingers were on keyboards; on Saturday, I headed out of the London bound for the Warwickshire countryside. Away from the barracks of the London tech scene, it didn't take long for the following conversation to happen:

Family member: "So what do you do nowadays?"
Me: "I work in Cyber Security."
Family member: "There seems to be a new cyber attack every day on the news! What can I possibly do to keep myself safe?"

If you work in the tech industry, you may find a family member asking you for advice on cybersecurity. This blog post will hopefully save you from stuttering whilst trying to formulate advice (like I did).

The Basics

The WannaCry Ransomware Attack was one of the most high-profile cyberattacks of 2017. In essence, ransomware works by infecting a computer, then encrypting files - preventing users from being able to access them. Users then see a window on their screen demanding payment with the promise of decrypting files. Multiple copycat viruses also sprung up, using the same exploit as WannaCry.

It is worth noting that even after paying, you're unlikely to see your files back (don't expect an honest transaction from criminals).

Simple Cyber Security Tips (for your Parents)

WannaCry was estimated to have infected over 300 000 computers around the world; this included high-profile government agencies and corporations, the UK's National Health Service was one notable instance of this.

Despite the wide-ranging impact of this attack, a lot of victims could have protected themselves fairly easily. Security patches had already been available to fix the bug that allowed this attack to happen and installing anti-virus software could have contained the spread of this ransomware.

For consumers; it is generally a good idea to install updates, particularly security updates. Platforms like Windows XP no longer receive security updates, and therefore shouldn't be used - regardless of how up-to-date they are on security patches.

Of course, it is also essential to back-up your most indispensable files, not just because of the damage security vulnerabilities.

Don't put your eggs in one Basket

It may not be Easter, but you certainly should not be putting all your eggs in one basket. For this reason; it is often not a good idea to use the same password across multiple sites.

Passwords have been around since 1961, however, no alternative has been found which keeps all their benefits; users continue to set passwords weakly, and website developers continue to store them insecurely.

When developers store computer passwords, they should do so in a way that they can check a computer password is correct but they can never know what the original password is. Unfortunately many websites (including some popular ones) implement internet security poorly. When they get hacked, a password dump can be leaked with everyone's emails/usernames alongside their passwords.

If the same email/username and password combination are used on multiple sites, hackers can automatically use the breached user data from one site to attempt logins against other websites you use online.

For this reason, it's absolutely critical to use a unique password across multiple sites. Password manager apps like LastPass or 1Password allow you to use unique randomly-generated passwords for each site but manage them from one encrypted wallet using a master password.

Simple passwords, based on personal information or using individual words in the dictionary, are far from safe too. Computers can repeatedly go through common passwords in order to crack them. Similarly, adding numbers and symbols (i.e. changing password to p4$$w0rd) will do little to help also.

When you have to choose a password you need to remember, you can create strong passwords from sentences. For example: "At Christmas my dog stole 2 pairs of Doc Martens shoes!" can become ACmds2poDMs! Passwords based on simple sentences can be long, but still easy to remember.

Another approach is to simply select four random dictionary words, for example: WindySoapLongBoulevard. (For obvious reasons, don't actually use that as your password.) Although this password uses solely letters, it is more secure than a shorter password that would also use numbers and symbols.

Layering Security

Authentication is how computers confirm you are who you say you are. Fundamentally, is done using either:

Something you know
Something you have
Something you are

A password is an example of how you can log-in using "something you know"; if someone is able to gain access to that password, it's game-over for that online account.

Instead, it is possible to use "something you have" as well. This means, should your password be intercepted or disclosed, you still have another safeguard to protect your account.

In practice, this means that after entering your password onto a website, you may also be prompted for another code that you need to read off an app on your phone. This is known as Two-Factor Authentication.

Simple Cyber Security Tips (for your Parents)

Two-Factor Authentication is supported on many of the worlds most popular social media, banking and shopping sites. You can find out how to enable it on popular websites at turnon2fa.com.

Know who you talk to

When you browse to a website online, you may notice a lock symbol light up in your address bar. This indicates encryption is enabled when talking to the website, this is important in order to prevent interception.

Simple Cyber Security Tips (for your Parents)

When inputting personal information into websites, it is important you check this green lock appears and that the website address starts with "https://".

It is, however, important to double check the address bar you're putting your personal information into. Is it cloudflare.com or have you been redirected away to a dodgy website at cloudflair.com or cloudflare.net?

Despite how common encrypted web traffic has become, on many sites, it still remains relatively easy to strip away this encryption - by pointing internet traffic to a different address. I describe how this can be done in:
Performing & Preventing SSL Stripping: A Plain-English Primer

It is also often good guidance to be careful about the links you see in emails; they legitimate emails from your bank, or just someone trying to capture your personal information from their fake "phishing" website that looks just like your bank? Just because someone has a little bit of information about you, doesn't mean they are who they say they are. When in doubt; void following links directly in emails, and check the validity of the email independently (such as by directly going to your banking website). A correct looking "to" address isn't enough to prove an email is coming from who it says it's from.

Conclusions

We always hear of new and innovative security vulnerabilities, but for most users, remembering a handful of simple security tips is enough to protect against the majority of security threats.

In Summary:

As a rule-of-thumb, install the latest security patches
Don't use obsolete software which doesn't provide security patches
Use well-trusted anti-virus
Back-up the files and folders you can't expect to lose
Use a Password Manager to set random, unique passwords for every site
Don't use common keywords or personal information as passwords
Adding numbers and symbols to passwords often doesn't add security but impacts how easy they are to remember
Enable Two-Factor Authentication on sites which support it
Check the address bar when inputting personal information, make sure the connection is encrypted and the site address is correct
Don't believe everything you see in your email inbox or trust every link sent through email; even if the sender has some information about you.

Finally, from everyone at Cloudflare, we wish you a wonderful and safe holiday season. For further reading, check out the Internet Mince Pie Database.

In honor of all the fervor around Bitcoin, we thought it would be fun to revisit the role finance has had in the history of technology even before the Internet came around. This was adapted from a post which originally appeared on the Eager blog.

On 10th of April 1814, almost one hundred thousand troops fought the battle of Toulouse in Southern France. The war had ended on April 6th. Messengers delivering news of Napoleon Is abdication and the end of the war wouldn’t reach Toulouse until April 12th.

The issue was not the lack of a rapid communication system in France, it just hadn’t expanded far enough yet. France had an elaborate semaphore system. Arranged all around the French countryside were buildings with mechanical flags which could be rotated to transmit specific characters to the next station in line. When the following station showed the same flag positions as this one, you knew the letter was acknowledged, and you could show the next character. This system allowed roughly one character to be transmitted per minute, with the start of a message moving down the line at almost 900 miles per hour. It wouldn’t expand to Toulouse until 1834 however, twenty years after the Napoleonic battle.

Stocks and Trades

It’s should be no secret that money motivates. Stock trading presents one of the most obvious uses of fast long-distance communication. If you can find out about a ship sinking or a higher than expected earnings call before other traders, you can buy or sell the right stocks and make a fortune.

In France, it was strictly forbidden to use the semaphore system for anything other than government business however. Being such a public method of communication, it wasn’t really possible for an enterprising investor to ‘slip in’ a message without discovery. The ‘Blanc brothers’ figured out one method however. They discovered they could bribe the operator to include one extra bit of information, the “Error - cancel last transmitted symbol” control character with a message. If an operative spotted that symbol, they knew it was time to buy.

Semaphore had several advantages over an electric telegraph. For one, there were no lines to cut, making it easier to defend during war. Ultimately though, its slow speed, need for stations every ten miles or so, and complete worthlessness at night and in bad weather made its time on this earth limited.

Thirty-Six Days Out of London

Ships crossing the Atlantic were never particularly fast. We American’s didn’t learn of the end of our own revolution at the Treaty of Versailles until October 22nd, almost two months after it had been signed. The news came from a ship “thirty-six days out of london”.

Anyone who could move faster could make money. At the end of the American Civil War, Jim Fisk chartered high speed ships to speed to London and short Confederate Bonds before the news could reach the British market. He made a fortune.

It wasn’t long before high speed clipper ships were making the trip with mail and news in twelve or thirteen days regularly. Even then though, there was fierce competition among newspapers to get the information first. New York newspapers like the Herald and the Tribune banded together to form the New York Associated Press (now known just as the Associated Press) to pay for a boat to meet these ships 50 miles off shore. The latest headlines were sent back to shore via pigeon or the growing telegraph system.

The Gold Indicator

The History of Stock Quotes

Most of the technology used by the morse code telegraph system was built to satisfy the demands of the finance industry.

The first financial indicator was a pointer which sat above the gold exchange in New York. In our era of complex technology, the pointer system has the refreshing quality of being very simple. An operator in the exchange had a button which turned a motor. The longer he held the button down, the further the motor spun, changing the indication. This system had no explicit source of ‘feedback’, beyond the operator watching the indicator and letting go of his button when it looked right.

Soon, other exchanges were clamoring for a similar indicator. Their motors were wired to those of the Gold Exchange. This did not form a particularly reliable system. Numerous boys had to run from site to site, resetting the indicators when they came out of sync from that at the Gold Exchange.

The Ticker Tape

I am crushed for want of means. My stockings all want to see my mother, and my hat is hoary from age.

— Samuel Morse, in his diary

This same technology formed the basis for the original ticker tape machines. A printing telegraph from this era communicated using a system of pulses over a wire. Each pulse would move the print head one ‘step’ on a racheting wheel. Each step would align a different character with the paper to be printed on. A longer pulse over the wire would energize an electromagnet enough to stamp the paper into the print head. Missing a single pulse though would send the printer out of alignment creating a 19th century version of Mojibake.

It was Thomas Edison who invented the ‘automatic rewinder’, which allowed the machines to be synchronized remotely. The first system used a screw drive. If you moved the print head through three full revolutions without printing anything, you would reach the end of the screw and it would stop actually rotating at a known character, aligning the printers. Printing an actual character would reset the screw. A later system of Edisons used the polarity of the wire to reset the system. If you flipped the polarity on the wire, switching negative and positive, the head would continue to turn in response to pulses, but it would stop at a predefined character, allowing you to ‘reset’ any of the printers which may have come out of alignment. This was actually a big enough problem that there is an entire US Patent Classification devoted to ‘Union Devices’ (178/41).

It will therefore be understood from the above explanation that the impression of any given character upon the type-wheel may be produced upon the paper by an operator stations at a distant point, ... simply by transmitting the proper number of electrical impulses of short duration by means of a properly-constructed circuit-breaker, which will cause the type-wheel to revolve without sensibly affecting the impression device. When the desired character is brought opposite the impression-lever the duration of the final current is prolonged, and the electro-magnet becomes fully magnetized, and therefore an impression of the letter or character upon the paper is produced.

— Thomas A. Edison, Patent for the Printing Telegraph 1870

Ticker tape machines used their own vocabulary:

IBM 4S 651/4

Meant 400 shares of IBM had just been sold for $65.25 per share (stocks were priced using fractions, not decimal numbers).

Ticker tape machines delivered a continuous stream of quotes while the market was open. The great accumulation of used ticker tape led to the famous ‘Ticker Tape parades’, where thousands of pounds of the tape would be thrown from windows on Wall Street. Today we still have ticker tape parades, but not the tape itself, the paper is bought specifically to be thrown out the window.

Trans-Lux

What’s the best way to share the stock ticker tape with a room full of traders? The early solution was a chalkboard where relevant stock trades could be written and updated throughout the day. Men were also employed to read the ticker and remember the numbers, ready to recall the most recent prices when asked.

A better solution came from the Trans-Lux company in 1939 however. They devised a printer which would print on translucent paper. The numbers could then be projected onto a screen from the rear, creating the first large stock ticker everyone could read.

This was improved through the creation of the Trans-Lux Jet. The Jet was a continuous tape composed of flippable cards. One side of each card was a bright color while the other was black. As each card passed by a row of electrically-controlled pneumatic jets, some were flipped, writing out a message which would pass along the display just as modern stock tickers do. The system would be controlled using a shift register which would read in the stock quotes and translate them into pneumatic pulses.

The Quotron

The key issue with a stock ticker is you have to be around when a trade of stock you care about is printed. If you miss it, you have to search back through hundreds of feet of tape to get the most recent price. If you couldn’t find the price, the next best option was a call to the trading floor in New York. What traders needed was a way of looking up the most recent quote for any given stock.

In 1960 Jack Scantlin released the Quotron, the first computerized solution. Each brokerage office would become host to a Quotron ‘master unit’, which was a reasonably sized ‘computer’ equipped with a magnetic tape write head and a separate magnetic tape read head. The tape would continually feed while the market was open, the write head keeping track of the current stock trades coming in over the stock ticker lines. When it was time to read a stock value, the tape would be unspooled between the two heads falling into a bucket. This would allow the read head to find the latest value of the stock even as the write head continued to store trades.

Each desk would be equipped with a keypad / printer combination unit which allowed a trader to enter a stock symbol and have the latest quote print onto a slip of paper. A printer was used because electronic displays were too expensive. In the words of engineer Howard Beckwith:

We considered video displays, but the electronics to form characters was too expensive then. I also considered the “Charactron tube” developed by Convair in San Diego that I had used at another company . . . but this also was too expensive, so we looked at the possibility of developing our own printer. As I remember it, I had run across the paper we used in the printer through a project at Electronic Control Systems where I worked prior to joining Scantlin. The paper came in widths of about six inches, and had to be sliced . . . I know Jack Scantlin and I spent hours in the classified and other directories and on the phone finding plastic for the tape tank, motors to drive the tape, pushbuttons, someone to make the desk unit case, and some company that would slice the tape. After we proved the paper could be exposed with a Xenon flash tube, we set out to devise a way to project the image of characters chosen by binary numbers stored in the shift register. The next Monday morning Jack came in with the idea of the print wheel, which worked beautifully.

The master ‘computer’ in each office was primitive by our standards. For one, it didn’t include a microprocessor. It was a hardwired combination of a shift register and some comparison and control logic. The desk units were connected with a 52-wire cable, giving each button on each unit its own wire. This was necessary because they units contained no logic themselves, their printing and button handling logic is all handled in the master computer.

When a broker in the office selected an exchange, entered a stock symbol, and requested a last price on his desk unit, the symbol would be stored in relays in the master unit, and the playback sprocket would begin driving the tape backwards over a read head at about ten feet per second, dumping the tape into the bin between the two heads (market data would continue to be recorded during the read operation). The tape data from the tracks for the selected exchange would be read into a shift register, and when the desired stock symbol was recognized, the register contents would be “frozen,” and the symbol and price would be shifted out and printed on the desk unit.

Only a single broker could use the system at a time:

If no desk units were in use, the master unit supplied power to all desk units in the office, and the exchange buttons on each unit were lit. When any broker pressed a lit button, the master unit disconnected the other desk units, and waited for the request from the selected desk unit. The desk unit buttons were directly connected to the master unit via the cable, and the master unit contained the logic to decode the request. It would then search the tape, as described above, and when it had an answer ready, would start the desk unit paper drive motor, count clock pulses from the desk unit (starting, for each character, when it detected an extra-large, beginning-of-wheel gap between pulses), and transmit a signal to operate the desk unit flash tube at the right time to print each character.

Ultronics

The Quotron system provided a vast improvement over a chalk board, but it was far from perfect. For one, it was limited to the information available over the ticker tape lines, which didn’t include information like the stocks volume, earnings, and dividends. A challenger named Ultronics created a system which used a similar hardwired digital computer, but with a drum memory rather than a magnetic tape.

The logic was advanced enough to allow the system to calculate volume, high and low for each stock as the data was stored. Rather than store one of these expensive memory units in every brokerage, Ultronics had centralized locations around the US which were connected to brokerages and each other using 1000 bps Dataphone lines.

This system notably used a form of packet addressing, possibly for the first time ever. When each stock quote was returned it included the address of the terminal which had made the request. That terminal was able to identify the responses meant for it based on that address, allowing all the terminals to be connected to the same line.

Quotron II

At one time during system checkout we had a very elusive problem which we couldn’t pin down. In looking over the programs, we realized that the symptoms we were seeing could occur if an unconditional jump instruction failed to jump. We therefore asked CDC whether they had any indication that that instruction occasionally misbehaved. The reply was, “Oh, no. That’s one of the more reliable instructions,” This was our first indication that commands could be ordered by reliability.

— Montgomery Phister, Jr. 1989

Facing competition from the Ultronics quote computers, it was time for Jack Scantlin’s team to create something even more powerful. What they created was the Quotron II. The Quotron II was powered by magnetic core memory, an early form of random-access memory which allowed them to read and store any stock’s value in any order. Unfortunately there wasn’t actually enough memory. They had 24K of memory to store 3000 securities.

One stock sold for over $1000; some securities traded in 32nds of a dollar; the prices to be stored included the previous day’s close, and the day’s open, high, low, and last, together with the total number of shares traded-the volume. Clearly we’d need 15 bits for each price (ten for the $1000, five for the 32nds), or 75 bits for the five prices alone. Then we’d need another 20 for a four-letter stock symbol, and at least another 12 for the volume. That added up to 107 bits (nine words per stock, or 27,000 words for 3000 stocks) in a format that didn’t fit conveniently into 12-bit words.

Their solution was to store most of the stocks in a compressed format. Each stocks previous closing price was stored in 11 bits, and store the other four values as six bit increments from that number. Any stocks priced over $256, stocks which used fractions smaller than eighths, and too large increments, were stored in a separate overflow memory area.

The Quotron II system was connected to several remote sites using eight Dataphone lines which provided a total bandwidth of 16 kbps.

The fundamental system worked by having one 160A computer read stock prices from a punch tape (using about 5000 feet of tape a day) into the common memory. A second 160A responded to quote requests over the Dataphone lines. The remote offices were connected to bankers office using teletype lines which could transmit up to 100 words-per-minute where a device would forward the messages to the requesting terminal.

It’s somewhat comforting to learn that hack solutions are nothing new:

Once the system was in operation, we had our share of troubles. One mysterious system failure had the effect of cutting off service to all customers in the St. Louis area. Investigation revealed that something was occasionally turning off some 160A memory bits which enabled service to that region. The problem was “solved” for a time by installing a patch which periodically reinstated those bits, just in case.

The system was also notable for introducing the +/- tick to represent if a stock had gone up or down since the last trade. It also added some helpful calculated quantities such as the average price change of all NYSE stocks.

The story of Quotron II showcases the value of preparing for things to go wrong even if you’re not exactly sure how they will, and graceful degradation:

Jack Scantlin was worried about this situation, and had installed a feature in the Quotron program which discarded these common-memory programs, thus making more room for exceptions, when the market went haywire. On the day President Kennedy was assassinated, Jack remembers sitting in his office in Los Angeles watching features disappear until brokers could get nothing but last prices.

Those of us who worked on Quotron II didn’t use today’s labels. Multiprogramming, multiprocessor, packet, timesharing-we didn’t think in those terms, and most of us had never even heard them. But we did believe we were breaking new ground; and, as I mentioned earlier, it was that conviction more than any other factor that made the work fascinating, and the time fly.

It’s valuable to remember that as easy as this system might be to create with modern technology, it was a tremendous challenge at the time. “Most of us lived Quotron 12 to 14 hour days, six and a half days a week; but the weeks flew by, and before we turned around twice, five years were gone...”

NASDAQ

Anyone who has ever been involved with the demonstration of an on-line process knows what happens next. With everyone crowded around to watch, the previously infallible gear or program begins to fall apart with a spectacular display of recalcitrance. Well so it went. We set the stage, everyone held their breath, and then the first query we keyed in proceeded to pull down the whole software structure.

Feeling pressure from the SEC to link all the nation’s securities markets, the National Association of Securities Dealers decided to build an ‘automated quotation service’ for their stocks. Unlike a stock ticker, which provides the price of the last trade of a stock, the purpose of the NASDAQ system was to allow traders to advertise the prices they would accept to other traders. This was extremely valuable, as before the creation of this system, it was left to each trader to strike a deal with their fellow stock brokers, a very different system than the roughly ‘single-price-for-all’ system we have today.

The NASDAQ system was powered by two Univac 1108 computers for redundancy. The central system in Connecticut was connected to regional centers in Atlanta, Chicago, New York and San Francisco where requests were aggregated and disseminated. As of December 1975 there was 20,000 miles of dedicated telephone lines connecting the regional centers to 642 brokerage offices.

Each NASDAQ terminal was composed of a CRT screen and dedicated keyboard. A request for a stock would return the currently available bid and ask price of each ‘market maker’ around the country. The market makers where the centers where stock purchases were aggregated and a price set. The trader could quickly see where the best price was available, and call the market maker to execute his trade. Similarly, the market makers could use the terminal units to update their quotations and transmit the latest values. This type of detailed ‘per-market-maker’ information is actually still a part of the NASDAQ system, but it’s only accessible to paying members.

One thing this system didn’t do is support trading via computer, without calling the money maker on the phone (the AQ in NASDAQ actually stands for ‘Automated Quotations’, no stock purchasing capability was originally intended). This became a problem on Black Monday in 1987 when the stock market lost almost a quarter of its value in a single day. During the collapse, many money makers couldn’t keep up with the selling demand, leaving many small investors facing big loses with no way to sell.

In response the NASDAQ created the Small Order Execution System which allowed small orders of a thousand shares or less to be traded automatically. The theory was these small trades didn’t require the man-to-man blustering and bargaining which was necessary for large-scale trading. Eventually this was phased out, in favor of the nearly all computerized trading based system we have today.

Now

Today over three trillion dollars worth of stocks are traded every month on computerized stock exchanges. The stocks being traded represent over forty billion dollars worth of corporate value. With the right credentials and backing it’s possible to gain or lose billions of dollars in minutes.

These markets make it possible for both the average citizen and billion dollar funds to invest in thousands of companies. In turn, it allows those companies to raise the money they need to (hopefully) grow.

Our next post in this series is on the history of digital communication before the Internet came along. Subscribe to be notified of its release.

Concise (Post-Christmas) Cryptography Challenges

It's the day after Christmas; or, depending on your geography, Boxing Day. With the festivities over, you may still find yourself stuck at home and somewhat bored.

Either way; here are three relatively short cryptography challenges, you can use to keep you momentarily occupied. Other than the hints (and some internet searching), you shouldn't require a particularly deep cryptography knowledge to start diving into these challenges. For hints and spoilers, scroll down below the challenges!

Concise (Post-Christmas) Cryptography Challenges

Challenges

Password Hashing

The first one is simple enough to explain; here are 5 hashes (from user passwords), crack them:

$2y$10$TYau45etgP4173/zx1usm.uO34TXAld/8e0/jKC5b0jHCqs/MZGBi
$2y$10$qQVWugep3jGmh4ZHuHqw8exczy4t8BZ/Jy6H4vnbRiXw.BGwQUrHu
$2y$10$DuZ0T/Qieif009SdR5HD5OOiFl/WJaDyCDB/ztWIM.1koiDJrN5eu
$2y$10$0ClJ1I7LQxMNva/NwRa5L.4ly3EHB8eFR5CckXpgRRKAQHXvEL5oS
$2y$10$LIWMJJgX.Ti9DYrYiaotHuqi34eZ2axl8/i1Cd68GYsYAG02Icwve

HTTP Strict Transport Security

A website works by redirecting its www. subdomain to a regional subdomain (i.e. uk.), the site uses HSTS to prevent SSLStrip attacks. You can see cURL requests of the headers from the redirects below, how would you practically go about stripping HTTPS in this example?

$ curl -i http://www.example.com
HTTP/1.1 302 Moved Temporarily
Server: nginx
Date: Tue, 26 Dec 2017 12:26:51 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
location: https://uk.example.com/

$ curl -i http://uk.example.com
HTTP/1.1 200 OK
Server: nginx
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Keep-Alive: timeout=5
Vary: Accept-Encoding
Cache-Control: no-cache
Date: Tue, 26 Dec 2017 12:26:53 GMT
Strict-Transport-Security: max-age=31536000
...

AES-256

The following images below are encrypted using AES 256 in CTR mode. Can you work out what the images originally were from?

1_enc.bmp

Concise (Post-Christmas) Cryptography Challenges

2_enc.bmp

Concise (Post-Christmas) Cryptography Challenges

Hints

Password Hashing

What kind of hash algorithm has been used here? Given the input is from humans, how would you go about cracking it efficiently?

HTTP Strict Transport Security

All the information you could possibly need on this topic is in the following blog post: Performing & Preventing SSL Stripping: A Plain-English Primer.

AES-256

From the original images; 1.bmp and 2.bmp, were encrypted using a command similar to the one below - the only thing which changed each time the command was run was the file names used:

openssl enc -aes-256-ctr -e -in 1.bmp -out 1_enc.bmp -k SomePassword -iv 0010011 -nosalt

Another hint; both images are from the same source image, each just has different layers of the same original image.

Solutions (Spoiler Alert!)

Password Hashing

The hashes start with a $2y$ identifier, this indicates the hash has been created using BCrypt ( $2*$ usually indicates BCrypt). The hashes also indicate they were generated using a somewhat decent work factor of 10.

Despite the recent implementations of the Argon2 key derivation function, BCrypt is still generally considered secure. Although it's unfeasible to crack BCrypt itself, users have likely not been as security conscious when setting passwords.

Let's try using a password dictionary to try and crack these hashes, I'll start with a small dictionary of just about 10 000 passwords.

Using this password list, we can crack these hashes (the mode number, 3200 represents BCrypt):

hashcat -a 0 -m 3200 hashes.txt ~/Downloads/10_million_password_list_top_10000.txt --force

Despite being on my laptop, it only takes 90 seconds to crack these hashes (the output below is in the format hash:plaintext):

$2y$10$DuZ0T/Qieif009SdR5HD5OOiFl/WJaDyCDB/ztWIM.1koiDJrN5eu:password1
$2y$10$TYau45etgP4173/zx1usm.uO34TXAld/8e0/jKC5b0jHCqs/MZGBi:password
$2y$10$qQVWugep3jGmh4ZHuHqw8exczy4t8BZ/Jy6H4vnbRiXw.BGwQUrHu:hotdog
$2y$10$0ClJ1I7LQxMNva/NwRa5L.4ly3EHB8eFR5CckXpgRRKAQHXvEL5oS:88888888
$2y$10$LIWMJJgX.Ti9DYrYiaotHuqi34eZ2axl8/i1Cd68GYsYAG02Icwve:hello123

HTTP Strict Transport Security

The first redirect is performed over plain-text HTTP and doesn't have HSTS enabled, but the redirect goes to a subdomain that does have HSTS enabled.

If we were to try stripping away the HTTPS in the redirect (i.e. forcibly changing HTTPS to HTTP using a Man-in-the-Middle attack), we wouldn't have much luck due to HSTS being enabled on the subdomain we're redirecting to.

Instead, we need to rewrite the uk. subdomain to point to a subdomain which doesn't have HSTS enabled, uk_secure. for example. In the worst case, we can simply redirect to an entirely phoney domain. You'll need some degree of DNS interception to do this.

From the phoney subdomain or domain, you can then proxy traffic back to the original domain - all the while acting as a Man-in-the-Middle for any private information crossing the wire.

AES-256

Generating Encrypted Files

Before we learn to crack this puzzle, let me explain how I set things up.

The original image looked like this:

Concise (Post-Christmas) Cryptography Challenges

Prior to encrypting this image, I split them into two distinct parts.

1.bmp

Concise (Post-Christmas) Cryptography Challenges

2.bmp

Concise (Post-Christmas) Cryptography Challenges

The first image was encrypted with the following command:

openssl enc -aes-256-ctr -e -in 1.bmp -out 1_enc.bmp -k hunter2#security -iv 0010011 -nosalt

Beyond the choice of cipher, there are two important options here:

-iv - We require OpenSSL to use a specific nonce instead of a dynamic one
-nosalt - Encryption keys in OpenSSL are salted prior to hashing, this option prevents this from happening

Using a Hex Editor, I added the headers to ensure the encrypted files were correctly rendered as BMP files. This resulted in the files you saw in the challenge above.

Reversing the Encryption

The critical mistake above is that the encryption key and the nonce are identical. Both ciphertexts were generated from identical nonces using identical keys (unsalted before they were hashed).

Additionally, we are using the AES in CTR mode, which vulnerable to the “two-time pad” attack.

We firstly run an XOR operation on the two files:

$ git clone https://github.com/scangeo/xor-files.git
$ cd xor-files/source
$ gcc xor-files.c -o xor-files
$ ./xor-files /some/place/1_enc.bmp /some/place/2_enc.bmp > /some/place/xor.bmp

The next step is to add BMP image headers such that we can display the image:

Concise (Post-Christmas) Cryptography Challenges

The resulting image is then inverted (using ImageMagick):

convert xor.bmp -negate xor_invert.bmp

` We then obtain the original input:

Concise (Post-Christmas) Cryptography Challenges

Conclusion

If you're interested in debugging cybersecurity challenges on a network that sees ~10% of global internet traffic, we're hiring for security & cryptography engineers and Support Engineers globally.

We hope you've had a relaxing holiday season, and wish you a very happy new year!

Why TLS 1.3 isn't in browsers yet

Upgrading a security protocol in an ecosystem as complex as the Internet is difficult. You need to update clients and servers and make sure everything in between continues to work correctly. The Internet is in the middle of such an upgrade right now. Transport Layer Security (TLS), the protocol that keeps web browsing confidential (and many people persist in calling SSL), is getting its first major overhaul with the introduction of TLS 1.3. Last year, Cloudflare was the first major provider to support TLS 1.3 by default on the server side. We expected the client side would follow suit and be enabled in all major browsers soon thereafter. It has been over a year since Cloudflare’s TLS 1.3 launch and still, none of the major browsers have enabled TLS 1.3 by default.

The reductive answer to why TLS 1.3 hasn’t been deployed yet is middleboxes: network appliances designed to monitor and sometimes intercept HTTPS traffic inside corporate environments and mobile networks. Some of these middleboxes implemented TLS 1.2 incorrectly and now that’s blocking browsers from releasing TLS 1.3. However, simply blaming network appliance vendors would be disingenuous. The deeper truth of the story is that TLS 1.3, as it was originally designed, was incompatible with the way the Internet has evolved over time. How and why this happened is the multifaceted question I will be exploring in this blog post.

To help support this discussion with data, we built a tool to help check if your network is compatible with TLS 1.3:
https://tls13.mitm.watch/

How version negotiation used to work in TLS

The Transport Layer Security protocol, TLS, is the workhorse that enables secure web browsing with HTTPS. The TLS protocol was adapted from an earlier protocol, Secure Sockets Layer (SSL), in the late 1990s. TLS currently has three versions: 1.0, 1.1 and 1.2. The protocol is very flexible and can evolve over time in different ways. Minor changes can be incorporated as “extensions” (such as OCSP and Certificate Transparency) while larger and more fundamental changes often require a new version. TLS 1.3 is by far the largest change to the protocol in its history, completely revamping the cryptography and introducing features like 0-RTT.

Not every client and server support the same version of TLS—that would make it impossible to upgrade the protocol—so most support multiple versions simultaneously. In order to agree on a common version for a connection, clients and servers negotiate. TLS version negotiation is very simple. The client tells the server the newest version of the protocol that it supports and the server replies back with the newest version of the protocol that they both support.

Versions in TLS are represented as two-byte values. Since TLS was adapted from SSLv3, the literal version numbers used in the protocol were just increments of the minor version:
SSLv3 is 3.0
TLS 1.0 is 3.1
TLS 1.1 is 3.2
TLS 1.2 is 3.3, etc.

When connecting to a server with TLS, the client sends its highest supported version at the beginning of the connection:

(3, 3) → server

A server that understands the same version can reply back with a message starting with the same version bytes.

(3, 3) → server
client ← (3, 3)

Or, if the server only knows an older version of the protocol, it can reply with an older version. For example, if the server only speaks TLS 1.0, it can reply with:

(3, 3) → server
client ← (3, 1)

If the client supports the version returned by the server then they continue using that version of TLS and establish a secure connection. If the client and server don’t share a common version, the connection fails.

This negotiation was designed to be future-compatible. If a client sends higher version than the server supports, the server should still be able to reply with whatever version the server supports. For example, if a client sends (3, 8) to a modern-day TLS 1.2-capable server, it should just reply back with (3,3) and the handshake will continue as TLS 1.2.

Pretty simple, right? As it turns out, some servers didn’t implement this correctly and this led to a chain of events that exposed web users to a serious security vulnerability.

POODLE and the downgrade dance

Why TLS 1.3 isn't in browsers yet CC0 Creative Commons

The last major upgrade to TLS was the introduction of TLS 1.2. During the roll-out of TLS 1.2 in browsers, it was found that some TLS 1.0 servers did not implement version negotiation correctly. When a client connected with a TLS connection advertising support for TLS 1.2, the faulty servers would disconnect instead of negotiating a version of TLS that they understood (like TLS 1.0).

Browsers had three options to deal with this situation

enable TLS 1.2 and a percentage of websites would stop working
delay the deployment of TLS 1.2 until these servers are fixed
retry with an older version of TLS if the connection fails

One expectation that people have about their browsers is that when they are updated, websites keep working. The number of misbehaving servers was far too numerous to just break with an update, eliminating option 1). By this point, TLS 1.2 had been around for a year, and still, servers were still broken waiting longer wasn’t going to solve the situation, eliminating option 2). This left option 3) as the only viable choice.

To both enable TLS 1.2 and keep their users happy, most browsers implemented what’s called an “insecure downgrade”. When faced with a connection failure when connecting to a site, they would try again with TLS 1.1, then with TLS 1.0, then if that failed, SSLv3. This downgrade logic “fixed” these broken servers... at the cost of a slow connection establishment. 🤦

However, insecure downgrades are called insecure for a reason. Client downgrades are triggered by a specific type of network failure, one that can be easily spoofed. From the client’s perspective, there’s no way to tell if this failure was caused by a faulty server or by an attacker who happens to be on the path of the connection network. This means that network attackers can inject fake network failures and trick a client into connecting to a server with SSLv3, even if both support a newer protocol. At this point, there were no severe publicly-known vulnerabilities in SSLv3, so this didn’t seem like a big problem. Then POODLE happened.

In October 2014, Bodo Möller published POODLE, a serious vulnerability in the SSLv3 protocol that allows an in-network attacker to reveal encrypted information with minimal effort. Because TLS 1.0 was so widely deployed on both clients and servers, very few connections on the web used SSLv3, which should have kept them safe. However, the insecure downgrade feature made it possible for attackers to downgrade any connection to SSLv3 if both parties supported it (which most of them did). The availability of this “downgrade dance” vector turned the risk posed by POODLE from a curiosity into a serious issue affecting most of the web.

Fixing POODLE and removing insecure downgrade

The fixes for POODLE were:

disable SSLv3 on the client or the server
enable a new TLS feature called SCSV.

SCSV was conveniently proposed by Bodo Möller and Adam Langley a few months before POODLE. Simply put, with SCSV the client adds an indicator, called a downgrade canary, into its client hello message if it is retrying due to a network error. If the server supports a newer version than the one advertised in the client hello and sees this canary, it can assume a downgrade attack is happening and close the connection. This is a nice feature, but it is optional and requires both clients and servers to update, leaving many web users exposed.

Browsers immediately removed support for SSLv3, with very little impact other than breaking some SSLv3-only websites. Users with older browsers had to depend on web servers disabling SSLv3. Cloudflare did this immediately for its customers, and so did most sites, but even in late 2017 over 10% of sites measured by SSL Pulse still support SSLv3.

Turning off SSLv3 was a feasible solution to POODLE because SSLv3 was not critical to the Internet. This begs the question: what happens if there’s a serious vulnerability in TLS 1.0? TLS 1.0 is still very widely used and depended on, turning it off in the browser would lock out around 10% of users. Also, despite SCSV being a nice solution to insecure downgrades, many servers don’t support it. The only option to ensure the safety of users against a future issue in TLS 1.0 is to disable insecure fallback.

After several years of bad performance due to the client having to reconnect multiple times, the majority of the websites that did not implement version negotiation correctly fixed their servers. This made it possible for some browsers to remove this insecure fallback:
Firefox in 2015, and Chrome in 2016 https://developers.google.com/web/updates/2016/03/chrome-50-deprecations#removeinsecuretlsversionfallback]. Because of this, Chrome and Firefox users are now in a safer position in the event of a new TLS 1.0 vulnerability.

Introducing a new version (again)

Designing a protocol for future compatibility is hard, it’s easy to make mistakes if there isn’t a feedback mechanism. This is the case of TLS version negotiation: nothing breaks if you implement it wrong, just hypothetical future versions. There is no signal to developers that an implementation is flawed, and so mistakes can happen without being noticed. That is, until a new version of the protocol is deployed and your implementation fails, but by then the code is deployed and it could take years for everyone to upgrade. The fact that some server implementations failed to handle a “future” version of TLS correctly should have been expected.

TLS version negotiation, though simple to explain, is actually an example of a protocol design antipattern. It demonstrates a phenomenon in protocol design called ossification. If a protocol is designed with a flexible structure, but that flexibility is never used in practice, some implementation is going to assume it is constant. Adam Langley compares the phenomenon to rust. If you rarely open a door, its hinge is more likely to rust shut. Protocol negotiation in TLS is such a hinge.

Around the time of TLS 1.2’s deployment, the discussions around designing an ambitious new version of TLS were beginning. It was going to be called TLS 1.3, and the version number was naturally chosen as 3.4 (or (3, 4)). By mid-2016, the TLS 1.3 draft had been through 15 iterations, and the version number and negotiation were basically set. At this point, browsers had removed the insecure fallback. It was assumed that servers that weren’t able to do TLS version negotiation correctly had already learned the lessons of POODLE and finally implemented version negotiation correctly. This turned out to not be the case. Again.

When presented with a client hello with version 3.4, a large percentage of TLS 1.2-capable servers would disconnect instead of replying with 3.3. Internet scans by Hanno Böck, David Benjamin, SSL Labs, and others confirmed that the failure rate for TLS 1.3 was very high, over 3% in many measurements. This was the exact same situation faced during the upgrade to TLS 1.2. History was repeating itself.

This unexpected setback caused a crisis of sorts for the people involved in the protocol’s design. Browsers did not want to re-enable the insecure downgrade and fight the uphill battle of oiling the protocol negotiation joint again for the next half-decade. But without a downgrade, using TLS 1.3 as written would break 3% of the Internet for their users. What could be done?

The controversial choice was to accept a proposal from David Benjamin to make the first TLS 1.3 message (the client hello) look like it TLS 1.2. The version number from the client was changed back to (3, 3) and a new “version” extension was introduced with the list of supported versions inside. The server would return a server hello starting with (3, 4) if TLS 1.3 was supported and (3, 3) or earlier otherwise. Draft 16 of TLS 1.3 contained this new and “improved” protocol negotiation logic.

And this worked. Servers for the most part were tolerant to this change and easily fell back to TLS 1.2, ignoring the new extension. But this was not the end of the story. Ossification is a fickle phenomenon. It not only affects clients and servers, but it also affects everything on the network that interacts with a protocol.

The real world is full of middleboxes

Why TLS 1.3 isn't in browsers yet CC BY-SA 3.0

Readers of our blog may remember a post from earlier this year on HTTPS interception. In it, we discussed a study that measured how many “secure” HTTPS connections were actually intercepted and decrypted by inspection software or hardware somewhere in between the browser and the web server. There are also passive inspection middleboxes that parse TLS and either block or divert the connections based on visible data, such as ISPs that use hostname information (SNI) from TLS connections to block “banned” websites.

In order to inspect traffic, these network appliances need to implement some or all of TLS. Every new device that supports TLS introduces a TLS implementation that makes assumption about how the protocol should act. The more implementations there are, the more likely it is that ossification occurs.

In February of 2017, both Chrome and Firefox started enabling TLS 1.3 for a percentage of their customers. The results were unexpectedly horrible. A much higher percentage of TLS 1.3 connections were failing than expected. For some users, no matter what the website, TLS 1.2 worked but TLS 1.3 did not.

Success rates for TLS 1.3 Draft 18

Firefox & Cloudflare
97.8% for TLS 1.2
96.1% for TLS 1.3

Chrome & Gmail
98.3% for TLS 1.2
92.3% for TLS 1.3

After some investigation, it was found that some widely deployed middleboxes, both of the intercepting and passive variety, were causing connections to fail.

Why TLS 1.3 isn't in browsers yet

Because TLS has generally looked the same throughout its history, some network appliances made assumptions about how the protocol would evolve over time. When faced with a new version that violated these assumptions, these appliances fail in various ways.

Some features of TLS that were changed in TLS 1.3 were merely cosmetic. Things like the ChangeCipherSpec, session_id, and compression fields that were part of the protocol since SSLv3 were removed. These fields turned out to be considered essential features of TLS to some of these middleboxes, and removing them caused connection failures to skyrocket.

If a protocol is in use for a long enough time with a similar enough format, people building tools around that protocol will make assumptions around that format being constant. This is often not an intentional choice by developers, but an unintended consequence of how a protocol is used in practice. Developers of network devices may not understand every protocol used on the internet, so they often test against what they see on the network. If a part of a protocol that is supposed to be flexible never changes in practice, someone will assume it is a constant. This is more likely the more implementations are created.

It would be disingenuous to put all of the blame for this on the specific implementers of these middleboxes. Yes, they created faulty implementations of TLS, but another way to think about it is that the original design of TLS lent itself to this type of failure. Implementers implement to the reality of the protocol, not the intention of the protocol’s designer or the text of the specification. In complex ecosystems with multiple implementers, unused joints rust shut.

Removing features that have been part of a protocol for 20 years and expecting it to simply “work” was wishful thinking.

Making TLS 1.3 work

At the IETF meeting in Singapore last month, a new change was proposed to TLS 1.3 to help resolve this issue. These changes were based on an idea from Kyle Nekritz of Facebook: make TLS 1.3 look like TLS 1.2 to middleboxes. This change re-introduces many of the parts of the protocol that were removed (session_id, ChangeCipherSpec, an empty compression field), and introduces some other changes that make TLS 1.3 look like TLS 1.2 in all the ways that matter to the broken middleboxes.

Several iterations of these changes were developed by BoringSSL and tested in Chrome over a series of months. Facebook also performed some similar experiments and the two teams converged on the same set of changes.

Chrome Experiment Success Rates

TLS 1.2 - 98.6%
Experimental changes (PR 1091 on Github) - 98.8%

Firefox Experiment Success Rates

TLS 1.2 - 98.42%
Experimental changes (PR 1091 on Github) - 98.37%

These experiments showed that is was possible to modify TLS 1.3 to be compatible with middleboxes. They also demonstrated the ossification phenomenon. As we described in a previous section, the only thing that could have rusted shut in the client hello, the version negotiation, rusted shut. This resulted in Draft 16, which moved the version negotiation to an extension. As an intermediary between the client and the server, middleboxes also care about the server hello message. This message had many more hinges that were thought to be flexible but turned out weren’t. Almost all of these had rusted shut. The new “middlebox-friendly” changes took this reality into account. These experimental changes were incorporated into the specification in TLS 1.3 Draft 22.

Making sure this doesn’t happen again

The original protocol negotiation mechanism is unrecoverably burnt. That means it likely can’t be used in a future version of TLS without significant breakage. However, many of the other protocol negotiation features are still flexible, such as ciphersuites selection and extensions. It would be great to keep it this way.

Last year, Adam Langley wrote a great blog post about cryptographic protocol design (https://www.imperialviolet.org/2016/05/16/agility.html) that covers similar ground as this blog post. In his post, he proposes the adage “have one joint and keep it well oiled.” This is great advice for protocol designers. Ossification is real, as we have seen in TLS 1.3.

Along these lines, David Benjamin proposed a way to keep the most important joints in TLS oiled. His GREASE proposal for TLS is designed to throw in random values where a protocol should be tolerant of new values. If popular implementations intersperse unknown ciphers, extensions and versions in real-world deployments, then implementers will be forced to handle them correctly. GREASE is like WD-40 for the Internet.

One thing to note is that GREASE is intended to prevent servers from ossifying, not middleboxes, so there is still potential for more types of greasing to happen in TLS.

Even with GREASE, some servers were only found to be intolerant to TLS 1.3 as late as December 2017. GREASE only identifies servers that are intolerant to unknown extensions, but some servers may still be intolerant to specific extension ids. For example, RSA's BSAFE library used the extension id 40 for an experimental extension called "extended_random", associated with a theorized NSA backdoor. Extension id 40 happens to be the extension id used for TLS 1.3 key shares. David Benjamin reported that this library is still in use by some printers, which causes them to be TLS 1.3 intolerant. Matthew Green has a detailed write-up of this compatibility issue.

Help us understand the issue

Cloudflare has been working with the Mozilla Firefox team to help measure this phenomenon, and Google and Facebook have been doing their own measurements. These experiments are hard to perform because the developers need to get protocol variants into browsers, which can take the entire release cycle of the browser (often months) to get into the hands of the users seeing issues. Cloudflare now supports the latest (hopefully) middlebox-compatible TLS 1.3 draft version (draft 22), but there’s always a chance we find a middlebox that is incompatible with this version.

To sidestep the browser release cycle, we took a shortcut. We built a website that you can use to see if TLS 1.3 works from your browser’s vantage point. This test was built by our Crypto intern, Peter Wu. It uses Adobe Flash, because that’s the only widespread cross-platform way to get access to raw sockets from a browser.

How it works:

We cross-compiled our Golang-based TLS 1.3 client library (tls-tris) to JavaScript
We build a JavaScript library (called jssock) that implements tls-tris on the low-level socket interface network exposed through Adobe Flash
We connect to a remote server using TLS 1.2 and TLS 1.3 and compare the results

If there is a mismatch, we gather information from the connection on both sides and send it back for analysis

If you see a failure, let us know! If you’re in a corporate environment, share the middlebox information, if you’re on a residential network, tell us who your ISP is.

Why TLS 1.3 isn't in browsers yet

We’re excited for TLS 1.3 to finally be enabled by default in browsers. This experiment will hopefully help prove that the latest changes make it safe for users to upgrade.

I wouldn’t be surprised if the title of this post attracts some Bitcoin aficionados, but if you are such, I want to disappoint you. For me crypto means cryptography, not cybermoney, and the price we pay for it is measured in CPU cycles, not USD.

If you got to this second paragraph you probably heard that TLS today is very cheap to deploy. Considerable effort was put to optimize the cryptography stacks of OpenSSL and BoringSSL, as well as the hardware that runs them. However, aside for the occasional benchmark, that can tell us how many GB/s a given algorithm can encrypt, or how many signatures a certain elliptic curve can generate, I did not find much information about the cost of crypto in real world TLS deployments.

CC BY-SA 2.0 image by Michele M. F.

As Cloudflare is the largest provider of TLS on the planet, one would think we perform a lot of cryptography related tasks, and one would be absolutely correct. More than half of our external traffic is now TLS, as well as all of our internal traffic. Being in that position means that crypto performance is critical to our success, and as it happens, every now and then we like to profile our production servers, to identify and fix hot spots.

In this post I want to share the latest profiling results that relate to crypto.

The profiled server is located in our Frankfurt data center, and sports 2 Xeon Silver 4116 processors. Every geography has a slightly different use pattern of TLS. In Frankfurt 73% of the requests are TLS, and the negotiated cipher-suites break down like so:

Processing all of those different ciphersuites, BoringSSL consumes just 1.8% of the CPU time. That’s right, mere 1.8%. And that is not even pure cryptography, there is a considerable overhead involved too.

Let’s take a deeper dive, shall we?

Ciphers

If we break down the negotiated cipher suites, by the AEAD used, we get the following breakdown:

BoringSSL speed tells us that AES-128-GCM, ChaCha20-Poly1305 and AES-128-CBC-SHA1 can achieve encryption speeds of 3,733.3 MB/s, 1,486.9 MB/s and 387.0 MB/s, but this speed varies greatly as a function of the record size. Indeed we see that GCM uses proportionately less CPU time.

Still the CPU time consumed by encryption and decryption depends on typical record size, as well as the amount of data processed, both metrics we don’t currently log. We do know that ChaCha20-Poly1305 is usually used by older phones, where the connections are short lived to save power, while AES-CBC is used for … well your guess is as good as mine who still uses AES-CBC and for what, but good thing its usage keeps declining.

Finally keep in mind that 6.8% of BoringSSL usage in the graph translates into 6.8% x 1.8% = 0.12% of total CPU time.

Public Key

Public key algorithms in TLS serve two functions.

The first function is as a key exchange algorithm, the prevalent algorithm here is ECDHE that uses the NIST P256 curve, the runner up is ECDHE using DJB’s x25519 curve. Finally there is a small fraction that still uses RSA for key exchange, the only key exchange algorithm currently used, that does not provide Forward Secrecy guarantees.

The second function is that of a signature used to sign the handshake parameters and thus authenticate the server to the client. As a signature RSA is very much alive, present in almost one quarter of the connections, the other three quarters using ECDSA.

BoringSSL speed reports that a single core on our server can perform 1,120 RSA2048 signatures/s, 120 RSA4096 signatures/s, 18,477 P256 ESDSA signatures/s, 9,394 P256 ECDHE operations/s and 9,278 x25519 ECDHE operations/s.

Looking at the CPU consumption, it is clear that RSA is very expensive. Roughly half the time BoringSSL performs an operation related to RSA. P256 consumes twice as much CPU time as x25519, but considering that it handles twice as much key-exchanges, while also being used as a signature, that is commendable.

If you want to make the internet a better place, please get an ECDSA signed certificate next time!

Hash functions

Only two hash function are currently used in TLS: SHA1 and SHA2 (including SHA384). SHA3 will probably debut with TLS1.3. Hash functions serve several purposes in TLS. First, they are used as part of the signature for both the certificate and the handshake, second they are used for key derivation, finally when using AES-CBC, SHA1 and SHA2 are used in HMAC to authenticate the records.

Here we see SHA1 consuming more resources than expected, but that is really because it is used as HMAC, whereas most cipher suites that negotiate SHA256 use AEADs. In terms of benchmarks BoringSSL speed reports 667.7 MB/s for SHA1, 309.0 MB/s for SHA256 and 436.0 MB/s for SHA512 (truncated to SHA384 in TLS, that is not visible in the graphs because its usage approaches 0%).

Conclusions

Using TLS is very cheap, even at the scale of Cloudflare. Modern crypto is very fast, with AES-GCM and P256 being great examples. RSA, once a staple of cryptography, that truly made SSL accessible to everyone, is now a dying dinosaur, replaced be faster and safer algorithms, still consumes a disproportionate amount of resources, but even that is easily manageable.

The future however is less clear. As we approach the era of Quantum computers it is clear that TLS must adapt sooner rather than later. We already support SIDH as a key exchange algorithm for some services, and there is a NIST competition in place, that will determine the most likely Post Quantum candidates for TLS adoption, but none of the candidates can outperform P256. I just hope that when I profile our edge two years from now, my conclusion won’t change to “Whoa, crypto is expensive!”.

For many of us, a New Year brings a renewed commitment to eat better, exercise regularly, and read more (especially the Cloudflare blog). But as we enter 2018, there is a unique and significant new commitment approaching -- protecting personal data and complying with the European Union’s (EU) General Data Protection Regulation (GDPR).

As many of you know by now, the GDPR is a sweeping new EU law that comes into effect on May 25, 2018. The GDPR harmonizes data privacy laws across the EU and mandates how companies collect, store, delete, modify and otherwise process personal data of EU citizens.

Since our founding, Cloudflare has believed that the protection of our customers’ and their end users’ data is essential to our mission to help build a better internet.

Keeping your GDPR Resolutions Image by GregMontani via Wikimedia Commons

Need a Data Processing Agreement?

As we explained in a previous blog post last August, Cloudflare has been working hard to achieve GDPR compliance in advance of the effective date, and is committed to help our customers and their partners prepare for GDPR compliance on their side. We understand that compliance with a new set of privacy laws can be challenging, and we are here to help with your GDPR compliance requirements.

First, we are committed to making sure Cloudflare’s services are GDPR compliant and will continue to monitor new guidance on best practices even after the May 25th effective date. We have taken these new requirements to heart and made changes to our products, contracts and policies.

And second, we have made it easy for you to comply with your own obligations. If you are a Cloudflare customer and have determined that you qualify as a data controller under the GDPR, you may need a data processing addendum (DPA) in place with Cloudflare as a qualifying vendor. We’ve made that part of the process easy for you.

This is all you need to do:

Go here to find our GDPR-compliant DPA, which has been pre-signed on behalf of Cloudflare.
To complete the DPA, you should fill in the “Customer” information and sign on pages 6, 13, 15, and 19.
Send an electronic copy of the fully executed DPA to Cloudflare at eu.dpa@cloudflare.com.

That’s it. Now you’re one step closer to GDPR compliance.

We can’t help you with the diet, exercise, and reading stuff. But if you need more information about GDPR and more resources, you can go to Cloudflare’s GDPR landing page.

Last week the news of two significant computer bugs was announced. They've been dubbed Meltdown and Spectre. These bugs take advantage of very technical systems that modern CPUs have implemented to make computers extremely fast. Even highly technical people can find it difficult to wrap their heads around how these bugs work. But, using some analogies, it's possible to understand exactly what's going on with these bugs. If you've found yourself puzzled by exactly what's going on with these bugs, read on — this blog is for you.

“When you come to a fork in the road, take it.” — Yogi Berra

Late one afternoon walking through a forest near your home and navigating with the GPS you come to a fork in the path which you’ve taken many times before. Unfortunately, for some mysterious reason your GPS is not working and being a methodical person you like to follow it very carefully.

Cooling your heels waiting for GPS to start working again is annoying because you are losing time when you could be getting home. Instead of waiting, you decide to make an intelligent guess about which path is most likely based on past experience and set off down right hand path.

After walking for a short distance the GPS comes to life and tells you which is the correct path. If you predicted correctly then you’ve saved a significant amount of time. If not, then you hop over to the other path and carry on that way.

Something just like this happens inside the CPU in pretty much every computer. Fundamental to the very essence and operation of a computer is the ability to branch, to choose between two different code paths. As you read this, your web browser is making branch decisions continuously (for example, some part of it is waiting for you to click a link to go to some other page).

One way that CPUs have reached incredible speeds is the ability to predict which of two branches is most likely and start executing it before it knows whether that’s the correct path to take.

For example, the code that checks for you clicking this link might be a little slow because it’s waiting for mouse movements and button clicks. Rather than wait the CPU will start automatically executing the branch it thinks is most likely (probably that you don’t click the link). Once the check actually indicates “clicked” or “not clicked” the CPU will either continue down the branch it took, or abandon the code it has executed and restart at the ‘fork in the path’.

This is known as “branch prediction” and saves a great deal of idling processor time. It relies on the ability of the CPU to run code “speculatively” and throw away results if that code should not have been run in the first place.

Every time you’ve taken the right hand path in the past it’s been correct, but today it isn’t. Today it’s winter and the foliage is sparser and you’ll see something you shouldn’t down that path: a secret government base hiding alien technology.

But wanting to get home fast you take the path anyway not realizing that the GPS is going to indicate that left hand path today and keep you out of danger. Before the GPS comes back to life you catch a glimpse of an alien through the trees.

Moments later two Men In Black appear, erase your memory and dump you back at the fork in the path. Shortly after, the GPS beeps and you set off down the left hand path none the wiser.

Something similar to this happens in the Spectre/Meltdown attack. The CPU starts executing a branch of code that it has previously learnt is typically the right code to run. But it’s been tricked by a clever attacker and this time it’s the wrong branch. Worse, the code will access memory that it shouldn’t (perhaps from another program) giving it access to otherwise secret information (such as passwords).

When the CPU realizes it’s gone the wrong way it forgets all the erroneous work it’s done (and the fact that it accessed memory it shouldn’t have) and executes the correct branch instead. Even though illegal memory was accessed what it contained has been forgotten by the CPU.

The core of Meltdown and Spectre is the ability to exfiltrate information, from this speculatively executed code, concerning the illegally accessed memory through what’s known as a “side channel”.

You’d actually heard rumours of Men In Black and want to find some way of letting yourself know whether you saw aliens or not. Since there’s a short space between you seeing aliens and your memory being erased you come up with a plan.

If you see aliens then you gulp down an energy drink that you have in your backpack. Once deposited back at the fork by the Men In Black you can discover whether you drank the energy drink (and therefore whether you saw aliens) by walking 500 metres and timing yourself. You’ll go faster with the extra carbs in a can of Reactor Core.

Computers have also reached high speeds by keeping a copy of frequently or recently accessed information inside the CPU itself. The closer data is to the CPU the faster it can be used.

This store of recently/frequently used data inside the CPU is called a “cache”. Both branch prediction and the cache mean that CPUs are blazingly fast. Sadly, they can also be combined to create the security problems that have recently been reported with Intel and other CPUs.

In the Meltdown/Spectre attacks, the attacker determines what secret information (the real world equivalent of the aliens) was accessed using timing information (but not an energy drink!). In the split second after accessing illegal memory, and before the code being run is forgotten by the CPU, the attacker’s code loads a single byte into the CPU cache. A single byte which it has perfectly legal access to; something from its own program memory!

The attacker can then determine what happened in the branch just by trying to read the same byte: if it takes a long time to read then it wasn’t in cache, if it doesn’t take long then it was. The difference in timing is all the attacker needs to know what occurred in the branch the CPU should never have executed.

To turn this into an exploit that actually reads illegal memory is easy. Just repeat this process over and over again once per single bit of illegal memory that you are reading. Each single bit’s 1 or 0 can be translated into the presence or absence of an item in the CPU cache which is ‘read’ using the timing trick above.

Although that might seem like a laborious process, it is, in fact, something that can be done very quickly enabling the dumping of the entire memory of a computer. In the real world it would be impractical to hike down the path and get zapped by the Men In Black in order to leak details of the aliens (their color, size, language, etc.), but in a computer it’s feasible to redo the branch over and over again because of the inherent speed (100s of millions of branches per second!).

And if an attacker can dump the memory of a computer it has access to the crown jewels: what’s in memory at any moment is likely to be very, very sensitive: passwords, cryptographic secrets, the email you are writing, a private chat and more.

Conclusion

I hope that this helps you understand the essence of Meltdown and Spectre. There are many variations of both attacks that all rely on the same ideas: get the CPU to speculatively run some code (through branch prediction or another technique) that illegally accesses memory and extract information using a timing side channel through the CPU cache.

If you care to read all the gory detail there’s a paper on Meltdown and a separate one on Spectre.

Acknowledgements: I’m grateful to all the people who read this and gave me feedback (including gently telling me the ways in which I didn’t understand branch prediction and speculative execution). Thank you especially to David Wragg, Kenton Varda, Chris Branch, Vlad Krasnov, Matthew Prince, Michelle Zatlyn and Ben Cartwright-Cox. And huge thanks for Kari Linder for the illustrations.

It is now commonly accepted as fact that web performance is critical for business. Slower sites can affect conversion rates on e-commerce stores, they can affect your sign-up rate on your SaaS service and lower the readership of your content.

In the run-up to Thanksgiving and Black Friday, e-commerce sites turned to services like Cloudflare to help optimise their performance and withstand the traffic spikes of the shopping season.

The Curious Case of Caching CSRF Tokens

In preparation, an e-commerce customer joined Cloudflare on the 9th November, a few weeks before the shopping season. Instead of joining via our Enterprise plan, they were a self-serve customer who signed-up by subscribing to our Business plan online and switching their nameservers over to us.

Their site was running Magento, a notably slow e-commerce platform - filled with lots of interesting PHP, with a considerable amount of soft code in XML. Running version 1.9, the platform was somewhat outdated (Magento was totally rewritten in version 2.0 and subsequent releases).

Despite the somewhat dated technology, the e-commerce site was "good enough" for this customer and had done it's job for many years.

They were the first to notice an interesting technical issue surrounding how performance and security can often feel at odds with each other. Although they were the first to highlight this issue, into the run-up of Black Friday, we ultimately saw around a dozen customers on Magento 1.8/1.9 have similar issues.

Initial Optimisations

After signing-up for Cloudflare, the site owners attempted to make some changes to ensure their site was loading quickly.

The website developers had already ensured the site was loading over HTTPS, in doing so they were able to ensure their site was loading over the new HTTP/2 Protocol and made some changes to ensure their site was optimised for HTTP/2 (for details, see our blog post on HTTP/2 For Web Developers).

At Cloudflare we've taken steps to ensure that there isn't a latency overhead for establishing a secure TLS connection, here is a non-complete list of optimisations we use:

TLS Session Resumption
OCSP Stapling
Fast Elliptic Curve Cryptography prioritised
Dynamic TLS Record Sizing

Additionally, they had enabled HTTP/2 Server Push to ensure critical CSS/JS assets could be pushed to clients when they made their first request. Without Server Push, a client has to download the HTML response, interpret it and then work out assets it needs to download.

Big images were Lazy Loaded, only downloading them when they needed to be seen by the users. Additionally, they had enabled a Cloudflare feature called Polish. With this enabled, Cloudflare dynamically works out whether it's faster serve an image in WebP (a new image format developed by Google) or whether it's faster to serve it in a different format.

These optimisations did make some improvement to performance, but their site was still slow.

Respect The TTFB

In web performance, there are a few different things which can affect the response times - I've crudely summarised them into the following three categories:

Connection & Request Time - Before a request can be sent off for a website to load something, a few things need to happen: DNS queries, a TCP handshake to establish the connection with the web server and a TLS handshake to establish a secure connection
Page Render - A dynamic site needs to query databases, call APIs, write logs, render views, etc before a response can be made to a client
Response Speed - Downloading the response from web server, browser-side rendering of the HTML and pulling the other resources linked in the HTML

The e-commerce site had taken steps to improve their Response Speed by enabling HTTP/2 and performing other on-site optimisations. They had also optimised their Connection & Response Time by using a CDN service like Cloudflare to provide fast DNS and reduce latency when optimising TLS/TCP connections.

However, they now realised the critical step they needed to optimise was around the Page Render that would happen on their web server.

By looking at a Waterfall View of how their site loaded (similar to the one below) they could see the main constraint.

The Curious Case of Caching CSRF Tokens
Example Waterfall view from WebSiteOptimization.com

On the initial request, you can see the green "Time to First Byte" view taking a very long time.

Many browsers have tools for viewing Waterfall Charts like the one above, Google provide some excellent documentation for Chrome on doing this: Get Started with Analyzing Network Performance in Chrome DevTools. You can also generate these graphs fairly easily from site speed test tools like WebPageTest.org.

Time to First Byte itself is an often misunderstood metric and often can't be attributed to a single fault. For example; using a CDN service like Cloudflare may increase TTFB by a few milliseconds, but do so to the benefit of an overall load time. This can be as the CDN is adding additional compression functionality to speed up the response, or simply as it has to establish a connection back to the origin web server (which isn't visible by the client).

There are instances where it is important to debug why TTFB is a problem. For example; in this instance, the e-commerce platform was taking upwards of 3 seconds just to generate the HTML response. In this case, it was clear the constraint was the server-side Page Render.

When the web server was generating dynamic content, it was having to query databases and perform logic before a request could be served. In most instances (i.e. a product page) the page would be identical to every other request. It would only be when someone would add something to their shopping cart that the site would really become dynamic.

Enabling Cookie-Based Caching

Before someone logs into the the Magento admin panel or adds something to their shopping cart, the page view is anonymous and will be served up identically to every visitor. It will only be the when an anonymous visitor logs in or adds something to their shopping cart that they will see a page that's dynamic and unlike every other page that's been rendered.

It therefore is possible to cache those anonymous requests so that Magento on an origin server doesn't need to constantly regenerate the HTML.

Cloudflare users on our Business Plan are able to cache anonymous page views when using Magneto using our Bypass Cache on Cookie functionality. This allows for static HTML to be cached at our edge, with no need for it to be regenerated from request to request.

This provides a huge performance boost for the first few page visits of a visitor, and allows them still to interact with the dynamic site when they need to. Additionally, it helps keep load down on the origin server in the event of traffic spikes, sparing precious server CPU time for those who need it to complete dynamic actions like paying for an order.

Here's an example of how this can be configured in Cloudflare using the Page Rules functionality:

The Curious Case of Caching CSRF Tokens

The Page Rule configuration above instructs Cloudflare to "Cache Everything (including HTML), but bypass the cache when it sees a request which contains any of the cookies: external_no_cache, PHPSESSID or adminhtml. The final Edge Cache TTL setting just instructs Cloudflare to keep HTML files in cache for a month, this is necessary as Magento by default uses headers to discourage caching.

The site administrator configured their site to work something like this:

On the first request, the user is anonymous and their request indistinguishable from any other - their page can be served from the Cloudflare cache
When the customer adds something to their shopping cart, they do that via a POST request - as methods like POST, PUT and DELETE are intended to change a resource, they bypass the Cloudflare cache
On the POST request to add something to their shopping cart, Magento will set a cookie called external_no_cache
As the site owner has configured Cloudflare to bypass the cache when we see a request containing the external_no_cache cookie, all subsequent requests go direct to origin

This behaviour can be summarised in the following crude diagram:

The Curious Case of Caching CSRF Tokens

The site administrators initially enabled this configuration on a subdomain for testing purposes, but noticed something rather strange. When they would add something to the cart on their test site, the cart would show up empty. If they then tried again to add something to the cart, the item would be added successfully.

The customer reported one additional, interesting piece of information - when they tried to mimic this cookie-based caching behaviour internally using Varnish, they faced the exact same issue.

In essence, the Add to Cart functionality would fail, but only on the first request. This was indeed odd behaviour, and the customer reached out to Cloudflare Support.

Debugging

The customer wrote in just as our Singapore office were finishing up their afternoon and was initially triaged by a Support Engineer in that office.

The Support Agent evaluated what the problem was and initially identified that if the frontend cookie was missing, the Add to Cart functionality would fail.

No matter which page you access on Magento, it will attempt to set a frontend cookie, even if it doesn't add an external_no_cache cookie

When Cloudflare caches static content, the default behaviour is to strip away any cookies coming from the server if the file is going to end up in cache - this is a security safeguard to prevent customers accidentally caching private session cookies. This applies when a cached response contains a Set-Cookie header, but does not apply when the cookie is set via JavaScript - in order to allow functionality like Google Analytics to work.

They had identified that the caching logic at our network edge was working fine, but for whatever reason Magento would refuse to add something to a shopping cart without a valid frontend cookie. Why was this?

As Singapore handed their shift work over to London, the Support Engineer working on this ticket decided to escalate the ticket up to me. This was largely as, towards the end of last year, I had owned the re-pricing of this feature (which opened it up to our self-service Business plan users, instead of being Enterprise-only). That said; I had not touched Magneto in many years, even when I was working in digital agencies I wasn't the most enthusiastic to build on it.

The Support Agent provided some internal comments that described the issue in detail and their own debugging steps, with an effective "TL;DR" summary:

The Curious Case of Caching CSRF Tokens

Debugging these kinds of customer issues is not as simple as putting breakpoints into a codebase. Often, for our Support Engineers, the customers origin server acts as a black-box and there can be many moving parts, and they of course have to manage the expectations of a real customer at the other end. This level of problem solving fun, is one of the reasons I still like answering customer support tickets when I get a chance.

Before attempting to debug anything, I double checked that the Support Agent was correct that nothing had gone wrong on our end - I trusted their judgement and no others customers were reporting their caching functionality had broken, but it is always best to cross-check manual debugging work. I ran some checks to ensure that there were no regressions in our Lua codebase that controls caching logic:

Checked that there were no changes to this logic in our internal code respository
Check that automated tests are still in place and build successfully
Run checks on production to verify that caching behaviour still works as normal

As Cloudflare has customers across so many platforms, I also checked to ensure that there were no breaking changes in Magento codebase that would cause this bug to occur. Occasionally we find our customers accidentally come across bugs in CMS platforms which are unreported. This, fortunately, was not one of those instances.

The next step is to attempt to replicate the issue locally and away from the customers site. I spun up a vanilla instance of Magento 1.9 and set it up with an identical Cloudflare configuration. The experiment was successful and I was able to replicate the customer issue.

I had an instinctive feeling that it was the Cross Site Request Forgery Protection functionality that was at fault here and I started tweaking my own test Magento installation to see if this was the cases.

Cross Site Request Forgery attacks work by exploiting the fact that one site on the internet can get a client to make requests to another site.

For example; suppose you have an online bank account with the ability to send money to other accounts. Once logged in, there is a form to send money which uses the following HTML:

<form action="https://example.com/send-money">
Account Name:
<input type="text" name="account_name" />
Amount:
<input type="text" name="amount" />
<input type="submit" />
</form>

After logging in and doing your transactions, you don't log-out of the website - but you simply navigate elsewhere online. Whilst browsing around you come across a button on a website that contains the text "Click me! Why not?". You click the button, and £10,000 goes from your bank account to mine.

This happens because the button you clicked was connected to an endpoint on the banking website, and contained hidden fields instructing it to send me £10,000 of your cash:

<form action="https://example.com/send-money">
<input type="hidden" name="account_name" value="Junade Ali" />
<input type="hidden" name="amount" value="10,000" />
<input type="submit" value="Click me! Why not?" />
</form>

In order to prevent these attacks, CSRF Tokens are inserted as hidden fields into web forms:

<form action="https://example.com/send-money">
Account Name:
<input type="text" name="account_name" />
Amount:
<input type="text" name="amount" />
<input type="hidden" name="csrf_protection" value="hunter2" />
<input type="submit" />
</form>

A cookie is first set on the clients computer containing a random session cookie. When a form is served to the client, a CSRF token is generated using that cookie. The server will check that the CSRF token submitted in the HTML form actually matches the session cookie, and if it doesn't block the request.

In this instance, as there was no session cookie ever set (Cloudflare would strip it out before it entered cache), the POST request to the Add to Cart functionality could never verify the CSRF token and the request would fail.

Due to CSRF vulnerabilities, Magento applied CSRF protection to all forms; this broke Full Page Cache implementations in Magento 1.8.x/1.9.x. You can find all the details in the SUPEE-6285 patch documentation from Magento.

Caching Content with CSRF Protected Forms

To validate that CSRF Tokens were definitely at fault here, I completely disabled CSRF Protection in Magento. Obviously you should never do this in production, I found it slightly odd that they even had a UI toggle for this!

Another method which was created in the Magento Community was an extension to disable CSRF Protection just for the Add To Cart functionality (Inovarti_FixAddToCartMage18), under the argument that CSRF risks are far reduced when we're talking about Add to Cart functionality. This is still not ideal, we should ideally have CSRF Protection on every form when we're talking about actions which change site behaviour.

There is, however, a third way. I did some digging and identified a Magento plugin that effectively uses JavaScript to inject a dynamic CSRF token the moment a user clicks the Add to Cart button but just before the request is actually submitted. There's quite a lengthy Github thread which outlines this issue and references the Pull Requests which fixed this behaviour in the the Magento Turpentine plugin. I won't repeat the set-up instructions here, but they can be found in an article I've written on the Cloudflare Knowledge Base: Caching Static HTML with Magento (version 1 & 2)

Effectively what happens here is that the dynamic CSRF token is only injected into the web page the moment that it's needed. This is actually the behaviour that's implemented in other e-commerce platforms and Magento 2.0+, allowing Full Page Caching to be implemented quite easily. We had to recommend this plugin as it wouldn't be practical for the site owner to simply update to Magneto 2.

One thing to be wary of when exposing CSRF tokens via an AJAX endpoint is JSON Hijacking. There are some tips on how you can prevent this in the OWASP AJAX Security Cheat Sheet. Iain Collins has a Medium post with further discussion on the security merits of CSRF Tokens via AJAX (that said, however you're performing CSRF prevention, Same Origin Policies and HTTPOnly cookies FTW!).

There is an even cooler way you can do this using Cloudflare's Edge Workers offering. Soon this will allow you to run JavaScript at our Edge network, and you can use that to dynamically insert CSRF tokens into cached content (and, then either perform cryptographic validation of CSRF either at our Edge or the Origin itself using a shared secret).

But this has been a problem since 2015?

Another interesting observation is that the Magento patch which caused this interesting behaviour had been around since July 7, 2015. Why did our Support Team only see this issue in the run-up to Black Friday in 2017? What's more, we ultimately saw around a dozen support tickets around this exact issue on Magento 1.8/1.9 over the course over 6 weeks.

When an Enterprise customer ordinarily joins Cloudflare, there is a named Solutions Engineer who gets them up and running and ensures there is no pain; however when you sign-up online with a credit card, your forgo this privilege.

Last year, we released Bypass Cache on Cookie to self-serve users when a lot of e-commerce customers were in their Christmas/New Year release freeze and not making changes to their websites. Since then, there were no major shopping events; most the sites enabling this feature were new build websites using Magento 2 where this wasn't an issue.

In the run-up to Black Friday, performance and coping under load became a key consideration for developers working on legacy e-commerce websites - and they turned to Cloudflare. Given the large, but steady, influx of e-commerce websites joining Cloudflare - the low overall percentage of those on Magento 1.8/1.9 became noticeable.

Conclusion

Caching anonymous page views is an important, and in some cases, essential mechanism to dramatically improve site performance to substantially reduce site load, especially during traffic spikes. Whilst aggressively caching content when users are anonymous, you can bypass the cache and allow users to use the dynamic functionality your site has to offer.

When you need to insert a dynamic state into cached content, JavaScript offers a nice compromise. JavaScript allows us to cache HTML for anonymous page visits, but insert a state when the users interact in a certain way. In essence, defusing this conflict between performance and security. In the future you'll be able to run this JavaScript logic at our network edge using Cloudflare Edge Workers.

It also remains important to respect the RESTful properties of HTTP and ensure GET, OPTIONS and HEAD requests remain safe and instead using POST, PUT, PATCH and DELETE as necessary.

If you're interested in debugging interesting technical problems on a network that sees around 10% of global internet traffic, we're hiring for Support Engineers in San Francisco, London, Austin and Singapore.

There’s Always Cache in the Banana Stand

We’re happy to announce that we now support all HTTP Cache-Control response directives. This puts powerful control in the hands of you, the people running origin servers around the world. We believe we have the strongest support for Internet standard cache-control directives of any large scale cache on the Internet.

Documentation on Cache-Control is available here.

Cloudflare runs a Content Distribution Network (CDN) across our globally distributed network edge. Our CDN works by caching our customers’ web content at over 119 data centers around the world and serving that content to the visitors nearest to each of our network locations. In turn, our customers’ websites and applications are much faster, more
available, and more secure for their end users.

A CDN’s fundamental working principle is simple: storing stuff closer to where it’s needed means it will get to its ultimate destination faster. And, serving something from more places means it’s more reliably available.

There’s Always Cache in the Banana Stand

To use a simple banana analogy: say you want a banana. You go to your local fruit stand to pick up a bunch to feed your inner monkey. You expect the store to have bananas in stock, which would satisfy your request instantly. But, what if they’re out of stock? Or what if all of the bananas are old and stale? Then, the store might need to place an order with the banana warehouse. That order might take some time to fill, time you would spend waiting in the store for the banana delivery to arrive. But you don’t want bananas that badly; you’ll probably just walk out and figure out some other way to get your tropical fix.

Now, what if we think about the same scenario in the context of an Internet request? Instead of bananas, you are interested in the latest banana meme. You go to bananameme.com, which sits behind Cloudflare’s edge network, and you get served your meme faster!

Of course, there’s a catch. A CDN in-between your server (the “origin” of your content) and your visitor (the “eyeball” in network engineer slang) might cache content that is out-of-date or incorrect. There are two ways to manage this:

1) the origin should give the best instructions it can on when to treat content as stale.

2) the origin can tell the edge when it has made a change to content that makes content stale.

Cache-Control headers allow servers and administrators to give explicit instructions to the edge on how to handle content.

Challenges of Storing Ephemeral Content (or: No Stale Bananas)

When using an edge cache like Cloudflare in-between your origin and visitors, the origin server no longer has direct control over the cached assets being served. Internet standards allow for the origin to emit Cache-Control headers with each response it serves. These headers give intermediate and browser caches fine-grained instruction over how content should be cached.

The current RFC covering these directives (and HTTP caching in general) is RFC 7234. It’s worth a skim if you’re into this kind of stuff. The relevant section on Response Cache-Control is laid out in section 5.2.2 of that document. In addition, some interesting extensions to the core directives were defined in RFC 5861, covering how caches should behave when origins are unreachable or in the process of being revalidated against.

To put this in terms of bananas:

George Michael sells bananas at a small stand. He receives a shipment of bananas for resale from Anthony’s Banana Company (ABC) on Monday. Anthony’s Banana Company serves as the origin for bananas for stores spread across the country. ABC is keenly interested in protecting their brand; they want people to associate them with only the freshest, perfectly ripe bananas with no stale or spoiled fruit to their name.

To ensure freshness, ABC provides explicit instructions to its vendors and eaters of its bananas. Bananas can’t be held longer than 3 days before sale to prevent overripening/staleness. Past 3 days, if a customer tries to buy a banana, George Michael must call ABC to revalidate that the bananas are fresh. If ABC can’t be reached, the bananas must not be sold.

To put this in terms of banana meme SVGs:

Kari uses Cloudflare to cache banana meme SVGs at edge locations around the world to reduce visitor latency. Banana memes should only be cached for up to 3 days to prevent the memes from going stale. Past 3 days, if a visitor requests https://bananameme.com/, Cloudflare must make a revalidation request to the bananameme.com origin. If the request to origin fails, Cloudflare must serve the visitor an error page instead of their zesty meme.

If only ABC and Kari had strong support for Cache-Control response headers!

If they did, they could serve their banana related assets with the following header:

Cache-Control: public, max-age=259200, proxy-revalidate

Public means this banana is allowed to be served from an edge cache. Max-age=259200 means it can stay in cache for up to 3 days (3 days * 24 hours * 60 minutes * 60 seconds = 259200). Proxy-revalidate means the edge cache must revalidate the content with the origin when that expiration time is up, no exceptions.

For a full list of supported directives and a lot more examples (but no more bananas), check out the documentation in our Help Center.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

This is a guest post by Elie Bursztein who writes about security and anti-abuse research. It was first published on his blog and has been lightly edited.

This post provides a retrospective analysis of Mirai — the infamous Internet-of-Things botnet that took down major websites via massive distributed denial-of-service using hundreds of thousands of compromised Internet-Of-Things devices. This research was conducted by a team of researchers from Cloudflare (Jaime Cochran, Nick Sullivan), Georgia Tech, Google, Akamai, the University of Illinois, the University of Michigan, and Merit Network and resulted in a paper published at USENIX Security 2017.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

At its peak in September 2016, Mirai temporarily crippled several high-profile services such as OVH, Dyn, and Krebs on Security via massive distributed Denial of service attacks (DDoS). OVH reported that these attacks exceeded 1 Tbps—the largest on public record.

What’s remarkable about these record-breaking attacks is they were carried out via small, innocuous Internet-of-Things (IoT) devices like home routers, air-quality monitors, and personal surveillance cameras. At its peak, Mirai infected over 600,000 vulnerable IoT devices, according to our measurements.

This blog post follows the timeline above

Mirai Genesis: Discusses Mirai’s early days and provides a brief technical overview of how Mirai works and propagates.
Krebs on Security attack: Recounts how Mirai briefly silenced Brian Krebs website.
OVH DDoS attack: Examines the Mirai author’s attempt to take down one of the world’s largest hosting providers.
The rise of copycats: Covers the Mirai code release and how multiple hacking groups end-up reusing the code. This section also describes the techniques we used to track down the many variants of Mirai that arose after the release. Finally, this section discusses the targets and the motive behind each major variants.
Mirai's takedown of the Internet: Tells the insider story behind Dyn attacks including the fact that the major sites (e.g., Amazon) taken down were just massive collateral damage.
Mirai’s attempted takedown of an entire country: Looks at the multiple attacks carried out against Lonestar, Liberia’s largest operator.
Deutsche Telekom goes dark: Discusses how the addition of a router exploit to one of the Mirai variant brought a major German Internet provider to its knees.
Mirai original author outed?: Details Brian Krebs’ in-depth investigation into uncovering Mirai’s author.
Deutsche Telekom attacker arrested: Recounts the arrest of the hacker who took down Deutsche Telekom and what we learned from his trial.

### Mirai Genesis

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

The first public report of Mirai late August 2016 generated little notice, and Mirai mostly remained in the shadows until mid-September. At that time, It was propelled in the spotlight when it was used to carry massive DDoS attacks against Krebs on Security the blog of a famous security journalist and OVH, one of the largest web hosting provider in the world.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

While the world did not learn about Mirai until at the end of August, our telemetry reveals that it became active August 1st when the infection started out from a single bulletproof hosting IP. From thereon, Mirai spread quickly, doubling its size every 76 minutes in those early hours.

By the end of its first day, Mirai had infected over 65,000 IoT devices. By its second day, Mirai already accounted for half of all Internet telnet scans observed by our collective set of honeypots, as shown in the figure above. At its peak in November 2016 Mirai had infected over 600,000 IoT devices.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

Retroactively looking at the infected device services banners using Censys' Internet-wide scanning reveals that most of the devices appear to be routers and cameras as reported in the chart above. Each type of banner is represented separately as the identification process was different for each so it might be that a device is counted multiple times. Mirai was actively removing any banner identification which partially explains why we were unable to identify most of the devices.

Before delving further into Mirai’s story, let’s briefly look at how Mirai works, specifically how it propagates and its offensive capabilities.

How Mirai works

At its core, Mirai is a self-propagating worm, that is, it’s a malicious program that replicates itself by finding, attacking and infecting vulnerable IoT devices. It is also considered a botnet because the infected devices are controlled via a central set of command and control (C&C) servers. These servers tell the infected devices which sites to attack next. Overall, Mirai is made of two key components: a replication module and an attack module.

Replication module

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

The replication module is responsible for growing the botnet size by enslaving as many vulnerable IoT devices as possible. It accomplishes this by (randomly) scanning the entire Internet for viable targets and attacking. Once it compromises a vulnerable device, the module reports it to the C&C servers so it can be infected with the latest Mirai payload, as the diagram above illustrates.

To compromise devices, the initial version of Mirai relied exclusively on a fixed set of 64 well-known default login/password combinations commonly used by IoT devices. While this attack was very low tech, it proved extremely effective and led to the compromise of over 600,000 devices. For more information about DDoS techniques, read this Cloudflare primer.

Attack module

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

The attack module is responsible for carrying out DDoS attacks against the targets specified by the C&C servers. This module implements most of the code DDoS techniques such as HTTP flooding, UDP flooding, and all TCP flooding options. This wide range of methods allowed Mirai to perform volumetric attacks, application-layer attacks, and TCP state-exhaustion attacks.

Krebs on Security attack

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

Krebs on Security is Brian Krebs’ blog. Krebs is a widely known independent journalist who specializes in cyber-crime. Given Brian’s line of work, his blog has been targeted, unsurprisingly, by many DDoS attacks launched by the cyber-criminals he exposes. According to his telemetry (thanks for sharing, Brian!), his blog suffered 269 DDOS attacks between July 2012 and September 2016. As seen in the chart above, the Mirai assault was by far the largest, topping out at 623 Gbps.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

Looking at the geolocation of the IPs that targeted Brian’s site reveals that a disproportionate number of the devices involved in the attack are coming from South American and South-east Asia. As reported in the chart above Brazil, Vietnam and Columbia appear to be the main sources of compromised devices.

One dire consequence of this massive attack against Krebs was that Akamai, the CDN service that provided Brian’s DDoS protection, had to withdraw its support. This forced Brian to move his site to Project Shield. As he discussed in depth in a blog post, this incident highlights how DDoS attacks have become a common and cheap way to censor people.

OVH attack

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

Brian was not Mirai’s first high-profile victim. A few days before he was struck, Mirai attacked OVH, one of the largest European hosting providers. According to their official numbers, OVH hosts roughly 18 million applications for over one million clients, Wikileaks being one of their most famous and controversial.

We know little about that attack as OVH did not participate in our joint study. As a result, the best information about it comes from a blog post OVH released after the event. From this post, it seems that the attack lasted about a week and involved large, intermittent bursts of DDoS traffic that targeted one undisclosed OVH customer.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

Octave Klaba, OVH’s founder, reported on Twitter that the attacks were targeting Minecraft servers. As we will see through this post, Mirai has been extensively used in gamer wars and is likely the reason why it was created in the first place.

According to OVH telemetry, the attack peaked at 1TBs and was carried out using 145,000 IoT devices. While the number of IoT devices is consistent with what we observed, the volume of the attack reported is significantly higher than what we observed with other attacks. For example, as mentioned earlier, Brian’s one topped out at 623 Gbps.

Regardless of the exact size, the Mirai attacks are clearly the largest ever recorded. They dwarf the previous public record holder, an attack against Cloudflare that topped out at ~400Gpbs.

The rise of copycats

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

In an unexpected development, on September 30, 2017, Anna-senpai, Mirai’s alleged author, released the Mirai source code via an infamous hacking forum. He also wrote a forum post, shown in the screenshot above, announcing his retirement.

This code release sparked a proliferation of copycat hackers who started to run their own Mirai botnets. From that point forward, the Mirai attacks were not tied to a single actor or infrastructure but to multiple groups, which made attributing the attacks and discerning the motive behind them significantly harder.

Clustering Mirai infrastructure

To keep up with the Mirai variants proliferation and track the various hacking groups behind them, we turned to infrastructure clustering. Reverse engineering all the Mirai versions we can find allowed us to extract the IP addresses and domains used as C&C by the various hacking groups than ran their own Mirai variant. In total, we recovered two IP addresses and 66 distinct domains.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

Applying DNS expansion on the extracted domains and clustering them led us to identify 33 independent C&C clusters that had no shared infrastructure. The smallest of these clusters used a single IP as C&C. The largest sported 112 domains and 92 IP address. The figure above depicts the six largest clusters we found.

These top clusters used very different naming schemes for their domain names: for example, “cluster 23” favors domains related to animals such as 33kitensspecial.pw, while “cluster 1” has many domains related to e-currencies such as walletzone.ru. The existence of many distinct infrastructures with different characteristics confirms that multiple groups ran Mirai independently after the source code was leaked.

Clusters over time

Looking at how many DNS lookups were made to their respective C&C infrastructures allowed us to reconstruct the timeline of each individual cluster and estimate its relative size. This accounting is possible because each bot must regularly perform a DNS lookup to know which IP address its C&C domains resolves to.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

The chart above reports the number of DNS lookups over time for some of the largest clusters. It highlights the fact that many were active at the same time. Having multiple variants active simultaneously once again emphasizes that multiple actors with different motives were competing to infect vulnerable IoT devices to carry out their DDoS attacks.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

Plotting all the variants in the graph clearly shows that the ranges of IoT devices infect by each variant differ widely. As the graph above reveals, while there were many Mirai variants, very few succeeded at growing a botnet large enough to take down major websites.

From cluster to motive

Notable clusters

Cluster	Notes
6	Attacked Dyn and gaming related targets
1	Original botnet. Attacked Krebs and OVH
2	Attacked Lonestar Cell

Looking at which sites were targeted by the largest clusters illuminates the specific motives behind those variants. For instance, as reported in the table above, the original Mirai botnet (cluster 1) targeted OVH and Krebs, whereas Mirai’s largest instance (cluster 6) targeted DYN and other gaming-related sites. Mirai’s third largest variant (cluster 2), in contrast, went after African telecom operators, as recounted later in this post.

Target	Attacks	Clusters	Notes
Lonestar Cell	616	2	Liberian telecom targeted by 102 reflection attacks
Sky Network	318	15, 26, 6	Brazilian Minecraft servers hosted in Psychz Networks data centers
104.85.165.1	192	1, 2, 6, 8, 11, 15 ...	Unknown router in Akamai’s network
feseli.com	157	7	Russian cooking blog
Minomortaruolo.it	157	7	Italian politician site
Voxility hosted C2	106	1, 2, 6, 7, 15 ...	Known decoy target
Tuidang websites	100	--	HTTP attacks on two Chinese political dissidence sites
execrypt.com	96	-0-	Binary obfuscation service
Auktionshilfe.info	85	2, 13	Russian auction site
houtai.longqikeji.com	85	25	SYN attacks on a former game commerce site
Runescape	73	—	World 26th of a popular online game
184.84.240.54	72	1, 10, 11, 15 ...	Unknown target hosted at Akamai
antiddos.solutions	71	—	AntiDDoS service offered at react.su.

Looking at the most attacked services across all Mirai variants reveals the following:

Booter services monetized Mirai: The wide diversity of targets shows that booter services ran at least some of the largest clusters. A booter service is a service provided by cyber criminals that offers on-demand DDoS attack capabilities to paying customers.
There are fewer actors than clusters: Some clusters have strong overlapping targets, which tends to indicate that they were run by the same actors. For example, clusters 15, 26, and 6 were used to target specific Minecraft servers.

Mirai’s takedown of the Internet

On October 21, a Mirai attack targeted the popular DNS provider DYN. This event prevented Internet users from accessing many popular websites, including AirBnB, Amazon, Github, HBO, Netflix, Paypal, Reddit, and Twitter, by disturbing the DYN name-resolution service.

We believe this attack was not meant to “take down the Internet,” as it was painted by the press, but rather was linked to a larger set of attacks against gaming platforms.

We reached this conclusion by looking at the other targets of the DYN variant (cluster 6). They are all gaming related. Additionally, this is also consistent with the OVH attack as it was also targeted because it hosted specific game servers as discussed earlier. As sad as it seems, all the prominent sites affected by the DYN attack were apparently just the spectacular collateral damage of a war between gamers.

Mirai’s attempted takedown of an entire country's network? October 31

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

Lonestar Cell, one of the largest Liberian telecom operators started to be targeted by Mirai on October 31. Over the next few months, it suffered 616 attacks, the most of any Mirai victim.

The fact that the Mirai cluster responsible for these attack has no common infrastructure with the original Mirai or the DYN variant indicate that they were orchestrated by a totally different actor than the original author.

A few weeks after our study was published, this assessment was confirmed when the author of one of the most aggressive Mirai variant confessed during his trial that he was paid to takedown Lonestar. He acknowledged that an unnamed Liberia’s ISP paid him $10,000 to take out its competitors. This validated that our clustering approach is able to accurately track and attribute Mirai’s attacks.

Deutsche Telekom going dark

On November 26, 2016, one of the largest German Internet provider Deutsche Telekom suffered a massive outage after 900,000 of its routers were compromised.

Inside the infamous Mirai IoT Botnet: A Retrospective Analysis

Ironically, this outage was not due to yet another Mirai DDoS attack but instead due to a particularly innovative and buggy version of Mirai that knocked these devices offline while attempting to compromise them. This variant also affected thousands of TalkTalk routers.

What allowed this variant to infect so many routers was the addition to its replication module of a router exploit targeting at the CPE WAN Management Protocol (CWMP). The CWMP protocol is an HTTP-based protocol used by many Internet providers to auto-configure and remotely manage home routers, modems, and other customer-on-premises (CPE) equipment.

Beside its scale, this incident is significant because it demonstrates how the weaponization of more complex IoT vulnerabilities by hackers can lead to very potent botnets. We hope the Deutsche Telekom event acts as a wake-up call and push toward making IoT auto-update mandatory. This is much needed to curb the significant risk posed by vulnerable IoT device given the poor track record of Internet users manually patching their IoT devices.

Mirai original author outed?

In the months following his website being taken offline, Brian Krebs devoted hundreds of hours to investigating Anna-Senpai, the infamous Mirai author. In early January 2017, Brian announced that he believes Anna-senpai to be Paras Jha, a Rutgers student who apparently has been involved in previous game-hacking related schemes. Brian also identified Josia White as a person of interest. After being outed, Paras Jha and Josia White and another individual were questioned by authorities and plead guilty in federal court to a variety of charges, some including their activity related to Mirai.

Deutsche Telekom attacker arrested

In November 2016, Daniel Kaye (aka BestBuy) the author of the Mirai botnet variant that brought down Deutsche Telekom was arrested at the Luton airport. Prior to Mirai, a 29-year-old British citizen was infamous for selling his hacking services on various dark web markets.

In July 2017 a few months after being extradited to Germany Daniel Kaye plead guilty and was sentenced to a one year and a half imprisonment with suspension. During the trial, Daniel admitted that he never intended for the routers to cease functioning. He only wanted to silently control them so he can use them as part of a DDoS botnet to increase his botnet firepower. As discussed earlier he also confessed being paid by competitors to takedown Lonestar.

In Aug 2017 Daniel was extradited back to the UK to face extortion charges after attempting to blackmail Lloyds and Barclays banks. According to press reports, he asked the Lloyds to pay about £75,000 in bitcoins for the attack to be called off.

Takeways

The prevalence of insecure IoT devices on the Internet makes it very likely that, for the foreseeable future, they will be the main source of DDoS attacks.

Mirai and subsequent IoT botnets can be averted if IoT vendors start to follow basic security best practices. In particular, we recommend that the following should be required of all IoT device makers:

Eliminate default credentials: This will prevent hackers from constructing a credential master list that allows them to compromise a myriad of devices as MIRAI did.
Make auto-patching mandatory: IoT devices are meant to be “set and forget,” which makes manual patching unlikely. Having them auto-patch is the only reasonable option to ensure that no widespread vulnerability like the Deutsche Telekom one can be exploited to take down a large chunk of the Internet.
Implement rate limiting: Enforcing login rate limiting to prevent brute-force attack is a good way to mitigate the tendency of people to use weak passwords. Another alternative would be using a captcha or a proof or work.

Thank you for reading this post until the end!

The Athenian Project: Helping Protect Elections

From cyberattacks on election infrastructure, to attempted hacking of voting machines, to attacks on campaign websites, the last few years have brought us unprecedented attempts to use online vulnerabilities to affect elections both in the United States and abroad. In the United States, the Department of Homeland Security reported that individuals tried to hack voter registration files or public election sites in 21 states prior to the 2016 elections. In Europe, hackers targeted not only the campaign of Emmanuel Macron in France, but government election infrastructure in the Czech Republic and Montenegro.

Cyber attack is only one of the many online challenges facing election officials. Unpredictable website traffic patterns are another. Voter registration websites see a flood of legitimate traffic as registration deadlines approach. Election websites must integrate reported results and stay online notwithstanding notoriously hard-to-model election day loads.

We at Cloudflare have seen many election-related cyber challenges firsthand. In the 2016 U.S. presidential campaign, Cloudflare protected most of the major presidential campaign websites from cyberattack, including the Trump/Pence campaign website, the website for the campaign of Senator Bernie Sanders, and websites for 14 of the 15 leading candidates from the two major parties. We have also protected election websites in countries like Peru and Ecuador.

Although election officials have worked hard to address the security and reliability of election websites, as well as other election infrastructure, budget constraints can limit the ability of governments to access the technology and resources needed to defend against attacks and maintain an online presence. Election officials trying to secure election infrastructure should not have to face a Hobson’s choice of deciding what infrastructure to protect with limited available resources.

The Athenian Project

Since 2014, Cloudflare has protected at-risk public interest websites that might be subject to cyberattack for free through Project Galileo. As part of Project Galileo, we have supported a variety of non-governmental election efforts helping to ensure that individuals have an opportunity to participate in their democracies. This support included protection of Electionland, a project to track and cover voting problems during the 2016 election across the country and in real-time.

When Project Galileo began, we did not anticipate that government websites in the United States might be similarly vulnerable because of resourcing concerns. The past few years have taught us otherwise. We at Cloudflare believe that the integrity of elections should not depend on whether state and local governments have sufficient resources to protect digital infrastructure from cyber attack and keep it online.

The common mission of those working on elections is to preserve citizen confidence in the democratic process and enhance voter participation in elections^[1]. To protect voters’ voices, election websites and infrastructure must be stable and secure. Prior to an election, websites provide critical information to the public such as registration requirements, voting locations and sample ballots. After an election, websites provide election results to citizens.

The institutions in which we place our trust must have the tools to protect themselves. Voter registration websites must stay online before a registration deadline, making it possible for voters who want to register to do so. Election websites should be available on election day notwithstanding increased traffic. Voters should have confidence that officials are doing everything they can to safeguard the integrity of election and voter data, and that election results will be available online.

That is why today, we are launching the Athenian Project, which builds on our work in Project Galileo. The Athenian Project is designed to protect state and local government websites tied to elections and voter data from cyberattack, and keep them online.

U.S. state and local governments can participate in the Athenian Project if their websites meet the following criteria:

The website is managed and owned by a state, county, or municipal government; and

```
The website is related to
```

The administration of elections, including the provision of information related to voting and polling places; or

Voter data, including voter registration or verification; or
The reporting of election results.

For websites that meet these criteria, Cloudflare will extend its highest level of protection for free.

We recognize that different government actors may have different challenges. We therefore intend to work directly with relevant state and municipal officials to address each site’s needs.

Protecting our Elections

In the last few months, we have been talking to a number of government officials about how we can help protect their elections. Today, we are proud to report that we helped the State of Alabama protect its website during its special general election for the U.S. Senate on Tuesday.

“In this year’s historic Senate Special election, it was crucial that our website be able to handle spikes in traffic and remain online in the event of attack,” said Jim Purcell, Acting Secretary of Information Technology for the State of Alabama. “It is very important to our state government and democracy as a whole that voters and the public be able to access registrar, election information, and election results. Cloudflare proved to be an excellent partner, helping us achieve this goal.”

By allowing voters to exercise their rights to register to vote, speak, and access information, the Internet can and should play a helpful role in democracy. Democracies depend on voters’ voices being enabled, not silenced. Helping to provide state and local governments the tools they need to keep websites online and secure from attack as they hold and report on elections restores the Internet’s promise and serves Cloudflare’s mission of helping to build a better Internet.

To learn more and apply to the Athenian Project, please visit: cloudflare.com/athenian-project

State of New York Board of Elections mission statement. ↩︎

Highlights from Cloudflare's Weekend at YHack

Along with four other Cloudflare colleagues, I traveled to New Haven, CT last weekend to support 1,000+ college students from Yale and beyond at YHack hackathon.

Throughout the weekend-long event, student attendees were supported by mentors, workshops, entertaining performances, tons of food and caffeine breaks, and a lot of air-mattresses and sleeping bags. Their purpose was to create projects to solve real world problems and to learn and have fun in the process.

How Cloudflare contributed

Cloudflare sponsored YHack. Our team of five wanted to support, educate, and positively impact the experience and learning of these college students. Here are some ways we engaged with students.

1. Mentoring

Our team of five mentors from three different teams and two different Cloudflare locations (San Francisco and Austin) was available at the Cloudflare table or via Slack for almost every hour of the event. There were a few hours in the early morning when all of us were asleep, I'm sure, but we were available to help otherwise.

2. Providing challenges

Cloudflare submitted two challenges to the student attendees, encouraging them to protect and improve the performance of their projects and/or create an opportunity for exposure to 6 million+ potential users of their apps.

Challenge 1: Put Cloudflare in front of your hackathon project

Challenge 2. Make a Cloudflare App

Prizes were awarded to all teams which completed these challenges. Cloudflare awarded a total of ten teams for completing the challenges. The team which was judged to have created the best Cloudflare app won free admission to Cloudflare's 2018 Internet Summit as well as several swag items and introductions to Cloudflare teams offering internships next summer.

3. Distributing swag & cookies

Over 3,000 swag items (t-shirts, laptop stickers, and laptop camera covers) were distributed to almost every attendee. When the Cloudflare team noticed student attendees were going without snacks after dinner, I made a trip to a local grocery store and bought hundreds of cookies to distribute as well. After about an hour, none remained.

4. Sponsoring

As with most any hackathon, costs for food, A/V, venue rental, and other expenses are considerable. Cloudflare decided to financially support YHack to help offset some of these costs and be sure student participants were able to attend, learn, and build meaningful projects without having to pay admission fees.

Results of the hackathon

In groups of up to four, students teamed up to create projects. There was no set project theme, so students could focus on subject areas they were most passionate about. Judging criteria were "practical and useful", "polish", "technical difficulty", and "creativity" and several sponsors, such as Intuit, Cloudflare, JetBlue, and Yale's Poynter Fellowship in Journalism submitted suggested challenges for teams to work on.

I saw a lot of cool projects that could help college students save money on food or other expenses, projects that could help identify fake news sources, and apps that would work great on Cloudflare Apps.

There were dozens of great completed projects by the end of the Hackathon. I'd like to highlight a few great ones which used Cloudflare best.

The Winner: TL;DR

Akrama Mirza, a second year student at University of Waterloo, created an app for Cloudflare Apps which allows a website owner to automatically generate a TL;DR summary of the content on their pages and place it anywhere on their site. This app could be used to give a TL;DR summary of a blog post, article, report, etc.

Here you can see how TL;DR would display on TechCrunch's site.

The TL;DR app won the Cloudflare challenge for "make a Cloudflare App." Akrama has been invited to attend Cloudflare's 2018 Internet Summit in San Francisco, as a result. I've also introduced him to some internal teams, so he may explore internship opportunities with Cloudflare next summer.

Read more about TL;DR on Akrama's Devpost page.

Other great projects which used Cloudflare

Of the completed projects which put Cloudflare in front of their projects, there were three I most wanted to feature.

K2

"Bringing Wall Street to Main Street through an accessible trading and back-testing platform."

K2 is a comprehensive backtesting platform for currency data, specializing in cryptocurrency and offering users the opportunity to create trading algorithms and simulate them in real-time.

The K2 team sought to equalize the playing field by enabling the general public to develop and test trading strategies. Users may specify a trading interval, time frame, and currency symbol, and the K2 backend will visualize the cumulative returns and generate financial metrics.

Read more about K2 on the team's Devpost page or website.

Money Moves

Money Moves analyzes data about financial advisors and their attributes and uses machine's deep learning unsupervised algorithms to predict if certain financial advisors will most likely be beneficial or detrimental to an investor's financial standing. Ultimately, Money Moves will help investors select the best possible financial advisor in their region.

The Money Moves team liked that they could easily use Cloudflare to protect and improve the performance of their site. One of the team members, Muyiwa Olaniyan, already used Cloudflare to run his personal server at home, so the team decided to use Cloudflare, from the start.

Read more about Money Moves on the team's Devpost page or website.

Mad Invest

"Smart Bitcoin Investing Chatbot"

When the Facebook messenger chatbot is asked when it is best to invest in Bitcoin, current market trends are observed and a Mad Invest's model predicts how likely it is for the price to go up or down. A conclusion as to when it would be a good time to invest or sell is drawn and delivered by message to users.

Moving forward, the team intends to improve their model's predictions by developing a way to analyze the Chinese market, which represents 70% of Bitcoin traffic.

When telling me about their experience using Cloudflare and how it saved them time, the team delivered my favorite quote from the whole weekend. “We spent half an hour setting up Let's Encrypt for the SSL and we realized we could just put Cloudflare in front of it.” Exactly.

Read more about Mad Invest on the team's Devpost page.

Final thoughts:

I was honored to be part of a team that supported so many awesome students in the development of their projects at YHack. I was pleased to hear many attendees had already heard of or used Cloudflare before and many teams interacted with Cloudflare to protect and improve the performance of their projects and develop new apps. I look forward to being involved many more events in 2018 as well.

At the end of 2016, I wrote a blog post with seven predictions for 2017. Let’s start by reviewing how I did.

Didn’t he do well
Public Domain image by Michael Sharpe

I’ll score myself with two points for being correct, one point for mostly right and zero for wrong. That’ll give me a maximum possible score of fourteen. Here goes...

2017-1: 1Tbps DDoS attacks will become the baseline for ‘massive attacks’

This turned out to be true but mostly because massive attacks went away as Layer 3 and Layer 4 DDoS mitigation services got good at filtering out high bandwidth and high packet rates. Over the year we saw many DDoS attacks in the 100s of Gbps (up to 0.5Tbps) and then in September announced Unmetered Mitigation. Almost immediately we saw attackers stop bothering to attack Cloudflare-protected sites with large DDoS.

So, I’ll be generous and give myself one point.

2017-2: The Internet will get faster yet again as protocols like QUIC become more prevalent

Well, yes and no. QUIC has become more prevalent as Google has widely deployed it in the Chrome browser and it accounts for about 7% of Internet traffic. At the same time the protocol is working its way through the IETF standardization process and has yet to be deployed widely outside Google.

So, I’ll award myself one point for this as QUIC did progress but didn’t get as far as I thought.

2017-3: IPv6 will become the defacto for mobile networks and IPv4-only fixed networks will be looked upon as old fashioned

IPv6 continued to grow throughout 2017 and seems to be on a pretty steady trajectory upwards. Although it’s not yet deployed on ¼ of the top 25,000 web sites. Note that the large jump in IPv6 support that occurred in the middle of 2016 when Cloudflare enabled it by default for all our customers.

The Internet Society reported that mobile networks that switch to IPv6 see 70-95% of their traffic use IPv6. Google reports that traffic from Verizon is now 90% IPv6 and T-Mobile is turning off IPv4 completely.

Here I’ll award myself two points.

2017-4: A SHA-1 collision will be announced

That happened on 23 February 2017 with the announcement of an efficient way to generate colliding PDF documents. It’s so efficient that here are two PDFs containing the old and new Cloudflare logos. I generated these two PDFs using a web site that takes two JPEGs, embeds them in two PDFs and makes them collide. It does this instantly.

They have the same SHA-1 hash:

$ shasum *.pdf
e1964edb8bcafc43de6d1d99240e80dfc710fbe1  a.pdf
e1964edb8bcafc43de6d1d99240e80dfc710fbe1  b.pdf

But different SHA-256 hash:

$ shasum -a256 *.pdf
8e984df6f4a63cee798f9f6bab938308ebad8adf67daba349ec856aad07b6406  a.pdf
f20f44527f039371f0aa51bc9f68789262416c5f2f9cefc6ff0451de8378f909  b.pdf

So, two points for getting that right (and thanks, Nick Sullivan, for suggesting it and making me look smart).

2017-5: Layer 7 attacks will rise but Layer 6 won’t be far behind

The one constant of 2017 in terms of DDoS was the prevalence of Layer 7 attacks. Even as attackers decided that large scale Layer 3 and 4 DDoS attacks were being mitigated easily and hence stopped performing them so frequently, Layer 7 attacks continued apace with attacks in the 100s of krps common place.

Awarding myself one point because Layer 6 attacks didn’t materialize as much as predicted.

2017-6: Mobile traffic will account for 60% of all Internet traffic by the end of the year

Ericsson reported mid-year that mobile data traffic was continuing to grow strongly and grew 70% between Q116 and Q117. Stats show that while mobile traffic continued to increase its share of Internet traffic and passed 50% in 2017 it didn’t reach 60%.

Zero points for me.

2017-7: The security of DNS will be taken seriously

This has definitely happened. The 2016 Dyn DNS attack was a wake up call that often overlooked infrastructure was at risk of DDoS attack. In April 2017 Wired reported that hackers took over 36 Brazilian banking web sites by hijacking DNS registration, and in June Mozilla and ICANN proposed encrypting DNS by sending it over HTTPS and the IETF has a working group on what’s now being called doh.

DNSSEC deployment continued with SecSpider showing steady, continuous growth during 2017.

So, two points for me.

Overall, I scored myself a total of 9 out of 14, or 64% right. With that success rate in mind here are my predictions for 2018.

2018 Predictions

2018-1: By the end of 2018 more than 50% of HTTPS connections will happen over TLS 1.3

The roll out of TLS 1.3 has been stalled because of difficulty in getting it working correctly in the heterogenous Internet environment. Although Cloudflare has had TLS 1.3 in production and available for all customers for over a year only 0.2% of our traffic is currently using that version.

Given the state of standardization of TLS 1.3 today we believe that major browser vendors will enable TLS 1.3 during 2018 and by the end of the year more than 50% of HTTPS connections will be using the latest, most secure version of TLS.

2018-2: Vendor lock-in with Cloud Computing vendors becomes dominant worry for enterprises

In Mary Meeker’s 2017 Internet Trends report she gives on statistics (slide 183) on the top three concerns of users of cloud computing. These show a striking change from being primarily about security and cost to worries about vendor lock-in and compliance. Cloudflare believes that vendor lock-in will become the top concern of users of cloud computing in 2018 and that multi-cloud strategies will become common.

BillForward is already taking a multi-cloud approach with Cloudflare moving traffic dynamically between cloud computing providers. Alongside vendor lock-in, users will name data portability between clouds as a top concern.

2018-3: Deep learning hype will subside as self-driving cars don't become a reality but AI/ML salaries will remain high

Self-driving cars won’t become available in 2018, but AI/ML will remain red hot as every technology company tries to hire appropriate engineering staff and finds they can’t. At the same time deep learning techniques will be widely applied across companies and industries as it becomes clear that these techniques are not limited to game playing, classification, or translation tasks and can be widely applied.

Expect unexpected applications of techniques, that are already in use in Silicon Valley, when they are applied to the rest of the world. Don’t be surprised if there’s talk of AI/ML managed traffic management for highways, for example. Anywhere there's a heuristic we'll see AI/ML applied.

But it’ll take another couple of years for AI/ML to really have profound effects. By 2020 the talent pool will have greatly increased and manufacturers such as Qualcomm, nVidia and Intel will have followed Google’s lead and produced specialized chipsets designed for deep learning and other ML techniques.

2018-4: k8s becomes the dominant platform for cloud computing

A corollary to users’ concerns about cloud vendor lock-in and the need for multi-cloud capability is that an orchestration framework will dominate. We believe that Kubernetes will be that dominant platform and that large cloud vendors will work to ensure compatibility across implementations at the demand of customers.

We are currently in the infancy of k8s deployment with the major cloud computing vendors deploying incompatible versions. We believe that customer demand for portability will cause cloud computer vendors to ensure compatibility.

2018-5: Quantum resistant crypto will be widely deployed in machine-to-machine links across the internet

During 2017 Cloudflare experimented with, and open sourced, quantum-resistant cryptography as part of our implementation of TLS 1.3. Today there is a threat to the security of Internet protocols from quantum computers, and although the threat has not been realized, cryptographers are working on cryptographic schemes that will resist attacks from quantum computers when they arrive.

We predict that quantum-resistant cryptography will become widespread in links between machines and data centers especially where the connections being encrypted cross the public Internet. We don’t predict that quantum-resistant cryptography will be widespread in browsers, however.

2018-6: Mobile traffic will account for 60% of all Internet traffic by the end of the year

Based on the continued trend upwards in mobile traffic I’m predicting that 2018 (instead of 2017) will be the year mobile traffic shoots past 60% of overall Internet traffic. Fingers crossed.

2018-7: Stable BTC/USD exchanges will emerge as others die off from security-based Darwinism

The meteoric rise in the Bitcoin/USD exchange rate has been accompanied by a drumbeat of stories about stolen Bitcoins and failing exchanges. We believe that in 2018 the entire Bitcoin ecosystem will stabilize.

This will partly be through security-based Darwinism as trust in exchanges and wallets that have security problems plummets and those that survive have developed the scale and security to cope with the explosion in Bitcoin transactions and attacks on their services.

During 2017 Cloudflare published 172 blog posts (including this one). If you need a distraction from the holiday festivities at this time of year here are some highlights from the year.

CC BY 2.0 image by perzon seo

The WireX Botnet: How Industry Collaboration Disrupted a DDoS Attack

We worked closely with companies across the industry to track and take down the Android WireX Botnet. This blog post goes into detail about how that botnet operated, how it was distributed and how it was taken down.

Randomness 101: LavaRand in Production

The wall of Lava Lamps in the San Francisco office is used to feed entropy into random number generators across our network. This blog post explains how.

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

Our network of data centers around the world all contain Intel-based servers, but we're interested in ARM-based servers because of the potential cost/power savings. This blog post took a look at the relative performance of Intel processors and Qualcomm's latest server offering.

How to Monkey Patch the Linux Kernel

One engineer wanted to combine the Dvorak and QWERTY keyboard layouts and did so by patching the Linux kernel using SystemTap. This blog explains how and why. Where there's a will, there's a way.

Introducing Cloudflare Workers: Run JavaScript Service Workers at the Edge

Traditionally, the Cloudflare network has been configurable by our users, but not programmable. In September, we introduced Cloudflare Workers which allows users to write JavaScript code that runs on our edge worldwide. This blog post explains why we chose JavaScript and how it works.

CC BY 2.0 image by Peter Werkman

Geo Key Manager: How It Works

Our Geo Key Manager gives customers granular control of the location of their private keys on the Cloudflare network. This blog post explains the mathematics that makes the possible.

SIDH in Go for quantum-resistant TLS 1.3

Quantum-resistant cryptography isn't an academic fantasy. We implemented the SIDH scheme as part of our Go implementation of TLS 1.3 and open sourced it.

The Languages Which Almost Became CSS

This blog post recounts the history of CSS and the languages that might have been CSS.

Perfect locality and three epic SystemTap scripts

In an ongoing effort to understand the performance of NGINX under heavy load on our machines (and wring out the greatest number of requests/core), we used SystemTap to experiment with different queuing models.

How we built rate limiting capable of scaling to millions of domains

We rolled out a rate limiting feature that allows our customers to control the maximum number of HTTP requests per second/minute/hour that their servers receive. This blog post explains how we made that operate efficiently at our scale.

CC BY 2.0 image by Han Cheng Yeh

Reflections on reflection (attacks)

We deal with a new DDoS attack every few minutes and in this blog post we took a close look at reflection attacks and revealed statistics on the types of reflection-based DDoS attacks we see.

On the dangers of Intel's frequency scaling

Intel processors contain special AVX-512 that provide 512-bit wide SIMD instructions to speed up certain calculations. However, these instructions have a downside: when used the CPU base frequency is scaled down slowing down other instructions. This blog post explores that problem.

How Cloudflare analyzes 1M DNS queries per second

This blog post details how we handle logging information for 1M DNS queries per second using a custom pipeline, ClickHouse and Grafana (via a connector we open sourced) to build real time dashboards.

AES-CBC is going the way of the dodo

CBC-mode cipher suites have been declining for some time because of padding oracle-based attacks. In this blog we demonstrate that AES-CBC has now largely been replaced by ChaCha20-Poly1305 .

CC BY-SA 2.0 image by Christine

How we made our DNS stack 3x faster

We answer around 1 million authoritative DNS queries per second using a custom software stack. Responding to those queries as quickly as possible is why Cloudflare is fastest authoritative DNS provider on the Internet. This blog post details how we made our stack even faster.

Quantifying the Impact of "Cloudbleed"

On February 18 a serious security bug was reported to Cloudflare. Five days later we released details of the problem and six days after that we posted this analysis of the impact.

LuaJIT Hacking: Getting next() out of the NYI list

We make extensive use of LuaJIT when processing our customers' traffic and making it faster is a key goal. In the past, we've sponsored the project and everyone benefits from those contributions. This blog post examines getting one specific function JITted correctly for additional speed.

Privacy Pass: The Math

The Privacy Pass project provides a zero knowledge way of proving your identity to a service like Cloudflare. This detailed blog post explains the mathematics behind authenticating a user without knowing their identity.

How and why the leap second affected Cloudflare DNS

The year started with a bang for some engineers at Cloudflare when we ran into a bug in our custom DNS server, RRDNS, caused by the introduction of a leap second at midnight UTC on January 1, 2017. This blog explains the error and why it happened.

There's no leap second this year.

As I’m writing this, four DDoS attacks are ongoing and being automatically mitigated by Gatebot. Cloudflare’s job is to get attacked. Our network gets attacked constantly.

The Trouble With Updates

The other thing we hear about updates from IoT developers is that often they are afraid to push a new update because it could mean breaking hundreds of thousands of devices at once.

The security model that worked for PC doesn’t work for IoT — the end user can’t be responsible, and patching isn’t reliable. We need something else. What’s the solution?

The Trouble With TLS

TLS Isn’t Lightweight

Here’s an HTTP POST to do that:

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

But let’s say the same device is going to use TLS. Here’s what the same POST looks like with the TLS handshake — this is with TLS 1.2:

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

If the same IoT device talks to the same server again, there’s actually no round trip at all. The parameters chosen in the initial handshake are sent alongside application data in the first packet.

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

TLS Certificate Size

Default Plaintext IoT Protocols

IoT devices need to be lightweight so two emerging protocols are replacing HTTP as the dominant transfer protocol for some IoT devices: MQTT and CoAP.

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

CoAP was standardized just three years ago. It has all the same methods as HTTP, but it’s over UDP so it’s really light.

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

The problem is, if you want to add TLS (DTLS really because CoAP is over UDP), it no longer is light anymore.

TLS 1.3 is going to save us all, and other reasons why IoT is still insecure

The Future

It will be quite interesting to see how update mechanisms and TLS implementations change as the number of deployed IoT devices continues to grow. If this type of thing interests you, come join us.

Software engineering pro-tip:

Do not, I repeat, do not deploy this week. That is how you end up debugging a critical issue from your parent's wifi in your old bedroom while your spouse hates you for abandoning them with your racist uncle.
— Chris Albon (@chrisalbon) December 20, 2017

Family member: "So what do you do nowadays?"
Me: "I work in Cyber Security."
Family member: "There seems to be a new cyber attack every day on the news! What can I possibly do to keep myself safe?"

The Basics

It is worth noting that even after paying, you're unlikely to see your files back (don't expect an honest transaction from criminals).

Simple Cyber Security Tips (for your Parents)

Of course, it is also essential to back-up your most indispensable files, not just because of the damage security vulnerabilities.

Don't put your eggs in one Basket

It may not be Easter, but you certainly should not be putting all your eggs in one basket. For this reason; it is often not a good idea to use the same password across multiple sites.

Layering Security

Authentication is how computers confirm you are who you say you are. Fundamentally, is done using either:

Something you know
Something you have
Something you are

A password is an example of how you can log-in using "something you know"; if someone is able to gain access to that password, it's game-over for that online account.

Instead, it is possible to use "something you have" as well. This means, should your password be intercepted or disclosed, you still have another safeguard to protect your account.

Simple Cyber Security Tips (for your Parents)

Two-Factor Authentication is supported on many of the worlds most popular social media, banking and shopping sites. You can find out how to enable it on popular websites at turnon2fa.com.

Know who you talk to

Simple Cyber Security Tips (for your Parents)

When inputting personal information into websites, it is important you check this green lock appears and that the website address starts with "https://".

Conclusions

We always hear of new and innovative security vulnerabilities, but for most users, remembering a handful of simple security tips is enough to protect against the majority of security threats.

In Summary:

As a rule-of-thumb, install the latest security patches
Don't use obsolete software which doesn't provide security patches
Use well-trusted anti-virus
Back-up the files and folders you can't expect to lose
Use a Password Manager to set random, unique passwords for every site
Don't use common keywords or personal information as passwords
Adding numbers and symbols to passwords often doesn't add security but impacts how easy they are to remember
Enable Two-Factor Authentication on sites which support it
Check the address bar when inputting personal information, make sure the connection is encrypted and the site address is correct
Don't believe everything you see in your email inbox or trust every link sent through email; even if the sender has some information about you.

Finally, from everyone at Cloudflare, we wish you a wonderful and safe holiday season. For further reading, check out the Internet Mince Pie Database.

Stocks and Trades

Thirty-Six Days Out of London

The Gold Indicator

The History of Stock Quotes

Most of the technology used by the morse code telegraph system was built to satisfy the demands of the finance industry.

The Ticker Tape

I am crushed for want of means. My stockings all want to see my mother, and my hat is hoary from age.

— Samuel Morse, in his diary

It will therefore be understood from the above explanation that the impression of any given character upon the type-wheel may be produced upon the paper by an operator stations at a distant point, ... simply by transmitting the proper number of electrical impulses of short duration by means of a properly-constructed circuit-breaker, which will cause the type-wheel to revolve without sensibly affecting the impression device. When the desired character is brought opposite the impression-lever the duration of the final current is prolonged, and the electro-magnet becomes fully magnetized, and therefore an impression of the letter or character upon the paper is produced.

— Thomas A. Edison, Patent for the Printing Telegraph 1870

Ticker tape machines used their own vocabulary:

IBM 4S 651/4

Meant 400 shares of IBM had just been sold for $65.25 per share (stocks were priced using fractions, not decimal numbers).

Trans-Lux

The Quotron

We considered video displays, but the electronics to form characters was too expensive then. I also considered the “Charactron tube” developed by Convair in San Diego that I had used at another company . . . but this also was too expensive, so we looked at the possibility of developing our own printer. As I remember it, I had run across the paper we used in the printer through a project at Electronic Control Systems where I worked prior to joining Scantlin. The paper came in widths of about six inches, and had to be sliced . . . I know Jack Scantlin and I spent hours in the classified and other directories and on the phone finding plastic for the tape tank, motors to drive the tape, pushbuttons, someone to make the desk unit case, and some company that would slice the tape. After we proved the paper could be exposed with a Xenon flash tube, we set out to devise a way to project the image of characters chosen by binary numbers stored in the shift register. The next Monday morning Jack came in with the idea of the print wheel, which worked beautifully.

When a broker in the office selected an exchange, entered a stock symbol, and requested a last price on his desk unit, the symbol would be stored in relays in the master unit, and the playback sprocket would begin driving the tape backwards over a read head at about ten feet per second, dumping the tape into the bin between the two heads (market data would continue to be recorded during the read operation). The tape data from the tracks for the selected exchange would be read into a shift register, and when the desired stock symbol was recognized, the register contents would be “frozen,” and the symbol and price would be shifted out and printed on the desk unit.

Only a single broker could use the system at a time:

If no desk units were in use, the master unit supplied power to all desk units in the office, and the exchange buttons on each unit were lit. When any broker pressed a lit button, the master unit disconnected the other desk units, and waited for the request from the selected desk unit. The desk unit buttons were directly connected to the master unit via the cable, and the master unit contained the logic to decode the request. It would then search the tape, as described above, and when it had an answer ready, would start the desk unit paper drive motor, count clock pulses from the desk unit (starting, for each character, when it detected an extra-large, beginning-of-wheel gap between pulses), and transmit a signal to operate the desk unit flash tube at the right time to print each character.

Ultronics

Quotron II

At one time during system checkout we had a very elusive problem which we couldn’t pin down. In looking over the programs, we realized that the symptoms we were seeing could occur if an unconditional jump instruction failed to jump. We therefore asked CDC whether they had any indication that that instruction occasionally misbehaved. The reply was, “Oh, no. That’s one of the more reliable instructions,” This was our first indication that commands could be ordered by reliability.

— Montgomery Phister, Jr. 1989

One stock sold for over $1000; some securities traded in 32nds of a dollar; the prices to be stored included the previous day’s close, and the day’s open, high, low, and last, together with the total number of shares traded-the volume. Clearly we’d need 15 bits for each price (ten for the $1000, five for the 32nds), or 75 bits for the five prices alone. Then we’d need another 20 for a four-letter stock symbol, and at least another 12 for the volume. That added up to 107 bits (nine words per stock, or 27,000 words for 3000 stocks) in a format that didn’t fit conveniently into 12-bit words.

The Quotron II system was connected to several remote sites using eight Dataphone lines which provided a total bandwidth of 16 kbps.

It’s somewhat comforting to learn that hack solutions are nothing new:

Once the system was in operation, we had our share of troubles. One mysterious system failure had the effect of cutting off service to all customers in the St. Louis area. Investigation revealed that something was occasionally turning off some 160A memory bits which enabled service to that region. The problem was “solved” for a time by installing a patch which periodically reinstated those bits, just in case.

The story of Quotron II showcases the value of preparing for things to go wrong even if you’re not exactly sure how they will, and graceful degradation:

Jack Scantlin was worried about this situation, and had installed a feature in the Quotron program which discarded these common-memory programs, thus making more room for exceptions, when the market went haywire. On the day President Kennedy was assassinated, Jack remembers sitting in his office in Los Angeles watching features disappear until brokers could get nothing but last prices.

Those of us who worked on Quotron II didn’t use today’s labels. Multiprogramming, multiprocessor, packet, timesharing-we didn’t think in those terms, and most of us had never even heard them. But we did believe we were breaking new ground; and, as I mentioned earlier, it was that conviction more than any other factor that made the work fascinating, and the time fly.

NASDAQ

Anyone who has ever been involved with the demonstration of an on-line process knows what happens next. With everyone crowded around to watch, the previously infallible gear or program begins to fall apart with a spectacular display of recalcitrance. Well so it went. We set the stage, everyone held their breath, and then the first query we keyed in proceeded to pull down the whole software structure.

Now

Our next post in this series is on the history of digital communication before the Internet came along. Subscribe to be notified of its release.

It's the day after Christmas; or, depending on your geography, Boxing Day. With the festivities over, you may still find yourself stuck at home and somewhat bored.

Concise (Post-Christmas) Cryptography Challenges

Challenges

Password Hashing

The first one is simple enough to explain; here are 5 hashes (from user passwords), crack them:

$2y$10$TYau45etgP4173/zx1usm.uO34TXAld/8e0/jKC5b0jHCqs/MZGBi
$2y$10$qQVWugep3jGmh4ZHuHqw8exczy4t8BZ/Jy6H4vnbRiXw.BGwQUrHu
$2y$10$DuZ0T/Qieif009SdR5HD5OOiFl/WJaDyCDB/ztWIM.1koiDJrN5eu
$2y$10$0ClJ1I7LQxMNva/NwRa5L.4ly3EHB8eFR5CckXpgRRKAQHXvEL5oS
$2y$10$LIWMJJgX.Ti9DYrYiaotHuqi34eZ2axl8/i1Cd68GYsYAG02Icwve

HTTP Strict Transport Security

$ curl -i http://www.example.com
HTTP/1.1 302 Moved Temporarily
Server: nginx
Date: Tue, 26 Dec 2017 12:26:51 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
location: https://uk.example.com/

$ curl -i http://uk.example.com
HTTP/1.1 200 OK
Server: nginx
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Keep-Alive: timeout=5
Vary: Accept-Encoding
Cache-Control: no-cache
Date: Tue, 26 Dec 2017 12:26:53 GMT
Strict-Transport-Security: max-age=31536000
...

AES-256

The following images below are encrypted using AES 256 in CTR mode. Can you work out what the images originally were from?

1_enc.bmp

Concise (Post-Christmas) Cryptography Challenges

2_enc.bmp

Concise (Post-Christmas) Cryptography Challenges

Hints

Password Hashing

What kind of hash algorithm has been used here? Given the input is from humans, how would you go about cracking it efficiently?

HTTP Strict Transport Security

All the information you could possibly need on this topic is in the following blog post: Performing & Preventing SSL Stripping: A Plain-English Primer.

AES-256

From the original images; 1.bmp and 2.bmp, were encrypted using a command similar to the one below - the only thing which changed each time the command was run was the file names used:

openssl enc -aes-256-ctr -e -in 1.bmp -out 1_enc.bmp -k SomePassword -iv 0010011 -nosalt

Another hint; both images are from the same source image, each just has different layers of the same original image.

Solutions (Spoiler Alert!)

Password Hashing

Let's try using a password dictionary to try and crack these hashes, I'll start with a small dictionary of just about 10 000 passwords.

Using this password list, we can crack these hashes (the mode number, 3200 represents BCrypt):

hashcat -a 0 -m 3200 hashes.txt ~/Downloads/10_million_password_list_top_10000.txt --force

Despite being on my laptop, it only takes 90 seconds to crack these hashes (the output below is in the format hash:plaintext):

$2y$10$DuZ0T/Qieif009SdR5HD5OOiFl/WJaDyCDB/ztWIM.1koiDJrN5eu:password1
$2y$10$TYau45etgP4173/zx1usm.uO34TXAld/8e0/jKC5b0jHCqs/MZGBi:password
$2y$10$qQVWugep3jGmh4ZHuHqw8exczy4t8BZ/Jy6H4vnbRiXw.BGwQUrHu:hotdog
$2y$10$0ClJ1I7LQxMNva/NwRa5L.4ly3EHB8eFR5CckXpgRRKAQHXvEL5oS:88888888
$2y$10$LIWMJJgX.Ti9DYrYiaotHuqi34eZ2axl8/i1Cd68GYsYAG02Icwve:hello123

HTTP Strict Transport Security

The first redirect is performed over plain-text HTTP and doesn't have HSTS enabled, but the redirect goes to a subdomain that does have HSTS enabled.

From the phoney subdomain or domain, you can then proxy traffic back to the original domain - all the while acting as a Man-in-the-Middle for any private information crossing the wire.

AES-256

Generating Encrypted Files

Before we learn to crack this puzzle, let me explain how I set things up.

The original image looked like this:

Concise (Post-Christmas) Cryptography Challenges

Prior to encrypting this image, I split them into two distinct parts.

1.bmp

Concise (Post-Christmas) Cryptography Challenges

2.bmp

Concise (Post-Christmas) Cryptography Challenges

The first image was encrypted with the following command:

openssl enc -aes-256-ctr -e -in 1.bmp -out 1_enc.bmp -k hunter2#security -iv 0010011 -nosalt

Beyond the choice of cipher, there are two important options here:

-iv - We require OpenSSL to use a specific nonce instead of a dynamic one
-nosalt - Encryption keys in OpenSSL are salted prior to hashing, this option prevents this from happening

Using a Hex Editor, I added the headers to ensure the encrypted files were correctly rendered as BMP files. This resulted in the files you saw in the challenge above.

Reversing the Encryption

The critical mistake above is that the encryption key and the nonce are identical. Both ciphertexts were generated from identical nonces using identical keys (unsalted before they were hashed).

Additionally, we are using the AES in CTR mode, which vulnerable to the “two-time pad” attack.

We firstly run an XOR operation on the two files:

$ git clone https://github.com/scangeo/xor-files.git
$ cd xor-files/source
$ gcc xor-files.c -o xor-files
$ ./xor-files /some/place/1_enc.bmp /some/place/2_enc.bmp > /some/place/xor.bmp

The next step is to add BMP image headers such that we can display the image:

Concise (Post-Christmas) Cryptography Challenges

The resulting image is then inverted (using ImageMagick):

convert xor.bmp -negate xor_invert.bmp

We then obtain the original input:

Concise (Post-Christmas) Cryptography Challenges

Conclusion

If you're interested in debugging cybersecurity challenges on a network that sees ~10% of global internet traffic, we're hiring for security & cryptography engineers and Support Engineers globally.

We hope you've had a relaxing holiday season, and wish you a very happy new year!

Why TLS 1.3 isn't in browsers yet

To help support this discussion with data, we built a tool to help check if your network is compatible with TLS 1.3:
https://tls13.mitm.watch/

How version negotiation used to work in TLS

When connecting to a server with TLS, the client sends its highest supported version at the beginning of the connection:

(3, 3) → server

A server that understands the same version can reply back with a message starting with the same version bytes.

(3, 3) → server
client ← (3, 3)

Or, if the server only knows an older version of the protocol, it can reply with an older version. For example, if the server only speaks TLS 1.0, it can reply with:

(3, 3) → server
client ← (3, 1)

Pretty simple, right? As it turns out, some servers didn’t implement this correctly and this led to a chain of events that exposed web users to a serious security vulnerability.

POODLE and the downgrade dance

Why TLS 1.3 isn't in browsers yet
CC0 Creative Commons

Browsers had three options to deal with this situation

enable TLS 1.2 and a percentage of websites would stop working
delay the deployment of TLS 1.2 until these servers are fixed
retry with an older version of TLS if the connection fails

Fixing POODLE and removing insecure downgrade

The fixes for POODLE were:

disable SSLv3 on the client or the server
enable a new TLS feature called SCSV.

Turning off SSLv3 was a feasible solution to POODLE because SSLv3 was not critical to the Internet. This raises the question: what happens if there’s a serious vulnerability in TLS 1.0? TLS 1.0 is still very widely used and depended on, turning it off in the browser would lock out around 10% of users. Also, despite SCSV being a nice solution to insecure downgrades, many servers don’t support it. The only option to ensure the safety of users against a future issue in TLS 1.0 is to disable insecure fallback.

After several years of bad performance due to the client having to reconnect multiple times, the majority of the websites that did not implement version negotiation correctly fixed their servers. This made it possible for some browsers to remove this insecure fallback:
Firefox in 2015, and Chrome in 2016. Because of this, Chrome and Firefox users are now in a safer position in the event of a new TLS 1.0 vulnerability.

Introducing a new version (again)

The real world is full of middleboxes

Why TLS 1.3 isn't in browsers yet
CC BY-SA 3.0

Success rates for TLS 1.3 Draft 18

Firefox & Cloudflare
97.8% for TLS 1.2
96.1% for TLS 1.3

Chrome & Gmail
98.3% for TLS 1.2
92.3% for TLS 1.3

After some investigation, it was found that some widely deployed middleboxes, both of the intercepting and passive variety, were causing connections to fail.

Why TLS 1.3 isn't in browsers yet

Removing features that have been part of a protocol for 20 years and expecting it to simply “work” was wishful thinking.

Making TLS 1.3 work

Chrome Experiment Success Rates

TLS 1.2 - 98.6%
Experimental changes (PR 1091 on Github) - 98.8%

Firefox Experiment Success Rates

TLS 1.2 - 98.42%
Experimental changes (PR 1091 on Github) - 98.37%

Making sure this doesn’t happen again

One thing to note is that GREASE is intended to prevent servers from ossifying, not middleboxes, so there is still potential for more types of greasing to happen in TLS.

Help us understand the issue

How it works:

We cross-compiled our Golang-based TLS 1.3 client library (tls-tris) to JavaScript
We build a JavaScript library (called jssock) that implements tls-tris on the low-level socket interface network exposed through Adobe Flash
We connect to a remote server using TLS 1.2 and TLS 1.3 and compare the results

If there is a mismatch, we gather information from the connection on both sides and send it back for analysis

If you see a failure, let us know! If you’re in a corporate environment, share the middlebox information, if you’re on a residential network, tell us who your ISP is.

Why TLS 1.3 isn't in browsers yet

We’re excited for TLS 1.3 to finally be enabled by default in browsers. This experiment will hopefully help prove that the latest changes make it safe for users to upgrade.