HTML5’s Drag and Drop Problem

The HTML5 drag and drop spec started as a reverse-engineered version of behavior introduced in IE5. This was a pragmatic decision: to get IE support for free, back when IE legacy support really mattered.

It’s not a pleasant API. It has an awkward mix of native and web behavior, and it hijacks DOM methods like event.preventDefault() for semantically suspect purposes. But it is built in to most browsers now, works across document boundaries (like iframes), and remains the only way you can drag in files and content from the desktop. It’s tempting to try to use the native spec for all your drag code.

We’ll argue that, on the contrary, there is probably only one case you should use the native API for: to exchange data with the OS. The browser-provided behavior just isn’t good enough yet to use for core interactions within an app. And it’s not just an issue of significant API gaps and implementation bugs. The real deal-breakers come from the goals and design of the API itself, which are likely to see only slow improvements.

Before Native Went Web

The longer you work with HTML5 drag and drop, the more clearly you can see its native roots. Of course Microsoft, who was then just getting into hot water for IE4’s deep integration with Windows 95, would have designed IE5’s drag and drop behavior primarily for talking to other applications on the desktop–the use case of dragging to attach files in Outlook. Rich interactions within the browser, such as dragging an element from one part of an app to another, were at best a secondary concern.

This post focuses on two features of the HTML5 drag and drop API, the DataTransfer object and drag ghost provided for user feedback . These features provide a necessary translation layer between the language of the web and the language of the OS, but at a significant design and implementation cost:

  • Data Payload: Each drag event object has a DataTransfer property that accepts some data at the beginning of a drag (stored by MIME type), seals itself off from mid-drag access with a strict security policy, and then reveals its contents to the HTMLElement or native application that receives the drop.
  • Visual Feedback: The feedback image (or “ghost”) that follows the mouse cursor is just a screen shot of whatever you started dragging, and this OS-friendly rasterization can only be changed by providing another image–not an HTML element (say goodbye to hover states).

At their best, these features enable an opinionated and familiar version of drag and drop behaviors, making it difficult to innovate and provide richer, visual feedback about what can be dropped where. At their worst, the design of these features introduces bugs and workaround hacks.

Security Policies and the DataTransfer Object

Problem: The data payload of a drag event is not available before the drop event.

Back in the native desktop days, a drag cursor might have gone over many windows while moving something from a folder to an editor. If you didn’t want to share the information in that file with every app your cursor passed over, it would have been good performance and security practice to seal the drag payload until it reached its destination. Other native applications could still query the MIME type to see if they could also be a drop target, but that would have been it.

This security policy isn’t sensible on the web. With native drag and drop, dragenter and dragleave events simply don’t have access to the drag payload; you only have MIME types to provide hover states. This makes it impossible to selectively allow or disable drop events based on what you’re dragging, or to provide predictive feedback about what would happen with a certain drop.

If you wanted to build an HTML editor, for example, you wouldn’t be able to hint that dropping an H2 directly into a UL would create invalid markup. You would only be able to catch the mistake for your (non-technical) user after the H2 has been dropped in, at which point you could surprise and potentially confuse them by refusing the drop. Being able to anticipate invalid drop actions–say, by turning the cursor into an “X”–could make up the difference between an frustratingly unpredictable interface and one that simply works.

Of course there are bugs filed to change this and there are a few workarounds:

  • You could use dark magic: switch statements and the flexible MIME type syntax to actually pass information along in the MIME type string (don’t actually do this). As a thought experiment, you could could set your DataTransfer object with both “text/html” and “text/html; tag-type=block-level” MIME types; then you could sniff the types of data listed in the DataTransfer object and read off the strings to determine whether a given drop would be valid before it’s happened.
  • You could use a real, but architecturally compromised approach: break native API’s encapsulation and introduce shared application state. This state could be a singleton, such as a DragManager, that would be updated with whatever your DataTransfer payload is when starting a new drag. You could then query this singleton to determine and display the validity of a drop on dragenter/dragleave.

The second approach works, but it begs the question: with the data stored elsewhere and encapsulation broken, why use the DataTransfer object? And this reminds us of our bigger question: why use HTML5 for in-application drag events at all?

Visual Feedback

Problem: The browser does a poor job of drawing drag feedback and you can’t make it much better.

A drag “ghost” is the element that follows your cursor around the screen to let you know what you’re dragging. When dragging a file into the browser from the desktop, the ghost is generally a file icon.

Unfortunately, as this gif shows, native drag and drop is terrible at creating ghosts from elements in the DOM.

Part of the problem is that the native-generated ghost, the clipped pattern icon in this gif, has to be rasterized by the browser before it can be passed along to the OS, which then handles displaying that image inside and outside of the browser window. Chrome handles this rasterization by taking a pixel dump of the element at the very beginning of its drag. In the above image, our pattern had a :hover state on dragstart and is scrolled partially offscreen, so the resulting ghost is cropped and the :hover styles are stuck on. This simply isn’t good enough.

The API does provide an escape hatch for overriding the default ghost, but it’s tightly constrained:the override can be specified as a canvas or image. This may be adequate if none of your ghosts need to reflect user-generated content, and if you have either a limited number of ghosts or they can be easily generated programmatically. But you can’t generate a feedback image from, say, user-provided HTML.

We can get creative again to hack in a more reliable, more configurable ghost, but the extra work doesn’t help:

  • One particularly extravagant approach could involve hijacking the renderer to create a screenshot (!) of the element as you’d like it to be styled, and then inserting it via a data URI.
  • A saner approach would be to simply absolutely position a ghost element to follow the mouse. But that takes us back to writing code that re-implements core parts of drag behavior outside of the API.

Design Issues, Other Bugs

Problem: The spec confuses its UI library features with its need to translate data between web and native behaviors. 

Fundamentally, the native API is probably just trying to do too many things at the same time. If you’ve worked with the API, you probably know the feeling of these behaviors not quite being in the language of the web, but also not in the language of the OS. The original IE5 implementation privileged fluid interactions with the desktop over rich behaviors in the browser, so every drag or drop was immediately turned into a native event.

This means browser implementations have to simultaneously provide 1) a translation layer between the language of the browser (DOM elements, loosely-typed object hashes) and the language of the desktop (rasterized images, payloads stored by MIME type) and 2) a UI library to create, manage, and handle the drag interactions. These are very distinct concerns, not separated. Of course there are bugs.

We’ve focused on two issues–a limited DataTransfer object and poorly constructed ghost images–but here are some other things to consider when using the native behavior (your mileage may vary based on browser or when you’re reading this):

  • Most browsers go into a semi-modal, mostly non-interactive state when a drag event starts within them. This means that interface elements freeze, animations pause, and most (but not all) scrolling features are disabled during the drag. These things also happen in a slightly different way when dragging content into the browser from the desktop.
  • The API is still inconsistently implemented, with WebKit missing properties like relatedTarget on dragenter/dragleave events. This makes it hard to implement mouseenter/mouseleave-style behaviors, so having rich hover behavior is even more difficult.
  • Major interaction and rendering bugs. In Chrome, We’ve seen native drag events leave scrollbars on iframes in a locked state and refuse to allow any other scroll value (they would tug back). We’ve also seen the drop event reapply a :hover state to the wrong element, incorrectly highlighting a distant sibling element.

Our Approach

On the Habitat team, we’re only using native drag and drop as a supplementary behavior for data entry, like bringing in files from the desktop. For other interactions, we’re rolling out a drag and drop library that tries to play nicely with the native behaviors without mimicking or overextending them. We’re finding that this lets us write more stable code and design clearer interactions.

We looked at other libraries before rolling our own drag and drop code, but we couldn’t find any open source helpers that scratched our major itches. We wanted something that paid special attention to customizable ghosts and had a genuinely helpful API for enter, leave, and move-within-element events. Since we lean on iframes for their sandboxing properties, we also wanted to find a way to make non-native drag and drop to work across multiple documents.* What we’ve written is pretty young code now, but look for it on our GitHub page after it has matured.

Good drag and drop on the web is still far from a solved problem. As the libraries improve, so eventually will the browser APIs. This is how browser innovation happens. We should be filing bugs on the browsers to improve their native behaviors, but we should also be writing our own code and improving what’s out there. Drag and drop may not always be the best tool in designing an interface, but we should be able to rely on it when it is. Direct manipulation on the web, whether on mobile or desktop, still has a long ways to go, but we hope to help a little.

  • Cross-document communication using the DOM methods was both easier and harder than expected. We may write a longer article on this hack, but the short story is that we used a mix of blanket divs, DOM math, and document.elementFromPoint().

Follow the discussion on Hacker News.

Dean helps build Inkling Habitat, focusing on the place where systems design meets user experience. He just got back from Art Hack Day Berlin. Follow him and Inkling Engineering on Twitter:@deanhuntus and @InklingEng.

Further Reading

Habitat is built by a tight, creative group of JavaScript and Python/Scala hackers. Some of us have started our careers here; others have joined us from places like Apple, VMware, and Google, where they made major contributions to projects like Google Docs, Gmail, and WebOS. Want to join us for lunch? Email Brian!