Flipboard Engineering

VAST Video Ad Service

https://www.linkedin.com/in/guangle-fan-13496731/ (Guangle Fan) — Tue, 30 May 2017 00:00:00 +0000

Videos are more ubiquitous than ever before: there are virtually no boundaries on how, when and where people can interact with video content. Flipboard as a curation platform brings multi-format content to peoples around their personal interests and passions. There has been tremendous opportunity for video ads to perform efficiently on our platform. At Flipboard, we have been building video inventory since 2013 in the form of transcoding, storing, and serving proprietary format video ads on an in-house platform. As we see high-quality beautiful video ads satisfy advertisers, we also face the challenge of scaling the platform to meet the increasing demands of our video ad inventory. To bring Flipboard’s video ad inventory to a larger group of advertisers, we need a programmatic selling channel that supports standard video ad formats, provides easy performance measurement and flexible video behavior control. This is where VAST can help.

What is VAST ?

VAST is the Video Ad Serving Template published by IAB. It is a universal XML schema that gives video players information about which ad to play, how the ad should show up, how long it should last, and whether peoples are able to skip it. It’s important because it lets video players and ad servers speak the same language. Before VAST, advertisers had very little information about the publisher’s player implementation. It also allows ad platforms to easily exchange ads with each other. For example, a physical video media file can be hosted at Service A, and passed through a chain of Services B, C, D, and served to consumers on publisher platform E. Each service will be able to wrap the ad with its own event trackers for performance measurement. VAST has been the most adopted video ad standard by advertisers and publishers, and the prevailing version is VAST 2.0.

How do we serve VAST ads ?

At Flipboard, we have ad inventory across our iOS, Android, Samsung Briefing, and Web products. The way some publishers implement a VAST compatible video player on the client side is OK, but not ideal. As a comparison, a backend-centric approach can work better, where a single VAST compliance server handles all the third-party communication, while the client side has consistent behavior across multiple platforms. It’s also easy to check ad quality before serving to the client and optimize package selection logic. Each transaction of VAST ads involves communication among multiple ad services. Each service provides a unique XML payload which carries its own event trackers including impression/error/click through, user engagement trackers, and a special HTTP URL called VAST tag pointing to its source third-party service, or its parent ad service. The pointers form a virtual network between services involved in the transaction. Crawling pointers along the chain, our ad server can eventually fetch the inline ad with media files. It also organizes all trackers by event types, assembles with media resources, adds our own event trackers, and then sends down a validated ad package to the clients.

Our internal inventory management tool supports two types of VAST ads. The first type connects to open market ad platforms such as Rubicon and Tremor (with more coming), where we don’t necessarily limit a particular group of buyers. Instead, inventory would be bid on and taken by the winner, although we provide various types of topic-based packages like business, technology, fashion, etc. Geo targeting is also available.

Another type is through direct sale channel advertisers providing us VAST tags hosted by a third-party server, e.g DCM or their own ad server, where they set up the tag once and distribute to all VAST compatible publisher platforms. This means we don’t need to transcode or host the media content, which lets us hammer out a deal in much faster way with less cost of communication and maintenance.

While both types increase our inventory sell-through rate, the former is such a flexible tool that opens a new selling channel, while the latter reinforces our direct selling channel by the unique ease of setting up.

Some Challenges

The communication among third-party ad platforms, DSP, and PSP has a pull-based architecture. That means inventory sellers like us need to constantly send PSP requests whenever they’re available, while buyers on the other side of the platform constantly check and buy what is available at the moment. The fill rate is low based on the total number of requests, and that’s the nature of many ad platforms. At our scale, requests per second can sustain at 10K on average. Since we just started selling a few packages on a few ad platforms, the number can only go up while we expand our footprint. Handling the scale in an efficient way is critical to meet high fill rate of our inventory. On the backend, we optimize our HTTP requests with pooling connections, asynchronous non-blocking calls, properly timing out and recycling resources, etc. Thus we control the usage of CPU, network, memory, and other system resources under a reasonable level without adding more hardware.

Since serving VAST ads involve multiple third-party services by design, it comes with a challenge of quickly identifying and responding to ad quality issues. While an ad can be wrapped and exchange hands several times, it’s not surprising how quickly the trace of the original buyer gets lost. In our ad server, we monitor the change of ad quality by ad system, by service domain, and identify typical reasons in real time. While new causes of ad quality problems may still come up, they tend to repeatedly happen afterward. Our ad server traces all VAST tags that lead to a problematic inline ad, samples the bad instance and surfaces it up to the logging system UI. For a bad instance, we are able to get the message of the preliminary reason, and a chain of VAST tags. That is helpful to quickly respond to the issue, limiting the impact.

Last but not least, as with any type of ads, tracking performance is half of the work. To be VAST compliant, our mobile app calls third-party tracking URLs from different services at the same time when the event happens, as well as our in-house event trackers. At the backend, we leverage the existing data pipeline to aggregate metrics in a structural way by campaign, order, and ad. We provide insights of quartile of video play, expand, collapse, click-through rate, play duration, etc. We also provide dimensional insights about peoples’ geo location and interests.

Bottom Line

Although our service can handle high throughput of third-party requests in the low-fill-rate programmatic world, we can still improve it by encouraging bidding activities with better tailored ad packages, and by leveraging our ad selection model and revenue prediction model. That is the next step to do.

To sum up, we are excited to announce that Flipboard supports VAST ad as a new programmatic selling channel. We think it can bring us a broader group of advertisers, higher quality ads, and better user experience to our beautiful app.

Detecting Trustworthy Domains

https://www.linkedin.com/in/mikecora/ (Mike Vlad Cora) — Wed, 12 Apr 2017 00:00:00 +0000

High quality, truthful, diverse and informative content is Flipboard’s #1 priority. Hand-picking trusted sources guarantees quality, but is very time consuming, and can potentially miss out on the multitude of excellent but smaller publishers. To address this problem, we’ve developed a machine learning (ML) system called the Domain Ranker. Its goal is to automatically distinguish authoritative domains from plagiarists, spammers and other low quality sources. It learns to predict how our editorial team would label a domain by analyzing the content and the signals generated by our users. The Domain Ranker scales our editorial thinking to a much larger amount of content than we could handle manually, ensuring high quality across all topics.

Overview

Flipboard has indexed hundreds of millions of articles in the last year alone. In addition to the constant stream of articles from our trusted partners, any user can add any web article into their personal magazines, further expanding Flipboard’s pool to non-partner publishers.

Our community support and editorial teams are constantly battling the endless churn of spam sites. Through their efforts, we have identified thousands of labeled spam domains, alongside thousands of partner and whitelisted publishers. The Domain Ranker is a machine learning system that uses this labeled data to learn and generalize our editorial thinking to unlabeled sources.

In this blog post I do not delve into the theory behind any of the machine learning classifiers used. They are all well known, off-the-shelf implementations in the scikit-learn Python library. Instead I will focus on the engineering journey: managing the live data pipeline, exploring and engineering reasonable features, and experimenting with a multitude of classifiers to maximize accuracy.

I compare an ML project to an open-ended “Choose Your Own Adventure” book: every path leads to an almost unlimited number of forks, fraught with perils and rewards. There are many paths that end in failure, even more leading to mediocre results, and just a few (if any) that end in happiness ~~ever-after~~ for a while.

Choose Your Own ML Adventure

Firstly, we need to consider whether ML is the right solution. Is this adventure worth playing? Can we get by with some manually built (but static) heuristics? Or is a machine learning system that automatically learns heuristics necessary for this problem?

New domains that publish content are registered daily on the web. Out of the millions of domains whose content users add to their Flipboard magazines, less than 1% have been labeled by our editorial team. There is an endless supply of usage data and content analysis features we can use. It is not clear at all what set of features distinguishes low quality from high quality, especially since spam continuously disguises itself as good content. So the cover of the Domain Ranker ML adventure book looks interesting enough to crack open.

Data Pipeline

If your data pipeline is anything like ours, it will be distributed through the multiverse, covering the breadth of technologies for good measure: Kafka queues, S3 data stores, HBase, RDS, Redis, and the like. Your ML adventure will be a lot more pleasant if you can collapse the multiverse into one source of truth, ideally a memory-mapped database. Not having to deal with the complexity of a distributed system allows more focus on exploring features and classifiers.

Luckily for me, Flipboard has a custom written memory-mapped index of important article usage events that is ridiculously fast to access.

To seed the domain features, I first ran a one-time-job to aggregate all historical article events (views, likes, shares, etc.) into domain-level features, and stored them in the index for quick retrieval and updating. Then a continuous stream of real-time article usage events arrives through various pipes: Kafka queues, S3 file stores, MySQL databases, etc. These usage events are reprocessed into article features and incrementally averaged into the domain-level features, for all labeled and unlabeled domains.

Feature Engineering

Specialized knowledge about the data space goes a long way towards a successful ML application. In addition to the raw data, it is worth experimenting with more complex polynomial features that may describe a decision boundary. In our case, an obvious place to start was with article features that imply quality and engagement, like clickthrough rate, article length, quickbacks, etc.

It is important to use features that do not suffer from presentation bias. As an example, users tend to read articles that are higher in a feed as opposed to ones lower in a feed. If a recommender system boosts popular articles, a positive feedback loop occurs: popular articles are ranked higher in the feed, which are more likely to be viewed, further increasing their popularity. So popularity does not imply quality. Instead, a better feature could be something that measures follow-up actions after the user has read an article (regardless of where in the feed it was).

Feature Selection

Another consideration is reducing correlated features. For example, the time spent reading an article may be a useful feature, but it is directly correlated to length. A more informative feature is percentage of quickbacks for a domain: the portion of views with less than 10% completion. This indicates a bad experience through either pop-up ads, or users quickly realizing the content is low quality.

Many more features than the ones listed are used in practice. Feature selection helped to identify correlated and irrelevant features. We ran a fairly exhaustive search over subsets of features and picked ones that gave the best results on our validation set. The search was performed by running various classifiers on subsets of features and comparing the results.

Feature Scaling

Features can have wildly different scales. Article length for instance, ranges from 20 words to thousands of words, while quickbacks are a percentage. We can simplify the job for the classifiers by scaling all features to be in the same 0 to 1 range. This extra heuristic can also help reduce the impact of outliers. For example, if a user leaves the app open on an article, it may generate an unrealistically large view time. My rule of thumb for a reasonable scaling range was the feature average +/- 2 times the standard deviation.

Some classifiers are not affected by different feature scales, while others completely fall apart. In our case, the SVC classifier took an inordinately long time to train on unscaled features compared to other classifiers. Once the features were scaled, training got exponentially faster. Based on many experiments, scaling the features increased training speed and marginally improved the results.

Model Selection

One can spend a lifetime on this chapter of the adventure. The scikit-learn library makes it quite easy to test many families of classifiers. Thank you to all of the researchers and grad students who continue supporting this amazing library.

First we establish a baseline with simple models: LogisticRegression, GaussianNB, SVC, MLPClassifier. For binary classification, the Receiver Operating Characteristic (ROC) curve is the guiding light towards success. The goal is to maximize the area under the curve (auc), with a 1 being perfect classification.

These curves are the average of 5-fold cross validation train/test passes for each classifier, based on this example: ROC with cross validation.

Ensemble of Models

Success of the various classifiers is highly dependent on the nature of the features. Armed with the simple models as a baseline, I expanded the search to ensembles of models like RandomForest and GradientBoostingClassifier. Additionally, using TPOT: your Data Science Assistant, I was able to squeeze out a bit more prediction accuracy. It uses genetic programming to search for an optimal machine learning pipeline. TPOT resulted in this interesting GradientBoostingStack, where the results of the LinearSVC and BernoulliNB classifiers are piped into the GradientBoostingClassifier:

GBoostingStack = make_pipeline(
    make_union(VotingClassifier([("est", LinearSVC(C=20.0, 
                                            dual=False,
                                            loss="squared_hinge", 
                                            penalty="l1"))]),
               FunctionTransformer(lambda X: X)),
              
    make_union(VotingClassifier([("est", BernoulliNB(alpha=0.001, 
                                            fit_prior=True))]),
               FunctionTransformer(lambda X: X)),
              
    GradientBoostingClassifier(learning_rate=0.1, 
                               max_depth=7, 
                               max_features=0.9, 
                               min_samples_leaf=16, 
                               min_samples_split=2, 
                               subsample=0.95))

In addition to the ROC curves and overall accuracy measures, I logged all other usual metrics: Precision, Recall, F1, Specificity and Brier Score. Accuracy is not a sufficient metric, as it is biased by the test data. If accuracy is 90% and the test data is 90% positive, then the classifier may just be labeling the entire data set as positive. Specificity (true negative rate) is also important, since we are trying to catch and filter out spammers (negative labels).

Results           Accuracy |  Precision | Specificity |  Recall |      F1 |   Brier 
-----------------------------------------------------------------------------------
GaussianNB           0.770 |      0.793 |       0.189 |   0.948 |   0.863 |   0.190
SVC                  0.800 |      0.798 |       0.180 |   0.990 |   0.884 |   0.134
MLPClassifier        0.812 |      0.808 |       0.232 |   0.989 |   0.890 |   0.139
LogisticRegression   0.820 |      0.833 |       0.369 |   0.958 |   0.891 |   0.132
RandomForest         0.865 |      0.881 |       0.578 |   0.953 |   0.915 |   0.102
GBoostingStack       0.870 |      0.891 |       0.619 |   0.946 |   0.918 |   0.100
DomainRankStack      0.875 |      0.887 |       0.601 |   0.959 |   0.922 |   0.097

Not all classifiers fail in the same way. Combining the 3 with the highest specificity, (RandomForest, GBoostingStack and LogisticRegression) into the DomainRankStack further boosted overall accuracy to a respectably useful 87.5%. Although specificity is slightly lower than GBoostingStack alone, the overall F1 and Brier scores have improved. Brier score is the mean squared difference between predicted probabilities and the actual outcome, so lower is better. It effectively communicates the confidence of the classifier, in addition to the prediction accuracy.

Conclusion

A “Choose Your Own ML Adventure” never truly ends. It’s a magical book, because on every re-read, new choices may appear, promising more exciting adventures. Over time, new features may be discovered, or new machine learning models may become available. Spammers may change tactics and find ways to fool a trained system. In production, our models are continuously retrained and verified as new content and data from our users and editorial team is collected.

A machine learning system is continuously sharpened, never perfected. For this reason, the Domain Ranker is just one set of “pliers” in our large tool-chest, helping to maintain the high standards of quality, truth and diversity that is expected of Flipboard.

Key Takeaways

Build and verify the data pipeline first, separately from all else.
If possible, aggregate the data on a single machine for much quicker experimentation.
Engineer and scale the features.
Garbage in/garbage out: be wary of biased and unbalanced training/test data.
scikit-learn is really fun and incredibly useful.
TPOT rocks.
Ensembles of classifiers rock.
Measure everything, rinse and repeat.

A great guide to practical ML is Martin Zinkevich’s Best Practices for ML Engineering.

Special thanks to Ben Frederickson, Arnab Bhadury, David Creemer, Mia Quagliarello and Christel van der Boom for proofreading.

Enjoyed this post? We’re hiring!

Clustering Similar Stories Using LDA

https://www.linkedin.com/in/arnab-bhadury-a6304768 (Arnab Bhadury) — Wed, 08 Feb 2017 00:00:00 +0000

There is more to a story than meets the eye, and some stories deserve to be presented from more than just one perspective. With Flipboard 4.0, we have released story roundups, a new feature that adds coverage from multiple sources to a story and provides you with a fuller picture of an event.

Here’s how it looks:

With our scale of millions of articles and constant stream of documents, it’s impossible to generate these roundups manually. So, we have developed a clustering algorithm that’s both fast and scalable, and in this blog post, I will explain how we create these roundups on Flipboard.

Why is this difficult?

Although there are many sophisticated automatic clustering algorithms, such as K-means or Agglomerative clustering, story clustering is a non-trivial problem. Because each text document can contain any word from our vocabulary, most text document representations are extremely high-dimensional. In high-dimensional spaces, even basic clustering or similarity measures fail or are very slow.

Additionally, two very similar documents often have very different word usages. For example, one article may use the term kitten and another may use feline, but both articles could be referring to the same cat.

Furthermore, we don’t know the number of roundups that we expect to see beforehand. This makes it difficult for us to directly use parameteric algorithms such as K-means. Our clustering algorithm also needs to be fast and easy to update, because there is a constant stream of documents coming into our system.

Overview

Since even the most basic distance measures fail in high dimensions, the first thing we do is lower the problem’s dimensionality. We represent each of our text documents as a bag-of-words, and remove stop-words and rare words from our vocabulary. Even after an aggressive trimming, the documents are still very high-dimensional. We then we use Latent Dirichlet Allocation (LDA) to further lower the documents’ dimensionality. We use LDA because this algorithm is amenable for text modeling and provides us with interpretable lower dimensional representations of documents.

LDA is generally used for qualitative understanding of big text corpora. The latent topics that the model learns are highly interpretable and provide deep insights to the data. However, that’s not the only thing it can be used for; it is also a very logical algorithm to use for dimensionality reduction as it learns a mapping of sparse document-term vectors to sparse document-topic vectors in an unsupervised setting.

Once the documents are represented over a tractable number of dimensions, all similarity and distance measures come into play. For our story clustering, we simply map all the documents to this conceptual level (latent topics), and look at the neighbours for each document within a certain distance. We use an approximate nearest neighbour model because it only requires us to look at a small neighbourhood of documents to generate these clusters.

The following sections explain how we use LDA for this problem, and then introduce some tricks to optimize LDA using Alias tables and Metropolis-Hastings tests.

High-level overview of LDA

LDA is a probabilistic generative model that extracts the thematic structure in a big document collection. The model assumes that every topic is a distribution of words in the vocabulary, and every document (described over the same vocabulary) is a distribution of a small subset of these topics. This is a statistical way to say that each topic (e.g. space) has some representative words (star, planet, etc.), and each document is only about a handful of these topics.

For example, let’s assume that we have a few topics as shown in the figure above. Knowing these topics, when we see a document explaining Detecting and classifying pets using deep learning, we can confidently say that the document is mostly about Topic 2 and a little about Topic 1 but not at all about Topic 3 and Topic 4.

LDA automatically infers these topics given a large collection of documents and expresses those (and future) documents in terms of topics instead of raw terms. The key advantage of doing this is that we allow term variability as the document is represented at a higher conceptual (topic) level rather than at the raw word level. This is akin to many success stories in image classification tasks using deep learning, where the classification is done on a higher conceptual level instead of on the pixel level.

To infer the above latent topics, we do posterior inference on the model using Gibbs Sampling. What it essentially comes down to is to estimate the best topic for each word seen in every document. This is estimated by:

$p (Z_{d,n} = k) \propto (C_k^d + \alpha) \times\frac{C_k^w + \beta}{C_k + V\beta}$

where $w$ is the word seen in the $n$ th position in document $d$ , $C_k^d$ is the number of times the topic $k$ has appeared in document $d$ , $C_k^w$ is the number of times the word $w$ has been estimated with topic $k$ in the whole corpus, and $C_k$ is the number of times the topic $k$ has been assigned in the corpus. It is a sensible equation which suggests that a topic is more likely to be assigned if it has been assigned to other words in the same document or when the term has been assigned to that same topic several times in the whole document corpus.

We calculate the above equation for each topic and define a multinomial distribution (a weighted dice roll), and generate a random topic from that distribution. The code for LDA’s inference (CGS-LDA) looks like this:

for (d = 0; d < D; d++) {
  for (n = 0; n < len(data[d]); n++) {
    w = data[d][n];
    k = Z[d][n];
    decrement_count_matrices(d, w, k);
    for (k = 0; k < K; k++) {
      p[k] = (CDK[d][k] + alpha) * (CWK[w][k] + beta) / (CK[k] + V * beta);
    }
    k = multinomial_distribution(p);
    increment_count_matrices(d, w, k);
    Z[d][n] = k;
  }
}

This computation can be very expensive if we try to capture more than a thousand topics, because the algorithmic complexity becomes $O(DNK)$ per iteration. What we would ideally like is to get rid of $O(K)$ loop for each word in a document.

Optimizing LDA using Alias Tables and Metropolis-Hastings Tests

Fortunately, there has been a lot of new research (LightLDA, AliasLDA) to speed up the sampling process and reduce the computational complexity to $O(DN)$ . The key question here is: Is it really possible to generate a single sample from a weighted multinomial distribution in under $O(K)$ time?

The answer, not surprisingly, is “no” because generating a $K$ dimensional multinomial probability array $p$ takes at least $O(K)$ time because we need to know the weight for each index. But once this array is created, generating a sample is simply a matter of generating a random number from $[0, sum(p)]$ and checking which index of the array the number falls in. And if we needed more samples from the same distribution, all future samples would only require $O(1)$ time.

Instead of taking a single sample each time from the distribution, if we take $K$ samples each time the table is generated, then the amortized sampling complexity would be $O(1)$ . But this method has huge memory implications because the sizes of these arrays are dependent on the sum of the weights, and in LDA’s case, they can get extremely massive due to dependencies on count matrices ( $C_d^k$ , $C_k^w$ and $C_k$ ).

Alias Sampling

Walker’s Alias method is an effective way to compactly store these probability vectors (Space complexity: $O(K)$ ), while keeping the sampling complexity at $O(1)$ . Instead of defining a long row vector, Alias sampling defines a completely filled 2-dimensional table (dim: $K \times 2)$ from which its easy to sample from in constant time.

These tables are generated by first multiplying each element by $K$ , followed by a Robin Hood algorithm and maintaining two lists: rich and poor. rich contains all the elements that have a greater weight than $1.0$ , and the poor stores the rest. This is followed by a simple iteration of putting the poor elements in the table cell first, and then filling up the height if needed by “stealing” from the rich:

Each column has a height of 1.0 with a maximum of two different array indices. For sample generation, a random number is generated between $[0, len(table)]$ to pick the column, and then we generate a random floating point between $[0, 1.0]$ and see which region the decimal number falls in.

Metropolis Hastings Algorithm / Test

Coming back to our LDA case, every time we need to sample a topic for a word in a document, we could generate $K$ samples. However, that would be wrong because the update equation is dependent on the count matrices, and stale samples don’t represent the true probability distribution of the inference.

Here’s where Metropolis-Hastings (MH) algorithm comes into play. MH algorithm is a Markov-Chain Monte Carlo (MCMC) method to move around the probability space with the intention of converging to an objective. Given an objective function $O(\cdot)$ , a proposal function $P(\cdot)$ and a proposed position in the probability space, MH algorithm acts as a guide telling the algorithm whether it is a good or bad idea to move from the current position $x$ to the proposed point $x'$ . The acceptance of a new proposed position is computed by:

$A(x' | x) = min(1, \frac{O(x')}{O(x)}\frac{P(x | x')}{P (x' | x)})$

If the algorithm thinks that it’s a great idea to move, MH will always accept the proposal.

The following toy example shows how we can approximate the area of a random function with Metropolis-Hastings sampling. In this example, we try to approximate the shape of a complicated non-linear function and use a standard Gaussian distribution ( $\mathcal{N}(0, 1)$ ) multiplied with a step-size to generate proposals.

From the above example, it can be seen that proposal functions control the convergence speed of the algorithm. In the ideal scenario, we would like a high-acceptance rate and the ability to move quickly around the space. For the toy example above, step-sizes of $0.2 - 0.4$ achieve the best results.

Back to LDA

We would like to have high proposal acceptance rates, good space coverage, and simple proposal generation complexity. LightLDA authors suggest using the expressions within the LDA’s update equation as proposal functions. These expressions match the function in certain regions. This has two further advantages: we don’t need to compute anything extra and simply use the statistics (count matrices) that we would have collected anyway, and we don’t actually need to create alias tables for one of the proposals.

$p (Z_{d,n} = k) \propto (\underbrace{C_k^d + \alpha}_{doc-proposal}) \times \underbrace{(\frac{C_k^w + \beta}{C_k + V\beta}}_{term-proposal})$

$Z_d$ acts as a proxy alias table for the doc-proposal because it stores the number of times each topic has appeared in the document $d$ , and if we generate a random number between $[0, len(d)]$ , we get a sample from the doc-proposal. For the word-proposals, we generate alias tables for each word and use the afforementioned Alias Sampling trick.

We cycle between these two proposals, and do the MH test on the fly, and accept/reject the proposals. We recompute an alias table for a word every time we have used $K$ proposals. The acceptance probabilities for doc-proposal ( $A_d$ ) and word-proposal ( $A_w$ ) given a proposal topic $p$ and the current topic $k$ can be calculated by using the LDA’s update equation as the objective function, and the doc and word proposals as the proposal functions.

$A_d = min\{ 1, \underbrace{\frac{(C_p^d + \alpha) (C_p^w + \beta) (C_k + V\beta)} {(C_k^d + \alpha) (C_k^w + \beta) (C_p + V\beta)}}_{objective} \times \underbrace{\frac{(C_k^d + \alpha)} {(C_p^d + \alpha)} \}}_{doc-proposal}$ $A_w = min \{ 1, \underbrace{\frac{(C_p^d + \alpha) (C_p^w + \beta) (C_k + V\beta)} {(C_k^d + \alpha) (C_k^w + \beta) (C_p + V\beta)}}_{objective} \times \underbrace{\frac{(C_k^w + \beta) (C_p + V\beta)} {(C_p^w + \beta) (C_k + V\beta)}}_{term-proposal} \}$

Using Alias tables and MH tests, the algorithm looks like the following:

for (d = 0; d < D; d++) {
  for (n = 0; n < len(data[d]); n++) {
    proposal = coinflip();
    w = data[d][n];
    k = Z[d][n];
    decrement_count_matrices(d, w, k);
    if (proposal == 0) {
      // doc-proposal
      index = randomInt(0, len(data[d]));
      p = Z[d][index];
      mh_acceptance = compute_doc_acceptance(k, p);
    } else {
      // term-proposal
      p = alias_sample(w);
      mh_acceptance = compute_term_acceptance(w, p);
    }
    // MH-test
    mh_sample = randomFloat(0, 1);
    if (mh_sample < mh_acceptance) {
      increment_count_matrices(d, w, k);  // reject proposal, revert to k
    } else {
      increment_count_matrices(d, w, p);  // accept proposal
    }	
  }
}

By doing this, the algorithmic complexity of LDA comes down to an amortized $O(DN)$ which allows us to process documents an order of magnitude faster. The following table and the graph compare the runtime (in seconds) of LDA of 60,000 documents after 100 iterations (convergence) on a single process.

Num. of Topics	100	200	300	500	1000
CGS LDA	3427.99	7605.02	12190.54	25274.20	57492.22
MH/Alias LDA	601.06	616.07	620.82	646.38	685.19

Clustering

LDA provides us with a sparse and robust representation of texts that reduces term variability in much lower dimensions. In 1000 or lower dimensions, most simple algorithms work really well. When each document is represented in this space, we do a fast Approximate Nearest Neighbour search, and cluster all documents that are within a certain distance from each other.

Using a distance based metric has an added advantage of being able to capture near and exact duplicates. The documents that are mapped too close to each other (purple circle) are considered to be the same story. The documents that are a certain radius away from the exact duplicates (pink circle) make up the roundups for each story.

Removing exact duplicates helps in capturing different views on an event, and here is an example of one of our story-clusters where the roundups capture differing perspectives on the same news.

Conclusion

Story roundups directly help us in diversifying our users’ feeds while also providing users with multiple perspectives on important stories. We had a lot of fun implementing this cool new feature, not least because we came across several new tricks that can be applied in multiple domains.

Alias Sampling and MH algorithm have been around for a long time but they are only now being used in tandem to optimize posterior and predictive inference problems.

Key Takeaways

LDA is awesome not just in context of qualitative topic analysis but also in terms of dimensionality reduction.
Simple algorithms work really well in low dimensions; (almost) everything fails in very high dimensions.
Bayesian Machine Learning doesn’t have to be slow or expensive.
If you made it this far, I hope you learned a new way to optimize some posterior inference problems - Alias Tables and MH tests are an amazing combination.

Extra special thanks to Ben Frederickson for suggestions, edits and the Alias Table visualization. Thanks to Dale Cieslak, Mike Vlad Cora and Mia Quagliarello for proofreading.

Enjoyed this post? We’re hiring!

iOS Meetup

https://twitter.com/timonus (Tim Johnsen) — Wed, 09 Sep 2015 00:00:00 +0000

Flipboard was honored to host a meetup of the Palo Alto iOS Developers group about developing apps for watchOS 2 at our office in Palo Alto last week. Ben Morrow, writer of the Happy Watch blog and host of several Apple Watch hackathons, gave a talk deep diving into various aspects of watchOS 2.

We also gave a short talk sharing a bit of the story about how we built our WatchKit app and what we’ve got in store for watchOS 2.

A photo posted by Flipboard (@flipboard) on Sep 3, 2015 at 11:29am PDT

We’d like to thank Mike Suprovici and the community for giving us the oppourtunity to host this event.

Here’s a complete video of the meetup

P.S. We’re hiring!

Introducing PSync

https://github.com/hzsweers (Zac Sweers) — Tue, 08 Sep 2015 00:00:00 +0000

Here on the Android team at Flipboard, we have a lot of settings for users to adjust their experience. If you throw in internal settings, we have about 100 total preferences to manage. This is a lot of boilerplate to maintain, because preferences in Android have no built-in synchronization (unlike Resources). Our

1	PreferenceFragment

class has a couple hundred lines of boilerplate fields at the top where we keep in-code mirrors to these preference values. This design is tedious, brittle, and requires a lot of overhead to keep in sync with XML. We developed a Gradle plugin called PSync to solve this problem.

PSync is an Android-specific Gradle plugin that generates Java representations of XML preferences. These Java classes can then be used directly in your code. The generated code is very much inspired by how

R.java

works for resources, and should feel familiar to developers. It’s easy to use, has some simple configurations for fine-tuning your generated code, and is ready to drop into both library and application projects.

A full overview can be found on the GitHub README, but here’s a quick preview:

Say you have a preference:

<CheckboxPreference
    android:key="show_images"
    android:defaultValue="true"
    />

You can reference it like this:

String theKey = P.showImages.key;
boolean current = P.showImages.get();
P.showImages.put(false).apply();

// If you use Rx-Preferences
P.showImages.rx().asObservable().omgDoRxStuff!

// If you need it
boolean theDefault = P.showImages.defaultValue();

The generated Java code looks like so:

public final class P {
    public static final class showImages {
        public static final String key = "show_images";

        public static final boolean defaultValue() {
            return true;
        }

        public static final boolean get() {
            return PREFERENCES.getBoolean(key, defaultValue());
        }

        public static final SharedPreferences.Editor put(final boolean val) {
            return PREFERENCES.edit().putBoolean(key, val);
        }

        public static final Preference<Boolean> rx() {
            return RX_SHARED_PREFERENCES.getBoolean(key);
        }
    }
}

It’s robust, fully tested, and should be ready for use with standard XML preferences. The process of writing tests for a plugin like PSync is worth its own blog post or talk, so look forward to that coming up. Future plans for this project include adding support for mixing in your own non-XML preferences, and supporting the full range of the SharedPreferences API.

We hope you find it useful, and welcome contributions and feedback on the repo: https://github.com/Flipboard/psync

If you like open source and working on cool projects, we are looking for talented Android engineers to join our team here in Palo Alto, California! Apply here

Presenting BottomSheet

https://github.com/emilsjolander (Emil Sjölander) — Thu, 04 Jun 2015 00:00:00 +0000

We are happy to introduce BottomSheet a new Open Source Android UI Component!

At Flipboard, we love building visually stunning and highly interactive UIs. When building these UIs, we tend to build them as fairly stand alone components. This makes it very easy for our developers to implement a similar interaction model and aesthetic across the whole product while working in parallel. BottomSheet is a UI component we developed to facilitate a new interaction model for saving an article to one of your magazines (otherwise known as “The Flip UI”).

The result of our efforts is a design you can now see in the current version of Flipboard.

BottomSheet fits perfectly with Google’s Material Design aesthetic as well, matching their Bottom Sheet specification. We think this component is great for displaying many types of views, from a simple intent picker to something as complex as the example shown from Flipboard. In its simplest form, BottomSheet can be used to show any view in a sort of modal UI that’s presented from the bottom of the screen can be interactively dismissed. At Flipboard, we think it’s very important that things feel responsive, and having interactive animations is an important part of this.

Apart from providing a component for presenting views in a BottomSheet, we have also have a module for common views developers might like to present in a BottomSheet. The first such component we’re releasing is the

1	IntentPickerSheetView

. This sheet view is initialized with an intent and will show a grid of activities that can handle the intent. It works very similarly to Android’s built in IntentChooser, and looks very similar to the picker used in Lollipop and above. There are a couple of ways in which our implementation improves upon the system intent chooser:

It can be interactively dismissed, making it feel much more responsive.
It adds an easy-to-use API for filtering and sorting the activities shown. Ever wanted to filter out the Bluetooth activity when adding sharing functionality to your app? With the 1 IntentPickerSheetView, that’s simple!

IntentPickerSheetView intentPickerSheet = new IntentPickerSheetView(MainActivity.this, shareIntent, "Share with...", new IntentPickerSheetView.OnIntentPickedListener() {
    @Override
    public void onIntentPicked(Intent intent) {
        bottomSheet.dismissSheet();
        startActivity(intent);
    }
});
// Filter out built in sharing options such as bluetooth and beam.
intentPickerSheet.setFilter(new IntentPickerSheetView.Filter() {
    @Override
    public boolean include(IntentPickerSheetView.ActvityInfo info) {
        return !info.componentName.getPackageName().startsWith("com.android");
    }
});
// Sort activities in reverse order for no good reason
intentPickerSheet.setSortMethod(new Comparator<IntentPickerSheetView.ActvityInfo>() {
    @Override
    public int compare(IntentPickerSheetView.ActvityInfo lhs, IntentPickerSheetView.ActvityInfo rhs) {
        return rhs.label.compareTo(lhs.label);
    }
});
bottomSheet.showWithSheetView(intentPickerSheet);

We would love for you to join us on GitHub to collaborate in adding many more of these common components!

And before I send you off to scour the interwebs for cat GIFs to display in your new BottomSheets, I want to take this opportunity to advertise that we are looking for talented Android engineers to join our team here in Palo Alto, California. Apply here

Image Scaling using Deep Convolutional Neural Networks

https://twitter.com/normantasfi (Norman Tasfi) — Wed, 06 May 2015 00:00:00 +0000

This past summer I interned at Flipboard in Palo Alto, California. I worked on machine learning based problems, one of which was Image Upscaling. This post will show some preliminary results, discuss our model and its possible applications to Flipboard’s products.

High quality and a print-like finish play a key role in Flipboard’s design language. We want users to enjoy a consistent and beautiful experience throughout all of Flipboard’s content, as if they had a custom print magazine in hand. Providing this experience consistently is difficult. Different factors, such as image quality, deeply affect the overall quality of the presented content. Image quality varies greatly depending on the image’s source. This varying image quality is especially apparent in magazines that display images across the whole page in a full bleed format.

When we display images on either the web or mobile devices they must be above a certain threshold to display well. If we receive a large image on our web product we can create breathtaking full bleed sections.

Full bleed High Quality Image

Lower resolution images introduce pixelation, over smoothing and artifacts when scaled above 100%. This is especially apparent in a full bleed presentation as seen below. This severely reduces the quality of presentation in our products.

Full bleed Low Quality Image

What is the cause of this? In general, when we need an image of size X that is required to be of size Y it must be run through a scaling algorithm. This algorithm performs a mathematical operation to scale the image pixels to the desired size Y. Some of the possible algorithms are bicubic, bilinear, and nearest-neighbor interpolation. Many of the algorithms listed above perform an interpolation between pixel values to create a transition. These algorithms use the surrounding pixels to guess what the missing values should be in the new image. The problem in the case of scaling the image to a larger size is when there are too many ‘new’ values to be filled in the image. These algorithms try to make “guesses” of what the new values should be but this introduces errors in the process which leads to noise, haloing, and artifacts.

Before we go any further I would like to give a high level introduction to both the traditional and convolutional flavours of Neural Networks. If you have a good grasp of them feel free to skip ahead to next section. Following the introduction to Neural Networks there is a preliminary results section, discussion of the model architectures, design decisions, and applications.

Note: Smaller nuances of Neural Networks will not be covered in the introduction.

Neural Networks

Neural Networks are an amazing type model that is able to learn from given data. They have a large breadth of applications and have enjoyed a recent resurgence in popularity in many domains such as computer vision, audio recognition, and natural language processing. Some recent feats include captioning images, playing Atari, aiding self-driving cars, and language translation.

Neural Networks exist in different configurations, such as convolutional and recurrent, that are each good at different types of tasks. Learning ‘modes’ also exist: supervised and unsupervised; we will only focus on supervised learning.

Supervised learning is described as a network that is trained on both an input and an output. This ‘mode’ is used to predict new outputs given an input. An example of this would be training a network on thousands of pictures of cats and dogs which have been manually labelled and then asking, on a new picture, is this a cat or dog?

On a structural level a Neural Network is a feedforward graph where each of the nodes, known as units, performs a nonlinear operation on the incoming inputs. Each of the inputs has a weight that the network is able to learn through an algorithm known as backpropagation.

Basic Neural Network
Source: Wikipedia

The structure of a Neural Network is flexible and based on the task at hand. We are free to customize our network by selecting attributes such as: the number of hidden layers (blue nodes above), number of units per layer, number of connections per unit etc. These attributes known as hyperparameters describe our model’s structure and behaviour. Selecting these parameters correctly is critical to achieving high performance. Hyperparameters are usually chosen through random grid search, optimization algorithms (bayesian and gradient based), or simple trial and error.

As mentioned previously the units within a network perform a mathematical operation on the inputs. We can take a closer look by calculating a simple numerical example involving a single unit with a handful of inputs.

Our simple model

The unit above takes three inputs: $\displaystyle x_1, x_2, x_3$ and a scalar bias term $\displaystyle b$ (not shown). Each of these inputs has weightings that are referred to as $\displaystyle w_1, w_2,$ and $\displaystyle w_3$ . The weights express the importance of the incoming input. The bias term allows us to offset our weighted value to the left or right. For this example we will use the Rectified Linear (ReL) function as the units mathematical operation, more formally known as an activation function, expressed as:

$\displaystyle \begin{array}{ll} f(x) = \begin{cases} x & {\text{if}}\ x>0 \\ 0 & {\text{if}}\ otherwise \\ \end{cases} \end{array}$

The gist of this activation function is any input value below zero is considered as zero and anything above remains the same. It has an output range of $\displaystyle [0, \infty]$ . Other activation functions can also be used such as the sigmoid, hyperbolic tangent, and maxout.

So how do we calculate an output value? To do so we need to compute $\displaystyle f(W^Tx+b)$ . The first step is to compute the vector product between transpose $\displaystyle W$ and $\displaystyle x,$ defined as $\displaystyle W^Tx$ . Our bias value is then added to this result before finally being passed through our activation function.

To make this example a little more concrete we will use some random numbers. Say we have the following vectors corresponding to our weights, inputs, and scalar bias value:

$\displaystyle W = [1, 0.2, 0.1] \\ x =[0.74, 5, 8] \\ b=1.0$

Going step by step we calculate the result of $\displaystyle W^Tx$ :

$\displaystyle 1.0(0.74)+0.2(5.0)+0.1(8.0)=2.54$

Then our bias value is added $\displaystyle W^Tx + b$ :

$\displaystyle 2.54+1.0 = 3.54$

We then take this quantity and pass it through our activation function $\displaystyle f(x)$ :

$\displaystyle f(3.54) = 3.54$

As per our function definition we see that if x is greater than zero we will get the value of x out, which in this case is 3.54.

Why is this exciting? Well, in terms of a single unit, it is not too exciting. As it stands, we can tweak the weights and bias value to model only the most basic of functions. Our little example lacks ‘expressive power’. In order to increase our ‘expressive power’ we can chain and link units together to form larger networks as seen below.

Our bigger network

The equations for the network above, where $\displaystyle W_{from,to}$ is the weight from one unit to another, is as follows:

$\displaystyle \begin{array}{ll} z_1 = f(W_{l1,h1}^T x + b) \\ z_2 = f(W_{l1,h2}^T x + b) \\ y = f(W_{h12,o1}^T z + b) \end{array}$

$\displaystyle z_1$ is the the same expression we calculated earlier! See how we have chained them together to create a more complicated network? We are applying the same calculation over and over. We use the output from a previous layer of units to compute the next. The values are propagated through the network until they arrive at the final layer. We call this process forward propagation. But this process is of little use to us if we need to change the output of our network as it makes no adjustments to the weights and biases.

To update the weights and biases of our network we will use an algorithm known as backpropagation. We will focus on a supervised approach where our network is given pairs of data: input x and the desired output y. In order to use backpropagation in this supervised manner we need to quantify our network’s performance. We must define an error that compares the result of forward propagation through our network, $\displaystyle \hat{y}$ (our estimated value), against the desired value $\displaystyle y$ .

This comparison between values is formally known as a cost function. It can be defined however you want but for this post we will use the mean squared error function:

$\displaystyle E_{MSE}(x,y) = \displaystyle\frac{1}{n} \vert \hat{y}(x) - y \vert ^ 2$

This function gives us a magnitude of error. If our network was given randomly initialized weights and biases the output will be far from the desired y value, causing a large output from $\displaystyle E_{MSE}$ .

Backpropagation takes this magnitude and propagates it from the back of the network to the front adjusting the weights and biases along the way. The amount of adjustment to each weight and bias can be thought of as its contribution to the error and is calculated through gradient descent. This algorithm seeks to minimize the error function above by changing the weights and biases.

The general process to train a Neural Network, given an input vector x and the expected output y, is as follows:

Perform a step of forward propagation through the network with input vector x. We will calculate an output $\displaystyle \hat{y}$ .
Calculate the error with our error function. $\displaystyle E_{MSE}(x,y) = \displaystyle\frac{1}{n} \vert \hat{y}(x) - y \vert ^ 2$ .
Backpropagate the errors through our network updating the weights and biases.

The above steps are repeated over and over with different xand y pairs until the weights and biases of our network give us the minimum error possible! We are minimizing the $\displaystyle E_{MSE}$ function.

Convolutional Neural Networks

If you have been paying attention to recent tech articles you will most likely have heard of Neural Networks breaking the state-of-the-art in several domains. These breakthroughs are due in a small part to convolutional Neural Networks.

Convolutional Neural Networks (convnets) are a slightly different flavour of the typical feed-forward Neural Network. Convnets take some biological inspiration from the visual cortex, which contains small regions of cells that are sensitive to subregions of the visual field. This is referred to as a receptive field. We can mimic this small subfield by learning weights in the form of matrices, referred to as ‘kernels’; which, like their biologically inspired counterparts, are sensitive to similar subregions of an image. We now require a way to express the similarity between our kernel and the subregion. Since the convolution operation essentially returns a measure of ‘similarity’ between two signals we can happily pass a learned kernel, along with a subregion of the image, through this operation and have a measure of similarity returned back!

Below is an animation of a kernel, in yellow, being convolved over an image, in green, with the result of the operation on the right in red.

Animation of the Convolution Operation.
Source: Feature extraction using convolution from Stanford Deep Learning

To illustrate this let’s run a very simple square kernel which detects directional edges over an image. The weights of our kernel:

$\displaystyle \left( \begin{array}{lll} -5 & 0 & 0\\ 0 & 0 & 0\\ 0 & 0 & 5\\ \end{array}\right)$

Applying the convolutional operation using our kernel, several times over the left image, we get the image on the right:

Animation of the Convolution Operation.
Source: Edge Detection with Matrix Convolution

So why is this useful? In the context of Neural Networks several reasons exist:

This allows us to extract information from the right image above. The kernels inform us of the presence and location of directional edges. When used in a Neural Network we can learn the appropriate weight settings of the kernels to extract out basic features such as edges, gradients and blurs. If the network is deep with enough convolutional layers it will start learning feature combinations off of the previous layers. The simple building blocks of edges, gradients, and blurs become things like eyes, noses, and hair in later layers!

Kernels building high level representations from earlier layers.
Source: Yann Lecun “ICML 2013 tutorial on Deep Learning”

Traditionally if we wanted to do work in the image or audio domains we would have to perform feature generation using other algorithms that would preprocess our data into a usable form. This would then allow our machine learning algorithm of choice to make sense of the incoming data. This process is tedious and fragile. By applying a convolutional frontend to our networks we are able to let the algorithm, with minimal preprocessing of the data on our side, to create the features that work best for that specific domain and situation. The network does feature extraction on its own. The learned features of the network work better than most of the algorithmically and hand engineered features.

Image Scaling using Convolutional Neural Networks

Below is a collection of preliminary results that were produced from the model. The left image is the original ‘high’ resolution. This is the ground truth and what we would hope to get with perfect reconstruction. We scale this original down by a factor of 2x and send it through either a bicubic scaling algorithm or the model. The results are in the center and right positions respectively.

Original

Bicubic

Model

The major differences can be seen between the hairline, eyebrows, and skin on the cheek and forehead.

Original

Bicubic

Model

The model does a good job along the hard edges on the balcony and on the sun tanning beds in the foreground.

Original

Bicubic

Model

This one is difficult to see at first glance. The finer details, such as the hair along the side of the ear and on the inside of the ear are present in the model.

Original

Bicubic

Model

No image related work is complete unless Lenna is included in some way. Observe the sharpness of the feathers, noise, lips, and eyes. The pattern on the hat also shows up better in the model output.

Original

Bicubic

Model

The major differences between the versions are the tree leaves, shadow shapes and tree textures.

Architecture

Below is one of the architectures used, the primary goal is to double the number of pixels taken in from the image. The architecture is a 8 layer Neural Network composed of three convolutional layers, each shown as stacked pinkish blocks, and four fully connected layers colored in blue. Each layer uses the rectified linear activation function. There is a final dense layer with linear Gaussian units which is not shown below.

A small section of the input image is ingested through the first convolutional layer. This image was collected from the larger image using a square sliding window. The first convolutional contains the largest number of filter maps. The outputs are ‘reprojected’ into a higher dimensionality through the first two dense fully-connected layers. Their output is further processed by the next two convolutional layers. Features from the two convolutional layers are then fed into a series of fully-connected layers. The final output image is calculated by a linear Gaussian layer.

No pooling operations are used after any of the convolutional layers. While pooling is useful for classification tasks, where invariance to input is important, the location of features detected by each kernel is important. Pooling also discards too much useful information which is the opposite of what is needed for this use case.

The weights were all initialized using the Xavier initialization suggested by Glorot & Benigo and was then slightly tweaked during hyperparameter optimzation. This is defined as $\displaystyle \frac{ 2 }{n_{in} + n_{out}}$ and was sampled from a normal distribution. The biases were all initialized to zero.

Dataset

The network was trained on a large dataset of approximately 3 million samples. The dataset images used natural images including those of animals and outdoor scenes. Some images needed to be filtered out of this set as they included illustrations of animals or text. As each image was of varying size and quality a constraint in the form of a pixel count was added to focus on images which contained a total pixel count of 640,000 and up.

Each sample within the dataset is a low and high resolution image pair. The low resolution image, the ‘x’ input, was created by downscaling a high resolution image by a certain factor. While the desired output, ‘y’, was of the original high resolution image. Very mild noise and distortions were added to the input data. The data was normalized to have zero mean by calculating the global mean and to unit variance by dividing through by the standard deviation of the dataset.

The dataset was divided into subsets of training, testing, and validation; following a 80%, 10%, and 10% split respectively.

Regularization

Max Norm

Max norm constraints enforce an absolute upper bound on the weight vectors magnitude for every unit in that layer which stops the network’s weights from ‘exploding’ if the gradient update is too large. Max norm constraints are used in all layers except the final linear Gaussian layer. An aggressive magnitude was used in all of the convolutional layers while the other layers’ magnitudes were much more lax.

L2

L2 regularization penalizes the network for using large weight vectors, $\displaystyle W$ . This is set by the parameter $\displaystyle \lambda$ and is known as the regularization strength. It is added to the cost function as the term $\displaystyle \frac{1}{2}\lambda W^2$ . The optimization algorithm tries to keep the weights small while minimizing the cost function as before. The convolutional layers and first two densely connected layers have mild regularization applied and the other densely connected layers use a stronger value.

Dropout

Dropout randomly ‘drops’ units from a layer on each training step, creating ‘sub-architectures’ within the model. It can be viewed as a type of sampling of a smaller network within a larger network. Only the weights of the included units are updated, which makes sure the network does not become too reliant on a few units. This process of removing units happens over and over between runs, so the units being included change. The convolutional layers all have a high inclusion probability of almost 1.0 while the last two fully connected layers include about half the units.

Training

The model was trained using Stochastic Gradient descent in batch sizes of 250 over the entire training set for ~250 epochs. A highish batchsize was used to smooth out the updates and make better use of the GPUs while still getting some benefit from perturbations of smaller batches.

The network is trained to minimize a mean square error function. A learning rate scale was used on all weights and biases. The formula for the weights (per layer) was $\displaystyle 1 - \frac{1}{2} \frac{ n_i }{n_{total}}^2$ where $\displaystyle n_i$ is the current layer position and $\displaystyle n_{total}$ is the total number of layers. This scaled the learning rate which helped the earlier layers converge. All the biases had a learning rate multiplier of 2.0. Nesterov momentum was used with an initial value of 0.15 and increased to 0.7 over 45 epochs.

Amazon g2.2xlarge EC2 instances were used to train the network with NVIDIA’s cuDNN library added in to speed up training. Training the final model took approximately 19 hours.

Hyperparameters

The majority of the hyperparameters were selected using an inhouse hyperparameter optimization library that works over clusters of Amazon g2.2xlarge instances. This was performed using a portion of the training dataset and the validation dataset. The process took roughly ~4 weeks and evaluated ~500 different configurations.

Variations

Some things that did not work out well while working on this problem:

Used a larger batch size of 1000, this worked well, but ran up against local minima quickly. The jitter provided by a smaller batch was useful to bounce out of these minimas.
Used a small convolutional network, this was alright but did not generalize as well as the larger convolutional network.
Tried to use the weight initialization formula suggested by He et al. : $\displaystyle \frac{2}{n}$ . Unfortunately this caused the network to sputter around and it failed to learn. Might be this specific configuration as many people have successfully used it.
Used the same amount of L2 regularization on all layers, it worked much better to vary the L2 regularization based on which layers started saturating or were clamped against max normal constraints.
Used pooling on the layers. Lost too much information between layers, images turned out grainy and poor looking.

The biggest lesson learned dealing with these larger networks is how important it is to getting the weight initializations right. I feel this aspect, after a few other hyperparameters are chosen, has the largest impact on how well your model will train. It is a good idea to spend time researching the different initialization techniques to understand the impact each has on your model. There are many papers and machine learning libraries out there with different initialization schemes from which you can easily learn.

Applications

Our goal was not to remove or replace the need for other upscaling algorithms, such as bicubic upscaling, but to try to improve quality using different technology and avenues. Our primary use case was to scale lower resolution images up when no higher resolution images are available. This happens occasionally across our platforms.

Besides the primary use case of still images this technique can be applied to different media formats such as GIFs. The GIF could be split into its separate frames, then scaled up, and then repackaged.

The final use case that we thought of was saving bandwidth. A smaller image could be sent to the client which would run a client side version of this model to gain a larger image. This could be accomplished using a custom solution or one of the javascript implementations of neural networks available such as ConvNetJS.

Further steps

We feel this problem space has a lot of potential and there are many things to try, including some wild ideas, such as:

Larger filter sizes in the convolutional layers.
Try more layers and data.
Try different color channel formats instead of RGB.
Try using hundreds of filters in the first convolutional layer, sample from them using dropout with a very small inclusion probability and try tweaking the learning rate of the layer.
Ditch the fully connected layers and try using all convolutional layers.
Curious if distillation would work with this problem. Might help create a lighter version to run on client devices easily.
Look into how small/large we can make the network before the quality starts degrading.

Conclusion

Pursuing high fidelity presentation is difficult. As with any endeavor, it takes an exceeding amount of effort to squeeze out those final few percentage points of quality. We are constantly reflecting on our product to see where those percentage points can come from, even if they don’t seem obvious or possible at first. While this wont have its place everywhere within our product, we feel it was a good cursory step forward to improving quality.

I hope you enjoyed reading through this post and have taken something interesting away with it. I would like to thank everyone at Flipboard for an outstanding internship experience. I have learnt a lot, met many awesome people, and gained invaluable experience in the process.

If machine learning, large datasets, and working with great people on interesting projects excites you then feel free to apply, we are hiring!

Feel free to follow me on twitter @normantasfi

Special thanks to Charles, Emil, Anh, Mike Klaas, and Michael Johnston for suggestions and edits throughout the process. Shoutout to Greg for always making time to help with server setups and questions.

NSUserDefaults Performance Boost

https://twitter.com/timonus (Tim Johnsen) — Mon, 16 Mar 2015 00:00:00 +0000

Since iOS 8 was released we’ve noticed some sluggishness when using Flipboard in the simulator. When taking a trace with Instruments in normal use we noticed a significant amount of time was being spent in

1	CFPreferences

On Twitter, an Apple engineer acknowledged that there were some changes in iOS 8 that added something called

cfprefsd

, and that emulating that in the simulator required synchronous reads from disk. This seemed to be the bottleneck we were encountering.

@timonus unfortunately cfprefsd doesn’t currently exist in the simulator, and emulating its behavior requires synchronize disk IO
— David Smith (@Catfish_Man) January 21, 2015

@timonus it’s on my list to fix, but it’s a large effort and device took priority
— David Smith (@Catfish_Man) January 21, 2015

About a month ago we were talking as a team about how slow we felt our app had become to debug in the simulator, so we decided to try to do something to speed it up. Our approach was to introduce a man-in-the-middle write-through cache to

1	NSUserDefaults

in memory. We do so by swizzling out all of

1
NSUserDefaults

’ setters and getters and adding a per-instance

1
NSMutableDictionary

to cache values. We avoid compiling this for device builds using

1
TARGET_IPHONE_SIMULATOR

because there aren’t such performance issues on device. The increase in performance is dramatic, where we were once spending 80% of our time in

1
CFPreferences

we’re now spending 1%.

If you’re seeing performance issues related to

1	NSUserDefaults

in the simulator I recommend trying this out, it’s open source and available for download on GitHub. To get started using it all you need to do is include it in your project.

Please note that use of this category may cause side effects when debugging extensions. This is documented in the GitHub project.

Introducing GoldenGate

https://twitter.com/emilsjolander (Emil Sjölander) — Tue, 24 Feb 2015 00:00:00 +0000

You might not know it, but both Flipboard for iOS and Flipboard for Android make heavy use of web views. We use web views so we can ensure consistent designs for our partners’ articles across all platforms. Communication between native code and the JavaScript code running in a web view is something that is both tedious to implement as well as very bug prone as it’s mostly just string concatenation. Today we are releasing a library to make this task easier when developing Android applications which use web views.

GoldenGate is a annotation processing library which generates java wrappers around your JavaScript code. An annotation processing library is a piece of code which runs when you compile your code and has the ability to generate new java classes. This means that GoldenGate can ensure at compile time that you are sending the correct types into the JavaScript functions you define in your bridge. If you’re interested in knowing more about how to write your own, this blog post is a good intro.

Let’s get into a quick example to really show the power of the library! First of all a quick example of how to currently call a JavaScript function in a WebView.

webview.loadUrl("javascript:alert(" + myString + ");");

You should clearly see that a lot can go wrong here which the compiler won’t catch. For example you could misspell

alert

or maybe

1
myString

isn’t a string at all.

GoldenGate allows for the compiler to have your back. The same code as above written with GoldenGate looks like this.

@Bridge
class JavaScript {
	void alert(String message);
}

JavaScriptBridge bridge = new JavaScript(webview);
bridge.alert(myString);

We started by defining an interface which describes the JavaScript methods we want to call from java. This interface is annotated with

@Bridge

which is where the magic happens. This will generate a class named

1
JavaScriptBridge

in this case which implements all the methods defined by the interface.

There are a couple more options for more advanced usage that we won’t cover in this blog post but they are well documented over on Github.

At Flipboard we love the type safety GoldenGate gives us and we think it can help more developers using both native and web technology in their apps. Head over to the Github repository to check out the code. Pull requests are very welcome!

60fps on the mobile web

https://twitter.com/bapjuseyo (Michael Johnston) — Tue, 10 Feb 2015 00:00:00 +0000

Flipboard launched during the dawn of the smartphone and tablet as a mobile-first experience, allowing us to rethink content layout principles from the web for a more elegant user experience on a variety of touchscreen form factors.

Now we’re coming full circle and bringing Flipboard to the web. Much of what we do at Flipboard has value independent of what device it’s consumed on: curating the best stories from all the topics, sources, and people that you care about most. Bringing our service to the web was always a logical extension.

As we began to tackle the project, we knew we wanted to adapt our thinking from our mobile experience to try and elevate content layout and interaction on the web. We wanted to match the polish and performance of our native apps, but in a way that felt true to the browser.

Early on, after testing numerous prototypes, we decided our web experience should scroll. Our mobile apps are known for their book-like pagination metaphor, something that feels intuitive on a touch screen, but for a variety of reasons, scrolling feels most natural on the web.

In order to optimize scrolling performance, we knew that we needed to keep paint times below 16ms and limit reflows and repaints. This is especially important during animations. To avoid painting during animations there are two properties you can safely animate: CSS transform and opacity. But that really limits your options.

What if you want to animate the width of an element?

How about a frame-by-frame scrolling animation?

(Notice in the above image that the icons at the top transition from white to black. These are 2 separate elements overlaid on each other whose bounding boxes are clipped depending on the content beneath.)

These types of animations have always suffered from jank on the web, particularly on mobile devices, for one simple reason:

The DOM is too slow.

It’s not just slow, it’s really slow. If you touch the DOM in any way during an animation you’ve already blown through your 16ms frame budget.

Enter <canvas>

Most modern mobile devices have hardware-accelerated canvas, so why couldn’t we take advantage of this? HTML5 games certainly do. But could we really develop an application user interface in canvas?

Immediate mode vs. retained mode

Canvas is an immediate mode drawing API, meaning that the drawing surface retains no information about the objects drawn into it. This is in opposition to retained mode, which is a declarative API that maintains a hierarchy of objects drawn into it.

The advantage to retained mode APIs is that they are typically easier to construct complex scenes with, e.g. the DOM for your application. It often comes with a performance cost though, as additional memory is required to hold the scene and updating the scene can be slow.

Canvas benefits from the immediate mode approach by allowing drawing commands to be sent directly to the GPU. But using it to build user interfaces requires a higher level abstraction to be productive. For instance something as simple as drawing one element on top of another can be problematic when resources load asynchronously, such as drawing text on top of an image. In HTML this is easily achieved with the ordering of elements or z-index in CSS.

Building a UI in <canvas>

Canvas lacks many of the abilities taken for granted in HTML + CSS.

Text

There is a single API for drawing text:

1	fillText(text, x, y [, maxWidth])

. This function accepts three arguments: the text string and x-y coordinates to begin drawing. But canvas can only draw a single line of text at a time. If you want text wrapping, you need to write your own function.

Images

To draw an image into a canvas you call

1	drawImage()

. This is a variadic function where the more arguments you specify the more control you have over positioning and clipping. But canvas does not care if the image has loaded or not so make sure this is called only after the image load event.

Overlapping elements

In HTML and CSS it’s easy to specify that one element should be rendered on top of another by using the order of the elements in the DOM or CSS z-index. But remember, canvas is an immediate mode drawing API. When elements overlap and either one of them needs to be redrawn, both have to be redrawn in the same order (or at least the dirtied parts).

Custom fonts

Need to use a custom web font? The canvas text API does not care if a font has loaded or not. You need a way to know when a font has loaded, and redraw any regions that rely on that font. Fortunately, modern browsers have a promise-based API for doing just that. Unfortunately, iOS WebKit (iOS 8 at the time of this writing) does not support it.

Benefits of <canvas>

Given all these drawbacks, one might begin to question selecting the canvas approach over DOM. In the end, our decision was made simple by one simple truth:

You cannot build a 60fps scrolling list view with DOM.

Many (including us) have tried and failed. Scrollable elements are possible in pure HTML and CSS with

1	overflow: scroll

(combined with

1
-webkit-overflow-scrolling: touch

on iOS) but these do not give you frame-by-frame control over the scrolling animation and mobile browsers have a difficult time with long, complex content.

In order to build an infinitely scrolling list with reasonably complex content, we needed the equivalent of UITableView for the web.

In contrast to the DOM, most devices today have hardware accelerated canvas implementations which send drawing commands directly to the GPU. This means we could render elements incredibly fast; we’re talking sub-millisecond range in many cases.

Canvas is also a very small API when compared to HTML + CSS, reducing the surface area for bugs or inconsistencies between browsers. There’s a reason there is no Can I Use? equivalent for canvas.

A faster DOM abstraction

As mentioned earlier, in order to be somewhat productive, we needed a higher level of abstraction than simply drawing rectangles, text and images in immediate mode. We built a very small abstraction that allows a developer to deal with a tree of nodes, rather than a strict sequence of drawing commands.

RenderLayer

A RenderLayer is the base node by which other nodes build upon. Common properties such as top, left, width, height, backgroundColor and zIndex are expressed at this level. A RenderLayer is nothing more than a plain JavaScript object containing these properties and an array of children.

Image

There are Image layers which have additional properties to specify the image URL and cropping information. You don’t have to worry about listening for the image load event, as the Image layer will do this for you and send a signal to the drawing engine that it needs to update.

Text

Text layers have the ability to render multi-line truncated text, something which is incredibly expensive to do in DOM. Text layers also support custom font faces, and will do the work of updating when the font loads.

Composition

These layers can be composed to build complex interfaces. Here is an example of a RenderLayer tree:

{
  frame: [0, 0, 320, 480],
  backgroundColor: '#fff',
  children: [
    {
      type: 'image',
      frame: [0, 0, 320, 200],
      imageUrl: 'http://lorempixel.com/360/420/cats/1/'
    },
    {
      type: 'text',
      frame: [10, 210, 300, 260],
      text: 'Lorem ipsum...',
      fontSize: 18,
      lineHeight: 24
    }
  ]
}

Invalidating layers

When a layer needs to be redrawn, for instance after an image loads, it sends a signal to the drawing engine that its frame is dirty. Changes are batched using

1	requestAnimationFrame

to avoid layout thrashing and in the next frame the canvas redraws.

Scrolling at 60fps

Perhaps the one aspect of the web we take for granted the most is how a browser scrolls a web page. Browser vendors have gone to great lengths to improve scrolling performance.

It comes with a tradeoff though. In order to scroll at 60fps on mobile, browsers used to halt JavaScript execution during scrolling for fear of DOM modifications causing reflow. Recently, iOS and Android have exposed

onscroll

events that work more like they do on desktop browsers but your mileage may vary if you are trying to keep DOM elements synchronized with the scroll position.

Luckily, browser vendors are aware of the problem. In particular, the Chrome team has been open about its efforts to improve this situation on mobile.

Turning back to canvas, the short answer is you have to implement scrolling in JavaScript.

The first thing you need is a way to compute scrolling momentum. If you don’t want to do the math the folks at Zynga open sourced a pure logic scroller that fits well with any layout approach.

The technique we use for scrolling uses a single canvas element. At each touch event, the current render tree is updated by translating each node by the current scroll offset. The entire render tree is then redrawn with the new frame coordinates.

This sounds like it would be incredibly slow, but there is an important optimization technique that can be used in canvas where the result of drawing operations can be cached in an off-screen canvas. The off-screen canvas can then be used to redraw that layer at a later time.

This technique can be used not just for image layers, but text and shapes as well. The two most expensive drawing operations are filling text and drawing images. But once these layers are drawn once, it is very fast to redraw them using an off-screen canvas.

In the demonstration below, each page of content is divided into 2 layers: an image layer and a text layer. The text layer contains multiple elements that are grouped together. At each frame in the scrolling animation, the 2 layers are redrawn using cached bitmaps.

Object pooling

During the course of scrolling through an infinite list of items, a significant number of RenderLayers must be set up and torn down. This can create a lot of garbage, which would halt the main thread when collected.

To avoid the amount of garbage created, RenderLayers and associated objects are aggressively pooled. This means only a relatively small number of layer objects are ever created. When a layer is no longer needed, it is released back into the pool where it can later be reused.

Fast snapshotting

The ability to cache composite layers leads to another advantage: the ability to treat portions of rendered structures as a bitmap. Have you ever needed to take a snapshot of only part of a DOM structure? That’s incredibly fast and easy when you render that structure in canvas.

The UI for flipping an item into a magazine leverages this ability to perform a smooth transition from the timeline. The snapshot contains the entire item, minus the top and bottom chrome.

A declarative API

We had the basic building blocks of an application now. However, imperatively constructing a tree of RenderLayers could be tedious. Wouldn’t it be nice to have a declarative API, similar to how the DOM worked?

React

We are big fans of React. Its single directional data flow and declarative API have changed the way people build apps. The most compelling feature of React is the virtual DOM. The fact that it renders to HTML in a browser container is simply an implementation detail. The recent introduction of React Native proves this out.

What if we could bind our canvas layout engine to React components?

Introducing React Canvas

React Canvas adds the ability for React components to render to

<canvas>

rather than DOM.

The first version of the canvas layout engine looked very much like imperative view code. If you’ve ever done DOM construction in JavaScript you’ve probably run across code like this:

// Create the parent layer
var root = RenderLayer.getPooled();
root.frame = [0, 0, 320, 480];

// Add an image
var image = RenderLayer.getPooled('image');
image.frame = [0, 0, 320, 200];
image.imageUrl = 'http://lorempixel.com/360/420/cats/1/';
root.addChild(image);

// Add some text
var label = RenderLayer.getPooled('text');
label.frame = [10, 210, 300, 260];
label.text = 'Lorem ipsum...';
label.fontSize = 18;
label.lineHeight = 24;
root.addChild(label);

Sure, this works but who wants to write code this way? In addition to being error-prone it’s difficult to visualize the rendered structure.

With React Canvas this becomes:

var MyComponent = React.createClass({
  render: function () {
    return (
      <Group style={styles.group}>
        <Image style={styles.image} src='http://...' />
        <Text style={styles.text}>
          Lorem ipsum...
        </Text>
      </Group>
    );
  }
});

var styles = {
  group: {
    left: 0,
    top: 0,
    width: 320,
    height: 480
  },

  image: {
    left: 0,
    top: 0,
    width: 320,
    height: 200
  },

  text: {
    left: 10,
    top: 210,
    width: 300,
    height: 260,
    fontSize: 18,
    lineHeight: 24
  }
};

You may notice that everything appears to be absolutely positioned. That’s correct. Our canvas rendering engine was born out of the need to drive pixel-perfect layouts with multi-line ellipsized text. This cannot be done with conventional CSS, so an approach where everything is absolutely positioned fit well for us. However, this approach is not well-suited for all applications.

css-layout

Facebook recently open sourced its JavaScript implementation of CSS. It supports a subset of CSS like margin, padding, position and most importantly flexbox.

Integrating css-layout into React Canvas was a matter of hours. Check out the example to see how this changes the way components are styled.

Declarative infinite scrolling

How do you create a 60fps infinite, paginated scrolling list in React Canvas?

It turns out this is quite easy because of React’s diffing of the virtual DOM. In

render()

only the currently visible elements are returned and React takes care of updating the virtual DOM tree as needed during scrolling.

var ListView = React.createClass({
  getInitialState: function () {
    return {
      scrollTop: 0
    };
  },

  render: function () {
    var items = this.getVisibleItemIndexes().map(this.renderItem);
    return (
      <Group
        onTouchStart={this.handleTouchStart}
        onTouchMove={this.handleTouchMove}
        onTouchEnd={this.handleTouchEnd}
        onTouchCancel={this.handleTouchEnd}>
        {items}
      </Group>
    );
  },

  renderItem: function (itemIndex) {
    // Wrap each item in a <Group> which is translated up/down based on
    // the current scroll offset.
    var translateY = (itemIndex * itemHeight) - this.state.scrollTop;
    var style = { translateY: translateY };
    return (
      <Group style={style} key={itemIndex}>
        <Item />
      </Group>
    );
  },

  getVisibleItemIndexes: function () {
    // Compute the visible item indexes based on `this.state.scrollTop`.
  }
});

To hook up the scrolling, we use the Scroller library to

1	setState()

on our ListView component.

...

// Create the Scroller instance on mount.
componentDidMount: function () {
  this.scroller = new Scroller(this.handleScroll);
},

// This is called by the Scroller at each scroll event.
handleScroll: function (left, top) {
  this.setState({ scrollTop: top });
},

handleTouchStart: function (e) {
  this.scroller.doTouchStart(e.touches, e.timeStamp);
},

handleTouchMove: function (e) {
  e.preventDefault();
  this.scroller.doTouchMove(e.touches, e.timeStamp, e.scale);
},

handleTouchEnd: function (e) {
  this.scroller.doTouchEnd(e.timeStamp);
}

...

Though this is a simplified version it showcases some of React’s best qualities. Touch events are declaratively bound in render(). Each touchmove event is forwarded to the Scroller which computes the current scroll top offset. Each scroll event emitted from the Scroller updates the state of the ListView component, which renders only the currently visible items on screen. All of this happens in under 16ms because React’s diffing algorithm is very fast.

See the ListView source code for the complete implementation.

Practical applications

React Canvas is not meant to completely replace the DOM. We utilize it in performance-critical rendering paths in our mobile web app, primarily the scrolling timeline view.

Where rendering performance is not a concern, DOM may be a better approach. In fact, it’s the only approach for certain elements such as input fields and audio/video.

In a sense, Flipboard for mobile web is a hybrid application. Rather than blending native and web technologies, it’s all web content. It mixes DOM-based UI with canvas rendering where appropriate.

A word on accessibility

This area needs further exploration. Using fallback content (the canvas DOM sub-tree) should allow screen readers such as VoiceOver to interact with the content. We’ve seen mixed results with the devices we’ve tested. Additionally there is a standard for focus management that is not supported by browsers yet.

One approach that was raised by Bespin in 2009 is to keep a parallel DOM in sync with the elements rendered in canvas. We are continuing to investigate the right approach to accessibility.

Conclusion

In the pursuit of 60fps we sometimes resort to extreme measures. Flipboard for mobile web is a case study in pushing the browser to its limits. While this approach may not be suitable for all applications, for us it’s enabled a level of interaction and performance that rivals native apps. We hope that by releasing the work we’ve done with React Canvas that other compelling use cases might emerge.

Head on over to flipboard.com on your phone to see what we’ve built, or if you don’t have a Flipboard account, check out a couple of magazines to get a taste of Flipboard on the web. Let us know what you think.