What is the problem and why is it important?

As data gets more and more conentrated in the servers of giant private companies, even if the data is not secret, analysing the patterns in which this data is accessed can reveal private information about the end user. Netflix and Youtube are good examples of this: they deliver public content but can see which subset of the public content each user chooses to interact with. While at the scale of an average individual user this is a negligible leak, at the size of hyperscalars the data can be used to infer private information about users often not directly related to their watch patterns that these users not disclose volunatrily. One option to solve the problem is to stop interacting with any service that assumes leaking data access patterns. In today’s digital world this is almost impossible that is what makes Popcorn important.

What are the core techniques used in the paper?

The paper notes that the state of the art at what the authors are trying to accomplish are ITPIR and CPIR. ITPIR is great if multiple independent servers are available to distribute the content end user needs. However, it requires that these multiple parties have plain-text access to the content which may not be desire for serving copyrighted material. Additionally, in order to effectively hide the content that was accessed, the server has to process half of available library with each request. CPIR, on the other hand, encrypts content thereby solving some of the issues of ITPIR but requires expensive cryptographic operations. The paper proposes a new approach that combines the advantages of both and breaks the trade-off. It uses ITPIR’s multiserver approach with non-PIR and relatively cheap encryption. It then uses simple xor arithmetic, like in ITPIR to ship the encrypted content to the client. It then needs to use the expensive CPIR encryption to only send the content encryption keys to the client.

Additionally, the paper uses batches in order to amortise the full library reads among several user accesses. Doing so brings about yet another trade-off between latency the end user experiences and amortization factor. The authors take advantage the unique characteristic of video content that they are targeting to break this trade-off as well: they notice that users watching video only need the beginning of the stream immediately. The access patterns for the rest of the stream are sequential and therefore more predictable and not sensitive to latency. They take advantage of this by partitioning the content into increasingly larger chunks to serve in each epoch. So, the beginnings of streams are always served quickly in small epochs while later chunks are served in larger epochs thereby amortising cost of reading the entire library per epoch.

What are the paper’s main limitations?

It seems the paper does not account for random access in video streams. They are right that streams are consumed in largely sequential order but it is not uncommon to go back and forth in a stream once or twice. The current implementation of Popcorn would incur massive latencies for such access patterns. The paper uses an Apache server implementation as a baseline and cites ~x3 overhead for popcorn. While even x3 would be unacceptable for sth like Youtube or Netflix, I think the real overhead is much larger than this and the Apache implementation largely underestimates how efficiently these large streaming systems run. Both Youtube and Netflix use analytics they learn from access patterns in order to optimize content placement. Implementing something like Popcorn would at the very least mean that they get less reliable information about access patterns (Popcorn itself does not provide any possibility for privacy preserving analytics but cites related work which does offer such functionality). This would in turn incur additional cost for Netflix since they would need to move content around on demand.

What are potential extensions of the work?

The idea of changing the epoch size based on the location inside the media stream informed by general consumption patterns of such streams was really interesting. I wonder if it can be applied in other scenarios such as content caching to closer data centers? (e.g. cache only the beginnings of streams and ship it on demand once there is a request to play it).

As it is, Popcorn does not allow any geolocation based caching as it would require the whole library in order to serve any client in private manner. Netflix and Youtube need analytics such as which content is popular in which general area. As a straw man example, if Youtube and Netflix decide to just voluntarily not use this access data, incur the additional cost of not knowing how to cache to the user, the services would be too expensive for anyone to use. This example assumes zero overhead for PIR. So, it follows that these streaming services should have a mechanism to learn such regional statistics. Given that, would be interesting to come up with a PIR mechanism that would hide private information put preserve larger community statistical access patterns. I think such approach will not only make the system more realistic for service providers, but will also decrease PIR overhead.