AI News

www.techrepublic.com

UK AI Startup Funding: Alan Turing Institute Identifies Huge Gender Disparity

Female-founded AI startups in the U.K. received six times less funding than their male counterparts between 2012 and 2022, The Alan Turing Institute finds.

www.sciencedaily.com

AI models identify biodiversity from animal sounds in tropical rainforests

Animal sounds are a very good indicator of biodiversity in tropical reforestation areas. Researchers demonstrate this by using sound recordings and AI models.

www.wired.com

The US Just Escalated Its AI Chip War With China

The American government has tightened its restrictions on exports of chips and chipmaking equipment, closing loopholes that let Chinese companies access advanced technology.

localhost

Goal Representations for Instruction Following

Goal Representations for Instruction Following A longstanding goal of the field of robot learning has been to create generalist agents that can perform tasks for humans. Natural language has the potential to be an easy-to-use interface for humans to specify arbitrary tasks, but it is difficult to train robots to follow language instructions. Approaches like language-conditioned behavioral cloning (LCBC) train policies to directly imitate expert actions conditioned on language, but require humans to annotate all training trajectories and generalize poorly across scenes and behaviors. Meanwhile, recent goal-conditioned approaches perform much better at general manipulation tasks, but do not enable easy task specification for human operators. How can we reconcile the ease of specifying tasks through LCBC-like approaches with the performance improvements of goal-conditioned learning? Conceptually, an instruction-following robot requires two capabilities. It needs to ground the language instruction in the physical environment, and then be able to carry out a sequence of actions to complete the intended task. These capabilities do not need to be learned end-to-end from human-annotated trajectories alone, but can instead be learned separately from the appropriate data sources. Vision-language data from non-robot sources can help learn language grounding with generalization to diverse instructions and visual scenes. Meanwhile, unlabeled robot trajectories can be used to train a robot to reach specific goal states, even when they are not associated with language instructions. Conditioning on visual goals (i.e. goal images) provides complementary benefits for policy learning. As a form of task specification, goals are desirable for scaling because they can be freely generated hindsight relabeling (any state reached along a trajectory can be a goal). This allows policies to be trained via goal-conditioned behavioral cloning (GCBC) on large amounts of unannotated and unstructured trajectory data, including data collected autonomously by the robot itself. Goals are also easier to ground since, as images, they can be directly compared pixel-by-pixel with other states. However, goals are less intuitive for human users than natural language. In most cases, it is easier for a user to describe the task they want performed than it is to provide a goal image, which would likely require performing the task anyways to generate the image. By exposing a language interface for goal-conditioned policies, we can combine the strengths of both goal- and language- task specification to enable generalist robots that can be easily commanded. Our method, discussed below, exposes such an interface to generalize to diverse instructions and scenes using vision-language data, and improve its physical skills by digesting large unstructured robot datasets. Goal Representations for Instruction Following The GRIF model consists of a language encoder, a goal encoder, and a policy network. The encoders respectively map language instructions and goal images into a shared task representation space, which conditions the policy network when predicting actions. The model can effectively be conditioned on either language instructions or goal images to predict actions, but we are primarily using goal-conditioned training as a way to improve the language-conditioned use case. Our approach, Goal Representations for Instruction Following (GRIF), jointly trains a language- and a goal- conditioned policy with aligned task representations. Our key insight is that these representations, aligned across language and goal modalities, enable us to effectively combine the benefits of goal-conditioned learning with a language-conditioned policy. The learned policies are then able to generalize across language and scenes after training on mostly unlabeled demonstration data. We trained GRIF on a version of the Bridge-v2 dataset containing 7k labeled demonstration trajectories and 47k unlabeled ones within a kitchen manipulation setting. Since all the trajectories in this dataset had to be manually annotated by humans, being able to directly use the 47k trajectories without annotation significantly improves efficiency. To learn from both types of data, GRIF is trained jointly with language-conditioned behavioral cloning (LCBC) and goal-conditioned behavioral cloning (GCBC). The labeled dataset contains both language and goal task specifications, so we use it to supervise both the language- and goal-conditioned predictions (i.e. LCBC and GCBC). The unlabeled dataset contains only goals and is used for GCBC. The difference between LCBC and GCBC is just a matter of selecting the task representation from the corresponding encoder, which is passed into a shared policy network to predict actions. By sharing the policy network, we can expect some improvement from using the unlabeled dataset for goal-conditioned training. However,GRIF enables much stronger transfer between the two modalities by recognizing that some language instructions and goal images specify the same behavior. In particular, we exploit this structure by requiring that language- and goal- representations be similar for the same semantic task. Assuming this structure holds, unlabeled data can also benefit the language-conditioned policy since the goal representation approximates that of the missing instruction. Alignment through Contrastive Learning We explicitly align representations between goal-conditioned and language-conditioned tasks on the labeled dataset through contrastive learning. Since language often describes relative change, we choose to align representations of state-goal pairs with the language instruction (as opposed to just goal with language). Empirically, this also makes the representations easier to learn since they can omit most information in the images and focus on the change from state to goal. We learn this alignment structure through an infoNCE objective on instructions and images from the labeled dataset. We train dual image and text encoders by doing contrastive learning on matching pairs of language and goal representations. The objective encourages high similarity between representations of the same task and low similarity for others, where the negative examples are sampled from other trajectories. When using naive negative sampling (uniform from the rest of the dataset), the learned representations often ignored the actual task and simply aligned instructions and goals that referred to the same scenes. To use the policy in the real world, it is not very useful to associate language with a scene; rather we need it to disambiguate between different tasks in the same scene. Thus, we use a hard negative sampling strategy, where up to half the negatives are sampled from different trajectories in the same scene. Naturally, this contrastive learning setup teases at pre-trained vision-language models like CLIP. They demonstrate effective zero-shot and few-shot generalization capability for vision-language tasks, and offer a way to incorporate knowledge from internet-scale pre-training. However, most vision-language models are designed for aligning a single static image with its caption without the ability to understand changes in the environment, and they perform poorly when having to pay attention to a single object in cluttered scenes. To address these issues, we devise a mechanism to accommodate and fine-tune CLIP for aligning task representations. We modify the CLIP architecture so that it can operate on a pair of images combined with early fusion (stacked channel-wise). This turns out to be a capable initialization for encoding pairs of state and goal images, and one which is particularly good at preserving the pre-training benefits from CLIP. Robot Policy Results For our main result, we evaluate the GRIF policy in the real world on 15 tasks across 3 scenes. The instructions are chosen to be a mix of ones that are well-represented in the training data and novel ones that require some degree of compositional generalization. One of the scenes also features an unseen combination of objects. We compare GRIF against plain LCBC and stronger baselines inspired by prior work like LangLfP and BC-Z. LLfP corresponds to jointly training with LCBC and GCBC. BC-Z is an adaptation of the namesake method to our setting, where we train on LCBC, GCBC, and a simple alignment term. It optimizes the cosine distance loss between the task representations and does not use image-language pre-training. The policies were susceptible to two main failure modes. They can fail to understand the language instruction, which results in them attempting another task or performing no useful actions at all. When language grounding is not robust, policies might even start an unintended task after having done the right task, since the original instruction is out of context. Examples of grounding failures "put the mushroom in the metal pot" "put the spoon on the towel" "put the yellow bell pepper on the cloth" "put the yellow bell pepper on the cloth" The other failure mode is failing to manipulate objects. This can be due to missing a grasp, moving imprecisely, or releasing objects at the incorrect time. We note that these are not inherent shortcomings of the robot setup, as a GCBC policy trained on the entire dataset can consistently succeed in manipulation. Rather, this failure mode generally indicates an ineffectiveness in leveraging goal-conditioned data. Examples of manipulation failures "move the bell pepper to the left of the table" "put the bell pepper in the pan" "move the towel next to the microwave" Comparing the baselines, they each suffered from these two failure modes to different extents. LCBC relies solely on the small labeled trajectory dataset, and its poor manipulation capability prevents it from completing any tasks. LLfP jointly trains the policy on labeled and unlabeled data and shows significantly improved manipulation capability from LCBC. It achieves reasonable success rates for common instructions, but fails to ground more complex instructions. BC-Z’s alignment strategy also improves manipulation capability, likely because alignment improves the transfer between modalities. However, without external vision-language data sources, it still struggles to generalize to new instructions. GRIF shows the best generalization while also having strong manipulation capabilities. It is able to ground the language instructions and carry out the task even when many distinct tasks are possible in the scene. We show some rollouts and the corresponding instructions below. Policy Rollouts from GRIF "move the pan to the front" "put the bell pepper in the pan" "put the knife on the purple cloth" "put the spoon on the towel" Conclusion GRIF enables a robot to utilize large amounts of unlabeled trajectory data to learn goal-conditioned policies, while providing a “language interface” to these policies via aligned language-goal task representations. In contrast to prior language-image alignment methods, our representations align changes in state to language, which we show leads to significant improvements over standard CLIP-style image-language alignment objectives. Our experiments demonstrate that our approach can effectively leverage unlabeled robotic trajectories, with large improvements in performance over baselines and methods that only use the language-annotated data Our method has a number of limitations that could be addressed in future work. GRIF is not well-suited for tasks where instructions say more about how to do the task than what to do (e.g., “pour the water slowly”)—such qualitative instructions might require other types of alignment losses that consider the intermediate steps of task execution. GRIF also assumes that all language grounding comes from the portion of our dataset that is fully annotated or a pre-trained VLM. An exciting direction for future work would be to extend our alignment loss to utilize human video data to learn rich semantics from Internet-scale data. Such an approach could then use this data to improve grounding on language outside the robot dataset and enable broadly generalizable robot policies that can follow user instructions. This post is based on the following paper: Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control Vivek Myers*, Andre He*, Kuan Fang, Homer Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, and Sergey Levine If GRIF inspires your work, please cite it with: @inproceedings{myers2023goal, title={Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control}, author={Vivek Myers and Andre He and Kuan Fang and Homer Walke and Philippe Hansen-Estruch and Ching-An Cheng and Mihai Jalobeanu and Andrey Kolobov and Anca Dragan and Sergey Levine}, booktitle={Conference on Robot Learning}, year={2023}, }

www.techrepublic.com

Made by Google News: Qualcomm Partners with Google, Pixel 8 and Pixel Watch 2 Specs

Qualcomm and Google partner on a RISC-V Snapdragon wearable platform. Also, the Pixel 8 Pro, which brings generative AI-enhanced image editing, and the Google Pixel Watch 2 are now available in many countries.

www.wired.com

AI Chatbots Can Guess Your Personal Information From What You Type

The AI models behind chatbots like ChatGPT can accurately guess a user’s personal information from innocuous chats. Researchers say the troubling ability could be used by scammers or to target ads.

www.wired.com

A ‘Godfather of AI’ Calls for an Organization to Defend Humanity

Yoshua Bengio’s pioneering research helped bring about ChatGPT and the current AI boom. Now he’s worried AI could harm civilization, and says the future needs a humanity defense organization.

news.mit.edu

New technique helps robots pack objects into a tight space

Researchers coaxed a family of generative AI models to work together to solve multistep robot manipulation problems.

news.mit.edu

A method to interpret AI might not be so interpretable after all

Some researchers see formal specifications as a way for autonomous systems to "explain themselves" to humans. But a new study finds that we aren't understanding.

techcrunch.com

ByteDance’s video editor CapCut targets businesses with AI ad scripts and AI-generated presenters

CapCut, the ByteDance-owned video editing app that’s the company’s second to hit $100 million in consumer spending after TikTok, is now expanding into business tools. Known today for its easy-to-use templates, tight integration with TikTok, and rapid adoption of AI effects and filters, CapCut has been a top consumer video editing app that now regularly […] © 2023 TechCrunch. All rights reserved. For personal use only.

techcrunch.com

Mac users are embracing AI apps, study finds, with 42% using AI apps daily

AI adoption among Mac app users is booming, according to a new report from app subscription service Setapp that found that 42% of Mac users today report using AI-based apps on a daily basis, and 63% claim to believe that AI apps are more beneficial than those without AI. In addition, Mac app developers are […] © 2023 TechCrunch. All rights reserved. For personal use only.

techcrunch.com

YouTube gets new AI-powered ads that let brands target special cultural moments

YouTube is putting Google AI to use for advertisers, the company announced this morning with the introduction of a new advertising package called “Spotlight Moments.” The idea here is to leverage AI to automatically identify the most popular YouTube videos related to a specific cultural moment — like Halloween, a major awards show, such as […] © 2023 TechCrunch. All rights reserved. For personal use only.

www.wired.com

None of Your Photos Are Real

Tools like Google’s Pixel 8 AI photo editor are ushering in a deepening distrust of everything we see. Welcome to our new counterfeit reality.

www.wired.com

Deepfake Porn Is Out of Control

New research shows the number of deepfake videos is skyrocketing—and the world's biggest search engines are funneling clicks to dozens of sites dedicated to the nonconsensual fakes.

localhost

Rethinking the Role of PPO in RLHF

Rethinking the Role of PPO in RLHF TL;DR: In RLHF, there’s tension between the reward learning phase, which uses human preference in the form of comparisons, and the RL fine-tuning phase, which optimizes a single, non-comparative reward. What if we performed RL in a comparative way? Figure 1: This diagram illustrates the difference between reinforcement learning from absolute feedback and relative feedback. By incorporating a new component - pairwise policy gradient, we can unify the reward modeling stage and RL stage, enabling direct updates based on pairwise responses. Large Language Models (LLMs) have powered increasingly capable virtual assistants, such as GPT-4, Claude-2, Bard and Bing Chat. These systems can respond to complex user queries, write code, and even produce poetry. The technique underlying these amazing virtual assistants is Reinforcement Learning with Human Feedback (RLHF). RLHF aims to align the model with human values and eliminate unintended behaviors, which can often arise due to the model being exposed to a large quantity of low-quality data during its pretraining phase. Proximal Policy Optimization (PPO), the dominant RL optimizer in this process, has been reported to exhibit instability and implementation complications. More importantly, there’s a persistent discrepancy in the RLHF process: despite the reward model being trained using comparisons between various responses, the RL fine-tuning stage works on individual responses without making any comparisons. This inconsistency can exacerbate issues, especially in the challenging language generation domain. Given this backdrop, an intriguing question arises: Is it possible to design an RL algorithm that learns in a comparative manner? To explore this, we introduce Pairwise Proximal Policy Optimization (P3O), a method that harmonizes the training processes in both the reward learning stage and RL fine-tuning stage of RLHF, providing a satisfactory solution to this issue. Background Figure 2: A description of the three stages of RLHF from an OpenAI blog post. Note that the third stage falls under Reinforcement Learning with Absolute Feedback as shown on the left side of Figure 1. In traditional RL settings, the reward is specified manually by the designer or provided by a well-defined reward function, as in Atari games. However, to steer a model toward helpful and harmless responses, defining a good reward is not straightforward. RLHF addresses this problem by learning the reward function from human feedback, specifically in the form of comparisons, and then applying RL to optimize the learned reward function. The RLHF pipeline is divided into several stages, detailed as follows: Supervised Fine-Tuning Stage: The pre-trained model undergoes the maximum likelihood loss on a high quality dataset, where it learns to respond to human queries through mimicking. Reward Modeling Stage: The SFT model is prompted with prompts $x$ to produce pairs of answers $y_1,y_2\sim \pi^{\text{SFT}}(y\vert x)$. These generated responses form a dataset. The response pairs are presented to human labellers who express a preference for one answer over the other, denoted as $y_w \succ y_l$. A comparative loss is then used to train a reward model $r_\phi$: \[\mathcal{L}_R = \mathbb{E}_{(x,y_l,y_w)\sim\mathcal{D}}\log \sigma\left(r_\phi(y_w|x)-r_\phi(y_l|x)\right)\] RL Fine-Tuning Stage: The SFT model serves as the initialization of this stage, and an RL algorithm optimizes the policy towards maximizing the reward while limiting the deviation from the initial policy. Formally, this is done through: \[\max_{\pi_\theta}\mathbb{E}_{x\sim \mathcal{D}, y\sim \pi_\theta(\cdot\vert x)}\left[r_\phi(y\vert x)-\beta D_{\text{KL}}(\pi_\theta(\cdot\vert x)\Vert \pi^{\text{SFT}}(\cdot\vert x))\right]\] An inherent challenge with this approach is the non-uniqueness of the reward. For instance, given a reward function $r(y\vert x)$, a simple shift in the reward of the prompt to $r(y\vert x)+\delta(x)$ creates another valid reward function. These two reward functions result in the same loss for any response pairs, but they differ significantly when optimized against with RL. In an extreme case, if the added noise causes the reward function to have a large range, an RL algorithm might be misled to increase the likelihood of responses with higher rewards, even though those rewards may not be meaningful. In other words, the policy might be disrupted by the reward scale information in the prompt $x$, yet fails to learn the useful part - relative preference represented by the reward difference. To address this issue, our aim is to develop an RL algorithm that is invariant to reward translation. Derivation of P3O Our idea stems from the vanilla policy gradient (VPG). VPG is a widely adopted first-order RL optimizer, favored for its simplicity and ease of implementation. In a contextual bandit (CB) setting, the VPG is formulated as: \[\nabla \mathcal{L}^{\text{VPG}} = \mathbb{E}_{y\sim\pi_{\theta}} r(y|x)\nabla\log\pi_{\theta}(y|x)\] Through some algebraic manipulation, we can rewrite the policy gradient in a comparative form that involves two responses of the same prompt. We name it Pairwise Policy Gradient: \[\mathbb{E}_{y_1,y_2\sim\pi_{\theta}}\left(r(y_1\vert x)-r(y_2\vert x)\right)\nabla\left(\log\frac{\pi_\theta(y_1\vert x)}{\pi_\theta(y_2\vert x)}\right)/2\] Unlike VPG, which directly relies on the absolute magnitude of the reward, PPG uses the reward difference. This enables us to bypass the aforementioned issue of reward translation. To further boost performance, we incorporate a replay buffer using Importance Sampling and avoid large gradient updates via Clipping. Importance sampling: We sample a batch of responses from the replay buffer which consist of responses generated from $\pi_{\text{old}}$ and then compute the importance sampling ratio for each response pair. The gradient is the weighted sum of the gradients computed from each response pair. Clipping: We clip the importance sampling ratio as well as the gradient update to penalize excessively large updates. This technique enables the algorithm to trade-off KL divergence and reward more efficiently. There are two different ways to implement the clipping technique, distinguished by either separate or joint clipping. The resulting algorithm is referred to as Pairwise Proximal Policy Optimization (P3O), with the variants being V1 or V2 respectively. You can find more details in our original paper. Evaluation Figure 3: KL-Reward frontier for TL;DR, both sequence-wise KL and reward are averaged over 200 test prompts and computed every 500 gradient steps. We find that a simple linear function fits the curve well. P3O has the best KL-Reward trade-off among the three. We explore two different open-ended text generation tasks, summarization and question-answering. In summarization, we utilize the TL;DR dataset where the prompt $x$ is a forum post from Reddit, and $y$ is a corresponding summary. For question-answering, we use Anthropic Helpful and Harmless (HH), the prompt $x$ is a human query from various topics, and the policy should learn to produce an engaging and helpful response $y$. We compare our algorithm P3O with several effective and representative approaches for LLM alignment. We start with the SFT policy trained by maximum likelihood. For RL algorithms, we consider the dominant approach PPO and the newly proposed DPO. DPO directly optimizes the policy towards the closed-form solution of the KL-constrained RL problem. Although it is proposed as an offline alignment method, we make it online with the help of a proxy reward function. Figure 4: KL-Reward frontier for HH, each point represents an average of results over 280 test prompts and calculated every 500 gradient updates. Left two figures compare P3O-V1 and PPO with varying base model sizes; Right two figures compare P3O-V2 and DPO. Results showing that P3O can not only achieve higher reward but also yield better KL control. Deviating too much from the reference policy would lead the online policy to cut corners of the reward model and produce incoherent continuations, as pointed out by previous works. We are interested in not only the well established metric in RL literature - the reward, but also in how far the learned policy deviates from the initial policy, measured by KL-divergence. Therefore, we investigate the effectiveness of each algorithm by its frontier of achieved reward and KL-divergence from the reference policy (KL-Reward Frontier). In Figure 4 and Figure 5, we discover that P3O has strictly dominant frontiers than PPO and DPO across various model sizes. Figure 5: Left figure displays the win rate evaluated by GPT-4. Right figure presents the win rate based on direct comparison of the proxy reward. Despite the high correlation between two figures, we found that the reward win rate must be adjusted according to the KL in order to align with the GPT-4 win rate. To directly assess the quality of generated responses, we also perform Head-to-Head Comparisons between every pair of algorithms in the HH dataset. We use two metrics for evaluation: (1) Reward, the optimized target during online RL, (2) GPT-4, as a faithful proxy for human evaluation of response helpfulness. For the latter metric, we point out that previous studies show that GPT-4 judgments correlate strongly with humans, with human agreement with GPT-4 typically similar or higher than inter-human annotator agreement. Figure 5 presents the comprehensive pairwise comparison results. The average KL-divergence and reward ranking of these models is DPO > P3O > PPO > SFT. Although DPO marginally surpasses P3O in reward, it has a considerably higher KL-divergence, which may be detrimental to the quality of generation. As a result, DPO has a reward win rate of 49.5% against P3O, but only 45.4% as evaluated by GPT-4. Compared with other methods, P3O exhibits a GPT-4 win rate of 57.0% against PPO and 69.3% against SFT. This result is consistent with our findings from the KL-Reward frontier metric, affirming that P3O could better align with human preference than previous baselines. Conclusion In this blog post, we present new insights into aligning large language models with human preferences via reinforcement learning. We proposed the Reinforcement Learning with Relative Feedback framework, as depicted in Figure 1. Under this framework, we develop a novel policy gradient algorithm - P3O. This approach unifies the fundamental principles of reward modeling and RL fine-tuning through comparative training. Our results show that P3O surpasses prior methods in terms of the KL-Reward frontier as well as GPT-4 win-rate. BibTex This blog is based on our recent paper and blog. If this blog inspires your work, please consider citing it with: @article{wu2023pairwise, title={Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment}, author={Wu, Tianhao and Zhu, Banghua and Zhang, Ruoyu and Wen, Zhaojin and Ramchandran, Kannan and Jiao, Jiantao}, journal={arXiv preprint arXiv:2310.00212}, year={2023} }

www.wired.com

A 'Green' Search Engine Sees Danger—and Opportunity—in the Generative AI Revolution

Berlin-based Ecosia carved out a niche as a carbon-negative search engine. To adapt to the ChatGPT era, it’s moving closer to Google and exploring how AI could help users cut carbon emissions.

www.wired.com

Millions of Workers Are Training AI Models for Pennies

From the Philippines to Colombia, low-paid workers label training data for AI models used by the likes of Amazon, Facebook, Google, and Microsoft.

www.sciencedaily.com

Researchers measure global consensus over the ethical use of AI

To examine the global state of AI ethics, a team of researchers performed a systematic review and meta-analysis of global guidelines for AI use. The researchers found that while most of the guidelines valued Privacy, Transparency, and Accountability, very few valued Truthfulness, Intellectual Property, or Children's Rights. Additionally, most of the guidelines described ethical principles and values without proposing practical methods for implementing them, and without pushing for legally binding regulation.

www.bbc.co.uk

Quiz: Are these images real or AI?

Test your skills at detecting AI-generated images with Bitesize's monthly AI or real quiz.

www.techrepublic.com

Deal Days: Learn How to Successfully Use ChatGPT For $20

This extensive training bundle will help you best utilize OpenAI's trending tool. Now on sale at $19.97 until 10/15.